Biometrics measures people’s unique physical and behavioral characteristics to recognize the identity of an individual. Gait , the walking pattern of an individual, is one of biometrics modalities besides face, fingerprint, iris, etc. Gait recognition has the advantage that it can operate at a distance without users’ cooperation. Also, it is difficult to camouflage. Due to these advantages, gait recognition is applicable to many applications such as person identification, criminal investigation, and healthcare.
As other recognition problems, gait data can usually be captured by five types of sensors , i.e., RGB camera, RGB-D camera [68, 79], accelerometer , floor sensor , and continuous-wave radar . Among them, RGB camera is not only the most popular one due to the low sensor cost, but also the most challenging one since RGB pixels might not be effective in capturing the motion cues. This work studies gait recognition from RGB cameras.
The core of gait recognition lies in extracting gait features from the video frames of a walking person, where the prior work can be categorized into two types: appearance-based and model-based methods. The appearance-based methods, e.g., Gait Energy Image (GEI) , take the averaged silhouette image as the gait feature. While having a low computational cost and being able to handle low-resolution imagery, it can be sensitive to variations such as cloth change, carrying, viewing angles and walking speed [53, 5, 69, 32, 2, 55, 72]
. The model-based methods use the articulated body skeleton from pose estimation as the gait feature. They show more robustness to aforementioned variations but at a price of a higher computational cost and dependency on pose estimation accuracy[3, 15, 23].
It is understandable that the challenge in designing a gait feature is the necessity of being invariant to the appearance variation due to clothing, viewing angle, carrying, etc. Therefore, our desire is to disentangle the gait feature from the non-gait-related appearance of the walking person. For both appearance-based or model-based methods, such disentanglement is achieved by manually handcrafting the GEI-like [28, 5] or body skeleton-like [3, 23, 15] features, since neither has color or texture information. However, we argue that these manual disentanglements may be sensitive to changes in walking condition. In other words, they can lose certain or create redundant gait information. E.g., GEI-like features have distinct silhouettes for the same subject wearing different clothes. For skeleton-like features, when carrying accessories (e.g., bags, umbrella), certain body joints such as hands may have fixed positions, and hence are redundant information to gait.
To remedy the aforementioned issues in handcrafted features, as shown in Fig. 3 (a), this paper proposes a novel approach to learn gait representations from the RGB video directly. Specifically, we aim to automatically disentangle dynamic pose features (trajectory of gait) from pose-irrelevant features. To further distill identity information from pose-irrelevant features, we disentangle the pose-irrelevant features into appearance (i.e., clothing) and canonical features. Here, the canonical feature refers to a standard and unique representation of human body, such as body ratio, width and limb lengths, etc. The pose features and canonical features are discriminative in identity and are used for gait recognition. Fig. 3 (b) visualizes the three disentangled features.
This disentanglement is realized by designing an autoencoder-based Convolutional Neural Network (CNN), GaitNet, with novel loss functions. For each video frame, the encoder estimates three latent representations: pose, canonical and appearance features, by employing three loss functions: 1) cross reconstruction loss enforces that the canonical and appearance features of one frame, fused with the pose feature of another frame, can be decoded to the latter frame; 2) pose similarity loss forces a sequence of pose features extracted from a video sequence, of the same subject to be similar even under different conditions; 3) canonical consistency loss favors consistent canonical features among videos of the same subject under different conditions. Finally, the pose features of a sequence are fed into a multi-layer LSTM with our designed incremental identity loss to generate the sequence-based dynamic gait feature. The average of canonical features results in the sequence-based static gait feature. Given two gait videos, the cosine distances between their respective dynamic and static gait features are computed and their summation is the final video-to-video gait similarity metric.
In addition, most prior work [28, 5, 3, 6, 17, 74, 45, 55, 60, 47] choose the walking video of the side view, which has the richest gait information, as the gallery sequence. However, in practices other viewing angles, such as the frontal view, can be very common when pedestrians walk toward or away from the surveillance camera. Also, the prior work [57, 9, 10, 49] that focuses on frontal view are often based on RGB-D videos, which have additional depth information than RGB. Therefore, to encourage gait recognition from frontal-view RGB videos that generally has the minimal amount of gait information, we collect a high-definition (HD, p) Frontal-View Gait database, named FVG, with a wide range of variations. It has three frontal-view angles where the subject walks from left , , and right off the optical axes of the camera. For each of three angles, different variants are explicitly captured including walking speed, clothing, carrying, multiple people, etc.
In summary, this paper makes the following contributions:
Our proposed GaitNet directly learns disentangled representations from RGB videos, which is in sharp contrast to the conventional appearance-based or model-based methods.
We introduce a Frontal-View Gait database, including various variations of viewing angles, walking speeds, carrying, clothing changes, background and time gaps. This is the first HD gait database, with nearly twice the number of subjects compared to existing RGB gait databases.
Our proposed method outperforms the state of the arts on three benchmarks, CASIA-B, USF, and FVG datasets.
We demonstrate the strength of gait recognition over face recognition in the task of person recognition from surveillance-quality videos.
2 Related Work
Gait Representation. Most prior works are based on two types of gait representations. In appearance-based methods, gait energy image (GEI)  or gait entropy image (GEnI)  are defined by extracting silhouette masks. Specifically, GEI uses an averaged silhouette image as the gait representation for a video. These methods are popular in the gait recognition community for their simplicity and effectiveness. However, they often suffer from sizeable intra-subject appearance changes due to covariates such as clothing, carrying, views, and walking speed. On the other hand, model-based methods [15, 23] fit articulated body models to images and extract kinematic features such as D body joints. While they are robust to some covariates such as clothing and speed, they require a relatively higher image resolution for reliable pose estimation and higher computational costs.
In contrast, our approach learns gait representation directly from raw RGB video frames which contain richer information, thus with higher potential of extracting more discriminative gait features. The most relevant work to ours is , which learns gait features from RGB images via Conditional Random Field. Compared to , our proposed approach learns two complimentary features: dynamic gait, and static gait features, and has the advantage of being able to leverage a large amount of training data and learning more discriminative representation from data with multiple covariates. In addition, some recent works [69, 74, 75, 45, 72] use CNN to learn more discriminative features from GEI. However, the source of the learning, GEI, already loses dynamic information since a random shuffle of video frames results in the identical GEI feature. In contrast, the proposed GaitNet learns features from RGB imagery instead, which allows the network to explore richer information for representation learning. This is demonstrated by our comparison with [12, 69] in Sec. 5.2.1 and Sec. 5.2.3.
Gait Databases. There are many classic gait databases such as SOTON Large dataset , USF , CASIA-B , OU-ISIR , and TUM GAID . We compare our FVG database with the widely used ones in Tab. I. CASIA-B is a large multi-view gait database with three variations: viewing angle, clothing, and carrying. Each subject is captured from views under three conditions: normal walking (NM), walking in coats (CL) and walking while carrying bags (BG). For each view, , , and videos are captured in NM, CL and BG conditions, respectively. USF database has subjects with five variations, totaling conditions per subject. It contains two viewing angles (left and right), two ground surfaces (grass and concrete), shoe change, carrying condition and time. While OU-ISIR-LP and OU-ISIR-LP-Bag are large databases, only silhouettes are publicly released in both of them. In contrast, our FVG focuses on the frontal view, with different near frontal-view angles toward the camera, and other variations including walking speed, carrying, clothing, multiple people and time.
Besides model-based approaches representing data with semantic latent vectors[62, 63, 61, 42], data-driven disentangled representation learning approaches are gaining popularity in the computer vision community. DrNet  disentangles content and pose vectors with a two-encoders architecture, which removes content information in the pose vector by generative adversarial training. The work of  segments foreground masks of body parts by D pose joints via U-Net  and then transforms body parts to desired motion with adversarial training. Similarly,  utilizes U-net and Variational Auto Encoder (VAE)  to disentangle an image into appearance and shape. DR-GAN [64, 65] achieves SOTA performances on pose-invariant face recognition by explicitly disentangling pose variation with a multi-task GAN . Different from [20, 4, 21], our method has only one encoder to disentangle the three latent features, through the design of novel loss functions without the need for adversarial training. Further, pose labels are used in DR-GAN training so as to disentangle identity feature from the pose. However, to disentangle pose and appearance features from RGB, there is no pose nor appearance label to be utilized for our method, since it is nontrivial to define the types of walking pattern or clothes as discrete classes.
Gait vs. Face recognition. Both gait and face are popular biometrics modalities, especially in covert identification-at-a-distance applications. Hence, it is valuable to understand the pros and cons of each modality if the SOTA gait recognition and face recognition algorithms are deployed. Along this direction, most of the prior works focus on the fusion of both modalities and evaluate on relatively small datasets [54, 36, 78]. In contrast, we conduct comprehensive evaluations using SOTA face and gait recognition algorithms, across various conditions of CASIA-B and FVG databases. Further, the performances are measured along the video duration to explore the impact of person-to-camera distances.
|scalar||Index of subject|
|scalar||Time step in a video|
|scalar||Number of frames in a video|
|matrices||Gait video under condition|
|matrix||Frame of video|
|matrix||Reconstructed frame via|
|Dynamic gait feature|
|Static gait feature|
|The output of LSTM at step t|
|-||Pose similarity loss|
|-||Canonical similarity loss|
|-||Incremental identity loss|
3 Proposed Approach
Let us start with a simple example. Assuming there are three videos, where videos and capture subject A wearing t-shirt and long down coat respectively, and in video subject B wears the same long down coat as in video . The objective is to design an algorithm, from which the gait features of video and are the same, while those of video and are different. Clearly, this is a challenging objective, as the long down coat can easily dominate the extracted feature, which would make video and to be more similar than and in the latent space of gait features. Indeed the core challenge, as well as the objective, of gait recognition is to extract gait features that are discriminative among subjects, but invariant to different confounding factors, such as viewing angles, walking speeds and changing clothes. Table II summarizes the symbol and notation used in this paper.
Our approach to achieve this objective is feature disentanglement. In our preliminary work , we disentangle features into two components: pose and “appearance” features. However, further research discovered that the “appearance” feature still contains certain discriminative information, which can be useful for identity classification. For instance, as in Fig. 6, imagining if we would ignore the body pose, e.g., position of arms and legs, and clothing information, e.g., color and texture of clothes, we may still tell apart different subjects by their inherent body characteristics, which can include categories of overall body shape (e.g., rectangle, triangle, inverted triangle, and hourglass ), arm length, torso vs. leg ratio , etc. In other words, even when different people wearing exactly the same clothing and standing still, these characteristics are still subject dependent. In the meantime, for the same subject under various conditions, these characteristics are relatively constant. In this work, we term the feature describing these characteristics as the canonical feature. Hence, given a walking video under condition , our framework disentangle the encoded feature into three components: the pose feature , the appearance feature and the canonical feature . We also term the concatenation of and as the pose-irrelevant feature, which is conceptually equivalent to the “appearance” feature in . The pose feature describes the positions of body parts, and their dynamic over time is essentially the core element of gait; the canonical feature defines the unique characteristics of individual body; and the appearance feature describes the subject’s clothing.
The above feature disentanglement can be naturally implemented as an encoder-decoder network. Specifically, as depicted in Fig. 7, the input to our GaitNet is a video sequence, with background removed using any off-the-shelf pedestrian detection and segmentation method [29, 8, 7]. With carefully designed loss functions, an encoder is learned to disentangle the pose, canonical and appearance features for each video frame. Then, a multi-layer LSTM explores the temporal dynamics of pose features and aggregates them to a sequence-based dynamic gait feature. In the meantime, the average of all the canonical features is defined as the static gait feature. Measuring distances of both dynamic and static features between the gallery and probe walking videos provides the final matching score. In this section, we first present the feature disentanglement, followed by temporal aggregation, model inference and finally implementation details.
3.2 Feature Disentanglement
|Constant Across Frames||Constant Across Conditions||Discriminative|
|No||Yes||Yes for over|
For the majority of gait datasets, there is limited intra-subject appearance variation. Hence, appearance could be a discriminative cue for identification during training as many subjects can be easily distinguished by their clothes. Unfortunately, any feature extractors relying on appearance will not generalize well on the test set or in practice, due to potentially diverse clothing or appearance between two videos of the same subject. This limitation on training sets also prevents us from learning ideal feature extractors if solely relying on identification objective. Hence we propose to learn to disentangle the canonical and pose feature from the visual appearance. Since a video is composed of frames, disentanglement should be conducted at the frame level first.
Before presenting the details of how we conduct disentanglement, let us first understand the various properties of three types of features, as summarized in Tab. III. These properties are crucial in guiding us to define effective loss functions for disentanglement. The appearance feature mainly describes the clothing information of the subject. Hence it is constant within a video sequence, but often different across different conditions. Of course it is not discriminative among individuals. The canonical feature is subject-specific, and is therefore constant across both video frames, and conditions. The pose feature is obviously different across video frames, but is assumed to be constant across conditions. Since the pose feature is the manifestation of video-based gait information at a specific frame, the pose feature itself might not be discriminative. However, the dynamics of pose features over time will constitute the dynamic gait feature, which is discriminative among individuals.
To this end, we propose to use an encoder-decoder network architecture with carefully designed loss functions to disentangle the pose feature and canonical feature from appearance feature. The encoder, , encodes a feature representation of each frame, , and explicitly splits it into three components, namely appearance feature , canonical feature and pose feature :
Collectively these three features are expected to fully describe the original input image. As they can be decoded back to the original input through a decoder :
We now define the various loss functions to jointly learn the encoder and decoder .
Cross Reconstruction Loss. The reconstructed image should be close to the original input . However, enforcing self-reconstruction loss as in typical auto-encoder cannot ensure the meaningful disentanglement as in our design. Hence, we propose the cross reconstruction loss, using the appearance feature and canonical feature of frame and the pose feature of frame to reconstruct the latter frame:
The cross reconstruction loss, on one hand, can act as the self-reconstruction loss to make sure the three features are sufficiently representative to reconstruct a video frame. On the other hand, as we can pair a pose feature of a current frame with the canonical and appearance features of any frame in the same video to reconstruct the same target, it enforces both the canonical and appearance features to be similar across all frames within a video. Indeed, according to Tab. III, between the pose-irrelevant feature, , and the pose feature , the main distinct property is that the former is constant across frames while the latter is not. This is the basis for designing our cross reconstruction loss.
Pose Similarity Loss. The cross reconstruction loss is able to prevent the pose-irrelevant feature, , to be contaminated by the pose information that changes across frames. If not, i.e., or contains some pose information, and would have different poses. However, clothing/texture and body information may still be leaked into the pose feature . In the extreme case, and could be constant vectors while encodes all the information of a video frame.
To encourage including only the pose information, we leverage multiple videos of the same subject. Given two videos of the same subject with length , in two different conditions , , they contain difference in the person’s appearance, i.e., cloth changes. Despite appearance changes, the gait information is assumed to be constant between two videos. Since it’s almost impossible to enforce similarity on between video frames as it requires precise frame-level alignment, we minimize the similarity between two videos’ averaged pose features:
According to Tab. III, the pose feature is constant across conditions, which is the basis of our pose similarity loss.
Canonical Consistency Loss. The canonical feature describes the subject’s body characteristics, which is unique over all video frames. To be specific, for two videos of the same subject in two different conditions , , the canonical feature is constant across both frames and conditions, as illustrated in Tab. III. Tab. III also states that the canonical feature is discriminative across subjects. Hence, to enforce the two constancy and the discriminativeness, we define the canonical consistency loss as follows:
where the three terms measure the consistency across frames in a single video, consistency across different videos of the same subject, and identity classification using a classifier , respectively.
3.3 Gait Feature Learning and Aggregation
Even when we can disentangle pose, canonical and appearance information for each video frame, the and have to be aggregated over time, since 1) gait recognition is conducted between two videos instead of two images; 2) not all the from every single frame is guaranteed to have same canonical information; 3) the current feature only represents the walking pose of the person at a specific instance, which can share similarity with another instance of a different individual. Here, we are looking for discriminative characteristics in a person’s walking pattern. Therefore, modeling its aggregation for and temporal change for is critical.
3.3.1 Static Gait Feature via Canonical Feature Aggregation
After learning for every single frame as defined in Eqn. 3.2, we explore the best representation of features across all frames of a video sequence. Since is assumed to be constant over time, we compute the averaged features as a way to aggregate the canonical features over time. Given that describes the body characteristics as if we freeze the gait, we call the aggregated as the static gait feature .
3.3.2 Dynamic Gait Feature via Pose Feature Aggregation
For temporal modeling of poses, this is where temporal modeling architectures like the recurrent neural network or long short-term memory (LSTM) work best. Specifically, in this work, we utilize a multi-layer LSTM structure to explore temporal information of pose features,e.g., how the trajectory of subjects’ body parts changes over time. As shown in Fig. 7, pose features extracted from one video sequence are fed into a -layer LSTM. The output of the LSTM is connected to a classifier , in this case, a linear classifier is used, to classify the subject’s identity.
Let be the output of the LSTM at time step , which is accumulative after feeding pose features into it:
Now we define the loss function for LSTM. A trivial option for identification is to add the classification loss on top of the LSTM output of the final time step:
which is the negative log likelihood that the classifier correctly identifies the final output as its identity label .
Identification with Averaged Feature. By the nature of LSTM, the output can be greatly affected by its last input . Hence the LSTM output, , could be unstable across time steps. With a desire to obtain a gait feature that is robust to the final instance of a walking cycle, we choose to use the averaged LSTM output as our gait feature for identification:
The identification loss can be rewritten as:
Incremental Identity Loss. LSTM is expected to learn that, the longer the video sequence, the more walking information it processes thus the more confident it identifies the subject. Instead of minimizing the loss at the final time step, we propose to use all the intermediate outputs of every time step weighted by :
where we set and other options such as also yield similar performance. In the experiments, we will ablate the impact of three options in classification loss: , , and . To this end, the overall loss function is:
The entire system, including encoder, decoder, and LSTM, are jointly trained. Updating to optimize also helps to further generate pose feature that has identity information and from which LSTM is able to explore temporal dynamics.
3.4 Model Inference
Since GaitNet takes one video sequence as input and outputs and as shown in Fig. 7, one single score is needed to measure the similarity between the gallery and probe videos for either gait authentication or identification. During testing, both and
are used as the identity features for score calculation. We use the cosine similarity scores, normalized to the range ofvia min-max. The static and dynamic scores are finally fused by a weighted sum rule:
where and represent gallery and probe, respectively.
|Output Size||Layers||Filters/Stride||Output Size|
3.5 Implementation Details
Detection and Segmentation. Our GaitNet receives video frames with the person of interest segmented. The foreground mask is obtained from the SOTA instance segmentation algorithm, Mask R-CNN 
. Instead of using a zero-one mask by hard thresholding, we maintain the soft mask returned by the network, where each pixel indicates the probability of being a person. This is partially due to the difficulty in choosing an appropriate threshold suitable for multiple databases. Also, it remedies the loss in information due to the mask estimation error. We use a bounding box with a fixed ratio of width : heightwith the absolute height and center location given by the Mask R-CNN network. The input of GaitNet is obtained by pixel-wise multiplication between the mask and the -normalized RGB values, and then resizing to pixels. This applies to all the experiments on CASIA-B, USF and FVG datasets in Sec. 5.
Network Structure and Hyperparameter.
Network Structure and Hyperparameter.Our encoder-decoder network is a typical CNN, illustrated in Tab. IV. Different from our preliminary work , we replace stride- convolution layers with stride-
convolution layers and max pooling layers, since we find the latter is able to achieve the similar results with less hyper-parameter searching for different training scenarios. Each convolution layer is followed by Batch Normalization and Leaky ReLU activation. The decoder structure, similar to, is built from transposed D convolution, Batch Normalization and Leaky ReLU layers. The final layer is a Sigmoid activation which can output the value into range as the input. All the transposed convolutions are with stride of to up sample images and all the Leaky ReLU are with slope of . The classification part is a stacked -layer LSTM , which has hidden units in each cell. The length of , and is , and respectively, as shown in Tab. II.
The Adam optimizer  is initialized with the learning rate of , and the momentum of . To prevent over-fitting, the weights decay of is applied to all the experiments, and the learning rate decays by multiplying in every iterations. For each batch, we use video frames from or different clips depending on different experiment protocols. Since video lengths are varied, a random crop of -frame sequence is applied during training; all shorter videos are discarded. The , and in Eqn. 12 are all set to in all experiments.
4 Front-View Gait (FVG) Database
Collection. To facilitate the research of gait recognition from frontal-view angles, we collect the Front-View Gait (FVG) database in a course of two years ( and ). During the capturing, we place the camera (Logitech C Pro Webcam or GoPro Hero ) on a tripod at the height of meters. We require each of subjects to walk toward the camera times starting from around meters away from the camera, which results in videos per subject. The videos are captured at resolution with FPS and the average length of seconds. The height of body in the video ranges from to pixels, and the height of faces ranges from to pixels. These walks have the combination of three angles toward the camera (, , off the optical axes of the camera), and four variations. As detailed in Tab. V, FVG is collected in three sessions with five variations: normal, walking speed (slow and fast), clothing changes, carrying/wearing change (bag or hat), and clutter background (multiple persons). The five variations are well balanced in three sessions. Fig. 8 shows exemplar images from FVG.
|Number of Subjects|
|Viewing Angle ()||-||-||-|
|Fast / Slow Walking||/||/||/|
|Carrying Bag / Hat||-||-||-||-||-||-|
Protocols. Different from prior gait databases, subjects in FVG are walking toward the camera, which creates a great challenge on exploiting gait information as the visual difference in consecutive frames is normally much smaller than side-view walking. We focus our evaluation on variations that are challenging, e.g., different clothes, carrying a bag while wearing a hat, or are not presented in prior databases, e.g., multi-person. To benchmark research on FVG, we define evaluation protocols, among which there are two commonalities: the first and remaining subjects are used for training and testing respectively; the video , the normal frontal-view walking, is always used as the gallery. The protocols differ in their respective probe data, which cover the variations of Walking Speed (WS), Carrying Bag while Wearing a Hat (BGHT), Changing Clothes (CL), Multiple Persons (MP), and all variations (ALL). At the top part of Tab. V, we list the detailed probe sets for all protocols. For instance, for the WS protocol, the probes are video in Session and video in Session . In all protocols, the performance metrics are the True Accept Rate (TAR) at and False Alarm Rate (FAR).
5 Experimental Results
We evaluate the proposed approach on three gait databases, CASIA-B , USF  and FVG. As mentioned in Sec. 2, CASIA-B and USF are the most widely used gait databases, which helps us to make the comprehensive comparison with prior works. We compare our method with [69, 12, 40, 39] on these two databases, by following the respective experimental protocols of the baselines. These are either the most recent and SOTA work, or classic gait recognition methods. The OU-ISIR database  is not evaluated, and related results  are not compared since our work consumes RGB video input, but OU-ISIR only releases silhouettes. Finally, we also conduct experiments to compare our gait recognition with the state-of-the-art face recognition method ArcFace  on the CASIA-B and FVG datasets.
5.1 Ablation Study
5.1.1 Feature Visualization Through Synthesis
While our decoder is only useful in training, but not model inference, it can enable us to visualize the disentangled features as a synthetic image, by feeding either the feature itself, or their random concatenation, to our learned decoder . This synthesis helps to gain more understanding of the feature disentanglement.
Visualization of Features in One Frame. Our decoder requires the concatenation of three vectors for synthesis. Hence, to visualize each individual feature, we concatenate it with two vectors of zeros and then feed to decoder. In Fig. 9, we show the disentanglement visualization of subjects (two frontal and two side views), each under the NM and CL conditions. First of all, the canonical feature discovers a standard body pose that is consistent across both subjects, which is more visible in the side view. Under such a standard body pose, the canonical feature then depicts the unique body shape, which is consistent within a subject but different between subjects. The appearance feature faithfully recovers the color and texture of clothing, at the standard body pose specified by the canonical feature. The pose feature captures the walking pose of the input frame. Finally, combining all three features can closely reconstruct the original input. This shows that our disentanglement not only preserves all information of the input, but also fulfills all the desired properties described in Tab. III.
Visualization of Features in Two Frames. As shown in Fig. 10, each result is generated by pairing the pose-irrelevant feature in the first column, and the pose feature in the first row. The synthesized images show that indeed pose-irrelevant feature contributes all the appearance and body information, e.g., cloth, body width, as they are consistent across each row. Meanwhile, contributes all the pose information, e.g., positions of hand and feet, which share similarity across columns. Despite that concatenating vectors from different subjects may create samples outside the input distribution of , the visual quality of synthetic images shows that is versatile to these new samples.
5.1.2 Feature Visualization Through t-SNE
To gain more insight into the frame-level features , , and sequence-level LSTM feature aggregation, we apply t-SNE  to these features to visualize their distribution in a D space. With the learnt models in Sec. 5.1.1, we randomly select two videos under NM and CL conditions for each of subjects.
Fig. 15 (a,b) visualizes the and features. Obviously, for the appearance feature , the margins between intra-class and inter-class distances are unpromising, which shows that has limited discrimination power. In contrast, the canonical feature has both the compact intra-class variations and separable inter-class differences – useful for identity classification. In addition, we visualize the from and its corresponding at each time step in Fig. 15 (c-d). As defined in Eqn. 4, we enforce the averaged of the same subject to be consistent under different conditions. Since Eqn. 4 only minimizes the intra-class distance, it cannot guarantee the discrimination among subjects. However, after aggregation by the LSTM network, distances of points at longer time duration for inter-class are substantially enlarged.
5.1.3 Loss Function’s Impact on Performance
Disentanglement with Pose Similarity Loss. With the cross reconstruction loss, the appearance feature and canonical feature can be enforced to represent static information that shares across the video. However, as discussed, could be contaminated by the appearance information or even encode the entire video frame. Here we show the benefit of the pose similarity loss to feature disentanglement. Fig. 16 shows the cross visualization of two different models learned with and without . Without the decoded image shares some appearance and body characteristic, e.g., cloth style, contour, with . Meanwhile, with , appearance better matches with and .
|Disentanglement Loss||Classification Loss||Classification Feature||Rank-|
Loss Function’s Impact on Recognition Performance. As there are various options in designing our framework, we ablate their effect on the final recognition performance from three perspectives: the disentanglement loss, the classification loss, and the classification feature. Tab. VI reports the Rank- recognition accuracy of different variants of our framework on CASIA-B under NM vs. CL and lateral view. The model is trained with all videos of the first subjects and tested on the remaining subjects.
We first explore the effects of different disentanglement losses applied to and use only for classification. Using as the classification loss, we train different variants of our framework: a baseline without any disentanglement losses, a model with and our model with both and . The baseline achieves the accuracy of . Adding slightly improves the accuracy to . By combining with , our model significantly improves the accuracy to . Between and , the pose similarity loss plays a more critical role as is mainly designed to constrain the appearance feature, which does not directly benefit identification.
We also compare the effects of different classification losses applied to . Even though the classification loss only affects , we report the performance with both and for a direct comparison with our full model in the last row. With the disentanglement loss of , and , we benchmark different options of the classification loss as presented in Sec. 3.2, as well as the autoencoder loss by Srivastava et al. . The model using the conventional identity loss on the final LSTM output achieves the rank- accuracy of . Using the average output of LSTM as the identity feature, improves the accuracy to . The autoencoder loss  achieves a good performance of . However, it is still far from our proposed incremental identity loss ’s performance at . Fig. 23 further visualizes the over time, for two models learnt with and loss respectively. Clearly, even with less than frames, the model with shows more discriminativeness, which also increases rapidly as time progresses.
Finally, we compare different features in computing the final classification score. The performance is based on the model with full disentanglement losses and as the classification loss. When is utilized in cosine distance calculation, the rank- accuracy is merely , while and achieve and respectively. The results prove the learnt and are effective for classification while has limited discriminative power. Also, by combining both and features, the recognition performance can be further improved to . We believe that such performance gain is owing to the complementary discriminative information offered by w.r.t. .
5.1.4 Dynamic vs. Static Gait Features
Since and are complementary in classification, it is interesting to understand their relative contributions, especially in the various scenarios of gait recognition. This amounts to exploring a global weight for the GaitNet on various training data, where ranges from to . There are three protocols on CASIA-B and hence three GaitNet models are trained respectively. We calculate the weighted score of all three models on the training data of protocol , since it is the most comprehensive and representative protocol covering all the viewing angles and conditions. The same experiment is conducted on “ALL” protocol of the FVG dataset.
As shown in Fig. 24, GaitNet has the best average performance on CASIA-B when is around , while on FVG is around . According to Eqn. 3.4, has relatively more classification contributions on CASIA-B. One potential reason is that it is more challenging to match dynamic walking poses under large range of viewing angles. In comparison, FVG favors . Since FVG is an all-frontal-walking dataset containing varying distances or resolutions, dynamic gait is relatively easier to learn with the fixed view, while might be sensitive to resolution changes.
Nevertheless, note that in the two extreme cases, where only or is used, there is relatively small performance gap between them. This means that either feature is effective in classification. Considering this observation and the balance between databases, we choose to set , which will be used in all subsequent experiments.
5.1.5 Gait Recognition Over Time
One interesting question to study is that, how many video frames are needed to achieve reliable gait recognition. To answer this question, we compare the performance with different feature scores (, and their fusion) for identification, with different video lengths. As shown in Fig. 27, both dynamic and static features achieve stable performance starting from about frames, after which the gain in performance is relatively small. At FPS, a clip of frames is equivalent to merely seconds of walking. Further, the static gait feature has notable good performance even with a single video frame. This impressive result shows the strength of our GaitNet in processing very short clips. Finally, for most of the frames in this duration, the fusion outperforms both the static and dynamic gait feature alone.
|Gallery NM #-||- (exclude identical viewing angle)|
|Probe NM #-||Mean|
|D MT network |
|Probe BG #-||Mean|
|Probe CL #1-2||Mean|
5.2 Evaluation on Benchmark Datasets
Since various experimental protocols have been defined on CASIA-B, for a fair comparison, we strictly follow the respective protocols in the baseline methods. Following , Protocol uses the first subjects for training and remaining for testing, regarding variations of NM (normal), BG (carrying bag) and CL (wearing a coat) with crossing viewing angles of to . Three models are trained for comparison in Tab. VII. For the detailed protocol, please refer to . Here we mainly compare our work to Wu et al. , along with other methods [34, 75]. We denote our preliminary work  as GaitNet-pre and this work as GaitNet. Under multiple viewing angles and across three variations, GaitNet achieves the best performance compared to all SOTA methods and GaitNet-pre since can distill more discriminative information under various viewing angles and conditions.
Recently, Chen et al.  propose new protocols to unify the training and testing where only one single model is trained for each protocol. Protocol focuses on walking direction variations, where all videos used are in the NM subset. The training set includes videos of first subjects in all viewing angles. The rest subjects are for testing. The gallery is made of four videos at view for each subject. The first two videos from remaining viewing angles are the probe. The Rank- recognition accuracies are reported in Tab. VIII. GaitNet achieves the best average accuracy of across viewing angles, with significant improvement on extreme views compared to our preliminary work . For example, at viewing angles of , and , the improvement margins are both . This shows that more discriminative gait information, such as a canonical body shape information, under different views are learned in , which contributes to the final recognition accuracy.
Protocol focuses on appearance variations. Training sets have videos under BG and CL. There are subjects in total with to viewing angles. Different test sets are made with the different combination of viewing angles of the gallery and probe as well as the appearance condition (BG or CL). The results are presented in Tab. IX. Our preliminary work has comparable performance as the SOTA method L-CRF  on BG subset while significantly outperforming on CL subset. The proposed GaitNet outperforms on both subsets. Note that due to the challenge of CL protocol, there is a significant performance gap between BG and CL for all methods except ours, which is yet another evidence that our gait feature has strong invariance to all major gait variations.
Across all evaluation protocols, GaitNet consistently outperforms the state of the art. This shows the superior of GaitNet on learning a robust representation under different variations. It is contributed to our ability to disentangle pose/gait information from appearance variations. Comparing with our preliminary work, the canonical feature contains discriminative power which can further improve the recognition performance.
|Index of Gallery & Probe videos|
The original protocol of USF  and the methods [11, 70, 27, 1] does not define a training set, which is not applicable to our method, as well as , that require data to train the models. Hence following the experiment setting in , which randomly partitions the dataset into the non-overlapping training and test sets, each with half of the subjects. We test on Probe A, defined in , where the probe is different from the gallery by the viewpoint. We achieve the identification accuracy of , which is better than of our preliminary work GaitNet-pre , the reported of LB network , and of multi-task GAN .
Given that FVG is a newly collected database and no reported performance from prior work, we make the efforts to implement classic or SOTA methods on gait recognition [28, 55, 2, 69]. Furthermore, given the large amount of effort in human pose estimation , aggregating joint locations over time can be a good candidate for gait features. Therefore we define another baseline, named PE-LSTM, using pose estimation results as the input to the same LSTM and classification loss as ours. Using SOTA D pose estimation , we extract joints’ locations, feed to the -layer-LSTM, and train with our proposed LSTM incremental loss. For each of baselines and our GaitNet, one model is trained with the -subject training set and tested on all protocols.
As shown in Tab. X, our method shows state-of-the-art performance compared with baselines, including the recent CNN-based methods. Among protocols, CL is the most challenging variation as in CASIA-B. Comparing with all different methods, GEI based methods suffer from frontal view due to the lack of walking information. Again, thanks to the discriminative canonical feature , GaitNet achieves better recognition accuracies than GaitNet-pre. Also, the superior performance of our GaitNet over PE-LSTM demonstrates that our feature and does explore more discriminate information than the joints’ locations alone.
5.3 Comparison to Face Recognition
Face recognition aims to identify subjects by extracting discriminative identity features, or representation, from face images. Due to the vigorous development in the past few years, face recognition system is one of the most studied and deployed systems in the vision community, even superior to human on some tasks .
However, the challenge is particularly prominent in the video surveillance scenario, where low-resolution and/or non-frontal faces are acquired at a distance. While gait, as a behavioral biometric compared to face, might have more advantages in those scenarios since the dynamic information can be more resistant even at a lower resolution and different viewing angles. Especially for GaitNet, and can have complementary contributions in changing distances, resolutions and viewing angles. Therefore, to explore the advantages and disadvantages of gait recognition and face recognition in surveillance scenario, we compare our GaitNet with the most recent SOTA face recognition method, ArcFace , on the CASIA-B and FVG databases.
Specifically, for face recognition, we first employ SOTA face detection algorithm RetinaFace  to detect face and ArcFace to extract features for each frame of gallery and probe videos. Then the features over all frames of a video are aggregated by average pooling, an effective scheme used in prior video-based face recognition work . We measure the similarity of features by their cosine distance. To keep consistency with above gait recognition experiments, both face and gait report TAR at FAR for FVG and Rank- score for CASIA-B. To evaluate effects of time, we use the entire sequence as gallery and partial (e.g., %) sequence as probe on points on the time axis ranging from % to %.
5.3.1 Gait vs. Face Recognition on CASIA-B
In this experiment, we select the videos of the NM as gallery and both CL and BG are probes. We compare gait and face recognition in three scenarios: frontal-frontal, side-side and side-frontal viewing angles. Fig. 32 shows the Rank- scores over the time duration. As the video begins, GaitNet is significantly superior to face in all scenarios since our can capture discriminative information such as body shape in low-resolution images, as mentioned in Sec. 5.1.5, while faces are of too low resolution to perform meaningful recognition. As time progresses, GaitNet is stable to the resolution change and view variations, with increasing accuracy. In comparison, face recognition always has lower accuracies throughout the entire duration, except the frontal-frontal view face recognition slightly outperforms gait in the last of the duration, which is expected as this is toward the ideal scenario for face recognition to shine. Unfortunately, for side-side or side-frontal views, face recognition continues to struggle even at the end of the duration.
5.3.2 Gait vs. Face Recognition on FVG
We further compare GaitNet with ArcFace on FVG with NM-BGHT and NM-ALL* protocols. Note that the videos of NM-BGHT contain variations in carrying bags and wearing hat. The videos of ALL*, different from ALL in Tab. X, include all the variations in FVG except carrying and wearing hat variations (refer to Tab. V for details). As shown in Fig. 32, on the BGHT protocol, gait outperforms face in the entire duration, since wearing hat dramatically affects face recognition but not gait recognition. For ALL* protocol, face outperforms gait in the last duration because by then low resolution is not an issue and FVG has frontal-view faces.
Figure 33 shows some examples in CASIB-B and FVG, which are incorrectly recognized by face recognition. We also show some images (video frames) for which our GaitNet fails to recognize in Fig. 34. The low resolution and illumination conditions in these videos are the main reasons for failure. Note that while video-based alignment [43, 59]
or super-resolution approaches might help to enhance the image quality, their impact to recognition is beyond the scope of this work.
5.4 Runtime Speed
System efficiency is an essential metric for many vision systems including gait recognition. We calculate the efficiency while each of the
gait recognition methods processing one video of FVG dataset on the same desktop with GeForce GTX 1080 Ti GPU. All the coding are implemented in PyTorch Framework of Python programming language. Parallel computing of batch processing is enabled for GPU on all the inference models, where batch size is number of samples in the probe. Alphapose and Mask-R-CNN takes batch size ofas input in inference. As shown in Tab. XI, our method is faster than the pose estimation method because of 1) an accurate, yet slow, version of AlphaPose  is required for model-based gait recognition method; 2) only low-resolution input of pixels is needed for GaitNet. Further, our method has similar efficiency as the recent CNN-based gait recognition methods.
This paper presents an autoencoder-based method termed GaitNet that can disentangle appearance and gait feature representation from raw RGB frames, and utilize a multi-layer LSTM structure to further leverage temporal information to generate a gait representation for each video sequence. We compare our method extensively with the state of the arts on CASIA-B, USF, and our collected FVG datasets. The superior results show the generalization and promise of the proposed feature disentanglement approach. We hope that in the future, this disentanglement approach is a viable option for other vision problems where motion dynamics needs to be extracted while being invariant to confounding factors, e.g., expression recognition with invariance to facial appearance, activity recognition with invariance to clothing.
This work was partially sponsored by the Ford-MSU Alliance program, and the Army Research Office under Grant Number W911NF-18-1-0330. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
Covariate conscious approach for gait recognition based upon Zernike moment invariants. IEEE Transactions on Cognitive and Developmental Systems (TCDS) 10 (2), pp. 397–407. Cited by: §5.2.2.
-  (2017) Improved Gait recognition based on specialized deep convolutional neural networks. Computer Vision and Image Understanding (CVIU) 164, pp. 103–110. Cited by: §1, §5.2.3, TABLE X, TABLE XI.
-  (2012) Marionette mass-spring model for 3D gait biometrics. In International Conference on Biometrics (ICB), Cited by: §1, §1, §1.
-  (2018) Synthesizing Images of Humans in Unseen Poses. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2010) Gait Recognition Using Gait Entropy Image. In International Conference on Imaging for Crime Detection and Prevention (ICDP), Cited by: §1, §1, §1, §2.
-  (2001) Gait Recognition Using Static, Activity-Specific Parameters. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2019) Pedestrian Detection with Autoregressive Network Phases. In Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
-  (2017) Illuminating Pedestrians via Simultaneous Detection and Segmentation. In International Conference on Computer Vision (ICCV), Cited by: §3.1.
-  (2014) Pose Depth Volume extraction from RGB-D streams for frontal gait recognition. Journal of Visual Communication and Image Representation 25 (1), pp. 53–63. Cited by: §1.
-  (2014) Frontal Gait Recognition From Incomplete Sequences Using RGB-D Camera. IEEE Transactions on Information Forensics and Security 9 (11), pp. 1843–1856. Cited by: §1.
-  (2011) Human identification using temporal information preserving gait template. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34 (11), pp. 2164–2176. Cited by: §5.2.2.
-  (2017) Multi-Gait Recognition Based on Attribute Discovery. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (7), pp. 1697–1710. Cited by: §2, §5.2.1, TABLE VIII, TABLE IX, §5.
-  (2018) FSRNet: end-to-end learning face super-resolution with facial priors. In Computer Vision and Pattern Recognition (CVPR), Cited by: §5.3.2.
-  (2003) Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
-  (2019) Skeleton-based gait recognition via robust Frame-level matching. IEEE Transactions on Information Forensics and Security 14 (10), pp. 2577–2592. Cited by: §1, §1, §2.
-  (2006) Body shape assessment scale: instrument development foranalyzing female figures. Clothing and Textiles Research Journal (CTRJ) 24 (2), pp. 80–95. Cited by: §3.1.
-  (2003) Automatic extraction and description of human gait models for recognition purposes. Computer Vision and Image Understanding (CVIU) 90 (1), pp. 1–41. Cited by: §1.
-  (2019) Arcface: Additive angular margin loss for deep face recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §5.3, §5.
-  (2019) RetinaFace: single-stage dense face localisation in the wild. In arXiv preprint arXiv:1905.00641, Cited by: §5.3.
-  (2017) Unsupervised Learning of Disentangled Representations from Video. In Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2018) A Variational U-Net for Conditional Appearance and Shape Generation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) RMPE: Regional Multi-Person Pose Estimation. In International Conference on Computer Vision (ICCV), Cited by: §5.2.3, §5.4.
-  (2016) Learning Effective Gait Features Using LSTM. In International Conference on Pattern Recognition (ICPR), Cited by: §1, §1, §2, §5.2.3.
-  (2000) Learning to forget: continual prediction with LSTM. Neural Computation 12 (10), pp. 2451–2471. Cited by: §3.5.
-  (2019) Low quality video face recognition: multi-mode aggregation recurrent network (MARN). In International Conference on Computer Vision Workshops (ICCVW), Cited by: §5.3.
-  (2014) Generative Adversarial Nets. In Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2014) On reducing the effect of covariate factors in gait recognition: a classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 37 (7), pp. 1521–1528. Cited by: §5.2.2.
-  (2005) Individual Recognition Using Gait Energy Image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28 (2), pp. 316–322. Cited by: §1, §1, §1, §2, §5.2.3, TABLE X.
-  (2017) Mask R-CNN. In International Conference on Computer Vision (ICCV), Cited by: §3.1, §3.5.
-  (2018) Multi-Task GANs for View-Specific Feature Learning in Gait Recognition. IEEE Transactions on Information Forensics and Security 14 (1), pp. 102–113. Cited by: §5.2.2.
-  (2014) The TUM Gait from Audio, Image and Depth (GAID) database: Multimodal recognition of subjects and traits. Journal of Visual Communication and Image Representation 25 (1), pp. 195–206. Cited by: §2.
-  (2010) Clothing-invariant gait identification using part-based clothing categorization and adaptive weight control. Pattern Recognition 43 (6), pp. 2281–2291. Cited by: §1.
Enhanced Gabor Feature Based Classification Using a Regularized Locally Tensor Discriminant Model for Multiview Gait Recognition. IEEE Transactions on Circuits and Systems for Video Technology 23 (7), pp. 1274–1286. Cited by: TABLE IX.
-  (2013) View-Invariant Discriminative Projection for Multi-View Gait-Based Human Identification. IEEE Transactions on Information Forensics and Security 8 (12), pp. 2034–2045. Cited by: §5.2.1, TABLE VII, TABLE VIII.
-  (2012) The OU-ISIR Gait Database Comprising the Large Population Dataset and Performance Evaluation of Gait Recognition. IEEE Transactions on Information Forensics and Security 7 (5), pp. 1511–1521. Cited by: TABLE I.
-  (2004) Fusion of gait and face for human identification. In Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §2.
-  (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
-  (2013) Recognizing Gaits Across Views Through Correlated Motion Co-Clustering. IEEE Transactions on Image Processing 23 (2), pp. 696–709. Cited by: TABLE VIII, §5.
Support Vector Regression for Multi-View Gait Recognition based on Local Motion Feature Selection. In Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE VIII, §5.
-  (2014) Recognizing Gaits on Spatio-Temporal Feature Domain. IEEE Transactions on Information Forensics and Security 9 (9), pp. 1416–1423. Cited by: TABLE VIII.
-  (2018) Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2010) Video-based face model fitting using adaptive active appearance model. Image and Vision Computing 28 (7), pp. 1162–1172. Cited by: §5.3.2.
Visualizing data using t-SNE.
Journal of Machine Learning Research9 (Nov), pp. 2579–2605. Cited by: §5.1.2.
-  (2018) Gait recognition by deformable registration. In Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §1, §2.
-  (2012) The OU-ISIR Gait Database Comprising the Treadmill Dataset. IPSJ Transactions on Computer Vision and Applications 4, pp. 53–62. Cited by: TABLE I, §2, §5.
-  (2017) Joint Intensity and Spatial Metric Learning for Robust Gait Recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §5.
-  (2005) A floor sensor system for gait recognition. In Workshop on Automatic Identification Advanced Technologies (AutoID), Cited by: §1.
-  (2012) Frontal Gait Recognition Combining 2D and 3D Data. In ACM Workshop on Multimedia and Security, Cited by: §1.
-  (2010) Human Identification Based on Gait. Springer Science & Business Media. Cited by: §1.
-  (2016) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), Cited by: §3.5.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cited by: §2.
-  (2005) The Human ID Gait Challenge Problem: Data Sets, Performance, and Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 27 (2), pp. 162–177. Cited by: §1, TABLE I, §2, §5.2.2, §5.
-  (2001) Integrated face and gait recognition from multiple views. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2016) GEINet: View-Invariant Gait Recognition Using a Convolutional Neural Network. In International Conference on Biometrics (ICB), Cited by: §1, §1, §5.2.3, TABLE X, TABLE XI.
-  (2004) On a Large Sequence-Based Human Gait Database. In Applications and Science in Soft Computing, Cited by: §2.
-  (2011) Gait Energy Volumes and Frontal Gait Recognition using Depth Images. In International Joint Conference on Biometrics (IJCB), Cited by: §1.
-  (2015) Unsupervised Learning of Video Representations using LSTMs. In International Conference on Machine Learning (ICML), Cited by: §5.1.3, TABLE VI.
Towards highly accurate and stable face alignment for high-resolution videos.
AAAI Conference on Artificial Intelligence (AAAI), Vol. 33, pp. 8893–8900. Cited by: §5.3.2.
-  (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29 (10), pp. 1700–1715. Cited by: §1.
-  (2019) Towards High-fidelity Nonlinear 3D Face Morphable Model. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) Nonlinear 3D Face Morphable Model. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) On learning 3D face morphable model from in-the-wild images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Note: doi: 10.1109/tpami.2019.2927975 Cited by: §2.
-  (2017) Disentangled Representation Learning GAN for Pose-Invariant Face Recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) Representation Learning by Rotating Your Faces. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Note: doi: 10.1109/tpami.2018.2868350 Cited by: §2.
-  (2018) A survey on gait recognition. ACM Computing Surveys (CSUR) 51 (5),