On Learning Disentangled Representations for Gait Recognition

09/05/2019 ∙ by Ziyuan Zhang, et al. ∙ Michigan State University 11

Gait, the walking pattern of individuals, is one of the important biometrics modalities. Most of the existing gait recognition methods take silhouettes or articulated body models as gait features. These methods suffer from degraded recognition performance when handling confounding variables, such as clothing, carrying and viewing angle. To remedy this issue, we propose a novel AutoEncoder framework, GaitNet, to explicitly disentangle appearance, canonical and pose features from RGB imagery. The LSTM integrates pose features over time as a dynamic gait feature while canonical features are averaged as a static gait feature. Both of them are utilized as classification features. In addition, we collect a Frontal-View Gait (FVG) dataset to focus on gait recognition from frontal-view walking, which is a challenging problem since it contains minimal gait cues compared to other views. FVG also includes other important variations, e.g., walking speed, carrying, and clothing. With extensive experiments on CASIA-B, USF, and FVG datasets, our method demonstrates superior performance to the SOTA quantitatively, the ability of feature disentanglement qualitatively, and promising computational efficiency. We further compare our GaitNet with state-of-the-art face recognition to demonstrate the advantages of gait biometrics identification under certain scenarios, e.g., long distance/lower resolutions, cross viewing angles.



There are no comments yet.


page 2

page 4

page 5

page 7

page 8

page 9

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Biometrics measures people’s unique physical and behavioral characteristics to recognize the identity of an individual. Gait [50], the walking pattern of an individual, is one of biometrics modalities besides face, fingerprint, iris, etc. Gait recognition has the advantage that it can operate at a distance without users’ cooperation. Also, it is difficult to camouflage. Due to these advantages, gait recognition is applicable to many applications such as person identification, criminal investigation, and healthcare.

As other recognition problems, gait data can usually be captured by five types of sensors [66], i.e., RGB camera, RGB-D camera [68, 79], accelerometer [76], floor sensor [48], and continuous-wave radar [67]. Among them, RGB camera is not only the most popular one due to the low sensor cost, but also the most challenging one since RGB pixels might not be effective in capturing the motion cues. This work studies gait recognition from RGB cameras.

The core of gait recognition lies in extracting gait features from the video frames of a walking person, where the prior work can be categorized into two types: appearance-based and model-based methods. The appearance-based methods, e.g., Gait Energy Image (GEI) [28], take the averaged silhouette image as the gait feature. While having a low computational cost and being able to handle low-resolution imagery, it can be sensitive to variations such as cloth change, carrying, viewing angles and walking speed [53, 5, 69, 32, 2, 55, 72]

. The model-based methods use the articulated body skeleton from pose estimation as the gait feature. They show more robustness to aforementioned variations but at a price of a higher computational cost and dependency on pose estimation accuracy 

[3, 15, 23].



Fig. 3: (a) While conventional gait databases capture side-view imagery, we collect a new gait database (FVG) with focus on more challenging frontal views. We propose a novel CNN-based model, termed GaitNet, to directly learn the disentangled appearance, canonical and pose features from walking videos, as opposed to handcrafted GEI or skeleton features. (b) Given videos of Subject and video of Subject , feature visualizations by our decoder in Fig. 7 show that, the appearance feature is video-specific capturing clothing information; the canonical feature is subject-specific capturing the overall body shape at a standard pose; the pose feature is frame-specific capturing body poses at individual frames.

It is understandable that the challenge in designing a gait feature is the necessity of being invariant to the appearance variation due to clothing, viewing angle, carrying, etc. Therefore, our desire is to disentangle the gait feature from the non-gait-related appearance of the walking person. For both appearance-based or model-based methods, such disentanglement is achieved by manually handcrafting the GEI-like [28, 5] or body skeleton-like [3, 23, 15] features, since neither has color or texture information. However, we argue that these manual disentanglements may be sensitive to changes in walking condition. In other words, they can lose certain or create redundant gait information. E.g., GEI-like features have distinct silhouettes for the same subject wearing different clothes. For skeleton-like features, when carrying accessories (e.g., bags, umbrella), certain body joints such as hands may have fixed positions, and hence are redundant information to gait.

To remedy the aforementioned issues in handcrafted features, as shown in Fig. 3 (a), this paper proposes a novel approach to learn gait representations from the RGB video directly. Specifically, we aim to automatically disentangle dynamic pose features (trajectory of gait) from pose-irrelevant features. To further distill identity information from pose-irrelevant features, we disentangle the pose-irrelevant features into appearance (i.e., clothing) and canonical features. Here, the canonical feature refers to a standard and unique representation of human body, such as body ratio, width and limb lengths, etc. The pose features and canonical features are discriminative in identity and are used for gait recognition. Fig. 3 (b) visualizes the three disentangled features.

This disentanglement is realized by designing an autoencoder-based Convolutional Neural Network (CNN), GaitNet, with novel loss functions. For each video frame, the encoder estimates three latent representations: pose, canonical and appearance features, by employing three loss functions: 1) cross reconstruction loss enforces that the canonical and appearance features of one frame, fused with the pose feature of another frame, can be decoded to the latter frame; 2) pose similarity loss forces a sequence of pose features extracted from a video sequence, of the same subject to be similar even under different conditions; 3) canonical consistency loss favors consistent canonical features among videos of the same subject under different conditions. Finally, the pose features of a sequence are fed into a multi-layer LSTM with our designed incremental identity loss to generate the sequence-based dynamic gait feature. The average of canonical features results in the sequence-based static gait feature. Given two gait videos, the cosine distances between their respective dynamic and static gait features are computed and their summation is the final video-to-video gait similarity metric.

In addition, most prior work [28, 5, 3, 6, 17, 74, 45, 55, 60, 47] choose the walking video of the side view, which has the richest gait information, as the gallery sequence. However, in practices other viewing angles, such as the frontal view, can be very common when pedestrians walk toward or away from the surveillance camera. Also, the prior work [57, 9, 10, 49] that focuses on frontal view are often based on RGB-D videos, which have additional depth information than RGB. Therefore, to encourage gait recognition from frontal-view RGB videos that generally has the minimal amount of gait information, we collect a high-definition (HD, p) Frontal-View Gait database, named FVG, with a wide range of variations. It has three frontal-view angles where the subject walks from left , , and right off the optical axes of the camera. For each of three angles, different variants are explicitly captured including walking speed, clothing, carrying, multiple people, etc.

A preliminary version of this work was published in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

 [77]. We extend the work from three aspects. 1) Instead of disentangling features in two components: pose and pose-irrelevant [77], we further decouple the pose-irrelevant features into discriminative canonical feature and appearance feature. By devising an effective canonical consistency loss, the canonical feature helps to improve gait recognition accuracy. 2) We conduct more insightful ablation studies to analyze the relationship between our disentanglement losses and features, gait recognition over time, and contributions of dynamic and static gait features. 3) We perform side-by-side comparison between gait recognition and the state-of-the-art (SOTA) face recognition on the same dataset.

In summary, this paper makes the following contributions:

Our proposed GaitNet directly learns disentangled representations from RGB videos, which is in sharp contrast to the conventional appearance-based or model-based methods.

We introduce a Frontal-View Gait database, including various variations of viewing angles, walking speeds, carrying, clothing changes, background and time gaps. This is the first HD gait database, with nearly twice the number of subjects compared to existing RGB gait databases.

Our proposed method outperforms the state of the arts on three benchmarks, CASIA-B, USF, and FVG datasets.

We demonstrate the strength of gait recognition over face recognition in the task of person recognition from surveillance-quality videos.

2 Related Work

Dataset #Subjects #Videos Environment FPS Resolution Format Variations
CASIA-B [73] Indoor RGB
View, Clothing, Carrying
USF [53] Outdoor RGB
View, Ground Surface, Shoes, Carrying, Time
OU-ISIR-LP [35] - Indoor - Silhouette
OU-ISIR-LP-Bag [46] - Indoor - Silhouette
FVG (ours) Outdoor RGB
View, Walking Speed, Carrying, Clothing, Multiple people, Time
TABLE I: Comparison of existing gait databases and our collected FVG database.

Gait Representation. Most prior works are based on two types of gait representations. In appearance-based methods, gait energy image (GEI) [28] or gait entropy image (GEnI) [5] are defined by extracting silhouette masks. Specifically, GEI uses an averaged silhouette image as the gait representation for a video. These methods are popular in the gait recognition community for their simplicity and effectiveness. However, they often suffer from sizeable intra-subject appearance changes due to covariates such as clothing, carrying, views, and walking speed. On the other hand, model-based methods [15, 23] fit articulated body models to images and extract kinematic features such as D body joints. While they are robust to some covariates such as clothing and speed, they require a relatively higher image resolution for reliable pose estimation and higher computational costs.

In contrast, our approach learns gait representation directly from raw RGB video frames which contain richer information, thus with higher potential of extracting more discriminative gait features. The most relevant work to ours is [12], which learns gait features from RGB images via Conditional Random Field. Compared to  [12], our proposed approach learns two complimentary features: dynamic gait, and static gait features, and has the advantage of being able to leverage a large amount of training data and learning more discriminative representation from data with multiple covariates. In addition, some recent works [69, 74, 75, 45, 72] use CNN to learn more discriminative features from GEI. However, the source of the learning, GEI, already loses dynamic information since a random shuffle of video frames results in the identical GEI feature. In contrast, the proposed GaitNet learns features from RGB imagery instead, which allows the network to explore richer information for representation learning. This is demonstrated by our comparison with [12, 69] in Sec. 5.2.1 and Sec. 5.2.3.

Gait Databases. There are many classic gait databases such as SOTON Large dataset [56], USF [53], CASIA-B [73], OU-ISIR [46], and TUM GAID [31]. We compare our FVG database with the widely used ones in Tab. I. CASIA-B is a large multi-view gait database with three variations: viewing angle, clothing, and carrying. Each subject is captured from views under three conditions: normal walking (NM), walking in coats (CL) and walking while carrying bags (BG). For each view, , and videos are captured in NM, CL and BG conditions, respectively. USF database has subjects with five variations, totaling conditions per subject. It contains two viewing angles (left and right), two ground surfaces (grass and concrete), shoe change, carrying condition and time. While OU-ISIR-LP and OU-ISIR-LP-Bag are large databases, only silhouettes are publicly released in both of them. In contrast, our FVG focuses on the frontal view, with different near frontal-view angles toward the camera, and other variations including walking speed, carrying, clothing, multiple people and time.

Disentanglement Learning.

Besides model-based approaches representing data with semantic latent vectors 

[62, 63, 61, 42], data-driven disentangled representation learning approaches are gaining popularity in the computer vision community. DrNet [20] disentangles content and pose vectors with a two-encoders architecture, which removes content information in the pose vector by generative adversarial training. The work of [4] segments foreground masks of body parts by D pose joints via U-Net [52] and then transforms body parts to desired motion with adversarial training. Similarly, [21] utilizes U-net and Variational Auto Encoder (VAE) [38] to disentangle an image into appearance and shape. DR-GAN [64, 65] achieves SOTA performances on pose-invariant face recognition by explicitly disentangling pose variation with a multi-task GAN [26]. Different from [20, 4, 21], our method has only one encoder to disentangle the three latent features, through the design of novel loss functions without the need for adversarial training. Further, pose labels are used in DR-GAN training so as to disentangle identity feature from the pose. However, to disentangle pose and appearance features from RGB, there is no pose nor appearance label to be utilized for our method, since it is nontrivial to define the types of walking pattern or clothes as discrete classes.

Gait vs. Face recognition. Both gait and face are popular biometrics modalities, especially in covert identification-at-a-distance applications. Hence, it is valuable to understand the pros and cons of each modality if the SOTA gait recognition and face recognition algorithms are deployed. Along this direction, most of the prior works focus on the fusion of both modalities and evaluate on relatively small datasets [54, 36, 78]. In contrast, we conduct comprehensive evaluations using SOTA face and gait recognition algorithms, across various conditions of CASIA-B and FVG databases. Further, the performances are measured along the video duration to explore the impact of person-to-camera distances.

Symbol Dim. Notation
scalar Index of subject
scalar Condition
scalar Time step in a video
scalar Number of frames in a video
matrices Gait video under condition
matrix Frame of video
matrix Reconstructed frame via
- Encoder network
- Decoder network
- Classifier for
- Classifier for
Pose feature
Canonical feature
Appearance feature
Dynamic gait feature
Static gait feature
The output of LSTM at step t
- Reconstruction loss
- Pose similarity loss
- Canonical similarity loss
- Incremental identity loss
TABLE II: Symbols and notations.

3 Proposed Approach

3.1 Overview

Let us start with a simple example. Assuming there are three videos, where videos and capture subject A wearing t-shirt and long down coat respectively, and in video subject B wears the same long down coat as in video . The objective is to design an algorithm, from which the gait features of video and are the same, while those of video and are different. Clearly, this is a challenging objective, as the long down coat can easily dominate the extracted feature, which would make video and to be more similar than and in the latent space of gait features. Indeed the core challenge, as well as the objective, of gait recognition is to extract gait features that are discriminative among subjects, but invariant to different confounding factors, such as viewing angles, walking speeds and changing clothes. Table II summarizes the symbol and notation used in this paper.

Our approach to achieve this objective is feature disentanglement. In our preliminary work [77], we disentangle features into two components: pose and “appearance” features. However, further research discovered that the “appearance” feature still contains certain discriminative information, which can be useful for identity classification. For instance, as in Fig. 6, imagining if we would ignore the body pose, e.g., position of arms and legs, and clothing information, e.g., color and texture of clothes, we may still tell apart different subjects by their inherent body characteristics, which can include categories of overall body shape (e.g., rectangle, triangle, inverted triangle, and hourglass [16]), arm length, torso vs. leg ratio [14], etc. In other words, even when different people wearing exactly the same clothing and standing still, these characteristics are still subject dependent. In the meantime, for the same subject under various conditions, these characteristics are relatively constant. In this work, we term the feature describing these characteristics as the canonical feature. Hence, given a walking video under condition , our framework disentangle the encoded feature into three components: the pose feature , the appearance feature and the canonical feature . We also term the concatenation of and as the pose-irrelevant feature, which is conceptually equivalent to the “appearance” feature in  [77]. The pose feature describes the positions of body parts, and their dynamic over time is essentially the core element of gait; the canonical feature defines the unique characteristics of individual body; and the appearance feature describes the subject’s clothing.

(a) The same subject
(b) Different subjects
Fig. 6: If we may ignore the differences in color/texture of clothing and the body pose, there are inherent body characteristics that are different across subjects (b), and invariant within the same subject (a). These include overall body shape, arm length, torso vs. leg ratio, etc. We define canonical feature to specifically describe these characteristics.

The above feature disentanglement can be naturally implemented as an encoder-decoder network. Specifically, as depicted in Fig. 7, the input to our GaitNet is a video sequence, with background removed using any off-the-shelf pedestrian detection and segmentation method [29, 8, 7]. With carefully designed loss functions, an encoder is learned to disentangle the pose, canonical and appearance features for each video frame. Then, a multi-layer LSTM explores the temporal dynamics of pose features and aggregates them to a sequence-based dynamic gait feature. In the meantime, the average of all the canonical features is defined as the static gait feature. Measuring distances of both dynamic and static features between the gallery and probe walking videos provides the final matching score. In this section, we first present the feature disentanglement, followed by temporal aggregation, model inference and finally implementation details.

3.2 Feature Disentanglement

Constant Across Frames Constant Across Conditions Discriminative
Yes No No
Yes Yes Yes
No Yes Yes for over
TABLE III: The properties of three disentangled features in terms of its constancy across frames and conditions, and discriminativeness. These properties are the basis for us to design loss functions for feature disentanglement.

For the majority of gait datasets, there is limited intra-subject appearance variation. Hence, appearance could be a discriminative cue for identification during training as many subjects can be easily distinguished by their clothes. Unfortunately, any feature extractors relying on appearance will not generalize well on the test set or in practice, due to potentially diverse clothing or appearance between two videos of the same subject. This limitation on training sets also prevents us from learning ideal feature extractors if solely relying on identification objective. Hence we propose to learn to disentangle the canonical and pose feature from the visual appearance. Since a video is composed of frames, disentanglement should be conducted at the frame level first.

Before presenting the details of how we conduct disentanglement, let us first understand the various properties of three types of features, as summarized in Tab. III. These properties are crucial in guiding us to define effective loss functions for disentanglement. The appearance feature mainly describes the clothing information of the subject. Hence it is constant within a video sequence, but often different across different conditions. Of course it is not discriminative among individuals. The canonical feature is subject-specific, and is therefore constant across both video frames, and conditions. The pose feature is obviously different across video frames, but is assumed to be constant across conditions. Since the pose feature is the manifestation of video-based gait information at a specific frame, the pose feature itself might not be discriminative. However, the dynamics of pose features over time will constitute the dynamic gait feature, which is discriminative among individuals.

Fig. 7: The overall architecture of proposed GaitNet. The bottom right block indicates the inference process, while the remaining illustrates the training process with the four color-coded loss functions.

To this end, we propose to use an encoder-decoder network architecture with carefully designed loss functions to disentangle the pose feature and canonical feature from appearance feature. The encoder, , encodes a feature representation of each frame, , and explicitly splits it into three components, namely appearance feature , canonical feature and pose feature :


Collectively these three features are expected to fully describe the original input image. As they can be decoded back to the original input through a decoder :


We now define the various loss functions to jointly learn the encoder and decoder .

Cross Reconstruction Loss. The reconstructed image should be close to the original input . However, enforcing self-reconstruction loss as in typical auto-encoder cannot ensure the meaningful disentanglement as in our design. Hence, we propose the cross reconstruction loss, using the appearance feature and canonical feature of frame and the pose feature of frame to reconstruct the latter frame:


The cross reconstruction loss, on one hand, can act as the self-reconstruction loss to make sure the three features are sufficiently representative to reconstruct a video frame. On the other hand, as we can pair a pose feature of a current frame with the canonical and appearance features of any frame in the same video to reconstruct the same target, it enforces both the canonical and appearance features to be similar across all frames within a video. Indeed, according to Tab. III, between the pose-irrelevant feature, , and the pose feature , the main distinct property is that the former is constant across frames while the latter is not. This is the basis for designing our cross reconstruction loss.

Pose Similarity Loss. The cross reconstruction loss is able to prevent the pose-irrelevant feature, , to be contaminated by the pose information that changes across frames. If not, i.e., or contains some pose information, and would have different poses. However, clothing/texture and body information may still be leaked into the pose feature . In the extreme case, and could be constant vectors while encodes all the information of a video frame.

To encourage including only the pose information, we leverage multiple videos of the same subject. Given two videos of the same subject with length , in two different conditions , , they contain difference in the person’s appearance, i.e., cloth changes. Despite appearance changes, the gait information is assumed to be constant between two videos. Since it’s almost impossible to enforce similarity on between video frames as it requires precise frame-level alignment, we minimize the similarity between two videos’ averaged pose features:


According to Tab. III, the pose feature is constant across conditions, which is the basis of our pose similarity loss.

Canonical Consistency Loss. The canonical feature describes the subject’s body characteristics, which is unique over all video frames. To be specific, for two videos of the same subject in two different conditions , , the canonical feature is constant across both frames and conditions, as illustrated in Tab. III. Tab. III also states that the canonical feature is discriminative across subjects. Hence, to enforce the two constancy and the discriminativeness, we define the canonical consistency loss as follows:


where the three terms measure the consistency across frames in a single video, consistency across different videos of the same subject, and identity classification using a classifier , respectively.

3.3 Gait Feature Learning and Aggregation

Even when we can disentangle pose, canonical and appearance information for each video frame, the and have to be aggregated over time, since 1) gait recognition is conducted between two videos instead of two images; 2) not all the from every single frame is guaranteed to have same canonical information; 3) the current feature only represents the walking pose of the person at a specific instance, which can share similarity with another instance of a different individual. Here, we are looking for discriminative characteristics in a person’s walking pattern. Therefore, modeling its aggregation for and temporal change for is critical.

3.3.1 Static Gait Feature via Canonical Feature Aggregation

After learning for every single frame as defined in Eqn. 3.2, we explore the best representation of features across all frames of a video sequence. Since is assumed to be constant over time, we compute the averaged features as a way to aggregate the canonical features over time. Given that describes the body characteristics as if we freeze the gait, we call the aggregated as the static gait feature .


3.3.2 Dynamic Gait Feature via Pose Feature Aggregation

For temporal modeling of poses, this is where temporal modeling architectures like the recurrent neural network or long short-term memory (LSTM) work best. Specifically, in this work, we utilize a multi-layer LSTM structure to explore temporal information of pose features,

e.g., how the trajectory of subjects’ body parts changes over time. As shown in Fig. 7, pose features extracted from one video sequence are fed into a -layer LSTM. The output of the LSTM is connected to a classifier , in this case, a linear classifier is used, to classify the subject’s identity.

Let be the output of the LSTM at time step , which is accumulative after feeding pose features into it:


Now we define the loss function for LSTM. A trivial option for identification is to add the classification loss on top of the LSTM output of the final time step:


which is the negative log likelihood that the classifier correctly identifies the final output as its identity label .

Identification with Averaged Feature. By the nature of LSTM, the output can be greatly affected by its last input . Hence the LSTM output, , could be unstable across time steps. With a desire to obtain a gait feature that is robust to the final instance of a walking cycle, we choose to use the averaged LSTM output as our gait feature for identification:


The identification loss can be rewritten as:


Incremental Identity Loss. LSTM is expected to learn that, the longer the video sequence, the more walking information it processes thus the more confident it identifies the subject. Instead of minimizing the loss at the final time step, we propose to use all the intermediate outputs of every time step weighted by :


where we set and other options such as also yield similar performance. In the experiments, we will ablate the impact of three options in classification loss: , , and . To this end, the overall loss function is:


The entire system, including encoder, decoder, and LSTM, are jointly trained. Updating to optimize also helps to further generate pose feature that has identity information and from which LSTM is able to explore temporal dynamics.

3.4 Model Inference

Since GaitNet takes one video sequence as input and outputs and as shown in Fig. 7, one single score is needed to measure the similarity between the gallery and probe videos for either gait authentication or identification. During testing, both and

are used as the identity features for score calculation. We use the cosine similarity scores, normalized to the range of

via min-max. The static and dynamic scores are finally fused by a weighted sum rule:


where and represent gallery and probe, respectively.



Output Size Layers Filters/Stride Output Size
Conv1 x/ xx FC - xx
MaxPool1 x/ xx FCConv1 x/ xx
Conv2 x/ xx FCConv2 x/ xx
MaxPool2 x/ xx FCConv3 x/ xx
Conv3 x/ xx FCConv4 x/ xx
(Conv4 x/ xx)
MaxPool3 x/ xx
FC -
TABLE IV: The architecture of and networks. Note the layer with ()* is removed for experiments with small training sets, i.e., all ablation studies in Sec. 5.1, to prevent overfitting.

3.5 Implementation Details

Detection and Segmentation. Our GaitNet receives video frames with the person of interest segmented. The foreground mask is obtained from the SOTA instance segmentation algorithm, Mask R-CNN [29]

. Instead of using a zero-one mask by hard thresholding, we maintain the soft mask returned by the network, where each pixel indicates the probability of being a person. This is partially due to the difficulty in choosing an appropriate threshold suitable for multiple databases. Also, it remedies the loss in information due to the mask estimation error. We use a bounding box with a fixed ratio of width : height

with the absolute height and center location given by the Mask R-CNN network. The input of GaitNet is obtained by pixel-wise multiplication between the mask and the -normalized RGB values, and then resizing to pixels. This applies to all the experiments on CASIA-B, USF and FVG datasets in Sec. 5.

Network Structure and Hyperparameter.

Our encoder-decoder network is a typical CNN, illustrated in Tab. IV. Different from our preliminary work [77], we replace stride- convolution layers with stride-

convolution layers and max pooling layers, since we find the latter is able to achieve the similar results with less hyper-parameter searching for different training scenarios. Each convolution layer is followed by Batch Normalization and Leaky ReLU activation. The decoder structure, similar to 

[51], is built from transposed D convolution, Batch Normalization and Leaky ReLU layers. The final layer is a Sigmoid activation which can output the value into range as the input. All the transposed convolutions are with stride of to up sample images and all the Leaky ReLU are with slope of . The classification part is a stacked -layer LSTM [24], which has hidden units in each cell. The length of , and is , and respectively, as shown in Tab. II.

The Adam optimizer [37] is initialized with the learning rate of , and the momentum of . To prevent over-fitting, the weights decay of is applied to all the experiments, and the learning rate decays by multiplying in every iterations. For each batch, we use video frames from or different clips depending on different experiment protocols. Since video lengths are varied, a random crop of -frame sequence is applied during training; all shorter videos are discarded. The , and in Eqn. 12 are all set to in all experiments.

Fig. 8: Examples of FVG Dataset. (a) Samples of the near frontal middle, left and right walking viewing angles in Session () of the first subject (). - is the same subject in Session . (b) Samples of slow and fast walking speed for another subject in Session . Frames in the second row are normal and in the third row are fast walking. Carrying bag and wearing hat sample is shown below. (c) Samples of changing clothes and with multiple people background from one subject in Session .

4 Front-View Gait (FVG) Database

Collection. To facilitate the research of gait recognition from frontal-view angles, we collect the Front-View Gait (FVG) database in a course of two years ( and ). During the capturing, we place the camera (Logitech C Pro Webcam or GoPro Hero ) on a tripod at the height of meters. We require each of subjects to walk toward the camera times starting from around meters away from the camera, which results in videos per subject. The videos are captured at resolution with FPS and the average length of seconds. The height of body in the video ranges from to pixels, and the height of faces ranges from to pixels. These walks have the combination of three angles toward the camera (, , off the optical axes of the camera), and four variations. As detailed in Tab. V, FVG is collected in three sessions with five variations: normal, walking speed (slow and fast), clothing changes, carrying/wearing change (bag or hat), and clutter background (multiple persons). The five variations are well balanced in three sessions. Fig. 8 shows exemplar images from FVG.

Collection Year
Number of Subjects
Viewing Angle () - - -
Fast / Slow Walking / / /
Carrying Bag / Hat - - - - - -
Change Clothes - - -
Multiple Person - - -
TABLE V: The FVG database. The last rows show the specific variations that are captured by each of videos per subject.

Protocols. Different from prior gait databases, subjects in FVG are walking toward the camera, which creates a great challenge on exploiting gait information as the visual difference in consecutive frames is normally much smaller than side-view walking. We focus our evaluation on variations that are challenging, e.g., different clothes, carrying a bag while wearing a hat, or are not presented in prior databases, e.g., multi-person. To benchmark research on FVG, we define evaluation protocols, among which there are two commonalities: the first and remaining subjects are used for training and testing respectively; the video , the normal frontal-view walking, is always used as the gallery. The protocols differ in their respective probe data, which cover the variations of Walking Speed (WS), Carrying Bag while Wearing a Hat (BGHT), Changing Clothes (CL), Multiple Persons (MP), and all variations (ALL). At the top part of Tab. V, we list the detailed probe sets for all protocols. For instance, for the WS protocol, the probes are video in Session and video in Session . In all protocols, the performance metrics are the True Accept Rate (TAR) at and False Alarm Rate (FAR).

5 Experimental Results

We evaluate the proposed approach on three gait databases, CASIA-B [73], USF [53] and FVG. As mentioned in Sec. 2, CASIA-B and USF are the most widely used gait databases, which helps us to make the comprehensive comparison with prior works. We compare our method with [69, 12, 40, 39] on these two databases, by following the respective experimental protocols of the baselines. These are either the most recent and SOTA work, or classic gait recognition methods. The OU-ISIR database [46] is not evaluated, and related results [47] are not compared since our work consumes RGB video input, but OU-ISIR only releases silhouettes. Finally, we also conduct experiments to compare our gait recognition with the state-of-the-art face recognition method ArcFace [18] on the CASIA-B and FVG datasets.

Fig. 9: Synthesis by decoding three features individually, , and , and their concatenation. Left and right parts are two learnt models on frontal and side views of CASIA-B. The top two rows are two frames of the same subject under different conditions (NM vs. CL) and the bottom two are another subject. The reconstructed frames closely match the original input. shows consistent body shape for the same subject while different for different subjects. recovers the appearance of clothes, at the pose specified by . The body pose of matches with the input frame.
Fig. 10: Synthesis by decoding pairs of pose features and pose-irrelevant features, . Left and right parts are examples of frontal and side views of CASIA-B. In either part, each of synthetic images is , where is extracted from images in the first column and is from the top row. The synthetic images resemble the appearance of the first column and the pose of the top row.

5.1 Ablation Study

5.1.1 Feature Visualization Through Synthesis

While our decoder is only useful in training, but not model inference, it can enable us to visualize the disentangled features as a synthetic image, by feeding either the feature itself, or their random concatenation, to our learned decoder . This synthesis helps to gain more understanding of the feature disentanglement.

Visualization of Features in One Frame. Our decoder requires the concatenation of three vectors for synthesis. Hence, to visualize each individual feature, we concatenate it with two vectors of zeros and then feed to decoder. In Fig. 9, we show the disentanglement visualization of subjects (two frontal and two side views), each under the NM and CL conditions. First of all, the canonical feature discovers a standard body pose that is consistent across both subjects, which is more visible in the side view. Under such a standard body pose, the canonical feature then depicts the unique body shape, which is consistent within a subject but different between subjects. The appearance feature faithfully recovers the color and texture of clothing, at the standard body pose specified by the canonical feature. The pose feature captures the walking pose of the input frame. Finally, combining all three features can closely reconstruct the original input. This shows that our disentanglement not only preserves all information of the input, but also fulfills all the desired properties described in Tab. III.

Visualization of Features in Two Frames. As shown in Fig. 10, each result is generated by pairing the pose-irrelevant feature in the first column, and the pose feature in the first row. The synthesized images show that indeed pose-irrelevant feature contributes all the appearance and body information, e.g., cloth, body width, as they are consistent across each row. Meanwhile, contributes all the pose information, e.g., positions of hand and feet, which share similarity across columns. Despite that concatenating vectors from different subjects may create samples outside the input distribution of , the visual quality of synthetic images shows that is versatile to these new samples.

Fig. 15: The t-SNE visualization of (a) appearance features , (b) canonical features , (c) pose features , and (d) dynamic gait features . We select subjects each with two videos of NM vs. CL conditions. Each point represents a single frame, whose color is for subject ID, shape of ‘dot’ and ‘cross’ is NM and CL respectively, and size is frame index. We see that and are far more discriminative than and .

5.1.2 Feature Visualization Through t-SNE

To gain more insight into the frame-level features , , and sequence-level LSTM feature aggregation, we apply t-SNE [44] to these features to visualize their distribution in a D space. With the learnt models in Sec. 5.1.1, we randomly select two videos under NM and CL conditions for each of subjects.

Fig. 15 (a,b) visualizes the and features. Obviously, for the appearance feature , the margins between intra-class and inter-class distances are unpromising, which shows that has limited discrimination power. In contrast, the canonical feature has both the compact intra-class variations and separable inter-class differences – useful for identity classification. In addition, we visualize the from and its corresponding at each time step in Fig. 15 (c-d). As defined in Eqn. 4, we enforce the averaged of the same subject to be consistent under different conditions. Since Eqn. 4 only minimizes the intra-class distance, it cannot guarantee the discrimination among subjects. However, after aggregation by the LSTM network, distances of points at longer time duration for inter-class are substantially enlarged.

5.1.3 Loss Function’s Impact on Performance

Disentanglement with Pose Similarity Loss. With the cross reconstruction loss, the appearance feature and canonical feature can be enforced to represent static information that shares across the video. However, as discussed, could be contaminated by the appearance information or even encode the entire video frame. Here we show the benefit of the pose similarity loss to feature disentanglement. Fig. 16 shows the cross visualization of two different models learned with and without . Without the decoded image shares some appearance and body characteristic, e.g., cloth style, contour, with . Meanwhile, with , appearance better matches with and .

Fig. 16: Synthesis on CASIA-B by decoding pose-irrelevant feature and pose feature from videos under NM vs. CL conditions. Left and right parts are two examples. For each example, is extracted from the first column (CL) and is from the top row (NM). Top row synthetic images are generated from model trained without loss, bottom row is with the loss. To show the difference, details in synthetic images are magnified.
Disentanglement Loss Classification Loss Classification Feature Rank-
 [58] &
TABLE VI: Ablation study on various options of the disentanglement loss, classification loss, and classification features. A GaitNet model is trained on NM and CL conditions of lateral view with the first subjects of CASIA-B and tested on remaining subjects.
Fig. 23: The t-SNE visualization of from subjects, each with videos (NM vs. CL). The symbols are defined the same as Fig. 15. The top and bottom rows are two models learnt with and loss respectively. From left to tight, the points are of the first frames, - frames, and - frames. Learning with leads to more discriminative dynamic features for the entire duration.

Loss Function’s Impact on Recognition Performance. As there are various options in designing our framework, we ablate their effect on the final recognition performance from three perspectives: the disentanglement loss, the classification loss, and the classification feature. Tab. VI reports the Rank- recognition accuracy of different variants of our framework on CASIA-B under NM vs. CL and lateral view. The model is trained with all videos of the first subjects and tested on the remaining subjects.

We first explore the effects of different disentanglement losses applied to and use only for classification. Using as the classification loss, we train different variants of our framework: a baseline without any disentanglement losses, a model with and our model with both and . The baseline achieves the accuracy of . Adding slightly improves the accuracy to . By combining with , our model significantly improves the accuracy to . Between and , the pose similarity loss plays a more critical role as is mainly designed to constrain the appearance feature, which does not directly benefit identification.

We also compare the effects of different classification losses applied to . Even though the classification loss only affects , we report the performance with both and for a direct comparison with our full model in the last row. With the disentanglement loss of , and , we benchmark different options of the classification loss as presented in Sec. 3.2, as well as the autoencoder loss by Srivastava et al. [58]. The model using the conventional identity loss on the final LSTM output achieves the rank- accuracy of . Using the average output of LSTM as the identity feature, improves the accuracy to . The autoencoder loss [58] achieves a good performance of . However, it is still far from our proposed incremental identity loss ’s performance at . Fig. 23 further visualizes the over time, for two models learnt with and loss respectively. Clearly, even with less than frames, the model with shows more discriminativeness, which also increases rapidly as time progresses.

Finally, we compare different features in computing the final classification score. The performance is based on the model with full disentanglement losses and as the classification loss. When is utilized in cosine distance calculation, the rank- accuracy is merely , while and achieve and respectively. The results prove the learnt and are effective for classification while has limited discriminative power. Also, by combining both and features, the recognition performance can be further improved to . We believe that such performance gain is owing to the complementary discriminative information offered by w.r.t. .

Fig. 24: Recognition by fusing and scores with different weights as defined in Eqn. 3.4. Rank- accuracy and TAR@ FAR is calculated for CASIA-B and FVG, respectively.

5.1.4 Dynamic vs. Static Gait Features

Since and are complementary in classification, it is interesting to understand their relative contributions, especially in the various scenarios of gait recognition. This amounts to exploring a global weight for the GaitNet on various training data, where ranges from to . There are three protocols on CASIA-B and hence three GaitNet models are trained respectively. We calculate the weighted score of all three models on the training data of protocol , since it is the most comprehensive and representative protocol covering all the viewing angles and conditions. The same experiment is conducted on “ALL” protocol of the FVG dataset.

As shown in Fig. 24, GaitNet has the best average performance on CASIA-B when is around , while on FVG is around . According to Eqn. 3.4, has relatively more classification contributions on CASIA-B. One potential reason is that it is more challenging to match dynamic walking poses under large range of viewing angles. In comparison, FVG favors . Since FVG is an all-frontal-walking dataset containing varying distances or resolutions, dynamic gait is relatively easier to learn with the fixed view, while might be sensitive to resolution changes.

Nevertheless, note that in the two extreme cases, where only or is used, there is relatively small performance gap between them. This means that either feature is effective in classification. Considering this observation and the balance between databases, we choose to set , which will be used in all subsequent experiments.

5.1.5 Gait Recognition Over Time

One interesting question to study is that, how many video frames are needed to achieve reliable gait recognition. To answer this question, we compare the performance with different feature scores (, and their fusion) for identification, with different video lengths. As shown in Fig. 27, both dynamic and static features achieve stable performance starting from about frames, after which the gain in performance is relatively small. At FPS, a clip of frames is equivalent to merely seconds of walking. Further, the static gait feature has notable good performance even with a single video frame. This impressive result shows the strength of our GaitNet in processing very short clips. Finally, for most of the frames in this duration, the fusion outperforms both the static and dynamic gait feature alone.

Fig. 27: Recognition performance at different video lengths. We use different feature scores (, , and their fusion) on NM-CL,BG conditions of CAISA-B. is on frontal-frontal view and is on side-side views.
Gallery NM #- - (exclude identical viewing angle)
Probe NM #- Mean
ViDP [34] - - - - - - - - -
LB [69]
D MT network  [69]
J-CNN [75]
GaitNet-pre [77] 92.0 90.5 86.9 93.5 90.9
Probe BG #- Mean
LB-subGEI [69]
J-CNN [75]
GaitNet-pre [77] 87.8 88.3 82.6 89.5 86.1
Probe CL #1-2 Mean
LB-subGEI [69]
J-CNN [75] 58.4 64.4 55.5 54.7 53.3 51.3 39.9 54.01
GaitNet-pre [77]
TABLE VII: Comparison on CASIA-B with cross view and conditions. Three models are trained for NM-NM, NM-BG, NM-CL. Average accuracies are calculated excluding probe viewing angles.
Methods Average
CPM [12]
GEI-SVR [40]
CMCC [41]
ViDP [34]
STIP+NN [39] - - - - - - - - -
LB [69]
L-CRF [12]
GaitNet-pre [77] 74
TABLE VIII: Recognition accuracy cross views under NM on CASIA-B dataset. One single GaitNet model is trained for all the viewing angles.

5.2 Evaluation on Benchmark Datasets

5.2.1 Casia-B

Since various experimental protocols have been defined on CASIA-B, for a fair comparison, we strictly follow the respective protocols in the baseline methods. Following [69], Protocol uses the first subjects for training and remaining for testing, regarding variations of NM (normal), BG (carrying bag) and CL (wearing a coat) with crossing viewing angles of to . Three models are trained for comparison in Tab. VII. For the detailed protocol, please refer to [69]. Here we mainly compare our work to Wu et al. [69], along with other methods [34, 75]. We denote our preliminary work [77] as GaitNet-pre and this work as GaitNet. Under multiple viewing angles and across three variations, GaitNet achieves the best performance compared to all SOTA methods and GaitNet-pre since can distill more discriminative information under various viewing angles and conditions.

Recently, Chen et al. [69] propose new protocols to unify the training and testing where only one single model is trained for each protocol. Protocol focuses on walking direction variations, where all videos used are in the NM subset. The training set includes videos of first subjects in all viewing angles. The rest subjects are for testing. The gallery is made of four videos at view for each subject. The first two videos from remaining viewing angles are the probe. The Rank- recognition accuracies are reported in Tab. VIII. GaitNet achieves the best average accuracy of across viewing angles, with significant improvement on extreme views compared to our preliminary work [77]. For example, at viewing angles of , and , the improvement margins are both . This shows that more discriminative gait information, such as a canonical body shape information, under different views are learned in , which contributes to the final recognition accuracy.

Probe Gallery
Subset BG
Subset CL
Mean -
TABLE IX: Comparison with [12] and [69] under different walking conditions on CASIA-B by accuracies. One single GaitNet model is trained with all gallery and probe views and the two conditions.

Protocol focuses on appearance variations. Training sets have videos under BG and CL. There are subjects in total with to viewing angles. Different test sets are made with the different combination of viewing angles of the gallery and probe as well as the appearance condition (BG or CL). The results are presented in Tab. IX. Our preliminary work has comparable performance as the SOTA method L-CRF [12] on BG subset while significantly outperforming on CL subset. The proposed GaitNet outperforms on both subsets. Note that due to the challenge of CL protocol, there is a significant performance gap between BG and CL for all methods except ours, which is yet another evidence that our gait feature has strong invariance to all major gait variations.

Across all evaluation protocols, GaitNet consistently outperforms the state of the art. This shows the superior of GaitNet on learning a robust representation under different variations. It is contributed to our ability to disentangle pose/gait information from appearance variations. Comparing with our preliminary work, the canonical feature contains discriminative power which can further improve the recognition performance.

Index of Gallery & Probe videos
Session - - - - - - ,-
Session - - - - - ,-
Session - - - - - - - - -
GEI [28]
GEINet [55]
DCNN [2]
LB [69]
GaitNet-pre [77]
TABLE X: Definition of FVG protocols and performance comparison. Under each of the protocols, the first/second columns indicate the indexes of videos used in gallery/probe.

5.2.2 Usf

The original protocol of USF [53] and the methods [11, 70, 27, 1] does not define a training set, which is not applicable to our method, as well as [69], that require data to train the models. Hence following the experiment setting in [69], which randomly partitions the dataset into the non-overlapping training and test sets, each with half of the subjects. We test on Probe A, defined in [69], where the probe is different from the gallery by the viewpoint. We achieve the identification accuracy of , which is better than of our preliminary work GaitNet-pre [77], the reported of LB network [69], and of multi-task GAN [30].

5.2.3 Fvg

Given that FVG is a newly collected database and no reported performance from prior work, we make the efforts to implement classic or SOTA methods on gait recognition [28, 55, 2, 69]. Furthermore, given the large amount of effort in human pose estimation [23], aggregating joint locations over time can be a good candidate for gait features. Therefore we define another baseline, named PE-LSTM, using pose estimation results as the input to the same LSTM and classification loss as ours. Using SOTA D pose estimation [22], we extract joints’ locations, feed to the -layer-LSTM, and train with our proposed LSTM incremental loss. For each of baselines and our GaitNet, one model is trained with the -subject training set and tested on all protocols.

As shown in Tab. X, our method shows state-of-the-art performance compared with baselines, including the recent CNN-based methods. Among protocols, CL is the most challenging variation as in CASIA-B. Comparing with all different methods, GEI based methods suffer from frontal view due to the lack of walking information. Again, thanks to the discriminative canonical feature , GaitNet achieves better recognition accuracies than GaitNet-pre. Also, the superior performance of our GaitNet over PE-LSTM demonstrates that our feature and does explore more discriminate information than the joints’ locations alone.

5.3 Comparison to Face Recognition

Face recognition aims to identify subjects by extracting discriminative identity features, or representation, from face images. Due to the vigorous development in the past few years, face recognition system is one of the most studied and deployed systems in the vision community, even superior to human on some tasks [71].

However, the challenge is particularly prominent in the video surveillance scenario, where low-resolution and/or non-frontal faces are acquired at a distance. While gait, as a behavioral biometric compared to face, might have more advantages in those scenarios since the dynamic information can be more resistant even at a lower resolution and different viewing angles. Especially for GaitNet, and can have complementary contributions in changing distances, resolutions and viewing angles. Therefore, to explore the advantages and disadvantages of gait recognition and face recognition in surveillance scenario, we compare our GaitNet with the most recent SOTA face recognition method, ArcFace [18], on the CASIA-B and FVG databases.

Specifically, for face recognition, we first employ SOTA face detection algorithm RetinaFace [19] to detect face and ArcFace to extract features for each frame of gallery and probe videos. Then the features over all frames of a video are aggregated by average pooling, an effective scheme used in prior video-based face recognition work [25]. We measure the similarity of features by their cosine distance. To keep consistency with above gait recognition experiments, both face and gait report TAR at FAR for FVG and Rank- score for CASIA-B. To evaluate effects of time, we use the entire sequence as gallery and partial (e.g., %) sequence as probe on points on the time axis ranging from % to %.

Fig. 32: Comparison of gait and face recognition on CASIA-B and FVG. Classification accuracy scores along with video duration percentage are calculated. (a) In CASIA-B, both gait and face recognition are performed in three scenarios: frontal-frontal ( vs. ), side-side ( vs. ) and frontal-side ( vs. ). (b) In FVG, both recognitions use NM vs. BGHT and NM vs. ALL* protocols. Detected face examples are shown on the top of each frontal and side view plots under various video duration percentage.

5.3.1 Gait vs. Face Recognition on CASIA-B

In this experiment, we select the videos of the NM as gallery and both CL and BG are probes. We compare gait and face recognition in three scenarios: frontal-frontal, side-side and side-frontal viewing angles. Fig. 32 shows the Rank- scores over the time duration. As the video begins, GaitNet is significantly superior to face in all scenarios since our can capture discriminative information such as body shape in low-resolution images, as mentioned in Sec. 5.1.5, while faces are of too low resolution to perform meaningful recognition. As time progresses, GaitNet is stable to the resolution change and view variations, with increasing accuracy. In comparison, face recognition always has lower accuracies throughout the entire duration, except the frontal-frontal view face recognition slightly outperforms gait in the last of the duration, which is expected as this is toward the ideal scenario for face recognition to shine. Unfortunately, for side-side or side-frontal views, face recognition continues to struggle even at the end of the duration.

5.3.2 Gait vs. Face Recognition on FVG

We further compare GaitNet with ArcFace on FVG with NM-BGHT and NM-ALL* protocols. Note that the videos of NM-BGHT contain variations in carrying bags and wearing hat. The videos of ALL*, different from ALL in Tab. X, include all the variations in FVG except carrying and wearing hat variations (refer to Tab. V for details). As shown in Fig. 32, on the BGHT protocol, gait outperforms face in the entire duration, since wearing hat dramatically affects face recognition but not gait recognition. For ALL* protocol, face outperforms gait in the last duration because by then low resolution is not an issue and FVG has frontal-view faces.

Figure 33 shows some examples in CASIB-B and FVG, which are incorrectly recognized by face recognition. We also show some images (video frames) for which our GaitNet fails to recognize in Fig. 34. The low resolution and illumination conditions in these videos are the main reasons for failure. Note that while video-based alignment [43, 59]

or super-resolution approaches 

[13] might help to enhance the image quality, their impact to recognition is beyond the scope of this work.

Fig. 33: Examples in CASIA-B and FVG where the SOTA face recognizer ArcFace fails. The first row is the image of probe set; the second row is the recognized wrong person in gallery; and the third row shows the genuine gallery. The first three columns are three scenarios of CASIA-B and the last two columns are two protocols of FVG.
Fig. 34: Failure cases of GaitNet on CASIB-B and FVG due to blurry and illumination conditions. The rows and columns are defined the same as Fig. 33.

5.4 Runtime Speed

System efficiency is an essential metric for many vision systems including gait recognition. We calculate the efficiency while each of the

gait recognition methods processing one video of FVG dataset on the same desktop with GeForce GTX 1080 Ti GPU. All the coding are implemented in PyTorch Framework of Python programming language. Parallel computing of batch processing is enabled for GPU on all the inference models, where batch size is number of samples in the probe. Alphapose and Mask-R-CNN takes batch size of

as input in inference. As shown in Tab. XI, our method is faster than the pose estimation method because of 1) an accurate, yet slow, version of AlphaPose [22] is required for model-based gait recognition method; 2) only low-resolution input of pixels is needed for GaitNet. Further, our method has similar efficiency as the recent CNN-based gait recognition methods.

Methods Pre-processing Inference Total
GEINet [55]
DCNN [2]
LB [69]
GaitNet (ours)
TABLE XI: Runtime (ms per frame) comparison on FVG dataset.

6 Conclusion

This paper presents an autoencoder-based method termed GaitNet that can disentangle appearance and gait feature representation from raw RGB frames, and utilize a multi-layer LSTM structure to further leverage temporal information to generate a gait representation for each video sequence. We compare our method extensively with the state of the arts on CASIA-B, USF, and our collected FVG datasets. The superior results show the generalization and promise of the proposed feature disentanglement approach. We hope that in the future, this disentanglement approach is a viable option for other vision problems where motion dynamics needs to be extracted while being invariant to confounding factors, e.g., expression recognition with invariance to facial appearance, activity recognition with invariance to clothing.


This work was partially sponsored by the Ford-MSU Alliance program, and the Army Research Office under Grant Number W911NF-18-1-0330. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.


  • [1] H. Aggarwal and D. K. Vishwakarma (2017)

    Covariate conscious approach for gait recognition based upon Zernike moment invariants

    IEEE Transactions on Cognitive and Developmental Systems (TCDS) 10 (2), pp. 397–407. Cited by: §5.2.2.
  • [2] M. Alotaibi and A. Mahmood (2017) Improved Gait recognition based on specialized deep convolutional neural networks. Computer Vision and Image Understanding (CVIU) 164, pp. 103–110. Cited by: §1, §5.2.3, TABLE X, TABLE XI.
  • [3] G. Ariyanto and M. S. Nixon (2012) Marionette mass-spring model for 3D gait biometrics. In International Conference on Biometrics (ICB), Cited by: §1, §1, §1.
  • [4] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag (2018) Synthesizing Images of Humans in Unseen Poses. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [5] K. Bashir, T. Xiang, and S. Gong (2010) Gait Recognition Using Gait Entropy Image. In International Conference on Imaging for Crime Detection and Prevention (ICDP), Cited by: §1, §1, §1, §2.
  • [6] A. F. Bobick and A. Y. Johnson (2001) Gait Recognition Using Static, Activity-Specific Parameters. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [7] G. Brazil and X. Liu (2019) Pedestrian Detection with Autoregressive Network Phases. In Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • [8] G. Brazil, X. Yin, and X. Liu (2017) Illuminating Pedestrians via Simultaneous Detection and Segmentation. In International Conference on Computer Vision (ICCV), Cited by: §3.1.
  • [9] P. Chattopadhyay, A. Roy, S. Sural, and J. Mukhopadhyay (2014) Pose Depth Volume extraction from RGB-D streams for frontal gait recognition. Journal of Visual Communication and Image Representation 25 (1), pp. 53–63. Cited by: §1.
  • [10] P. Chattopadhyay, S. Sural, and J. Mukherjee (2014) Frontal Gait Recognition From Incomplete Sequences Using RGB-D Camera. IEEE Transactions on Information Forensics and Security 9 (11), pp. 1843–1856. Cited by: §1.
  • [11] W. Chen, J. Zhang, L. Wang, J. Pu, and X. Yuan (2011) Human identification using temporal information preserving gait template. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34 (11), pp. 2164–2176. Cited by: §5.2.2.
  • [12] X. Chen, J. Weng, W. Lu, and J. Xu (2017) Multi-Gait Recognition Based on Attribute Discovery. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (7), pp. 1697–1710. Cited by: §2, §5.2.1, TABLE VIII, TABLE IX, §5.
  • [13] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang (2018) FSRNet: end-to-end learning face super-resolution with facial priors. In Computer Vision and Pattern Recognition (CVPR), Cited by: §5.3.2.
  • [14] K. Cheung, S. Baker, and T. Kanade (2003) Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • [15] S. Choi, J. Kim, W. Kim, and C. Kim (2019) Skeleton-based gait recognition via robust Frame-level matching. IEEE Transactions on Information Forensics and Security 14 (10), pp. 2577–2592. Cited by: §1, §1, §2.
  • [16] L. J. Connell, P. V. Ulrich, E. L. Brannon, M. Alexander, and A. B. Presley (2006) Body shape assessment scale: instrument development foranalyzing female figures. Clothing and Textiles Research Journal (CTRJ) 24 (2), pp. 80–95. Cited by: §3.1.
  • [17] D. Cunado, M. S. Nixon, and J. N. Carter (2003) Automatic extraction and description of human gait models for recognition purposes. Computer Vision and Image Understanding (CVIU) 90 (1), pp. 1–41. Cited by: §1.
  • [18] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) Arcface: Additive angular margin loss for deep face recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §5.3, §5.
  • [19] J. Deng, J. Guo, Z. Yuxiang, J. Yu, I. Kotsia, and S. Zafeiriou (2019) RetinaFace: single-stage dense face localisation in the wild. In arXiv preprint arXiv:1905.00641, Cited by: §5.3.
  • [20] E. Denton and B. Vighnesh (2017) Unsupervised Learning of Disentangled Representations from Video. In Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [21] P. Esser, E. Sutter, and B. Ommer (2018) A Variational U-Net for Conditional Appearance and Shape Generation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [22] H. Fang, S. Xie, Y. Tai, and C. Lu (2017) RMPE: Regional Multi-Person Pose Estimation. In International Conference on Computer Vision (ICCV), Cited by: §5.2.3, §5.4.
  • [23] Y. Feng, Y. Li, and J. Luo (2016) Learning Effective Gait Features Using LSTM. In International Conference on Pattern Recognition (ICPR), Cited by: §1, §1, §2, §5.2.3.
  • [24] F. A. Gers, J. Schmidhuber, and F. Cummins (2000) Learning to forget: continual prediction with LSTM. Neural Computation 12 (10), pp. 2451–2471. Cited by: §3.5.
  • [25] S. Gong, Y. Shi, and A. K. Jain (2019) Low quality video face recognition: multi-mode aggregation recurrent network (MARN). In International Conference on Computer Vision Workshops (ICCVW), Cited by: §5.3.
  • [26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative Adversarial Nets. In Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [27] Y. Guan, C. Li, and F. Roli (2014) On reducing the effect of covariate factors in gait recognition: a classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 37 (7), pp. 1521–1528. Cited by: §5.2.2.
  • [28] J. Han and B. Bhanu (2005) Individual Recognition Using Gait Energy Image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28 (2), pp. 316–322. Cited by: §1, §1, §1, §2, §5.2.3, TABLE X.
  • [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In International Conference on Computer Vision (ICCV), Cited by: §3.1, §3.5.
  • [30] Y. He, J. Zhang, H. Shan, and L. Wang (2018) Multi-Task GANs for View-Specific Feature Learning in Gait Recognition. IEEE Transactions on Information Forensics and Security 14 (1), pp. 102–113. Cited by: §5.2.2.
  • [31] M. Hofmann, J. Geiger, S. Bachmann, B. Schuller, and G. Rigoll (2014) The TUM Gait from Audio, Image and Depth (GAID) database: Multimodal recognition of subjects and traits. Journal of Visual Communication and Image Representation 25 (1), pp. 195–206. Cited by: §2.
  • [32] M. A. Hossain, Y. Makihara, J. Wang, and Y. Yagi (2010) Clothing-invariant gait identification using part-based clothing categorization and adaptive weight control. Pattern Recognition 43 (6), pp. 2281–2291. Cited by: §1.
  • [33] H. Hu (2013)

    Enhanced Gabor Feature Based Classification Using a Regularized Locally Tensor Discriminant Model for Multiview Gait Recognition

    IEEE Transactions on Circuits and Systems for Video Technology 23 (7), pp. 1274–1286. Cited by: TABLE IX.
  • [34] M. Hu, Y. Wang, Z. Zhang, J. J. Little, and D. Huang (2013) View-Invariant Discriminative Projection for Multi-View Gait-Based Human Identification. IEEE Transactions on Information Forensics and Security 8 (12), pp. 2034–2045. Cited by: §5.2.1, TABLE VII, TABLE VIII.
  • [35] H. Iwama, M. Okumura, Y. Makihara, and Y. Yagi (2012) The OU-ISIR Gait Database Comprising the Large Population Dataset and Performance Evaluation of Gait Recognition. IEEE Transactions on Information Forensics and Security 7 (5), pp. 1511–1521. Cited by: TABLE I.
  • [36] A. Kale, A. K. RoyChowdhury, and R. Chellappa (2004) Fusion of gait and face for human identification. In Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §2.
  • [37] D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
  • [38] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [39] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, and L. Wang (2013) Recognizing Gaits Across Views Through Correlated Motion Co-Clustering. IEEE Transactions on Image Processing 23 (2), pp. 696–709. Cited by: TABLE VIII, §5.
  • [40] W. Kusakunniran, Q. Wu, J. Zhang, and H. Li (2010)

    Support Vector Regression for Multi-View Gait Recognition based on Local Motion Feature Selection

    In Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE VIII, §5.
  • [41] W. Kusakunniran (2014) Recognizing Gaits on Spatio-Temporal Feature Domain. IEEE Transactions on Information Forensics and Security 9 (9), pp. 1416–1423. Cited by: TABLE VIII.
  • [42] F. Liu, D. Zeng, Q. Zhao, and X. Liu (2018) Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [43] X. Liu (2010) Video-based face model fitting using adaptive active appearance model. Image and Vision Computing 28 (7), pp. 1162–1172. Cited by: §5.3.2.
  • [44] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE.

    Journal of Machine Learning Research

    9 (Nov), pp. 2579–2605.
    Cited by: §5.1.2.
  • [45] Y. Makihara, D. Adachi, C. Xu, and Y. Yagi (2018) Gait recognition by deformable registration. In Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §1, §2.
  • [46] Y. Makihara, H. Mannami, A. Tsuji, M. A. Hossain, K. Sugiura, A. Mori, and Y. Yagi (2012) The OU-ISIR Gait Database Comprising the Treadmill Dataset. IPSJ Transactions on Computer Vision and Applications 4, pp. 53–62. Cited by: TABLE I, §2, §5.
  • [47] Y. Makihara, A. Suzuki, D. Muramatsu, X. Li, and Y. Yagi (2017) Joint Intensity and Spatial Metric Learning for Robust Gait Recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §5.
  • [48] L. Middleton, A. A. Buss, A. Bazin, and M. S. Nixon (2005) A floor sensor system for gait recognition. In Workshop on Automatic Identification Advanced Technologies (AutoID), Cited by: §1.
  • [49] A. M. Nambiar, P. L. Correia, and L. D. Soares (2012) Frontal Gait Recognition Combining 2D and 3D Data. In ACM Workshop on Multimedia and Security, Cited by: §1.
  • [50] M. S. Nixon, T. Tan, and R. Chellappa (2010) Human Identification Based on Gait. Springer Science & Business Media. Cited by: §1.
  • [51] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), Cited by: §3.5.
  • [52] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cited by: §2.
  • [53] S. Sarkar, P. J. Phillips, Z. Liu, I. R. Vega, P. Grother, and K. W. Bowyer (2005) The Human ID Gait Challenge Problem: Data Sets, Performance, and Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 27 (2), pp. 162–177. Cited by: §1, TABLE I, §2, §5.2.2, §5.
  • [54] G. Shakhnarovich, L. Lee, and T. Darrell (2001) Integrated face and gait recognition from multiple views. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [55] K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi (2016) GEINet: View-Invariant Gait Recognition Using a Convolutional Neural Network. In International Conference on Biometrics (ICB), Cited by: §1, §1, §5.2.3, TABLE X, TABLE XI.
  • [56] J. D. Shutler, M. G. Grant, M. S. Nixon, and J. N. Carter (2004) On a Large Sequence-Based Human Gait Database. In Applications and Science in Soft Computing, Cited by: §2.
  • [57] S. Sivapalan, D. Chen, S. Denman, S. Sridharan, and C. Fookes (2011) Gait Energy Volumes and Frontal Gait Recognition using Depth Images. In International Joint Conference on Biometrics (IJCB), Cited by: §1.
  • [58] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised Learning of Video Representations using LSTMs. In International Conference on Machine Learning (ICML), Cited by: §5.1.3, TABLE VI.
  • [59] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and Y. Chen (2019) Towards highly accurate and stable face alignment for high-resolution videos. In

    AAAI Conference on Artificial Intelligence (AAAI)

    Vol. 33, pp. 8893–8900. Cited by: §5.3.2.
  • [60] D. Tao, X. Li, X. Wu, and S. J. Maybank (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29 (10), pp. 1700–1715. Cited by: §1.
  • [61] L. Tran, F. Liu, and X. Liu (2019) Towards High-fidelity Nonlinear 3D Face Morphable Model. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [62] L. Tran and X. Liu (2018) Nonlinear 3D Face Morphable Model. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [63] L. Tran and X. Liu (2019) On learning 3D face morphable model from in-the-wild images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Note: doi: 10.1109/tpami.2019.2927975 Cited by: §2.
  • [64] L. Tran, X. Yin, and X. Liu (2017) Disentangled Representation Learning GAN for Pose-Invariant Face Recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [65] L. Tran, X. Yin, and X. Liu (2018) Representation Learning by Rotating Your Faces. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Note: doi: 10.1109/tpami.2018.2868350 Cited by: §2.
  • [66] C. Wan, L. Wang, and V. V. Phoha (2018) A survey on gait recognition. ACM Computing Surveys (CSUR) 51 (5),