Robust Cross-View Gait Identification with Evidence: A Discriminant Gait GAN (DiGGAN) Approach on 10000 People

11/26/2018 ∙ by BingZhang Hu, et al. ∙ University of Oxford Georgia Institute of Technology Newcastle University 0

Gait is an important biometric trait for surveillance and forensic applications, which can be used to identify individuals at a large distance through CCTV cameras. However, it is very difficult to develop robust automated gait recognition systems, since gait may be affected by many covariate factors such as clothing, walking surface, walking speed, camera view angle, etc. Out of them, large view angle was deemed as the most challenging factor since it may alter the overall gait appearance substantially. Recently, some deep learning approaches (such as CNNs) have been employed to extract view-invariant features, and achieved encouraging results on small datasets. However, they do not scale well to large dataset, and the performance decreases significantly w.r.t. number of subjects, which is impractical to large-scale surveillance applications. To address this issue, in this work we propose a Discriminant Gait Generative Adversarial Network (DiGGAN) framework, which not only can learn view-invariant gait features for cross-view gait recognition tasks, but also can be used to reconstruct the gait templates in all views --- serving as important evidences for forensic applications. We evaluated our DiGGAN framework on the world's largest multi-view OU-MVLP dataset (which includes more than 10000 subjects), and our method outperforms state-of-the-art algorithms significantly on various cross-view gait identification scenarios (e.g., cooperative/uncooperative mode). Our DiGGAN framework also has the best results on the popular CASIA-B dataset, and it shows great generalisation capability across different datasets.



There are no comments yet.


page 2

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gait is a behavioural biometric characteristic which can be used for remote human identification. Compared with recognition technologies based on other biometric characteristics like fingerprint or iris, gait recognition can be applied at a much larger (longer) distance without subjects’ cooperation. Nowadays, surveillance cameras are widely installed in public places such as airports, government buildings, streets and shopping malls, which makes gait recognition a useful tool for crime prevention and law enforcement. Gait analysis has contributed to evidence for convictions in criminal cases in some countries like Denmark [18] and UK [5]. For automated gait recognition, there are two main approaches: model-based, and appearance-based. Model-based methods aim to model the human body structure parameters, while appearance-based approaches extract gait features directly from gait sequences regardless of the underlying body structure. This work falls in the latter category, which can also work well on low-quality gait videos, when the body structure parameters are difficult to extract precisely.

The average silhouette over one gait cycle, known as Gait Energy Image (GEI, as shown in Fig. 1) is widely used in recent appearance-based gait recognition systems because of its simplicity and effectiveness [10]. In [13]

, several gait templates were evaluated on a gait dataset consisting of more than 3000 subjects and it was found that directly matching GEI can yield very good performance, when gaits from the probe (i.e., query gait) and gallery (i.e., reference gait) are in the same walking conditions. Yet in real-world scenarios, there exist covariate factors such as shoe type, carrying condition, clothing, speed, or camera viewpoint, which may affect the recognition performance significantly. Various machine learning algorithms were proposed

[8][21][1] to learn gait features that are robust to covariates which only partially change the gait appearance. Large camera angle, however, was considered the most challenging factor— which may affect the gait features in a global manner. Fig. 1 demonstrates several GEI samples from the OU-MVLP dataset [27] from different view angles, and we can see that view changes may substantially alter the visual features of gaits, causing recognition difficulties. Recently, various deep learning approaches (e.g.,[26][28]) were applied to learn view-invariant features, which show superb performance on datasets with a small number of subjects (e.g., on CASIA-B [30]). However, the performance of these deep approaches do not scale well to large datasets, e.g., on the OU-MVLP dataset [27] with more than 10000 subjects. Moreover, the black-box nature of these deep CNNs makes it hard for real-world applications, e.g., when evidence is required. To address this issue, in this work we proposed a Discriminant Gait Generative Adversarial Network (DiGGAN) framework, which not only can scale well for large-scale cross-view gait identification tasks, but also can generate all the possible views for evidence. Our contribution can be summarised as follows:

  • Algorithm: A generative adversarial network based model (named DiGGAN) is proposed in this paper. With the mechanisms of two independent discriminators, the proposed network can generate GEIs at unseen views while preserving the identity information. Besides, a triplet loss is handily introduced in our framework to enhance the discriminability of the feature learned.

  • Application: Large-scale cross-view gait identification is challenging, and our proposed DiGGAN effectively solves the issue. Moreover, it can generate the all-view evidence, which is important for forensic applications.

  • Performance: On the world’s largest OU-MVLP dataset (with more than 10000 subjects), our method outperforms other algorithms significantly on many real-world gait identification scenarios (e.g., cooperative/uncooperative mode). It also has the best results on the popular CASIA-B dataset and shows strong generalisation ability across datasets.

Figure 1: Gait Energy Images (GEIs) in OU-MVLP dataset.

2 Related Work

2.1 Cross-view Gait Recognition

Cross-view gait recognition methods can be roughly divided into three categories. Methods belonging to the first category (e.g., [2]) are based on 3D reconstruction through images from multiple calibrated cameras. However, these methods require a fully controlled and cooperative multiple camera environments, which limits its application in real-world surveillance scenarios. Methods in the second category perform view normalization on gait features before matching. In [6]

, after estimating the poses of lower limbs, Goffredo et al. extracted the rectified angular measurements and trunk spatial displacements as gait features. Kusakunniran et al.

[17] proposed a view normalization framework based on domain Transformation obtained through Invariant Low-rank Textures (TILT), and gaits from different views are normalised to the side view for matching. These methods yielded reasonable recognition accuracies in some cross-view recognition tasks, yet they are sensitive for views when gait features are hard to estimate (e.g., frontal/back views).

The third category is to learn the mapping/projection relationships of gaits across views. The learning process relies on the training data that covers the views appearing in the gallery and probe. Through the learned metric(s), gaits from two different views can be projected into the common subspace for matching. In [19], Makihara et al. introduced the SVD-based View Transformation Model (VTM) to project gait features from one view into another. After pointing out the limitations of SVD-based VTM, a method in [14] reformulated VTM construction as a regression problem. Instead of using the global features (e.g.,[19]

, local Region of Interest (ROI) was selected based on local motion relationship to build VTMs through Support Vector Regression (SVR). In

[15], the performance was further improved by replacing SVR to Sparse Regression (SR). Instead of projecting gait features into a common space, Bashir et al. [4] used Canonical Correlation Analysis (CCA) to project gaits from two different views into two subspaces with maximal correlation. The correlation strength was employed as the similarity measure for identification. In [16], after claiming there may exist some weakly or non-correlated information on the global gaits across views [4]

, motion co-clustering was carried out to partition the global gaits into multiple groups of gait segments. For feature extraction, they performed CCA on these multiple groups, instead of the global gait features as in

[4]. Different from most works (e.g., [19][16],[4]) with multiple trained projection matrices for different view pairs, recently, Hu et al. proposed a novel unitary linear projection named View-invariant Discriminative Projection (ViDP)[12]. The unitary nature of ViDP makes cross-view gait recognition can be performed without knowing the query gait views. Most recently, deep learning approaches [26], [28],[29], [11] were applied for gait recognition, which can model the non-linear relationship between different views. In [26], the basic CNN framework, namely GEINet was applied on a large gait dataset, and the experimental results suggested its effectiveness when the view angle changes between probe and gallery are small. To combat large view changes, a number of CNN structures were studied in [28] on the CASIA-B dataset (with 11 views from to ), and Siamese-like structures were found to yield the highest accuracies. However, this dataset only includes subjects, and the most recent work [27] found these CNN structures do not generalise well to a large number of subjects. In [29],[11], GAN approaches are applied to generate gait features/images to a common view or a target view for matching. However, the generative nature of both GAN models limit the recognition accuracies, although they are more interpretable than the discriminant CNN-based approaches [11].

2.2 Generative Adversarial Networks

Generative Adversarial Networks (GANs)[7] introduces a novel self-upgrading system. By keeping a balanced competition between a generator and a discriminator, fake data can be synthesised. While early work focuses on preventing low-quality, instability and model collapse problems, e.g. WGANs[3, 9] and DCGANs[24], recent applications utilise various supervision to control the generated data. Conditional GANs [22]

can generate samples according to provided label information. The assumption is that the data is generated by interpolating conditional variations along a low-dimensional manifold. By modifying different manifold assumption, GANs have been successfully applied to interpolating facial poses, ages. GaitGAN

[29] and MGANs[11] are close related work that uses GANs for gait recognition. However, compared with their methods, our method can 1) extract more discriminant view-invariant features, which is robust for large cross-view gait recognition tasks and 2) generate GEI images at unseen view angles, which can be used as important evidence for forensic applications.

3 Methodology

In this section, we describe the framework of the proposed DiGGAN and discuss the details of each component respectively. For a convenient discussion, in the rest of the paper, we use to denote the GEI image of the subject captured at angle , thus and , where is the number of subjects and is the number of the views in the dataset.

Figure 2: The illustration of the proposed DiGGAN

3.1 Framework Overview

Fig. 2 illustrates the pipeline of the proposed DiGGAN. The network is trained to transfer a GEI image at an arbitrary view to GEI image with the target view . As the input GEI image and the target GEI image are supposed to share the same identity, an auto-encoder is first applied on to disentangle the view angle information and the identity information thus project the images to an identity preserving latent space and yield the latent code . To involve the target view information , the latent code is concatenated with a one-hot vector label , followed by a generator that takes the concatenated vector as input and generate the image . Finally, two discriminators on angle and identity are employed to impose the angle and identity information. Additionally, to enhance the discriminability of the embeddings in latent space, a triplet loss is introduced to constrain .

3.2 Angle Sensitive Discriminator

We assume that the GEI image is sampled from a low dimensional manifold where the identity and angle change smoothly along respective dimensions. As the latent code is constrained to contain the identity information only, we can easily manipulate the angle of the generated image by concatenating different angle labels to . Thus it is intuitive to employ a conditional discriminator to ensure the view angle of the generated image. Mathematically, for a given training pair , the angle sensitive discriminator can be trained by:


It is worth noting that in , the one-hot vector label is concatenated after the first convolutional layer to obtain a better performance according to [23].

3.3 Identity Preserving Discriminator

One of the drawbacks in original GANs is the poor diversity in generated samples, for example, the model tends to remember samples in the training set hence outputs averaged images without differentiating identities. To tackle this problem, we introduce an identity preserving discriminator in our framework. Inheriting the similar idea of , the is designed as a conditional discriminator which takes two images as input and is expected to predict if two inputs share a same identity and otherwise. Thus the objective function can be derived as:

Figure 3: The illustration of the triplet loss employed in DiGGAN. The triplet loss is introduced to push the negative samples away from the anchor samples while pulling the positive samples closer.

3.4 Triplet Constraints on

Although the generated images can be directly used for gait recognition, e.g. direct matching on the pixels [30], searching on the latent space has been widely adopted by most of existing works [28] [11] for its higher performance and efficiency. However, the identity preserving discriminator does not directly constrains the latent code, which may result in the distribution of the latent code exhibiting a ‘hole’. Inspired by [32]

, which employs an extra discriminator to impose a uniform distribution on

, we introduce the triplet loss to enhance the discriminability of . Concretely, as shown in Fig. 3, a triplet sample consists of an anchor, a positive and a negative sample, where the positive sample shares the same identity with the anchor while the negative has a different one. A hinge loss is employed here to push the negative sample away from the anchor, and at the same time, to pull the positive closer. For example, in triplet , the objective function below is to be minimized:


where can be distance and the is the margin to be ensured.

3.5 Objective Function and Training Strategies

Reconstruction loss Besides the adversarial loss, the pixel-wise reconstruction loss is also introduced to enhance the sharpness of the generated image:


Overall objective function Based on Eq. 1 to 4, we can define the overall objective function as follows:


Training strategies

Empirically, training such a model with multiple loss functions in Eq. 

5 is challenging thus always leads to poor results. To tackle this difficulty, we propose a step-by-step strategy for training. In the frist step, we only train the angle sensitive discriminator with artificial batches that are generated from the realistic GEI images. Specifically, we randomly sample GEI images from the training set to form a batch and train the . In each batch, half of the images are assigned with wrong angle labels while the rest are assigned with the correct ones. Training with realistic images rather than generated ones helps the angle sensitive discriminator to converge quickly. After the converges, we subsequently train the network without the triplet loss in two sub stages. In the first sub stage, we set , which means a same image is fed into the network as the (input, ground truth) pair, therefore enables the network to learn to recover the input image first. Then in the second sub stage, we feed different images to teach the model to generate images with different angles. Finally, we take the triplet loss in and fine tune the whole network. Fig. 4 shows the generated images at different stages of the training process. The model learns to generate averaged images at the initial stage. After that, with different images being fed into the network, the model learns to generate images of new angles. Finally the model learns to generate images with more details of the identity information from the triplet loss.

Figure 4: The generated images at different stages of the training process. First row: initial stage of the model. The model outputs averaged image. Second row: the model learns to generate images with new angles from . Third row: after adding triplet loss into training, the model learns more identity details. Last row: model converges.
(a) VTM [19] (b) GEINet [26]
Probe Probe
Gallery 0 30 60 90 Mean Gallery 0 30 60 90 Mean
0 68.8 0.5 0.2 0.1 17.4 0 75.9 32.1 7.0 7.4 30.6
30 0.7 82.2 2.1 0.8 21.4 30 17.3 89.6 43.7 22.7 43.3
60 0.3 3.2 77.6 5.4 21.6 60 4.0 43.4 86.5 55.4 47.3
90 0.2 1.1 4.2 80.9 21.6 90 3.4 21.5 50.2 90.7 41.5
Mean 17.5 21.7 21.0 21.8 20.5 Mean 25.2 46.6 46.8 44.0 40.7
(c)Siamese [31] (d)CNN-MT [28]
Probe Probe
Gallery 0 30 60 90 Mean Gallery 0 30 60 90 Mean
0 52.7 23.7 11.1 11.3 24.7 0 70.7 16.7 4.4 3.9 23.9
30 18.4 78.6 32.6 27.6 39.3 30 14.1 88.1 36.9 17.0 39.0
60 8.0 33.5 76.1 39.6 39.3 60 4.0 39.2 85.7 44.2 43.3
90 7.9 26.5 36.5 82.1 38.2 90 3.2 16.2 43.4 89.3 38.0
Mean 21.8 40.6 39.1 40.1 35.4 Mean 23.0 40.0 42.6 38.6 36.1
(e)CNN-LB [28] (f)DM [30]
Probe Probe
Gallery 0 30 60 90 Mean Gallery 0 30 60 90 Mean
0 74.4 16.5 3.5 2.8 24.3 0 68.8 0.8 0.1 0.0 17.4
30 13.6 89.3 36.0 16.2 38.8 30 1.2 82.2 1.4 0.3 21.3
60 2.9 36.2 88.4 44.7 43.0 60 0.1 1.1 77.5 5.6 21.1
90 2.2 14.0 41.2 91.7 37.3 90 0.0 0.2 4.1 80.9 21.3
Mean 23.3 39.0 42.3 38.9 35.9 Mean 17.5 21.1 20.8 21.7 20.3
(g)MGANs [11] (h)DiGGAN(Ours)
Probe Probe
Gallery 0 30 60 90 Mean Gallery 0 30 60 90 Mean
0 72.0 9.6 6.8 2.4 22.7 0 79.0 62.1 46.5 47.7 58.8
30 9.4 83.2 30.3 10.7 33.4 30 58.1 89.8 64.8 58.5 67.8
60 5.3 30.6 80.3 21.0 34.3 60 44.1 66.0 88.7 67.2 66.5
90 2.1 12.0 22.0 85.9 30.5 90 44.6 58.9 66.0 90.0 64.8
Mean 22.2 33.8 34.8 30.0 30.2 Mean 56.4 69.2 66.5 65.8 64.5
Table 1: Rank 1 identification rate (%) for all baselines in cooperative setting on OU-MVLP datase.

4 Experiments

In this section, we systematically evaluated our method on two datasets, the OU-ISIR Gait Database, Multi-View Large Population Dataset (OU-MVLP) [27] and CASIA-B [33]. It is worth noting that the OU-LP [13] and USF [25] datasets are not used in this paper due to lack of large view changes.

To evaluate the performance of our proposed method, we mainly focus on the cross-view identification under the cooperative setting [27], where the gallery has a uniform camera view angle. We also studied the uncooperative setting [27], where the gallery contains unknown views and following [27], we randomly select one out of all the view angles for each test subject in gallery. Moreover, we explored the effect of the triplet-loss to the performance of our framework. Specifically, we also demonstrated the generated gait images for unseen views, which may serve as important evidence for forensic application. To the best of our knowledge, this is the first work that is flexible (any-to-any view generation) at such a fine level.

In the following, we will in turn introduce each of them.

Datasets OU-MVLP is the world’s largest cross-view gait dataset [27]. It contains 10,307 subjects (5,114 males and 5,193 females with various ages, ranging from 2 to 87 years) and 14 different view angles , , , , , , , , , , , , and . The subjects repeat forward and backward walking twice of each, such that two sequences are generated in each view. The wearing conditions of subjects are various due to the collection process ranging different seasons. The size-normalized GEIs used in this paper are pixels. Some examples from OU-MVLP dataset are illustrated in Fig. 1. CASIA-B is another widely used cross-view gait dataset that consists of 124 subjects with 11 different view angles range from to with an interval of [33]. For each subject, there are six sequences of normal walking, two sequences with bags and two sequences with different clothes.

Settings For the experiments in OU-MVLP, we follow the settings in [27]. The 10,307 subjects in OU-MVLP dataset are split into two disjoint groups —- 5153 subjects for training our DiGGAN model and 5154 for testing (i.e., probe and gallery). Similarly, for the CASIA-B dataset, we choose the first 62 subjects for training and the rest 62 subjects for testing.

Technical Details: Due to the page limitation, the details of our network architecture as well as the implementation code can be found at our Github111 repository after the review. For the parameters, the dimension of the latent code is set as for OU-MVLP and for CASIA-B; and the in Eq. 3 is set as for all the experiments.

Performance Measurement:

Rank-1 identification rate (i.e., recognition accuracy) is used as the evaluation metric. Features are extracted from the trained DiGGAN, before nearest neighbour classifier can be applied for different cross-view gait recognition tasks.

4.1 Experimental Results on Cooperative Setting

Experimental Results on OU-MVLP Since two GEIs with view difference are mostly considered as those from the same-view pair based on perspective projection assumption [20], we focus on four typical view angels (, , , ) in this section. We compared our DiGGAN framework with some state-of-the-art baselines, including classical ones: direct matching (DM)[30], VTM[19], CNN-based methods: GEINet [26], Siamese[31], CNN-MT[28], CNN-LB[28], and the most recent GAN-based approach: MGANs[11]. In the cooperative mode, the rank 1 identification rates of all four view angles are reported in Table 1, from where we can see:

  • Our method outperforms other methods significantly on cross-view gait identification tasks. Our overall rank-1 accuracy is , and that is higher than the second best GEINet.

  • Our method is more robust on cross-view gait identification. In this cooperative mode, although accuracy decreases w.r.t. increasing view angles differences, they are less significant when compared with other algorithms. Our DiGGAN can yield very competitive performances even when the view difference is , which indicate our method can extract robust view-invariant features.

  • Most of the methods suffered from gallery in view , yet our DiGGAN can achieve a reasonable accuracy of , much higher than the second best.

In Table 2, we also report the average rank 1 accuracies on cross-view gait identification excluding the identical views (between probe and gallery). We can see other algorithms do not generalise well in this large-scale cross-view gait recognition evaluation, while our DiGGAN can still remain very competitive results.

Method Mean
VTM[19] 0.4 1.6 2.2 2.1 1.6
GEINet[26] 8.2 32.3 33.6 33.6 26.9
Siamese[31] 11.4 27.9 26.7 26.2 23.1
CNN-MT[28] 7.1 24.0 28.2 21.7 20.3
CNN-LB[28] 6.2 22.2 26.9 21.2 19.1
DM[30] 0.4 0.7 1.9 2.0 1.3
MGANs[11] 5.6 17.4 19.7 11.4 13.5
DiGGAN(ours) 48.9 62.3 59.1 57.8 57.0
Table 2: Average rank 1 identification rates (%) under Probe ,,, excluding identical view (cooperative mode) on OU-MVLP dataset.

Effect of Triplet Loss and Identity Discriminator To explore the effect of the triplet loss, we trained two separate models on OU-MVLP: one with the triplet constrains on and another without the triplet constrains. We compared them with the state-of-the-art method GEINet[26]. The results are shown in Table. 3. Although without the triplet loss, our method still outperforms the state-of-the-art, the improvement by introducing the triplet loss is significant as illustrated.

Method Mean
GEINet[26] 25.2 46.6 46.8 44.0 40.7
DiGGAN(w/o T) 37.6 50.8 52.7 51.3 48.1
DiGGAN 56.4 69.2 66.5 65.8 64.5
Table 3: Average rank 1 identification rates (%). (w/o T) indicates the model without triplet loss.

Experimental Results on CASIA-B CASIA-B is a relative small dataset. We evaluated our model and report the average recognition accuracies on CASIA-B in Table 4. The comparison is conducted under the probe views , and and with several methods such as VTM [14], C3A [4], ViDP [12], CNN [28] and MGANs [11] The results show that our method yields the competitive performance under probe while getting significant improvements under probe and , which indicates our framework works well on small scale datasets.

Method Mean
VTM[19] 55.0 46.0 54.0 51.0
C3A[4] 75.7 63.7 74.8 71.4
ViDP[12] 64.2 60.4 65.0 63.2
CNN[28] 94.6 88.3 93.8 92.2
MGANs[11] 84.2 72.3 83.0 79.8
DiGGAN(ours) 94.4 91.2 93.9 93.2
Table 4: Average rank 1 identification rates (%) under Probe , and excluding identical view (cooperative mode) on CASIA-B dataset.

Cross Dataset Evaluation In this section, we evaluated the generalisation ability of our model. We trained three models, among which the first model () is trained on OU-MVLP dataset only, the second model () is trained on CASIA-B dataset and the last model () is first trained on OU-MVLP and then fine-tuned on CASIA-B. We report the average rank 1 identification rates of each model on the 62 subjects in CASIA-B’s test set, and the results are shown in Table 5. We can see that the model trained on OU-MVLP yields a promising identification rate on CASIA-B dataset. We can also find that pre-training on OU-MVLP dataset helps the model to achieve the best results among the three because of its massive number of training samples.However, we noticed that does not benefit much from a large pretrain set. A possible reason is that the view angles as well as the nationalities of the subjects in OU-MVLP and CASIA-B are very different. Nevertheless, the experimental results suggest it is not harmful to use the large OU-MVLP for representation learning. In fact, based on the learned representation, even without local fine tuning, our model can outperform all the existing methods except the CNN[28], which shows our framework has a very strong generalisation ability.

Model Mean
86.2 82.2 84.7 84.4
94.4 91.2 93.9 93.2
94.6 91.3 93.9 93.3
Table 5: Average rank 1 identification rates (%) under Probe ,,, excluding identical view (cooperative mode) on CASIA-B.

4.2 In-depth Analysis

To better understand the success of our proposed model, this section provides detailed discussions and verifies some key statements in our methodology. All experimental results are based on OU-MVLP dataset.

Uncooperative Setting Results Compared with cooperative mode, this scenario is more challenging since the gallery views are non-uniform. Following the settings in [27], we randomly select one from the 14 view angles for each test subject in gallery. Furthermore, considering the cost of collecting full-view training samples, it would be more practical to train the model with less views but can generalise to more. In this paper, we thus add an extra challenge and use the same model that is trained by only 4 angles and the rest of 10 angles in the test gallery are assumed as unseen. To the best of our knowledge, this is the first attempt to match gait images from unseen view angles in the test gallery. In Table 6, our model significantly outperforms state-of-the-art approaches that are trained by full 14 views.

Method Mean
GEINet[26] 15.7 41.0 39.7 39.5 34.0
Siamese[31] 15.6 36.2 33.1 36.5 30.3
CNN-LB[28] 14.2 32.7 32.3 34.6 28.5
CNN-MT[28] 11.1 31.5 31.1 29.8 25.9
DM[30] 7.1 7.4 7.5 9.7 7.9
DiGGAN(ours) 30.8 43.6 41.3 42.5 39.6
Table 6: Rank 1 identification rate (%) for all baselines in uncooperative setting on OU-MVLP dataset.

Performance on Small-scale Gallery In many realistic applications, such as indoor office, the gallery size can be smaller. In Fig. 5, we can see the performance tends to be higher with smaller gallery. At 100-identity scale, the accuracies under all views exceed 90%, which is in line with the experimental results on the small-scale CASIA-B. Given the high performance, our model has many potential industrial values..

Figure 5: Performance w.r.t. the size of gallery (in cooperative mode) on OU-MVLP.
Figure 6: Generated images at , , and  with different input views. The top row shows the ground truth GEIs from the target views in the gallery. The first column shows the input GEIs from the probe. The images in bottom right matrix are the generated GEIs.
Figure 7: Gait view generation: 6 generated GEI images with corresponding views. is completely unseen during training.
Figure 8: Qualitative analysis of the evidence generation. The first column marked with red box illustrates three different GEIs in the probe. The rest five images in each row are generated GEIs based on the 5 most similar reference(i.e. latent code) templates.

Any-to-Any View Gait Evidence Generation One of the advantages of our proposed model is that we can generate gait images from arbitrary view angle to all target angles whereas existing approaches can only achieve 1-to-1 generation [11]. Such an extension helps the understanding to humans when the identification is based on the latent features and thus improve the user’s trust.

Figure 9: The generated images of 14 views with the input image at , which is indicated by the blue box on top left.

Fig. 9 shows the generated 14 views ( - , -) given an input image at . We also show the generated gait images of four typical views ( , , ,) using these four angles as input (Fig. 6

). We can see that the generated gait images have a high similarity with the ground truth, even for a large view variance. These cross-view generated gait images can be used as evidence in surveillance and forensic applications.

We also conduct a new scenario that has not been considered in previous works, in which the training dataset does not contain certain views that appear in test dataset. In Fig. 7, we show the generated view which is completely unseen during training. We can see that both the identity and angle information can be generated, although some details (e.g. hands and feet) are missing.

Consistency Evaluation between Latent Code Searching and Evidence To evaluate the effectiveness of our generated evidence, we combine it with the results of latent code searching (rank 5). As Fig. 8 shows, the generated gait evidence of the subject identified by searching in latent space have high similarity with the input probe image, which indicates that good consistency is achieved in both latent space and generated image level. Moreover, this demonstrates the generated evidence is effective, which could be applied to the real-world forensic situations. On the other hand, we can see that the generated images for the first five nearest subjects are similar (Fig. 8), which means if the latent codes are close with each other, the generated images also look similar. It also proves the effectiveness of our model.

5 Conclusion

This paper studied a challenging large-scale cross-view gait recognition problem. Using GANs to generate different views, the learnt latent embedding achieved remarkable cross-view transferability. The model effectively incorporated three modules. The loss provided an interactive interface through which a given arbitrary view could be used to generate all of other views. The loss preserves identity sensitive information in the generated images. To further discriminate a large number of identities, triplet constraint was introduced onto the latent embedding. Moreover, since the triplet training incorporated images from different views, the inter-identity distance was enlarged, which further de-correlated effects of the cross-view problem. Extensive experiments manifested promising improvements over the state-of-the-arts. Our method also achieved the best results in the non-cooperative scenario, which has non-uniform views in the gallery. More reliable performance was achieved in small-scale datasets (i.e. CASIA-B) and we further show our DiGGAN framework can effectively take advantage of large dataset for cross dataset generalisation. Detailed training strategy was discussed so as the model could benefit both gait recognition domain and experts of other domains who would use GANs to solve their problems. Overall, this paper made a breakthrough towards reliable cross-view gait recognition at a very large scale with generated evidence for practical applications.


  • [1] H. Aggarwal and D. K. Vishwakarma.

    Covariate conscious approach for gait recognition based upon zernike moment invariants.

    IEEE Transactions on Cognitive and Developmental Systems, 10(2):397–407, June 2018.
  • [2] G. Ariyanto and M. Nixon. Model-based 3d gait biometrics. In IJCB, pages 1–7, 2011.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223, 2017.
  • [4] K. Bashir, T. Xiang, and S. Gong. Cross-view gait recognition using correlation strength. In BMVC, pages 1–11, 2010.
  • [5] I. Bouchrika, M. Goffredo, J. Carter, and M. S. Nixon. On using gait in forensic biometrics. Journal of Forensic Sciences, 56(4):882–889, 2011.
  • [6] M. Goffredo, I. Bouchrika, J. Carter, and M. Nixon. Self-calibrating view-invariant gait biometrics. IEEE TSMC, 40(4):997–1008, Aug 2010.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [8] Y. Guan, C. Li, and F. Roli. On reducing the effect of covariate factors in gait recognition: a classifier ensemble method. TPAMI, 37(7):1521–1528, 7 2015.
  • [9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
  • [10] J. Han and B. Bhanu. Individual recognition using gait energy image. IEEE TPAMI, 28(2):316–322, Feb. 2006.
  • [11] Y. He, J. Zhang, H. Shan, and L. Wang. Multi-task gans for view-specific feature learning in gait recognition. IEEE TIFS, 14(1):102–113, 2019.
  • [12] M. Hu, Y. Wang, Z. Zhang, J. Little, and D. Huang. View-invariant discriminative projection for multi-view gait-based human identification. TIFS, 8(12):2034–2045, Dec 2013.
  • [13] H. Iwama, M. Okumura, Y. Makihara, and Y. Yagi. The ou-isir gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE TIFS, 7(5):1511–1521, 2012.
  • [14] W. Kusakunniran, Q. Wu, J. Zhang, and H. Li.

    Support vector regression for multi-view gait recognition based on local motion feature selection.

    In CVPR, pages 974–981, 2010.
  • [15] W. Kusakunniran, Q. Wu, J. Zhang, and H. Li. Gait recognition under various viewing angles based on correlated motion regression. IEEE TCSVT, 22(6):966–980, 2012.
  • [16] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, and L. Wang. Recognizing gaits across views through correlated motion co-clustering. IEEE TIP, 23(2):696–709, Feb 2014.
  • [17] W. Kusakunniran, Q. Wu, J. Zhang, Y. Ma, and H. Li. A new view-invariant feature for cross-view gait recognition. IEEE TIFS, 8(10):1642–1653, Oct 2013.
  • [18] P. Larsen, E. Simonsen, and N. Lynnerup. Gait analysis in forensic medicine. Journal of Forensic Sciences, 53:1149–1153, 2008.
  • [19] Y. Makihara, R. Sagawa, Y. Mukaigawa, T. Echigo, and Y. Yagi. Gait recognition using a view transformation model in the frequency domain. In ECCV, volume 3953, pages 151–163, 2006.
  • [20] Y. Makihara, R. Sagawa, Y. Mukaigawa, T. Echigo, and Y. Yagi. Which reference view is effective for gait identification using a view transformation model? In CVPRW, pages 45–45. IEEE, 2006.
  • [21] Y. Makihara, A. Suzuki, D. Muramatsu, X. Li, and Y. Yagi. Joint intensity and spatial metric learning for robust gait recognition. In CVPR, pages 6786–6796, July 2017.
  • [22] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv, 2014.
  • [23] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible conditional gans for image editing. arXiv, 2016.
  • [24] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 2015.
  • [25] S. Sarkar, P. Phillips, Z. Liu, I. Vega, P. Grother, and E. Ortiz. The humanid gait challenge problem: data sets, performance, and analysis. IEEE TPAMI, 27(2):162–177, 2005.
  • [26] K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi.

    Geinet: View-invariant gait recognition using a convolutional neural network.

    In ICB, pages 1–8, 2016.
  • [27] N. Takemura, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi. Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition.

    IPSJ Transactions on Computer Vision and Applications

    , 10(1):4, Feb 2018.
  • [28] Z. Wu, Y. Huang, L. Wang, X. Wang, and T. Tan. A comprehensive study on cross-view gait based human identification with deep cnns. IEEE TPAMI, 39(2):209–226, Feb 2017.
  • [29] S. Yu, H. Chen, E. B. G. Reyes, and N. Poh. Gaitgan: Invariant gait feature extraction using generative adversarial networks. In CVPRW, pages 532–539, July 2017.
  • [30] S. Yu, D. Tan, and T. Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In ICPR, volume 4, pages 441–444, 2006.
  • [31] C. Zhang, W. Liu, H. Ma, and H. Fu. Siamese neural network based gait recognition for human identification. In ICASSP, pages 2832–2836, March 2016.
  • [32] Z. Zhang, Y. Song, and H. Qi.

    Age progression/regression by conditional adversarial autoencoder.


    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 2, 2017.
  • [33] S. Zheng, J. Zhang, K. Huang, R. He, and T. Tan. Robust view transformation model for gait recognition. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 2073–2076. IEEE, 2011.