VGR-Net: A View Invariant Gait Recognition Network

10/13/2017 ∙ by Daksh Thapar, et al. ∙ IIT Rajasthan IIT Mandi 0

Biometric identification systems have become immensely popular and important because of their high reliability and efficiency. However person identification at a distance, still remains a challenging problem. Gait can be seen as an essential biometric feature for human recognition and identification. It can be easily acquired from a distance and does not require any user cooperation thus making it suitable for surveillance. But the task of recognizing an individual using gait can be adversely affected by varying view points making this task more and more challenging. Our proposed approach tackles this problem by identifying spatio-temporal features and performing extensive experimentation and training mechanism. In this paper, we propose a 3-D Convolution Deep Neural Network for person identification using gait under multiple view. It is a 2-stage network, in which we have a classification network that initially identifies the viewing point angle. After that another set of networks (one for each angle) has been trained to identify the person under a particular viewing angle. We have tested this network over CASIA-B publicly available database and have achieved state-of-the-art results. The proposed system is much more efficient in terms of time and space and performing better for almost all angles.



There are no comments yet.


page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In current perilously dynamic scenarios, continuous human personal authentication is essentially required to handle any socially inimical activity. Unlike, traditionally used forms of identification and authentication such as passwords, ID cards, tokens , biometric traits do not require to be memorized and cannot be shared, as they are unique to an individual. Hence they have started to replace traditional methods in the recent past. Biometrics can be divided into two classes : 1) Physiological and 2) Behavioral. The former includes traits like face, fingerprint, iris, ear, knuckle, palm The later includes signature, voice, gait Physiological traits are usually unique and highly discriminant. But these traits require cooperation from the subject along with a comprehensive controlled environmental setup, for efficient and accurate authentication. Hence, alone these traits are not very useful, especially in surveillance systems. On the other hand behavioral traits such as gait, enable us to recognize a person using standard cameras and even at a distance. Hence, the easy accessibility and low susceptibility to the noise makes gait a good biometric trait that can be used in video surveillance. Even within a controlled environment, fusing behavioral biometrics such as gait with other physiological biometrics have shown to give very promising results.

1.1 Motivation and Challenges

Human gait has many advantages over the conventional biometric traits (like fingerprint, ear, iris ) such as its non-invasive nature and comprehensible at a distance. Therefore, gait recognition has applications in various areas, such as authentication and detection of impostor demeanor during video surveillance, personal authentication as well as in several other security related fields. Since, gait features are chief behavioural characteristics of any person, these features are believed to be hard for circumvention and can be monitored through any surveillance system, constituting gait as an essential biometric personal characteristic. Unfortunately, gait recognition can be severely affected by many factors including viewing angle, clothes, presence of bags, surroundings illumination making it a really challenging problem. Hence, multi-view gait recognition can be seen as a major problem. Efficient and accurate recognition from any viewing angle or any camera angle makes this problem extremely hard.

Multi-view recognition is essentially required as most of the times, input from surveillance systems are not in accordance to the viewing angle of the enquirer resulting in very hard recognition task. In order to tackle these challenges, we propose an efficient, 3-D CNN based network for multi-view gait recognition as shown in Fig. 1.

Figure 1: Our proposed multi-view gate recognition network

1.2 Related Work

Huge amount of work has been done in order to progress the state-of-the-art of gait recognition, started by distinguishing human locomotion from other locomotion  [5]

. Till now one can classify gait feature extraction and matching broadly into two approaches : a) Handcrafted features and matching and b) Deep-Learning based approaches.

Handcrafted Features : Liu et al. [10], used frieze patterns and combined them with dynamic time warping. This approach worked well with similar view points. Kale et al. [6]

, used the person silhouette and applied hidden markov model over it to optimize the performance. Other set of approaches used Gait Energy Image (GEI) that is an average of silhouettes over a whole gait cycle. Man et al. 

[11] and Hofmann and Rigoll [3], further used GEI for the recognition purpose. In approaches observed in  [18],  [9] and  [12], transformations of gait sequences into a particular desired view point has been performed.

Deep Learning : Recently researchers have started to exploit 3D features for multi-view gait recognition advancing problems from 2D image classification to 3D video classification. Karpathy et al. [7] use a multi-resolution, foveal architecture by applying 3D convolutions on different time frames of a video. Similar to  [7], Tran et al.  [15] have designed a CNN using 3D convolutions with a deeper structure, that can fully exploit spatio-temporal features useful for video classification. Deqing et al. [14] showed us how the optical flow can be very informative for classification and identification in videos. Thomas et al. [16]

uses 3D convolutional neural networks with a three channel input, with gray-scale, optical flow in X direction and optical flow in Y direction considered as three input channels. They also demonstrated that how, playing with the training and testing data can give us better results. Shiqi Yu et al. 

[17], showed that how one can use GAN networks to correctly generate GEI images (of other views) and further learn a person’s gait features. Yang et al. [2], have used LSTM networks to memorize and extract the walking pattern of a person by correctly extracting the heat-map information.

1.3 Contribution

In this work we have used only silhouettes of persons instead of other useful but expensive optical flow or GEI information. The proposed network has been trained on less amount of data without any overlapping. This also resulted in faster training and testing of network. In order to achieve view point invariant recognition, a multi-stage network initally performing viewing angle classification and later fine level subject classification. Multi-level classification has been performed by using voting at clip level. Finally we also have used stereo image data representation to train our network in order to enhance our results.

The remaining paper has been organized as follows, next section describes the proposed architecture. Section 3, illustrates the experimental analysis and last section concludes the proposed work.

Figure 2: Hierarchically Generalized Multi-View Gait Recognition Architecture

2 Proposed Architecture

In this work we have utilized gait video sequences for recognition and that too at various angles. Hence we have considered spatio-temporal features. After analyzing the temporal domain, one can extract a lot of features obtained from surroundings using CNN’s, that can be used, further for video classification. Our proposed approach is a hierarchical two step process. In both steps deep 3D CNN’s have been trained to learn the spatio-temporal features, among the video frames. The network details are shown in Fig. 2

. The first network attempts to identify and learn to estimate the viewing angle and then it attempts to perform subject identification. In order to realize this network, we have started with a basic 3D-convolutional network architecture and added few dense and pooling layers. Later we have fine tuned our network by optimizing it in terms of performance.

2.1 Spatio-Temporal Features

The 2D Convolutional neural networks can only analyze and learn the spatial features present in the images. In order to learn temporal gait features we need to explore another temporal dimension using 3D CNN. The spatio-temporal features relate space and time together (having both spatial extension and temporal duration).

2.2 C3D : 3D CNN Architecture

C3D (Convolutional 3D) [15],  [4],  [7]

is a network based on 3D convolutions which has achieved state-of-the-art results in action recognition and scene classification. It is established that when 2D convolutions are applied over successive frames of a video, the temporal information present in the frames is lost. Thus 3D convolutions and pooling are necessary to retain the time variant information in the video.

[a] The network takes a frames clip each of size as an input. It contains convolution layers, pooling layers,

fully connected layers followed by a softmax classification layer. Each of the convolution layers has ReLU activation applied over it with padding to preserve the spatial dimensions.

[b] The first layer consists of filters each of size followed by a pooling layer of

with a stride of

where corresponds to the temporal domain. Then another layer of filters each of size has been applied followed by a pooling layer of pool size and stride of .

[c] After these initial layers two similar layers, each consisting of filters each of size

followed by a max pooling of pool size

and stride have been introduced.

[d] Now two sets of layers are stacked after this each with two followed by a pooling layer. All layer have to learn filters, each of kernel size , followed by a pooling layer of with a stride of .

[e] Finally the output is flattened and two fully connected layers each of neurons have been added with a dropout of to handle over-fitting. At last a dense layer of has been applied with a softmax activation to classify the video.

This model has been pre-trained on the Sports-1M dataset [8], which is one of the largest video classification benchmarks and contains million sports videos.

2.3 Model Architecture

  1. Stage-1 : The first stage of the proposed network uses convolutions, to identify the viewing angle of any gait video. Sixteen frames are uniformly sampled from the video and replicated into three channels to pass through the network. The proposed network consists of a slightly tweaked version of C3D model which was already trained on Sports-1M data-set and takes as input frames each of size , as shown in left side of Fig. 2 and discussed below.

    [i] The last layers of the C3D model have been removed as we need only the basic temporal features and not the action specific features ( [13]).


    The C3D flatten layer output gives a feature vector of size

    . Later we have added a fully connected layer of neurons to the above output and apply activation to achieve non-linear sparsity. A dropout of has again been applied to avoid over-fitting and force neurons to learn generic features instead of the features specific to a given training data.

    [iii] A fully connected layer of neurons is applied as a second last layer over the output with ReLU activation and a dropout of , so as to achieve the best possible latent representation.

    [iv] Finally output layer of neurons has been added to classify any given videos into different viewing angles ranging from to

    at a difference of 18 degrees. A Softmax activation is applied over the output to get the probabilities corresponding to each of the viewing angles, since it is a classification problem.

    Parameter Value
    Optimizer Stochastic Gradient Descent optimizer
    Epochs 90
    Dual Learning Rate (For 1st 40 epochs) 0.005

    (For rest 50 epochs) 0.003
    Mini batch size 16
    Table 1: Summarizing the Person Identification Model Parameters
  2. Stage-2 : In the second stage for each of the different viewing angles a separate network has been trained only on the videos of that angle corresponding to all the subjects. This network is also inspired by the C3D model trained on the Sports-1M data-set, as shown in right side of Fig. 2 and discussed below.

    [i] The last five layers of the network has been removed and the output of the flatten layer of the model is taken.

    [ii] Seven blocks, each consisting of a fully connected layer of neurons with a ReLU activation and a dropout of are stacked sequentially after the output of the flatten layer.

    [iii] Finally a fully connected layer of neurons is added with Softmax activation to get the class probabilities for each of subjects. Since frames can be fed into the C3D model at a time, original video is cut into clips of frames each.

    Voting Scheme : While training each of these clips has been considered as a different video and error is calculated and back-propagated by comparing it with the ground-truth of each clip. However during testing the trained network has been used to predict the subject corresponding to each of the clips independently under a voting scheme. The subject which receives the highest number of votes from the clips has been referred as the final prediction.

Parameter Value
Optimizer Stochastic Gradient Descent optimizer
Epochs 115

Multiple Learning Rate
(For 1st 25 epochs) 0.001

(For next 40 epochs) 0.005

(For last 50 epochs) 0.003
Mini batch size 16
Table 2: Summarizing the Model Parameters for Stereo Training

3 Experimental Analysis

We have conducted experiments on CASIA-B dataset [1] to show the utility of our model. The Correct Classification Rate (CCR%) is computed for performance evaluation. The results are computed under two scenarios namely (a) Stage-1 classification : Predicting viewpoint angle and (b) Stage-2 classification : Personal identification. In this work we have compared our 2-stage deep 3D convolutional neural network with the current state-of-the-art network proposed in  [16]. We have successfully demonstrated that the proposed network is efficient in terms of time and space as well as outperforms [16] in terms of performance (i.e. CCR%).

Figure 3: Sample silhouettes of a subject at five viewing angles.

3.1 Database Specification

The CASIA-B Dataset [1], is a large multi-view gait database. This dataset consist of gait images corresponding to subjects, and the gait data has been captured from views ranging from to . There are six sequences of normal walking corresponding to each of the subjects and for all different angles. Instead of the original video, we have used human silhouettes extracted from the videos for our experiments as shown in Fig. 3.

3.2 Performance Parameters

We have used the correct classification rate (CCR) as our performance parameter. It gives us an idea about how many times we have correctly identified the view point angle or the person in a video.

3.3 Training and Testing Protocol

For angle identification network, we have used subjects for training that corresponds to of the total data whereas the remaining subject videos have been used for testing corresponding to of the total dataset. For person identification network, we have used first four sequences from each subject for training whereas two sequences have been used for testing.

Figure 4: Formation of stereo images using silhouettes

Training hyper-parameters and strategy : Person identification networks (Stage-2), have been trained by using stochastic gradient descent optimizer with a learning rate of for 1st epochs. Later we have decreased it to for next epochs so as to utilize dual diminished learning rate for fast convergence and accuracy but with a fixed momentum at . All model parameters are shown in Table 1. After training our networks over the gray-scale silhouettes, we fine tuned it over stereo images. They are created by stacking one image over the other and finding the difference at the pixel level of the two images as shown in Fig. 4. Hence the two consecutive video frames can be fed one by one to create a fused stereo image. Over them, we have fine tuned the model that has already been trained on silhouettes with a learning rate of with momentum for epochs. Finally we have trained the model for first epochs at learning rate and next epochs at a learning rate of . Those model parameters for stereo training are shown in Table 2.

3.4 Analysis

In this subsection we have presented a detailed and rigorous multi-stage as well as stereo/partial overlapping performance analysis.

3.4.1 Stage-1 : View Point Angle Identification

The training and testing at stage-1 has been performed on images acquired from same type of sensor model. Our model has achieved the perfect for angle identification. It can be clearly observed from Fig. 5, that the stage-1 network has learned the underlying latent features representation successfully at different layers and hierarchy. The perfect angle identification can be easily justified by the discriminative spatio-temporal representations as shown in Fig. 5. We believe that this fact attributes most significantly towards the efficiency and accuracy of our proposed overall multi-stage network.

Figure 5: Visualizations of various convolution layers for different angles, for proposed angle detection network
Angle Thomas [16] VGR-Net VGR-Net
(Stereo images)
0 96.30% 98.33% 98.67%
18 98.20% 99.17% 99.55%
36 98.50% 99.17% 100%
54 95.40% 96.67% 99.55%
72 94.30% 97.92% 97.78%
90 99.90% 97.08% 97.79%
108 98.60% 97.91% 98.67%
126 97% 97.08% 96.90%
144 97.40% 96.25% 96.38%
162 99.20% 96.67% 96.10%
180 96.10% 97.08% 97.69%
Table 3: The Comparative Analysis of Quality Performance in CCR(%) on proposed architecture and the previous architecture for different View-Point Angles for Gray-Scale and Stereo Images. Green colored cell indicates highest accuracy for that angle.

3.4.2 Stage-2 : Person Identification

In this stage, personal identifications task has been performed for a given view point angle. Table 3, indicates the computed performance of the proposed architecture on CASIA-B dataset [1]. It has been observed that the proposed architecture performs better than state-of-the-art model [16], in out of viewpoint angles.

Angular performance analysis : For obtuse angular view, we have visually observed that subject’s velocity with respect to the cameras has been much greater as compared with acute angles. That causes frame overlap between the clips (used for classification) to be very small for obtuse angles. In [16], huge amount of overlapping clips have been utilized for training/testing resulting in better performance for obtuse angles but at an expense of more time and space resources. In this work in order to optimize the time and space we have used non-overlapping clips apart from multi-stage network. Such a strategy enable us to achieve better performance for acute angle as well as some obtuse angles.

Figure 6: Visualizations of the second model for two different subjects at 90 degree and 36 degree

Experimentation with stereo images : Since they have used optical flow also for the prediction, we also have tested our model on stereo images. The has been shown in Table 3, which shows that using stereo images instead of normal silhouettes will lead to higher accuracy specially for acute angles primarily due to above mentioned reason.

Angle Thomas [16] VGR-Net VGR-Net
(partial overlap)
90 99.90% 97.08% 98.55%
108 98.60% 97.91% 98.75%
144 97.40% 96.25% 97.50%
162 99.20% 96.67% 98.33%

Table 4: The Comparative Analysis of Quality Performance in CCR(%) on proposed architecture and the previous architecture for different View-Point Angles for Gray-Scale and with partial overlap. Green colored cell indicates highest accuracy for that angle.

Experiment with “partial overlap” : It is well understood that overlapping clips can handle the problem adequately but at an expense of more time and space. Hence we have also tested our network with a minimal overlap (almost 50% as considered in [16]). In our model overlapping has been done as follows : 1 to 16 frames in 1 st clip then 8 to 24 frames in 2nd clip and so on on contrary to [16], where they took complete overlap between the clips resulting them a very slow training and testing network. We have tested the model with “partial overlap” only at 90, 108, 144 and 162 degrees to justify the above arguments. The “partial overlap” results are shown in Table 4. One can clearly observe that by considering “partial overlap”, the performance of our model increases and even surpasses [16] at obtuse angles.

3.5 Visualizations

We have visualized the features extracted at different layers of our proposed network, since these are the actual images as seen at any layers. We have critically analyzed them, so as to understand how and what our network has learned and interpreting about any image/clip.

Angle Identification : The Fig. 5 depicts outputs of , and convolutional layers of the network model, at different angles, for angle identification network. Clearly it is visible that the filters have learned the spatial as well as the temporal features that can easily discriminate the angles. The first convolution layer consists of filters of size and they are trained to learn the motion in three adjacent video frames. Since at 0 degree, the person comes towards the camera, that appears to be just one silhouette that overlaps the previous frames silhouettes. Since till now no pooling has been done in spatial or temporal domain, the silhouette edges are clearly visible. In second convolution layer one can notice that the filters learn deeper and much more complex and higher level features. Since pooling has been done in the spatial domain, the edges seem to fade a little and only the basic outline of the silhouettes tend to remain. However, as yet no pooling has been done in the time variant zone, therefore similar to the first layer, only features from adjacent three frames of the video are learned by a filter.

From the output of this layer it is clearly visible that degree can be quiet easy to differentiate from rest of the angles because the successive silhouettes (for other angles) overlap with the existing ones and thus overshadow them, leaving the latest silhouettes visible. Output of the convolution layer aggregate the features in all three dimensions, as just before it a max pooling of 2 with a stride of 2 has been done. Hence filters in this layer learn features at another scale. The filters learn motion in consecutive frames leaving all other irrelevant spatial information. Since in this network the aim is to identify the view-angle, therefore these deeper filters learn the direction of flow in the frames and lose the information about the person identity that may be present in the spatial domain.

Personal Identification : In our stage-2 model (personal identification network), we have noticed that the neurons try to learn not only the movement but also the spatial salient features in the silhouettes. This is due to the fact that, now the network needs to learn those latent features that can effectively differentiate between various subjects as shown in Fig. 6. During the initial convolution layers like , and the filters tend to learn local features such as the basic outline of the silhouettes and the aggregated motion in the successive silhouettes which can help the further classifiers to classify the different subjects. As we move on to the more deeper convolution layers in the network such as we have observed that the neurons tends to learn more global features. Here only those neurons which are responsible for the classification of that subject tend to fire. It can be observed in Fig. 6 that only some of the neurons fire because they are the classification neurons for that particular subject. In Fig. 6, two subject comparative visualization has been shown at two different angles and degrees. Over the former angle ( 90 degrees) our results are slightly lesser, as the features seen by network tends to become more general and ambiguous for some subjects at complete profile angle. While sparse and unique features got extracted from the network at degrees.

4 Conclusion

We have proposed an efficient (in terms of time and space) and accurate (in terms of performance) architecture for multi-view gait recognition in comparison to the present state-of-the-art network [16]. Our network is viewpoint invariant due to its multi-stage architecture. We have not considered complete overlapping clips as well as any other pre-computed optical-flow features as they need to be computed outside the network and slow down the network significantly. Instead we have only used the basic silhouette and have decreased the running time of the whole system considerably as well as achieved better/comparable results to the present state-of-the-art system. In order to show that overlapping and optical flow can easily boost the system performance, we have also utilized stereo images as well as small overlapping clips for better performance but at an expense of additional cost.


  • [1] Casia gait dataset b : A large multi-view gait database., 2005.
  • [2] Y. Feng, Y. Li, and J. Luo. Learning effective gait features using lstm. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 325–330. IEEE, 2016.
  • [3] M. Hofmann and G. Rigoll. Exploiting gradient histograms for gait-based person identification. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 4171–4175. IEEE, 2013.
  • [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
  • [5] G. Johansson. Visual perception of biological motion and a model for its analysis. Perception & psychophysics, 14(2):201–211, 1973.
  • [6] A. Kale, A. Sundaresan, A. Rajagopalan, N. P. Cuntoor, A. K. Roy-Chowdhury, V. Kruger, and R. Chellappa. Identification of humans using gait. IEEE Transactions on image processing, 13(9):1163–1173, 2004.
  • [7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    , pages 1725–1732, 2014.
  • [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  • [9] W. Kusakunniran, Q. Wu, J. Zhang, and H. Li. Gait recognition under various viewing angles based on correlated motion regression. IEEE transactions on circuits and systems for video technology, 22(6):966–980, 2012.
  • [10] Y. Liu, R. Collins, and Y. Tsin. Gait sequence analysis using frieze patterns. In European Conference on Computer Vision, pages 657–671. Springer, 2002.
  • [11] J. Man and B. Bhanu. Individual recognition using gait energy image. IEEE transactions on pattern analysis and machine intelligence, 28(2):316–322, 2006.
  • [12] D. Muramatsu, A. Shiraishi, Y. Makihara, M. Z. Uddin, and Y. Yagi. Gait-based person recognition using arbitrary view transformation model. IEEE Transactions on Image Processing, 24(1):140–154, 2015.
  • [13] S. J. Pan and Q. Yang.

    A survey on transfer learning.

    IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • [14] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and their principles. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2432–2439. IEEE, 2010.
  • [15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  • [16] T. Wolf, M. Babaee, and G. Rigoll. Multi-view gait recognition using 3d convolutional neural networks. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 4165–4169. IEEE, 2016.
  • [17] S. Yu, H. Chen, E. B. G. Reyes, A. Center, and N. Poh. Gaitgan: Invariant gait feature extraction using generative adversarial networks. 2017.
  • [18] S. Zheng, J. Zhang, K. Huang, R. He, and T. Tan. Robust view transformation model for gait recognition. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 2073–2076. IEEE, 2011.