Biometrics such as the face, iris, fingerprint and signature are widely applied for human identity authentication. One major limitation of these biometrics is that we need subject cooperation to acquire these biometrics, which is difficult to be implemented in an uncooperative environment. Gait, as an important biometric cue, overcome this limitation since it can be easily captured by a surveillance camera at long distance in uncontrolled scenarios without subject cooperation.
Nowadays, a large number of surveillance cameras are installed in almost every corner of cities, such as shopping malls, hotel, airports, rail stations, etc. Cameras provide a large volume of useful data for crimes and forensic identification. Among the techniques used in surveillance technology and forensic identification, gait recognition is one of the most powerful methods. It has already been applied in a real case to convict criminals . However, gait recognition is still a challenging task due to large variations in walking speeds, clothing, viewpoints and carrying conditions. A lot of methods were proposed to solve these problems. Most of these methods can be grouped into model-based and appearance-based approaches .
The model-based approaches try to build models to reconstruct underlying structures of the human body from video sequences. For example, in  and , they used four parameters, including the height of the body, the distance between head and the pelvis, the distance between the pelvis and left foot, the distance between pelvis and right foot and the distance between left and right foot, to represent the structure of a body. Then gait recognition is completed based on these four parameters. Ariyanto et al.  used 3D gait data reconstructed from multiple cameras  to perform recognition. View variation may not be an issue when multiple cameras are available. 3D data conveys more information than 2D data, thus can achieve high accuracy. However, 3D data acquisition costs a lot and should be conducted in a controlled environment, which limits its application.
The appearance-based approaches take surveillance image sequences as input instead of modeling the underlying structure of the human body. To reduce the impact of clothing, silhouette based representation is prevalent within the gait recognition community [17, 19, 20]. The first step of silhouette based representation is extracting a binary silhouette sequence from a video sequence. Then several methods can be used to aggregate gait silhouette sequence into one image. Gait Energy Image (GEI)  is one of the most popular representation and it is obtained by averaging silhouette sequence over a complete gait cycle(s). Although only one single image is generated, GEI encodes spatial and temporal information of a gait cycle, thus achieves promising results. Based on GEI, various approaches have been proposed to enhance the performance of gait recognition. Tao et al.  proposed Gabor features which are obtained by convolving the averaged gait image with Gabor filters. Xu et al.  proposed a patch distribution feature which representes each GEI as a set of local augmented Gabor features. Similarly, Guan and Li  convolved GEI with Gabor filters from five scales and eight orientations to generate Gabor-GEI feature template.
Other features are also developed to represent motion or/and appearance information of gait silhouette sequences. Inspired by the Motion History Image (MHI)  which was developed for human action recognition, Lam and Lee  proposed Motion Silhouettes Image (MSI) to embed spatial and temporal information of gait silhouettes. Later in , Lam introduced Gait Flow Image (GFI) for gait recognition. Bashir et al.  proposed Gait Entropy Image (GEnI), which captures most motion information and encodes the information in a single image.
The most intractable problem in gait biometric is cross view gait recognition which has been being a hot research direction for years. Numerous studies have made great efforts to tackle this problem. As mentioned above, model-based methods especially 3D model based methods are good solutions to this despite of high cost. Appearance-based methods either focus on extracting view-invariant gait features or project extracted features from different viewpoint to a subspace which minimizing the variance of view-change[7, 27, 31, 35]. Reviews on gait recognition can be found in  and .
, human pose estimation
and face recognition. In these research areas, deep learning methods, especially Convolutional Neural Networks (CNN), accomplish significant progress by learning rich features from large volumes of training data.
were explored to perform gait recognition, which also achieves remarkable improvements. CNN can automatically extract hierarchal features from given image, which is far more efficient than hand-crafted features. In addition to feature extraction, deep learning based similarity measuring methods have also been proposed. Among which, Siamese network is the most popular one. The Siamese network architecture is a useful tool to learn similarity metric between a pair of inputs by learning sufficient feature representations that make inter-class distance close while intra-class distance large[8, 28].
A Siamese neural network  contains two parallel branches sharing the same weights. In training stage, pairs of similar and dissimilar data are fed into the two branches separately. Then the outputs from two branches are combined by matching layers to compute the contrastive loss. Back propagation algorithm is used to train the model. In the testing stage, the Siamese network calculates the distance between the query input and every gallery data, and choose the closest gallery as result. In  and , Siamese networks are applied on gait recognition. The inputs of these neural networks are GEI and CGI  respectively. The Siamese neural network achieves two purposes: the first is extracting features from the input image; the second is mapping features to the target space defined by the specific task. In this work, we argue that we can learn useful information from raw data, binary silhouette sequence, directly, and fuse them in feature level instead of data level. To achieve this, we propose an improved Siamese neural network that learns features directly from raw silhouette sequence images and fuses them in a layer-wise pooling way. Subsequently, additional convolutional layers are applied to map the fused features into task space. The method can be used to cross-view gait recognition. We test it on OU-ISIR large population dataset , and obtain promising performance which is comparable with the state of art.
2 Proposed method
The brief architecture of our method is shown in Figure 1
. Each silhouette and the difference of two adjacent frames are fed to a convolutional neural network to extract useful features representing gait information at the moment (we call it fCNN for short). Inspired by the spatial pooling within a feature layer and the fact that adjacent frames are highly correlated, we explore using a layer-wise pooling method to fuse outputs of fCNN. Layer-wise pooling can convert arbitrary length gait sequences into fixed size feature maps which preserving spatial and temporal gait information. The followed is a classical Siamese network architecture. Given a pair of sequences, corresponding feature maps are obtained through fCNN and layer-wise pooling. Then, the absolute difference is computed between the two fixed feature maps. A one-layer CNN is used to map the difference features into a vector (we call it mCNN for short). It should be noted that more layers are feasible. A fully connected module will convert this vector into two probabilities indicating the pair inputs are the same or not.
To predict the identity of a given probe sample, the similarities between probe gait sequences and every gallery gait sequences are inferred by the whole network. Then the identity of the most similar gait gallery is chosen as the probe sample’s identity.
As shown in Figure 1, the silhouette image and the difference image between two silhouettes are processed by fCNN. The detailed parameters of fCNN are shown in the left part of Figure 2. fCNN contains two convolutional layers. The first one includes 16 filters. The second one includes 64 filters. All filters are of size 7
7 and applied with one stride. Spatial pooling and local response normalization (LRN) are appended after each convolutional layer. The spatial pooling operations are applied in 22 neighborhood. The LRN arguments are set to values recommended from. After applying the first convolutional layer and pooling layer, we obtain 16 feature maps sized , and 64 feature maps after the second convolutional and pooling layer. For notational simplicity, we refer to fCNN as a function , which takes a silhouette gray image and difference image as input and produces feature maps as output. The size of is 642727. Let be the input sequence data of length T, where one channel of is the image at time t and another channel is the difference between image at t and image at t-1. The silhouette image at time 0 is ignored because there is no previous image. It should be noted the layer-wise pooling introduced in the following subsection can fuse arbitrary frames, so the length T is not fixed. Each silhouette will go through the fCNN to produce feature maps, .
2.2 Feature map pooling
One straightforward way to tackle temporal sequence is using recurrent neural networks to encode information across time. Another widely used method is 3D CNN, which is developed by  to perform action recognition. Inspired by the spatial pooling used in CNN and the fact that adjacent frames in a video are highly redundant, we proposed to use feature map pooling to aggregate extracted features from sequence frames.
Feature maps for each frame can be obtained through fCNN. Similar to spatial pooling, there are two ways to aggregate these features, max pooling and mean pooling.
where is the value of th fused feature map at position , and is value of th feature map of frame at . Finally, arbitrary number feature maps are merged into one whose size is 642727.
mean pooling is also a commonly used aggregation strategy. It is used here to produce a single feature maps averaged over all the extracted feature maps, as follows:
In this paper, we test both pooling methods. But we found that mean pooling performance is around 5% worse than max pooling when comparing cross-view identification precision average in experiments, so max pooling was chosen in our experiments.
2.3 Similarity measurement
Given a pair of fused feature maps for two sequences, the task is to identify whether the two sequences represent the same person. To this end, a network similar to Siamese neural network is employed to measure the similarity between two fused feature maps. Firstly, the absolute difference between two fused feature maps is obtained. The output difference feature maps have size same to the fused map, which is 642727. Then a mapping convolutional layer, mCNN, is applied to project the difference feature to a similarity vector. mCNN is a one layer convolutional network which has 256 filters sized 77. The detail of mCNN module is shown at the middle part of Figure 2. The output size of mCNN is 25621 21. Then it is reshaped to one dimension vector with 112896 elements. The following fully connected layer take this vector as input and produce the final result. Detail information of fully connected module is shown at the right part of Figure 2.
We test our method on OU-ISIR large population dataset , as it is the largest gait dataset suitable for training deep neural networks. There are two versions for OU-ISIR large population dataset: OULP-C1V1 and OULP-C1V2 111http://www.am.sanken.osaka-u.ac.jp/BiometricDB/GaitLP.html. OULP-C1V1 contains 4,007 subjects, while OULP-C1V2 includes 4,016 subjects. Aside from this difference, OULP-C1V2 has a more accurate bounding box for each silhouette and the size of moving-average filter applied in the size-normalized silhouette creation process. In this work, the first version of OU-ISIR, OULP-C1V1, was used to evaluated the performance of our method. Figure 3 shows the full silhouette images of subject OULP-C1V1-6218964 with four observation views: 55, 65, 75 and 85 deg.
In this work, we follow protocol used in , only a subset of OULP-C1V1 is used to test our method. The subset contains 1912 subjects. And each subject has probe and gallery gait sequences with different angle views: 55, 65, 75 and 85 degree. There are 8 sequences for one subject. The length of sequence ranges from 19 to 43 frames.
The network inputs are pairs of arbitrary long gait sequences. Each silhouette is resized to 126126. In each mini batch, half of the input pairs have same identities. For one probe sequence, we pick its corresponding gallery sequence with a random view angle to form a positive training sample pair. Similarly, another gallery sequence with different identity can be selected to form negative training sample pair.
We use negative
loss and stochastic gradient descent to train our network.
1912 subjects are divided into two groups with the same size for training and testing without overlapping222http://www.am.sanken.osaka-u.ac.jp/~mansur/files/list_train_test.txt. Images are resized to 126126 to input the networks. We randomly select 100 subjects in training set as validation set, so 856 subjects are left for training. It should be noted that no data augmentation is used during training. The size of mini-batches was set to 128, learning rate was set to 0.001, momentum was set to 0.0. LRN was set to default as suggested in 
. The networks were written in Torch 7 and trained on a NVIDIA GeForce GTX Titan X. We run validation test every 100 iterations. It will cost several minutes. The number of iterations is up to 1.8 million and the training phase lasted 7 days. The model with highest recognition rate was chosen for evaluation. Figure4 shows the loss curves and precision increasing with respect to the number of iterations.
Given a probe gait sequence and a gallery gait set, the similarity between the probe sequence and each sequence in the gallery set are evaluated by trained network. The identity of the probe gait sequence will be assigned to the most similar one in the gallery. For cross-view recognition, 16 recognition tasks need to be done on different probe view and gallery view setting, . It is very slow to calculate cross view gait recognition rate on test set with 956 subjects. There are measurements should be computed between subjects for each task. To facilitate calculation, we storage the fused features for each silhouette sequence and similarities only be calculated between fused features. This will drastically reduce computing time. 16 cross view recognition tasks can be done within 5 hours on GPU.
3.3 Impact of sequence length
Our method can take arbitrary length sequence as input. However, a longer sequence may improve the performance of recognition. To evaluate this, we conduct experiments on different sequence length ranging from 1 frame to 43 frames. The results are shown in Figure 5 and Figure 5. The precision and EERs are averaged across different view angles between probe and gallery sets.
We follow  protocol to test our method. Only a subset of 856 subjects is used for training. Evaluation is conducted on 956 subjects. Table 1 reports the performance of our method in terms of Rank-1, Rank-2 and Rank-5 recognition rates. Table 2 lists equal error rates (EERs). From these two tables, we can see that the proposed method shows promising results on OULP-C1V1 gait dataset. To the best of our knowledge, there are no previous works reporting cross-view EERs and recognition rate fully, only EERs and recognition rate between 85 degree gallery and each 55, 65 and 75 degree were reported in this work .
Furthermore, we compare our method with LDA , DATER , MvDA , GMLDA , and CCA  to demonstrate its superiority, as shown in Table 3. It can be seen that our method outperforms these methods significantly in terms of EERs which is an important verification indicator. Both GMLDA and MvDA require view information as input, thus achieve better identification rate than our method. However, our method is blind to view angle information. Even though, it still performs well in terms of Rank-1 recognition rate.
This paper present a novel CNN based gait recognition method. The proposed network architecture combines the advantage of convolutional neural network and Siamese network which evaluate similarity between two given arbitrary length silhouette sequences instead of GEI. Firstly, CNNs is used to extract features from each frame of sequence and the difference between previous frame. Inspired by the spatial pooling used within feature maps, a feature map pooling is employed to aggregate extracted features from different frames. Subsequently, a one layer CNN maps the difference of two fused features into task space. Finally, fully connected layers perform recognition. Experiments for cross-view gait recognition on OU-ISIR large population dataset are conducted. Our method outperforms other methods significantly when compared with EERs. Specifically, it yielded approximately two times better than other methods in verification accuracy.
This work was supported by The National Key Research and Development Plan (Grant No.2016YFC0801002)
-  G. Ariyanto and M. S. Nixon. Model-based 3d gait biometrics. In International Joint Conference on Biometrics, pages 1–7. IEEE, 2011.
-  K. Bashir, T. Xiang, and S. Gong. Gait recognition using gait entropy image. In International Conference on Imaging for Crime Detection and Prevention. IET, 2009.
-  A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on pattern analysis and machine intelligence, 23(3):257–267, 2001.
A. F. Bobick and A. Y. Johnson.
Gait recognition using static, activity-specific parameters.
Computer Vision and Pattern Recognition, volume 1, pages I–I. IEEE, 2001.
-  I. Bouchrika, M. Goffredo, J. Carter, and M. Nixon. On using gait in forensic biometrics. Journal of Forensic Sciences, 56(4):882–889, 2011.
-  A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In European Conference on Computer Vision, pages 717–732. Springer, 2016.
-  X. Chen, T. Yang, and J. Xu. Cross-view gait recognition based on human walking trajectory. Journal of Visual Communication and Image Representation, 25(8):1842–1855, 2014.
-  S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, volume 1, pages 539–546. IEEE, 2005.
-  T. Connie, K. O. M. Goh, and A. B. J. Teoh. A review for gait recognition across view. In Information and Communication Technology (ICoICT), 2015 3rd International Conference on, pages 574–577. IEEE, 2015.
-  Y. Guan and C.-T. Li. A robust speed-invariant gait recognition system for walker and runner identification. In International Conference on Biometrics, pages 1–8. IEEE, 2013.
-  H. Iwama, M. Okumura, Y. Makihara, and Y. Yagi. The ou-isir gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE Transactions on Information Forensics and Security, 7(5):1511–1521, 2012.
-  S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
-  A. Y. Johnson and A. F. Bobick. A multi-view method for gait recognition using static body parameters. In International Conference on Audio-and Video-Based Biometric Person Authentication, pages 301–311. Springer, 2001.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
-  T. Lam and R. Lee. A new representation for human gait recognition: Motion Silhouettes Image (MSI). Advances in Biometrics, pages 612–618, 2005.
-  T. H. Lam, K. H. Cheung, and J. N. Liu. Gait flow image: A silhouette-based gait representation for human identification. Pattern Recognition, 44(4):973–987, 2011.
-  N. Liu, J. Lu, and Y.-P. Tan. Joint subspace learning for view-invariant gait recognition. IEEE Signal Processing Letters, 18(7):431–434, 2011.
-  Z. Liu and S. Sarkar. Simplest representation yet for gait recognition: Averaged silhouette. In International Conference on Pattern Recognition, volume 4, pages 211–214. IEEE, 2004.
-  J. Man and B. Bhanu. Individual recognition using gait energy image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2):316–322, 2006.
-  A. Mansur, Y. Makihara, D. Muramatsu, and Y. Yagi. Cross-view gait recognition using view-dependent discriminative analysis. In International Joint Conference on Biometrics, pages 1–8. IEEE, 2014.
-  N. McLaughlin, J. Martinez del Rincon, and P. Miller. Recurrent convolutional network for video-based person re-identification. In Computer Vision and Pattern Recognition, pages 1325–1334, 2016.
-  N. Otsu. Optimal linear and nonlinear solutions for least-square discriminant feature extraction. In International Conference on Pattern Recognition, pages 557–560, 1982.
C. Prakash, R. Kumar, and N. Mittal.
Recent developments in human gait research: parameters, approaches, applications, machine learning techniques, datasets and challenges.Artificial Intelligence Review, pages 1–40, 2016.
-  R. D. Seely, S. Samangooei, M. Lee, J. N. Carter, and M. S. Nixon. The university of southampton multi-biometric tunnel and introducing a novel 3d gait dataset. In IEEE International Conference on Biometrics: Theory, Applications and Systems, pages 1–6. IEEE, 2008.
-  A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs. Generalized multiview analysis: A discriminative latent space. In Computer Vision and Pattern Recognition, pages 2160–2167. IEEE, 2012.
-  K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi. Geinet: View-invariant gait recognition using a convolutional neural network. In International Conference on Biometrics, pages 1–8. IEEE, 2016.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition, pages 1701–1708, 2014.
D. Tao, X. Li, X. Wu, and S. J. Maybank.
General tensor discriminant analysis and gabor features for gait recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10), 2007.
-  C. Wang, J. Zhang, J. Pu, X. Yuan, and L. Wang. Chrono-gait image: A novel temporal template for gait recognition. In European Conference on Computer Vision, pages 257–270. Springer, 2010.
-  Z. Wu, Y. Huang, L. Wang, X. Wang, and T. Tan. A comprehensive study on cross-view gait based human identification with deep cnns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
-  D. Xu, Y. Huang, Z. Zeng, and X. Xu. Human gait recognition using patch distribution feature and locality-constrained group sparse representation. IEEE Transactions on Image Processing, 21(1):316–326, 2012.
-  S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H.-J. Zhang. Discriminant analysis with tensor representation. In Computer Vision and Pattern Recognition, volume 1, pages 526–532. IEEE, 2005.
-  C. Zhang, W. Liu, H. Ma, and H. Fu. Siamese neural network based gait recognition for human identification. In International Conference on Acoustics, Speech and Signal Processing, pages 2832–2836. IEEE, 2016.
-  S. Zheng, J. Zhang, K. Huang, R. He, and T. Tan. Robust view transformation model for gait recognition. In International Conference on Image Processing, pages 2073–2076. IEEE, 2011.