Seq-Masks: Bridging the gap between appearance and gait modeling for video-based person re-identification

by   Zhigang Chang, et al.

ideo-based person re-identification (Re-ID) aims to match person images in video sequences captured by disjoint surveillance cameras. Traditional video-based person Re-ID methods focus on exploring appearance information, thus, vulnerable against illumination changes, scene noises, camera parameters, and especially clothes/carrying variations. Gait recognition provides an implicit biometric solution to alleviate the above headache. Nonetheless, it experiences severe performance degeneration as camera view varies. In an attempt to address these problems, in this paper, we propose a framework that utilizes the sequence masks (SeqMasks) in the video to integrate appearance information and gait modeling in a close fashion. Specifically, to sufficiently validate the effectiveness of our method, we build a novel dataset named MaskMARS based on MARS. Comprehensive experiments on our proposed large wild video Re-ID dataset MaskMARS evidenced our extraordinary performance and generalization capability. Validations on the gait recognition metric CASIA-B dataset further demonstrated the capability of our hybrid model.



There are no comments yet.


page 1

page 2


RealGait: Gait Recognition for Person Re-Identification

Human gait is considered a unique biometric identifier which can be acqu...

Person Identification from Partial Gait Cycle Using Fully Convolutional Neural Network

Gait as a biometric property for person identification plays a key role ...

The Arm-Swing Is Discriminative in Video Gait Recognition for Athlete Re-Identification

In this paper we evaluate running gait as an attribute for video person ...

Running Event Visualization using Videos from Multiple Cameras

Visualizing the trajectory of multiple runners with videos collected at ...

Clothes-Changing Person Re-identification with RGB Modality Only

The key to address clothes-changing person re-identification (re-id) is ...

Person Re-identification by analyzing Dynamic Variations in Gait Sequences

Gait recognition is a biometric technology that identifies individuals i...

Gait Recognition in the Wild with Dense 3D Representations and A Benchmark

Existing studies for gait recognition are dominated by 2D representation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Video-based Person re-identification (Re-ID)  [5, 11, 17, 4, 2, 15] has been drawing proliferating attention from researchers worldwide because of its ubiquitous presence in diversified scenarios ranging from cross-camera object tracking, pedestrian behavior analysis, video surveillance to criminal investigations. Traditional RGB-based, Video-based Person Re-ID methods seeking to build a discriminative appearance model has gained growing momentum in recent years. But the prerequisite is harsh and fatal. It is vulnerable against illumination changes, camera parameters, and scene clutter; the time-effectiveness of the model is short due to clothing change; uniforms and camouflage can easily paralyze the system. Consequently, it is rather reasonable to ponder the possibility of investigating auxiliary information or characteristics.

Gait [16, 12, 14] which manifests the walking style of pedestrians comes to rescue, given its unique advantages ranging from being insensitive to resolution variation to being extremely challenging to impersonate, thus, highlighting the necessity of employing gait feature into performing person re-id tasks [12]. Some gait recognition methods [3] aim to extract implicit features from videos based on contour or articulated body representations( e.g., UV maps and 2d keypoints), where semantic segmentation (Mask-RCNN [7]

) and human pose estimation (DensePose 

[6] and OpenPose [1]) can be utilized to model the discriminative representations. However, this stream of works is limited by pedestrian speed, camera perspective, video frame rate, and other factors, leading to low gait recognition performance, especially in practical scenarios. Since video-based person Re-ID methods regard re-identification as sequence matching, which is also challenging due to the random camera view angles and wild scenarios. The goals and settings of these two branches have strong consistency, which naturally leads us to integrate them for better Re-ID performance in wild scenarios.

Fig. 1: Motivation

Fig. 2: Detailed structure of our Framework

In traditional video-based Re-ID, 3D convolution, RNN models, and temporal pooling are commonly used to build an embedding for a video sequence, which integrates multiple information like appearance, motions (gait and other motions), optical flow, etc,. These representations are tightly bound and hard to disentangle. This representation lacks interpretability and may be hard to learn. Our motivation is somehow trying to locally disentangle the gait representation manually by adding a gait branch to a video Re-ID framework. As shown in Fig. 1, the enhanced gait representation serves as strong prior to joining the final embedding.

In this paper, we utilize the foreground masks of the video sequence to bridge the appearance branch and the gait branch. For the gait branch, we adopt a variant of an advanced gait recognition network Gaitset [3] where the foreground sequence masks serve as the inputs of the gait branch. For the appearance branch, the foreground sequence masks can be regarded as the saliency map to highlight the foreground human semantic and eliminate background interference. Meanwhile, we also keep the original global branch to reserve global information of the sequence. Finally, we concatenate all the features after each branch and integrate them into a fused representation.

In summary, our contributions are three-fold: First, we proposed a novel end-to-end video person id framework which exploits both appearance and gait information. Second, we built a novel large-scale video Re-id dataset named Mask-MASRS. Third, extensive experiments have demonstrated the validity and effectiveness of our proposed model on both Mask-MARKS and CASIA-B datasets.

Ii Method

The overall network architecture is shown in Fig 2. The network contains 3 modules in total: Appearance Module, Gait Module, and Feature Fusion Module. The appearance module is comprised of the global branch and the foreground branch. For each input video sequence (each sequence contains T frames), we use foreground extraction methods (such as Mask RCNN) to extract the pedestrian foreground sequence masks.

In Appearance Module global branch, the backbone neural network extracts the feature maps of each frame of the sequence; then the global average pooling and temporal average pooling operations flatten and aggregate these feature maps of one sequence to a 2048 dimensions vector, after which a designed bottleneck block reduces the vector to 512 dimensions. As for the foreground branch, the output feature maps of the global branch backbone network are re-used and conducted dot multiplication with the resized foreground masks, which solely average pool the foreground region of the heatmaps and represent the feature of foreground (also 2048 dimensions). Another bottleneck block reduces the foreground vector to 512 dimensions in the same way.

Meanwhile, Gait Module randomly samples K frames of foreground sequence masks as the inputs of the module. After the Gait Module (a variant of Gaitset [3]), we obtain a gait feature (a 512-dimension vector) of each sequence masks.

Finally, the global appearance feature, the foreground appearance feature, and gait feature are concatenated to a 1536-dimension feature vector. We use the Feature Fusion Module to get the fusion feature vector (1536 dimensions). Finally, the fusion feature vector is used for the final representation of the video. We will further introduce the details of each module in the following subsections.

Ii-a Backbone network and bottleneck structure

We adopt ResNet-50 [8]

as the backbone network. By removing the final fully connected layer and changing the last stride from 2 to 1 in

, we can obtain a larger feature map incorporating richer feature information. For a typical image input size of 256 128, the output of the modified ResNet-50 is 16 8.

Both the global branch and the foreground branch adopt the bottleneck structure. The structure of the bottleneck is depicted in Fig. 2

(3), which is comprised of two fully connected layers, each followed by a normalization layer and a ReLU function. Compared to a fully connected layer (2048

512), the bottleneck structure results in notable network parameter reduction.

Ii-B Gait Branch

We adopt and modify an advanced state-of-the-art gait recognition method GaitSet to obtain the gait feature. Different from other template-based and sequence-based methods, GaitSet treats the input as a set of disordered pedestrian contour images. Since pedestrian contours at different time flames exhibit different shapes intuitively, even if the contour sequence is re-shuffled, they can be rearranged into correct orders according to the shapes. The set-based GaitSet method inherently has a processing requirement for sets: sort independence, where the result has nothing to do with the order of input contours.

Compared with the original GaitSet, the modified GaitSet network reduces the number of horizontal strips at the end of the network, achieving the purpose of reducing the dimension of feature, and the performance is slightly reduced. The modified GaitSet architecture is shown in Fig. 3. The architecture settings before Global Pooling are the same as the original network (also shown detailly in fig. 2

). The Set Pooling (SP) module is used to integrate each frame-level feature to form a set-level feature. As we know, a deeper convolutional layer has a larger receptive field. Shallow feature maps contain more local and fine-grained information, while deep feature maps contain global and coarse-grained information. Therefore, MGP(Multilayer Global Pipeline) is designed for aggregating the set-level features from different layers’ outputs. The MGP structure is consistent with the main branch, with the same convolutional layer and pooling layer design, but the parameters are not shared with the main branch.

We obtain two 128-dimension vectors from the main branch and MGP after global pooling. After two independent fully connected layers, vectors are mapped to 256 dimensions. During the training phase, these two feature vectors are associated with two separate training losses. In the inference stage, they will be concatenated into a 512-dimensional feature vector.

Set Pooling is designed as Fig. 3. It uses 3 kinds of statistical operations that are independent of order: , and . represents a 1

1 convolutional layer, which is used to fuse cascaded statistical features. Attention mechanism and residual structure are also designed for aggregation and maintaining the information capacity while accelerating and stabilizing convergence. Global Pooling computes the sum of Global Average Pooling and Global Max Pooling.

Fig. 3: The design of the Gaitset

Ii-C Feature Fusion Module (FFM)

We refer to the channel attention mechanism designed in SeNet to design a feature fusion network, shown as Fig. 2. The global appearance feature, the foreground appearance feature, and the gait feature are concatenated to a 1536-dimension vector. Then we use the Bottleneck mechanism (ratio=8) to construct a channel attention mechanism to promote the exchange of information between different features. At the same time, the residual structure is used to accelerate and stabilize convergence. Finally, a 1526-dimensional fusion feature vector is obtained.

Ii-D Loss Functions

During training phase, batch-all triplet [9], batch-hard triplet loss [9] and SoftMax loss with Label-Smoothing Regularization (LSR) [13] are employed to train the network. The loss of the Appearance Module () is calculated as shown in Eq. 1; the loss of the Gait Module () and the Feature Fusion Module () are shown as Eq. 2.


And the total loss is shown as as Eq. 3.


Iii Datasets and Data preprocessing

In this paper, both sequence images and foreground sequence masks are needed as inputs for experiments. To obtain data in wild scenarios, we build a dataset named Mask-MARS based on the MARS dataset [17], a large-scale dataset for video-based person Re-ID. We also conduct experiments on the CASIA-B dataset [16].

  • Mask-MARS: We create the original Mask-MARS dataset by computing the foreground mask of each RGB image in MARS dataset by a strong instance segmentation method. We require a video sequence to contain at least 8 effective foreground masks (not necessarily continuous), where an effective mask is simply defined as the proportion of the foreground area to the original image is not less than 15%. After screening by these two simple rules, the Mask-MARS data set contains a total of 1250 IDs and 14764 video sequences. The training set contains 624 IDs and 5726 video sequences; the query set contains 626 IDs and 1819 video sequences; the gallery set contains 621 IDs and 7170 video sequences. The length of the video sequence varies from 8 to 920 frames, with an average of 70 frames.

  • CASIA-B: is a popular gait dataset. CASIA-B [16] has a total of 124 IDs, containing 11 angles (0, 18, 36,…, 180 degrees) and 3 different walking conditions. Walking conditions include normal (NM), each pedestrian contains 6 sequences; carrying bag (BG), each pedestrian contains 2 sequences; wearing a jacket or jacket (CL), each pedestrian contains 2 sequences. so each pedestrian contains 110 sequences. We use the first 74 IDs as the training set and the last 50 IDs as the test set. In the test environment, the first 4 sequences (NM1-4) under NM conditions are retained in the gallery subset, and the remaining 6 sequences are divided into 3 query subsets (NM5-6, BG1-2, and CL1-2).

We adopt the method in [14] to preprocess the foreground mask to achieve alignment. During the experiment, the size of the aligned mask image is set to 6464. We crop 10 pixels on both the left and right sides of the horizontal direction to obtain the size of 6444 as the input to the GaitSet network.

In the training phase, we preprocess the color image sequence and its corresponding masks in the following way: (1) Random sequence crop: We set the size of the cropped color image to 256128, and the corresponding mask size of the output feature map of the last layer of ResNet-50 is 16

8, and the random probability

. (2) Random sequence flip: we set random probability

. (3) Image standardization: we use the mean and variance statistics of the 3 channels of the ImageNet dataset to normalize the image.

Iv Experiments

Iv-a Progressive results and analysis

For simplicity and clarity, we name the baseline along with other module combines as follows:

Appearance Baseline: the global branch of Appearance Module, training individually, 512-dimension feature (without FFM). Appearance Baseline + Foreground Branch: whole Appearance Module, joint training, 1024-dimension feature (without FFM). Modified GaitSet: whole Gait Module, training individually, 512-dimension feature (without FFM). Fusion Network: our full system, joint end-to-end training, 1536-dimension feature (with FFM), with two version: each pre-trained parts finetuning version and end-to-end trained version (end2end) .

As shown in Tab. I, in Mask-MARS, compared with the baseline model, the Rank1 index of the model with the foreground branch increased from 84.7% to 86.5%, and the mAP increased from 78.9% to 80.7%. Because MARS is a large-scale wild scene dataset with random camera views and large background clutters. Only use the gait branch, the performance is very poor (MAP 10.7% and rank1 17.7%). Even if the gait model does not perform well, our Fusion Network still outperforms Baseline + Foreground Branch with a notable margin. The two indicators of Rank1 and mAP have increased by 0.8% and 3.0% respectively. Compared to the Baseline, the improvement is even more significant (Rank1 +2.6% and mAP +4.8%).

In the CASIA-B dataset, the Tab. II records the experimental results of including and excluding the same view sequence as the query in the gallery. Adding appearance features makes the model more robust to view angle variation. Since the dataset is quite simple, there is basically no background clutter. Therefore, we can see that in the case of NM and BG, the appearance-based model can perform very well. However, when clothes changed (CL case), the performance of our Fusion Network (end2end) dominates all other models, achieving 72.891% rank1 accuracy, far exceeding the Modified GaitSet (+19.86%) and Appearance Baseline (+29.49%). This reflects that when the appearance of pedestrians changes significantly, the fusion features are often more discriminative than single-modal features. In any case in the CASIA-B dataset (NM, BG, CL), we achieve the best performance compared to a single-modal model.

Model Rank1 Rank5 Rank10 Rank20 MAP
Appearance Baseline 84.7 94.6 95.9 97.2 78.9
Appearance Baseline + Foreground Branch 86.5 95.5 96.3 97.1 80.7
Modified GaitSet 17.7 34.1 43.1 51.8 10.7
FusionNetwork(finetune) 87.1 95.2 96.3 97.2 81.1
FusionNetwork(end2end) 87.3 95.6 96.5 97.9 83.7
TABLE I: Progressive results on Mask-MARS dataset
Model Including the same angle Excluding the same angle

98.669 96.630 45.884 98.536 96.475 45.400
AppearanceBaseline+Foreground Branch 98.405 95.960 48.843 98.245 95.729 48.364
Modified GaitSet 83.570 71.284 55.934 81.964 69.168 55.036
FusionNetwork(finetune) 99.455 96.782 73.634 99.437 96.579 69.495
FusionNetwork(end2end) 99.620 96.822 75.950 99.582 96.623 74.891
TABLE II: Comparison of concatenation and fusion effects of different features

Iv-B Comparison of concatenation and fusion effects of different features

This experiment is to verify the effectiveness of the feature fusion module (FFM). We name models with different setting as follows: GGConcat: concatenated feature from 2 branches: global branch and gait branch; GGFusion: fusion feature from 2 branches with FFM; AGConcat: concatenated feature from 3 branches: global branch, foreground branch and gait branch without FFM; AGFusion: fusion feature from 3 branches with FFM. The experimental results on the Mask-MARS and CASIA-B datasets are shown in Tab. III.

Iv-C Compared with other advanced video-based ReID methods

We also reproduce some recent advanced algorithms  [11, 10, 4] to validate the effectiveness of our hubrid model on the dataset we created in Tab. IV. The performance of our method is far superior compared to methods that model only the appearance features.

rank1 rank5 rank10 mAP NM BG CL

86.6 95.2 96.4 81.0 99.736 97.905 68.785
AGFusion 87.1 95.2 96.3 81.1 99.620 96.822 75.950
GGConcat 87.0 95.4 96.5 81.0 99.810 97.806 75.099
GGFusion 87.3 95.3 96.2 80.7 99.149 96.740 75.992

TABLE III: Comparison of concatenation and fusion effects of different features
Model Rank1 Rank5 Rank10 Rank20 MAP
Non-local+C3D [11] 84.1 94.5 96.0 97.3 77.2
STAN [10] 82.3 92.9 94.6 96.8 65.7
Snipped [4] 81.2 92.1 94.6 96.5 69.4
Snipped+ [4] 86.3 94.7 95.7 98.2 76.1
Ours (end2end) 87.3 95.6 96.5 97.9 83.7
TABLE IV: Comparison with the state-of-the-art video-based Re-ID methods on MASK-MARS dataset.

V Conclusion

This paper propose an end-to-end framework which utilizes the sequence masks (SeqMasks) in each video to jointly exploit the power of appearance and gait in video Re-ID. Experiments on Mask-MARS dataset evidence the favorable performance and generalization ability of the proposed algorithm. Further validations on gait recognition metric CASIA-B dataset highlight the performance of our hybrid model.


  • [1] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh (2019)

    OpenPose: realtime multi-person 2d pose estimation using part affinity fields

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I.
  • [2] Z. Chang, Z. Qin, H. Fan, H. Su, H. Yang, S. Zheng, and H. Ling (2020) Weighted bilinear coding over salient body parts for person re-identification. Neurocomputing 407, pp. 454–464. External Links: ISSN 0925-2312, Document, Link Cited by: §I.
  • [3] H. Chao, Y. He, J. Zhang, and J. Feng (2018) GaitSet: regarding gait as a set for cross-view gait recognition. Cited by: §I, §I, §II.
  • [4] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang (2018) Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1169–1178. Cited by: §I, §IV-C, TABLE IV.
  • [5] J. Gao and R. Nevatia (2018) Revisiting temporal modeling for video-based person reid. arXiv preprint arXiv:1805.02104. Cited by: §I.
  • [6] R. A. Güler, N. Neverova, and I. Kokkinos (2018) DensePose: dense human pose estimation in the wild. Cited by: §I.
  • [7] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017-10) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §I.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §II-A.
  • [9] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §II-D.
  • [10] S. Li, S. Bak, P. Carr, and X. Wang (2018) Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–378. Cited by: §IV-C, TABLE IV.
  • [11] X. Liao, L. He, Z. Yang, and C. Zhang (2018) Video-based person re-identification via 3d convolutional networks and non-local attention. In Asian Conference on Computer Vision, pp. 620–634. Cited by: §I, §IV-C, TABLE IV.
  • [12] A. Nambiar, A. Bernardino, and J. C. Nascimento (2019) Gait-based person re-identification: a survey. ACM Computing Surveys (CSUR) 52 (2), pp. 1–34. Cited by: §I.
  • [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2015) Rethinking the inception architecture for computer vision. Cited by: §II-D.
  • [14] N. Takemura, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi (2018) Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition. Ipsj Transactions on Computer Vision and Applications 10 (1), pp. 4. Cited by: §I, §III.
  • [15] Z. Yang, Z. Chang, and S. Zheng (2019) Large-scale video-based person re-identification via non-local attention and feature erasing. In International Forum on Digital TV and Wireless Multimedia Communications, pp. 327–339. Cited by: §I.
  • [16] S. Yu, D. Tan, and T. Tan (2006) A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In 18th International Conference on Pattern Recognition (ICPR 2006), 20-24 August 2006, Hong Kong, China, Cited by: §I, 2nd item, §III.
  • [17] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) Mars: a video benchmark for large-scale person re-identification. In European Conference on Computer Vision, pp. 868–884. Cited by: §I, §III.