Person re-identification(re-id) has been widespread concerned recently, as this issue underpins various critical applications such as video surveillance, pedestrian tracking and searching. Given a target person appearing in a surveillance camera, a re-id system generally aims to identify it in the other cameras through the whole camera-network, i.e., determining whether instances captured by different cameras belong to the same person. However, due to the influence of cluttered background, occlusions and viewpoint variations across camera views, this task is quite challenging.
A re-id system may have an image or a video as input for feature extraction. Since only limited information can be exploited from a single image, it is difficult to overcome the occlusion, camera-view and pose variation problems and to capture the varying appearance of a pedestrian performing different action primitives. Thus it is better to deal with the video-based re-id problem, as videos inherently contain more temporal information of the moving person than an independent image, not to mention in many practical applications the input are videos to begin with. Besides, video is a sequence of images, so spatial and temporal cues are more abundant in a video than in a image, which can facilitate extracting more features.
In spite of the rich space-time information provided by a video sequence, more challenges come along. So far, only a few video-based methods have been presented , , . Most of them focus on investigating the temporal information related to person’s motion, such as their gait, and perhaps even the patterns of how their bodies and clothes move. Although such movement is one type of behavioral biometrics, it is unfortunate that a large number of persons share similarity in walking manners and related behavior  . Moreover, since gait is considered a biometric that is not affected by the appearance of a person, most approaches tried to exploit it by working with silhouettes, which are difficult to extract, especially from surveillance data with cluttered background and occlusions . Besides, time-series analysis usually requires extracting information at different timescales . In the person re-id problem, gait information often exists in short time, thus the information provided by movement descriptors is limited. In some cases, it is even harder to distinguish the video representations of different identities than the still-image appearance .
Unlike previous work, in this paper we intend to extract a compact appearance representation from several representative frames rather than the whole frames for video-based re-id. Compared to the temporal-based methods, the proposed appearance model works more similarly to human visual system. Because the visual perception studies on appearance (e.g., color, texture) and motion stimuli have shown that the pattern detection thresholds are much lower than the motion detection thresholds   . Hence, human performs better at identifying the appearance of human body or belongings than the manners of how a person walks. In most cases, people can be distinguished more easily from appearance such as clothes and bags on their shoulders than from gait and pose which are generally similar among different persons , as shown in Fig. 2. So, videos are highly redundant and it is unnecessary to incorporate all frames for person re-id. Our study shows that several typical frames with appropriate feature extraction can offer competitive or even better identification performance.
More specifically, given a walking sequence, we first split it into a couple of segments corresponding to different action primitives of a walking cycle. The most representative frames are selected from a walking cycle by exploiting the local maxima and minima of the Flow Energy Profile (FEP) signal . For each frame, we propose a CNN to learn feature based on person’s joint appearance information. Since different frames may have different discriminative features for recognition, by introducing an appearance-pooling layer, the salient appearance features of multiple frames are preserved to form a discriminative feature descriptor for the whole video sequence. The central point of our algorithm lies in the exploration of the key appearance information of a video, contrary to the conventional methods like  and , which highly rely on accurate temporal information.
2 Related work
Person re-identification has been a hot topic in the computer vision community for decades  . In general, the key is to generate discriminative signatures for pedestrian representation across different cameras. The most frequently used low-level features are color, texture, gradient, and the combination of them   , extracted either from the whole body area or from the regions of interest.
Another popular way of synthesizing feature descriptors is through deep learning , which has shown great potential in various tasks of computer vision, such as object detection, image classification, face and pose recognition. In these areas, deep neural networks have largely replaced traditional computer vision pipelines based on hand-crafted features . As for the task of image-based person re-id, different CNNs have been used for learning a joint representation of images and similarity between image pairs or triplets directly from the pixels of the input images   .
Recently, the attention is moving to the video-based re-id problem and most efforts were spent on exploiting the temporal cues for the pedestrian modeling. Specifically, Wang et al.  employed the HOG3D  as descriptor for action and activity recognition. Liu et al.  developed a spatio-temporal alignment of video segments by tracking the gait information of pedestrians. Zheng et al.  attempted to extract the motion by exploiting the HOD3G  feature and the Gait Energy Image (GEI)  feature. Some efforts even were spent on using hybrid tools such as RNN optical flow  for temporal information extraction. However, as aforementioned, the temporal cues such as gait and motion are often unreliable from a walking sequence which is often short and of low quality in practical surveillance videos.
As an alternative, the proposed method is more like an image-based re-id algorithm. Following the human visual system, this work intends to solve the video re-id problem by pooling the distinctive features from several representative frames. To select the representative frames automatically, we do need some temporal cues to extract the walking cycle. Compared to the conventional temporal methods like optical flow, it is much simpler and does not need to be very accurate. As shown in Fig. 3, a rough approximate about the motion profile of consecutive frames is good enough. This could be also regarded as an implicit and more efficient way of using the temporal information, which may relieve the burdens of accurate motion or gait extraction in video re-id.
3 Proposed method
As illustrated in the Fig. 2, our method proceeds in three steps: frame selection, feature pooling and identification. Given a video sequence, some representative frames are selected automatically based on the walking profile. Then each representative frame is processed by a CNN to extract reliable features. To compile all features into a compact yet informative description, a feature pooling layer is incorporated. Finally, we employ distance metric learning for identification, which maximizes the distance between features of different people and minimizes the distance of features of the same people.
3.1 Representative frame extraction
To automatically select the most representative frames, we first extract the Flow Energy Profile (FEP) as proposed in , which is a one dimensional signal denoted by that approximates the motion energy intensity profile of the consecutive frames in a video sequence. Ideally, the local maximum of corresponds to the postures when the person’s two legs overlap, while at the local minimum the two legs are the farthest away. However, as shown in Fig. 3
, it can only provide a rough approximate about the walking circle as the estimation of FEP is sensitive to the noisy background and occlusions. Inspired by
, the discrete Fourier transform is further employed to transform the FEP signal into the frequency domain, and the walking cycles can be better indicated by the dominant frequencies.
A full cycle of the walking action contains two consecutive sinusoid curves, one step from each leg. Since it is extremely difficult to distinguish between the two, each sinusoid curve of a single step is regarded as a walking cycle. Given a walking cycle, we can obtain the key frames corresponding to the different action primitives. As illustrated in Fig. 3, the frames with the maximum FEP value and minimum FEP value are the best candidates for the representation of a walking cycle. The other frames can be sampled equally between the maximum and minimum of the circle. The studies in Table 1 show that four frames sampled from one circle give the best identification result. Adding more frames does not help, as most appearance information are already included.
3.2 CNN-based feature extraction
3.2.1 Network architecture
The proposed network consists of five convolutional layers () followed by two fully connected layers () and a softmax classification layer which is similar to the VGG-M network . The detailed structures are given in Fig. 4
. To aggregate the features from the extracted representative frames into a single compact descriptor, a feature pooling layer is introduced to the network. Besides, the Rectified Linear Unit () layers are used before each layer, which can accelerate the convergence process and avoid manually tweaking the weights and bias .
The parameters of network are initialized from the pre-trained VGG-M model and then finetuned on the target training pedestrian sequence. At the training phase, the whole selected representative frames of each walking cycle are firstly rescaled to , and then fed into the CNNs along with their corresponding label to train the network.
At the testing phase, the proposed network can be considered as a feature extractor using the CNN architectures. Specifically, each of the rescaled frame is first fed into the CNN to obtain its features with the convolutional layers. The learnt descriptors are then aggregated by a feature pooling layer and finally turns to be a 4096 dimensional representation at the fully connected layers. Note that, the features yielded at the layer gave the best performance in experiments. So, the and layers are discarded after training.
3.2.2 Feature pooling
In this section, we focus on aggregating the key information from different views into a single, compact feature descriptor. After feeding the representative frames, the proposed CNN architecture will yield multiple feature maps as shown in Fig. 4. Simply averaging these features is a straightforward way, but often leads to inferior performance . A feature pooling layer is added to the proposed CNNs. As shown in Table 3
, max pooling across the feature maps obtained from multiple CNNs produced the best re-id results.
Specifically, as illustrated in Fig. 5, although CNN is able to capture information from each frame, the discriminative appearance features of a pedestrian may appear in any frame, i.e., the desired discriminative features are scattered among different frames. However, by using the element-wise maximum operation among the feature maps, the strongest features from different views can be integrated to form a informative description about the pedestrian. Theoretically, this pooling layer can be inserted anywhere of the proposed network, yet the experimental results show that it performs best to be placed between the last convolutional layer and the first fully connected layer.
3.3 Distance metric learning
After feature extraction and pooling, to compare the final representation, we learn a metric on the training set using distance metric learning approaches. Specifically, for each pedestrian representation with
feature vectors () from the query set and representation with feature vectors () from the gallery set, the minimum distance of all the feature pairs is adopted as the distance between them as follows:
An alternative is using the average of the minimum distance as the distance measurement between each feature pair as below:
Empirically, it is found that the latter measurement gives better performance. Besides, PCA is first performed to reduce the dimension of the original representation before distance metric learning and we choose the same reduced dimension as 100 in all of our experiments. More analysis and discussion about distance learning and dimension reduction can be found in Section 4.3.
4 Experimental results
In this section, we conducted experiments on benchmark video re-identification datasets and made comparison between the proposed method and state-of-the-art approaches.
4.1 Datasets and settings
Experiments were conducted on three person re-id datasets: PRID 2011 dataset , iLIDS-VID dataset  and SDU-VID dataset . The PRID 2011 dataset includes 400 images sequences for 200 persons, captured by two non-overlapping cameras, and the average length of each sequence is 100. This dataset was captured in uncrowded outdoor scenes with relatively simple and clean background. The iLIDS-VID dataset contains 600 image sequences for 300 randomly sampled persons, with an average length of 73. This dataset was captured by two non-overlapping cameras in an airport hall under a multi-camera CCTV network. Subject to quite large illumination changes, occlusions, and viewpoints variations across camera views, this dataset is more challenging. The SDU-VID dataset  contains 600 image sequences for 300 persons captured by two non-overlapping cameras. There are more image frames in each video sequence, and the average length is 130. This is also a challenging dataset due to the cluttered background, occlusions and viewpoint variations.
In our experiments, all datasets are randomly divided into training set and testing set by half, with no overlap between them. During testing, we consider the sequences from the first camera as the query set while the other one as the gallery set. For each walking cycle extracted from the video sequences, four representative frames are selected automatically as the inputs to four independent CNNs, which finally output a 4096-D descriptor for the whole walking cycle. Since different video sequences may contain different numbers of walking cycles, for each sequences we may extract a different number of feature descriptors. We use all of them as query or gallery descriptors and learn a metric to determine the distance between two sets of descriptors extracted from two sequences. The widely used Cumulative Matching Characteristics (CMC) curve is employed for quantitative measurement. All tests will be repeated 10 times and the average rates is reported to ensure statistically reliable evaluation.
4.2 Results of feature learning
|Frame selection method||R-1||R-1||R-1|
|Proposed sampling method||60.2||83.3||89.3|
Randomly sample frames among the whole sequence. is determined automatically as stated in Section 3.1.
Equally divide the video into segments and sample one frame from each.
Representative frames extraction:
As described in Section 3.1, frames are sampled as the representative ones from each walking cycle for feature learning. To study the influence of number of frames to re-id, experiments were carried out respectively with different number of frames (1 to 10 frames) sampled at equal intervals within each walking cycle. Note that the parameters of the CNNs are shared across all frames, which means the descriptions of all frames are generated by the same feature-extraction network.
Using the feature maps of the the first frame for description.
The results are given in Fig. 6 and Table 1. Roughly speaking, the performance of using different number of sampled frames is comparable, which demonstrates our claim that it is unnecessary to use all frames for video re-id. For all datasets, four-frame sampling is the best choice and produced the best results. This is because the four frames are sampled at the maximum, minimum and middle of them in a circle, and thus contains all distinctive walking poses as illustrated in Fig. 3
. In most cases, one or two frames gives poor results as it is too short to offer sufficient information. It is interesting to see that adding more frames does not help, because the information for identification is already redundant and more outliers may be incurred in feature learning.
To further validate the effectiveness of representative frames, experiments were conducted to compare the proposed frame sampling method to other baseline sampling methods. As shown in Tab. 2, By dividing the sequence into walking circles, the proposed method can select the most representative frames, and performed better than randomly selection in terms of re-id accuracy. It also shows superiority over that of using all frames, which demonstrates the observation again that there is no need to extract features from all frames in video re-id.
Feature pooling settings:
In this work, each sampled frame of the walking cycles is fed into the proposed network for feature extraction separately and aggregated at the feature pooling layer as shown in Fig. 4. Hence, feature pooling layer has an important impact on the feature aggregation as well as the final identification. As mentioned in Section 3.2.2, there are mainly two kinds of pooling strategies, max-pooling and average-pooling.
As shown in the Table 3, experiments were carried out to test the performance of max-pooling and average-pooling for the proposed network. The performance of using the features of the first frame (i.e., without pooling) is also provided as baseline for comparison. Apparently, accumulating features from multiple frames via pooling provides gains for re-id. Also, max-pooling shows superiority over average-pooling. This is unsurprising because average-pooling is usually employed in the cases within which all the input frames are considered equally important, while max-pooling cares more about the strongest (distinctive) information of each frame.
Besides, we have also considered different locations to place the feature pooling layer in the proposed network. The performance does not change much when pooling is set at the layer after , however decreases evidently among the first few layers before . Generally, we observed that pooling between and works slightly better, and thus was used for all experiments.
Description layer evaluation:
Table 4 shows the re-id results with different layers for feature description after feature pooling. Roughly, the former layers performs better than the latter layers, and yields the highest accuracy in most cases. Besides, as we illustrated before, the layer is also considered as it serves as the neuron activation function after each layer. However, there only exist slight difference between each layer and its corresponding layer. Hence, the layer is used for the feature description in this work.
Other base networks:
Experiments have been carried out with some other base networks, such as VGG-19, Caffenet and Resnet-50. As shown in Tab. 5, VGG-M (proposed method) is better than the other networks on the provided datasets.
Feature map visualization:
To validate the proposed appearance representation, we intend to visualize the learned intermediate features. Fig. 7 shows two examples, and each of them presents some feature maps produced by and . As expected, most representative features, including silhouettes and distinctive appearance like clothes and bags, can be captured by the proposed learning model. The feature pooling layer is capable of combining all representative features learned at different walking states (frames) into a joint representation for more effective person identification.
4.3 Results of distance learning
|Distance metric learning||R-1||R-1||R-1|
Metric learning evaluation:
In this experiment, we combine the proposed network with different supervised distance metric learning methods such as KISSME , Local Fisher Discriminant Analysis (LFDA)  and Cross-view Quadratic Discriminant Analysis(XQDA) . As shown in Table 6, among the three methods, KISSME performed best in most cases and thus was chosen as the default method for distance metric learning.
Distance measure evaluation:
also gives the testing of classifiers with different distance measures: minimum distance in Eq.(1) and average distance in Eq.(2). It is observed that, the classifier with average distance performs better than the one with minimum distance measure , especially on the iLIDS-VID dataset. This is mainly because the average classifier is more resilient to noise caused by occlusion and light changing, which happens more frequently in the first dataset.
Dimension reduction evaluation:
Appropriate dimension reduction not only help preserve discriminative information, but also help filter out the noises in features. The effect of dimension reduction using PCA is studied in Table 7. The optimal performance is obtained with the dimension reduced to 100 using PCA.
4.4 Comparison to state-of-the-art
In this section, we compare the performance of the proposed method to existing video-based re-id approaches as shown Table 8. It can be observed that the proposed algorithm achieved state-of-the-art performance on the benchmark public datasets. For iLIDS-VID, our algorithm outperforms the second best one: RNN+OF  by . For PRID 2011, our algorithm outperforms the second best one: CNN+XQDA  by . For SDU-VID, only the results of STA  and RNN  are provided, and our method produced significant gains of . It should be stressed that the above methods takes all the frames as input and the performance mostly rely on the motion features extracted using hybrid tools, e.g., RNN , optical flow , HOG3D  and GEI . In contrast, the propsed method yileds superior results by pooling the image features from only a few frames.
4.5 Limitations and discussions
Besides iLIDS-VID , PRID 2011  and SDU-VID , a new dataset: MARS  was developed recently, and differs much from the other three datasets. As shown in Fig. 8, the pedestrians of the earlier publicly datasets were mostly captured from sideview, while the camera viewpoints and poses in the MARS vary greatly, and the length of tracklets is much smaller. Besides, as shown in Fig. 8(d), since the pedestrian detection and tracking were performed automatically, the quality of the cropped pedestrian is poor. As mentioned by the authors, quite a number of distractor tracklets were produced by false detection or tracking results. All these issues pose great difficulty for our method to extract walking circle and representative frames. Also, the feature pooling is also fragile to the ambiguity incurred by the large portion of background or other scene object. Since extracting the representative frame by walking circle is intractable, we split each tracklet into half and randomly select four frames as the representative ones for feature learning and pooling. The results are given in Table 9. Our method is inferior to CNN+XQDA  in this case. This is reasonable as CNN+XQDA takes all frames for different sequences. Without the representative frames, our method can only process a constant number of frames sampled randomly (e.g., four frames) and thus is more sensitive to the above issues in MARS. The above limitations are shared among most methods as shown in Tab. 9.
In this paper, we presented a novel video-based person re-id framework based on deep CNNs. Unlike the previous work focusing on extracting the motion cues, the efforts were spent on extracting compact but discriminative appearance feature from typical frames of a video sequence. The proposed appearance model was built with a deep CNN architecture incorporated with feature pooling. Extensive experimental results on benchmark datasets confirmed the superiority of the proposed appearance model for video-based re-id.
-  S. Bak, E. Corvee, F. Bremond, and M. Thonnat. Person re-identification using spatial covariance regions of human body parts. In Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE International Conference on, 2010.
-  A. Bedagkar-Gala and S. K. Shah. Part-based spatio-temporal model for multi-person re-identification. Pattern Recognition Letters, 33(14), 2012.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
-  S.-Z. Chen, C.-C. Guo, and J.-H. Lai. Deep ranking for person re-identification via joint representation learning. IEEE Transactions on Image Processing, 25(5):2353–2367, 2016.
-  A. M. Derrington, D. R. Badcock, and G. B. Henning. Discriminating the direction of second-order motion at short stimulus durations. Vision Research, 33(13):1785–1794, 1993.
-  S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 48(10):2993–3003, 2015.
-  J. Han and B. Bhanu. Individual recognition using gait energy image. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(2):316–322, 2006.
-  M. Hirzer, C. Beleznai, P. Roth, and H. Bischof. Person re-identification by descriptive and discriminative classification. In SCIA. 2011.
-  A. Klaser, M. Marszaek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008.
-  M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012.
-  W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
-  S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2197–2206, 2015.
-  D. T. Lindsey and D. Y. Teller. Motion at isoluminance: discrimination/detection ratios for moving isoluminant gratings. Vision Research, 30(11):1751–1761, 1990.
-  C. Liu, S. Gong, C. C. Loy, and X. Lin. Person re-identification: What features are important? In European Conference on Computer Vision, pages 391–401. Springer, 2012.
-  K. Liu, B. Ma, W. Zhang, and R. Huang. A spatio-temporal appearance representation for viceo-based pedestrian re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 3810–3818. http://www.vsislab.com/projects/MLAI/PedestrianRepresentation/.
-  B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision. In IJCAI, volume 81, pages 674–679, 1981.
-  B. Ma, Y. Su, and F. Jurie. BiCov: a novel image representation for person re-identification and face verification. In BMVC, 2012.
-  J. Man and B. Bhanu. Individual recognition using gait energy image. IEEE transactions on pattern analysis and machine intelligence, 28(2):316–322, 2006.
-  R. Martin-Felez and T. Xiang. Gait recognition by ranking. In ECCV. 2012.
-  N. McLaughlin, J. Martinez del Rincon, and P. Miller. Recurrent convolutional network for video-based person re-identification.
-  C. Nakajima, M. Pontil, B. Heisele, and T. Poggio. Full-body person recognition system. Pattern recognition, 36(9), 2003.
-  W. Ouyang, X. Zeng, and X. Wang. Learning mutual visibility relationship for pedestrian detection with a deep model. International Journal of Computer Vision, pages 1–14, 2016.
-  S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Local fisher discriminant analysis for pedestrian re-identification. In CVPR, 2013.
-  A. E. Seiffert and P. Cavanagh. Position-based motion perception for color and texture stimuli: effects of contrast and speed. Vision research, 39(25):4172–4185, 1999.
-  H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 945–953, 2015.
-  T. Wang, S. Gong, X. Zhu, and S. Wang. Person re-identification by video ranking. In ECCV. 2014.
-  T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. arXiv preprint arXiv:1604.07528, 2016.
-  Y. Xu, L. Lin, W.-S. Zheng, and X. Liu. Human re-identification by matching compositional template with cluster sampling. In ICCV, 2013.
-  J. You, A. Wu, X. Li, and W.-S. Zheng. Top-push video-based person re-identification. arXiv preprint arXiv:1604.08683, 2016.
-  R. Zhao, W. Ouyang, and X. Wang. Person re-identification by salience matching. In ICCV, 2013.
-  L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In European Conference on Computer Vision, pages 868–884. Springer, 2016.