Person re-identification (Re-Id) is useful in various intelligent video surveillance applications. The task can be considered as image retrieval problem, where a query image of a person (probe) is given and we search the person in a set of images extracted from different cameras (gallery). The query can be a single image or multiple images . Often multi-image query uses early fusion of images and generate an average query image . The method thus consumes higher computational power as compared to single image-based methods. Advanced hardware and efficient learning frameworks have encouraged the researchers to focus on designing Re-Id systems applicable to videos. However, video-based re-identification research is still in its infancy [4, 5]. Even though the existing video Re-Id applications seem to be promising, such methods often fail in low resolution videos, crowded environment, or in the presence of significant camera angle variations. It has also been observed that the query image or video has to be selected judiciously to obtain good retrieval results. Choosing an improper image or video may lead to poor quality of retrieval. In this paper, we detect and track humans in movement and construct spatio-temporal tubes that are used in the re-identification framework. We also propose a method for selecting optimum set of key pose images and use a 3-stage learning framework to re-identify persons appearing in different cameras. To accomplish this, we have made the following contributions in this paper:
We propose a learning-based method to select an optimum set of key pose images to reconstruct the query tube by minimizing its length in terms of number of frames.
We propose a 3-stage hierarchical framework that has been built using (i) SVDNet guided Re-Id architecture, (ii) self-similarity estimation, and (iii) temporal correlation analysis to rank the tubes of the gallery.
We introduce a new video dataset, named Tube-based Re-identification Video Dataset (TRiViD) that has been prepared with an aim to help the re-identification research community.
Rest of the paper is organized as follows. In Section 2, we discuss the state-of-the-art of person re-identification research. Section 3 presents the proposed Re-Id framework with various components. Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5.
Ii Related Work
Person re-identification applications are growing rapidly in numbers. However, humongeous growth in CCTV surveillance has thrown up various challenges to the re-identification research community. The primary challenges are to handle large volume of data [6, 7], tracking in complex environment [8, 9], presence of group , occlusion , varying pose and style across different cameras [2, 12, 13, 14], etc. The process of Re-Id can be categorized as image-guided [15, 16, 10, 2] and video-guided [4, 5, 17, 18, 19]
. The image-guided methods typically use deep neural networks for feature representation and re-identification, whereas the video-guided methods typically use recurrent convolutional networks (RNN) to embed the temporal information such as optical flow, sequence of pose, etc. Table I summarizes recent progress in person re-identification. In recent years, late fusion of different scores [15, 20]
has shown significant improvement over the final ranking. Our method is similar to a typical delayed or late fusion guided method. We refine search results obtained using convolutional neural networks with the help of temporal correlation analysis.
|Lv et al. ||Motion and image based features Recurrent convolutional network for video-based person re-identification|
|Barman et al. ||
|Chang et al. ||Visual appearance and multiple semantic level features Multi-Level Factorization Net for Person Re-Identification|
|Chen et al. ||
|Chen et al. ||
|Chung et al. ||
|Deng et al. ||
|He et al. ||
|Huang et al. ||
|Kalayeh et al. ||
|Li et al. ||
|Li et al. ||
|Liu et al. ||Augmented pose of persons and generate training set as used to re-identify persons Pose Transferrable Person Re-Identification|
|Liu et al. ||
|Lv et al. ||
|Fu et al. ||
Used multi-scale feature representation and chose correct scale for matching Multi-scale Deep Learning Architectures for Person Re-identification
|Tomasi et al. ||Proposed method for selection of good features for re-identification Features for Multi-Target Multi-Camera Tracking and Re-Identification|
|Roy et al. ||
|Sarfraz et al. ||
|Shen et al. ||
|Shen et al. ||
|Si et al. ||
|Wang et al. ||
|Wu et al. ||
|Xu et al. ||Body parts-based attention network for re-identification Attention-Aware Compositional Network for Person Re-identification|
|Xu et al. ||
|Zhang et al. ||Sequential decision making has been used to identify each frame in a video Multi-shot Pedestrian Re-identification via Sequential Decision Making|
|Zhong et al. ||Used style transfer across different camera to improve re-identification Camera Style Adaptation for Person Re-identification|
Iii Proposed Approach
Our method can be regarded as tracking followed by re-identification. Moving persons are tracked using Simple Online Deep Tracking (SODT) that has been developed using YOLO  framework. A tube is defined as the sequence of spatio-temporal frames of a moving person. Training is done using the videos captured by a camera. Videos captured using cameras are used to construct the gallery of tubes. Assume a gallery () contains tubes as given in (1).
Suppose a tube () in the gallery contains frames as given in (2).
At the time of re-identification, a query tube is given as a probe. First, the noisy frames are eliminated and the query tube is minimized. Next, frames of the revised query tube are passed through a 3-stage hierarchical re-ranking process to get the final ranking of the tubes in the gallery. The method is depicted in Figure 1.
Iii-a Query Minimization
Re-identification using multiple images usually performs better as compared to single image-based frameworks. However, the former method consumes more computational power. Also, selecting a set of frames that can uniquely represent a tube can be challenging. To address this, we have used a deep similarity matching architecture to select a set of representative frames based on pose dissimilarity. First, a query tube is passed through binary classifier to remove noisy frames such as blurry, cropped, low-quality, etc. Next, a ResNet50  framework has been trained using a few query tubes containing similar looking images. The similarity cost is calculated using (3).
The input tube contains images, whereas the output query tube contains images such that . The images in the optimized query tube can be represented using (4).
The pairwise query cost function for a given frame and other frame is defined in (5).
The loss of energy is defined as given in (6).
The optimal query energy () is defined in (7), where is the set of images that are not included in and is a weighting parameter called query threshold (between 0-1). Larger produces higher number of images in .
Figure 2 depicts the steps and the minimized query images TRiViD dataset.
Iii-B Image Re-identification using SVDNet
Our proposed method uses single image-based re-identification at the top layer of the hierarchy. We have used Singular Vector Decomposition Network (SVDNet) as the baseline. It uses a convolutional neural network and an eigenlayer before the fully connected layer. The eigenlayer consists of a set of weights. Figure 3 demonstrates the architecture of a typical SVDNet. The outputs of SVDNet are a set of retrieved images with ranks up to as given in (8).
Iii-C Self Similarity Guided Re-ranking
In the next step, we have aggregated the self-similarity scores with the SVDNet outputs. A typical ResNet50  architecture has been trained to learn self-similarity scores using the tubes of the query set. We assume the images available in a tube are similar. Next, a similarity score between the query image and every output image of SVD network up to rank , is calculated. Finally, the scores are averaged and the images are re-ranked. This step ensures that the dissimilar images get pushed toward the end of the ranked sequence of the retrieved images. Figure 4 illustrates this method.
Iii-D Tube Ranking by Temporal Correlation
Final step of the proposed method is to rank the tubes by temporal correlation among the retrieved images. We assume the images that belong to a single tube, are temporally correlated as they are extracted by detection and tracking. Let the result matrix up to rank for the query tube after the first two stages be denoted by . Weight of an image of can be estimated using (9).
Similarly, weight of a tube () can be estimated using (10).
Finally, the temporal correlation cost () of an image in can be estimated as given in (11).
Based on the temporal correlation, the retrieved tubes are ranked. Let the ranked tubes up to be represented using (12), where higher rank tubes have higher weights.
The final ranked images are extracted by taking the highest scoring images from the tubes. The final ranked images are given in (13). Figure 5 explains the whole process of tube ranking and selection of final set of frames.
We have evaluated our proposed approach on two public datasets, iLIDS-VID  and PRID-11  that are often used for testing video-based re-identification frameworks. In addition to that, we have also prepared a new re-identification dataset. It has been recorded using 2 cameras in an indoor environment with human movements with moderately dense crowd (with more than 10 people appearing within 4-6 sq-mt), varying camera angles, and persons with similar clothing. Such situations have not been covered yet in existing re-identification video datasets. Details about these datasets are presented in Table II. Several experiments have been conducted to validate our method and a through comparative analysis has been performed.
|PRID-11 ||2||245||475||Large volume|
|iLIDS-VID ||2||119||300||Clothing Similarity|
|TRiViD||2||47||342||Dense, Tracking, Similarity|
Evaluation Metrics and Strategy: We have followed the well known experimental protocols for evaluating the method. For iLIDS-VID and TRiViD dataset videos, the tubes are randomly split into 50% for training and 50% for testing. For PRID-11, we have followed the experimental setup as proposed in [38, 36, 3, 39, 5]. Only first 200 persons who appeared in both cameras of the PRID-11 dataset, have been used in our experiments. A 10-folds cross validation scheme has been adopted and the average results are reported. We have prepared Cumulative Matching Characteristics (CMC) and mean average precision (mAP) curves to evaluate and compare the performance.
Iv-a Comparative Analysis
As per the state-of-the-art, our work though unique in design has some similarities with video re-id methods proposed in [40, 38], multiple query-based method , and the re-ranking method . Therefore, we have compared our approach with the above three recently proposed methods. It has been observed that the proposed method can achieve a gain up to 9.6% as compared to the state-of-the-art methods when top rank accuracy is estimated. Even if we compute the accuracy up to rank 20, our method has the upper hand with a margin of 3%. This is the USP of the proposed method and we claim it to be significant at this stage. This happens because our method tries to reduce the number of false positives which has not yet been addressed by the re-identification research community. Figures 6-8 represent CMC curves and Table III summarizes the mAP up to rank 20 across the three datasets. Figure 9 shows a typical query and response applied on PRID-11 dataset.
|SVD Net  (Single Image)||76.44||69||79.11|
|SVD Net  (Multiple Images)||79.21||66.71||82.66|
|SVD Net+Re Rank ||77.25||69.2||78.6|
Iv-B Computational Complexity Analysis
re-identification in real-time is a challenging task. All research work carried out so far presume the gallery as a pre-recorded set of images and they try to rank best 5, 10, 15, 20 images from the set. However, executing a single query takes considerable time when multiple images are involvd in the query. We have carried out a comparative analysis on computation complexities across various re-identification frameworks including the proposed scheme. A Nvdia Quadro P5000 series GPU has been used to implement the frameworks. The results are reported in Figure 10. We have observed that the proposed tube-based re-identification framework takes lesser time as compared to video re-id framework proposed in  and the multiple images-based re-id using SVDNet .
Iv-C Effect of
Our proposed method depends on the query threshold . In this section, we present an analysis about the effect of on results. Figure 11 depicts the average number of query images generated from various query tubes. It may be observed that, higher produces more query images.
Figure 12 depicts average CMC by varying . It may be observed that the accuracy does not increase significantly when is increased above 0.4.
Figure 13 presents execution time (in seconds) by varying the query threshold. It can also be observed that an increase in leads to higher response time. Therefore, we have used in our experiments.
Iv-D Results After Various Stages
In this section, we present the effect of various stages of the overall framework on re-identification results. Table IV shows the accuracy (CMC) in each step of the proposed method. It may be observed that the proposed method gains 11% rank-1 accuracy after the first stage and 7% rank-1 accuracy after the second step. The method gains 7% rank-20 accuracy in the first stage and 6% rank-20 accuracy after the second stage. Table IV shows the accuracy (CMC) in each step. Figure 14 shows an example of scores (true positives and false positives) during the self-similarity fusion. It may be observed that SVDNet output scores and similarity scores are high in case of true positives. Similarity scores are relatively low in case of false positives. More results can be found in the form of supplementary data.
|PRID11 ||iLIDS ||TRiViD|
|SVD Net (Multi Image)||66||76||84||89||56||68||76||86||68||71||74||89|
In this paper, we propose a new person re-identification framework that is able to outperform existing re-identification schemes when applied on videos or sequence of frames. The method uses a CNN-based framework (SVDNet) at the beginning. A self-similarity layer is used to refine the SVDNet scores. Finally, a temporal correlation layer is used to aggregate multiple query outputs and to match tubes. A query optimization has also been proposed to select an optimum set of images for a query tube. Our study reveals that the proposed method outperforms in several cases as compared to the state-of-the-art single image-based, multiple images-based, and video-based re-identification methods. The computational is also reasonably low.
The work has been funded under KIST Flagship Project (Project No.XXXX) and Global Knowledge Platform (GKP) of Indo-Korea Science and Technology Center (IKST) executed at IIT Bhubaneswar under the Project Code: XXX. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P5000 GPU used for this research.
-  Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in ICCV. IEEE, 2017, pp. 3820–3828.
-  W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification,” in CVPR, vol. 1, no. 2, 2018, p. 6.
-  S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointly attentive spatial-temporal pooling networks for video-based person re-identification,” in ICCV. IEEE, 2017, pp. 4743–4752.
-  J. Lv, W. Chen, Q. Li, and C. Yang, “Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns,” in CVPR, 2018, pp. 7948–7956.
-  D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang, “Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding,” in CVPR, 2018, pp. 1169–1178.
-  L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in ECCV. Springer, 2016, pp. 868–884.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015, pp. 1116–1124.
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in ECCV. Springer, 2016, pp. 17–35.
-  L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian et al., “Person re-identification in the wild.” in CVPR, vol. 1, 2017, p. 2.
-  D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistent similarity learning via deep crf for person re-identification,” in CVPR, 2018, pp. 8649–8658.
-  H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversarially occluded samples for person re-identification,” in CVPR, 2018, pp. 5098–5107.
-  J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose transferrable person re-identification,” in CVPR, 2018, pp. 4099–4108.
-  M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in CVPR, 2018.
-  Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person re-identification,” in CVPR, 2018, pp. 5157–5166.
-  A. Barman and S. K. Shah, “Shape: A novel graph theoretic algorithm for making consensus-based decisions in person re-identification systems,” in ICCV. IEEE, 2017, pp. 1124–1133.
-  X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” in CVPR, vol. 1, 2018, p. 2.
-  D. Chung, K. Tahboub, and E. J. Delp, “A two stream siamese convolutional neural network for person re-identification,” in ICCV, 2017.
-  Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning,” in CVPR, 2018, pp. 5177–5186.
-  J. Zhang, N. Wang, and L. Zhang, “Multi-shot pedestrian re-identification via sequential decision making,” in CVPR, 2018.
-  S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel, “Learning to rank in person re-identification with metric ensembles,” in CVPR, 2015, pp. 1846–1855.
-  L. He, J. Liang, H. Li, and Z. Sun, “Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach,” in CVPR, 2018, pp. 7073–7082.
-  M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah, “Human semantic parsing for person re-identification,” in CVPR, 2018, pp. 1062–1071.
-  S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotemporal attention for video-based person re-identification,” in CVPR, 2018, pp. 369–378.
-  W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in CVPR, vol. 1, 2018, p. 2.
-  Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion for unsupervised video person re-identification,” in ICCV. IEEE, 2017, pp. 2448–2457.
-  X. Q. Y. Fu, Y.-G. Jiang, and T. X. X. Xue, “Multi-scale deep learning architectures for person re-identification.” ICCV, 2017.
-  E. R. C. Tomasi, “Features for multi-target multi-camera tracking and re-identification,” in CVPR, 2018.
-  S. Roy, S. Paul, N. E. Young, and A. K. Roy-Chowdhury, “Exploiting transitivity for learning person re-identification models on a budget,” in CVPR, 2018, pp. 7064–7072.
-  Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang, “Deep group-shuffling random walk for person re-identification,” in CVPR, 2018, pp. 2265–2274.
-  Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-end deep kronecker-product matching for person re-identification,” in CVPR, 2018, pp. 6886–6895.
-  J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang, “Dual attention matching network for context-aware feature sequence based person re-identification,” arXiv preprint arXiv:1803.09937, 2018.
-  Y. Wang, Z. Chen, F. Wu, and G. Wang, “Person re-identification with cascaded pairwise convolutions,” in CVPR, 2018, pp. 1470–1478.
-  J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-aware compositional network for person re-identification,” in CVPR, 2018.
-  M. B. Jensen, K. Nasrollahi, and T. B. Moeslund, “Evaluating state-of-the-art object detector on challenging traffic light data,” in CVPRW. IEEE, 2017, pp. 882–888.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
-  T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by video ranking,” in ECCV. Springer, 2014, pp. 688–703.
-  M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” in SCIA. Springer, 2011, pp. 91–102.
-  N. McLaughlin, J. Martinez del Rincon, and P. Miller, “Recurrent convolutional network for video-based person re-identification,” in CVPR, 2016, pp. 1325–1334.
Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification,” inCVPR. IEEE, 2017, pp. 6776–6785.
-  J. You, A. Wu, X. Li, and W.-S. Zheng, “Top-push video-based person re-identification,” in CVPR, 2016, pp. 1345–1353.