Person Re-identification in Videos by Analyzing Spatio-Temporal Tubes

Typical person re-identification frameworks search for k best matches in a gallery of images that are often collected in varying conditions. The gallery may contain image sequences when re-identification is done on videos. However, such a process is time consuming as re-identification has to be carried out multiple times. In this paper, we extract spatio-temporal sequences of frames (referred to as tubes) of moving persons and apply a multi-stage processing to match a given query tube with a gallery of stored tubes recorded through other cameras. Initially, we apply a binary classifier to remove noisy images from the input query tube. In the next step, we use a key-pose detection-based query minimization. This reduces the length of the query tube by removing redundant frames. Finally, a 3-stage hierarchical re-identification framework is used to rank the output tubes as per the matching scores. Experiments with publicly available video re-identification datasets reveal that our framework is better than state-of-the-art methods. It ranks the tubes with an increased CMC accuracy of 6-8 reduces the number of false positives. A new video re-identification dataset, named Tube-based Reidentification Video Dataset (TRiViD), has been prepared with an aim to help the re-identification research community



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 8


Convolutional Temporal Attention Model for Video-based Person Re-identification

The goal of video-based person re-identification is to match two input v...

Improving Person Re-Identification with Temporal Constraints

In this paper we introduce an image-based person re-identification datas...

Supervised Mixed Norm Autoencoder for Kinship Verification in Unconstrained Videos

Identifying kinship relations has garnered interest due to several appli...

Video Temporal Relationship Mining for Data-Efficient Person Re-identification

This paper is a technical report to our submission to the ICCV 2021 VIPr...

Running Event Visualization using Videos from Multiple Cameras

Visualizing the trajectory of multiple runners with videos collected at ...

Video Synopsis Generation Using Spatio-Temporal Groups

Millions of surveillance cameras operate at 24x7 generating huge amount ...

Copy and Paste method based on Pose for Re-identification

Re-identification (ReID) aims at matching objects in surveillance camera...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Person re-identification (Re-Id) is useful in various intelligent video surveillance applications. The task can be considered as image retrieval problem, where a query image of a person (probe) is given and we search the person in a set of images extracted from different cameras (gallery). The query can be a single image 

[1] or multiple images [2]. Often multi-image query uses early fusion of images and generate an average query image [3]. The method thus consumes higher computational power as compared to single image-based methods. Advanced hardware and efficient learning frameworks have encouraged the researchers to focus on designing Re-Id systems applicable to videos. However, video-based re-identification research is still in its infancy [4, 5]. Even though the existing video Re-Id applications seem to be promising, such methods often fail in low resolution videos, crowded environment, or in the presence of significant camera angle variations. It has also been observed that the query image or video has to be selected judiciously to obtain good retrieval results. Choosing an improper image or video may lead to poor quality of retrieval. In this paper, we detect and track humans in movement and construct spatio-temporal tubes that are used in the re-identification framework. We also propose a method for selecting optimum set of key pose images and use a 3-stage learning framework to re-identify persons appearing in different cameras. To accomplish this, we have made the following contributions in this paper:

  • We propose a learning-based method to select an optimum set of key pose images to reconstruct the query tube by minimizing its length in terms of number of frames.

  • We propose a 3-stage hierarchical framework that has been built using (i) SVDNet guided Re-Id architecture, (ii) self-similarity estimation, and (iii) temporal correlation analysis to rank the tubes of the gallery.

  • We introduce a new video dataset, named Tube-based Re-identification Video Dataset (TRiViD) that has been prepared with an aim to help the re-identification research community.

Rest of the paper is organized as follows. In Section 2, we discuss the state-of-the-art of person re-identification research. Section 3 presents the proposed Re-Id framework with various components. Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5.

Ii Related Work

Person re-identification applications are growing rapidly in numbers. However, humongeous growth in CCTV surveillance has thrown up various challenges to the re-identification research community. The primary challenges are to handle large volume of data [6, 7], tracking in complex environment [8, 9], presence of group [10], occlusion [11], varying pose and style across different cameras [2, 12, 13, 14], etc. The process of Re-Id can be categorized as image-guided [15, 16, 10, 2] and video-guided [4, 5, 17, 18, 19]

. The image-guided methods typically use deep neural networks for feature representation and re-identification, whereas the video-guided methods typically use recurrent convolutional networks (RNN) to embed the temporal information such as optical flow 

[17], sequence of pose, etc. Table I summarizes recent progress in person re-identification. In recent years, late fusion of different scores [15, 20]

has shown significant improvement over the final ranking. Our method is similar to a typical delayed or late fusion guided method. We refine search results obtained using convolutional neural networks with the help of temporal correlation analysis.

Reference Method Overview
Lv et al. [4] Motion and image based features Recurrent convolutional network for video-based person re-identification
Barman et al. [15]
Graph theory and multiple algorithm fusion-based algorithm SHaPE: A Novel Graph Theoretic Algorithm for Making Consensus-based Decisions in Person Re-identification Systems
Chang et al. [16] Visual appearance and multiple semantic level features Multi-Level Factorization Net for Person Re-Identification
Chen et al. [10]
Fusion of local similarity and group similarity-based DNN and CRF Group Consistent Similarity Learning via Deep CRF for Person Re-Identification
Chen et al. [5]
Divides a long person sequence into short snippet and match snippets for re-identification Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding
Chung et al. [17]
Learn spatial and temporal similarity and used weighed fusion A Two Stream Siamese Convolutional Neural Network For Person Re-Identification
Deng et al. [2]
Learn self similarity and domain dissimilarity Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification
He et al. [21]
Deep pixel-level CNN for person re-identification from partially observed images. Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-free Approach
Huang et al. [11]
Proposed augmented training data generation for person re-identification. Adversarially Occluded Samples for Person Re-identification
Kalayeh et al. [22]
Proposed human semantic parts model to train state-of-the-art deep networks and calculate weighted average. Human Semantic Parsing for Person Re-identification
Li et al. [23]

Distinct body parts-based attention model for re-identification. Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification

Li et al. [24]
Harmonious attention network consists of pixel-level and bounding box level attention as feature. Harmonious Attention Network for Person Re-Identification
Liu et al. [12] Augmented pose of persons and generate training set as used to re-identify persons Pose Transferrable Person Re-Identification
Liu et al. [25]
Tracklets have been used as training and re-identification. Stepwise Metric Promotion for Unsupervised Video Person Re-identification
Lv et al. [4]
Transfer learning have been used to learn spatio-temporal pattern in unsupervised manner. Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatial-Temporal Patterns
Fu et al. [26]

Used multi-scale feature representation and chose correct scale for matching Multi-scale Deep Learning Architectures for Person Re-identification

Tomasi et al. [27] Proposed method for selection of good features for re-identification Features for Multi-Target Multi-Camera Tracking and Re-Identification
Roy et al. [28]
Minimized the labeling effort by choosing minimum image for labeling task in re-identification. Exploiting Transitivity for Learning Person Re-identification Models on a Budget
Sarfraz et al. [13]
Used fine and coarse pose information for deep re-identification. A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking
Shen et al. [29]
Proposed group-shuffling random walk network for fully utilizing train and test images. Deep Group-shuffling Random Walk for Person Re-identification
Shen et al. [30]
Proposed Kronecker Product Matching module to match feature maps of different persons in an end-to-end trainable deep neural network. End-to-End Deep Kronecker-Product Matching for Person Re-identification
Si et al. [31]
Uses and learn context-aware feature sequences and perform attentive sequence comparison simultaneously. Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification
Wang et al. [32]
Deep architecture named BraidNet is proposed. It uses the cascaded Wconv structure learns to extract the comparison features Images. Person Re-identification with Cascaded Pairwise Convolutions
Wu et al. [18]
It propose an approach to exploiting unsupervised Convolutional Neural Network (CNN) feature representation via stepwise learning. Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning
Xu et al. [33] Body parts-based attention network for re-identification Attention-Aware Compositional Network for Person Re-identification
Xu et al. [3]
Joint Spatial and Temporal Attention Pooling Network (ASTPN) has been used in video sequences. Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification
Zhang et al. [19] Sequential decision making has been used to identify each frame in a video Multi-shot Pedestrian Re-identification via Sequential Decision Making
Zhong et al. [14] Used style transfer across different camera to improve re-identification Camera Style Adaptation for Person Re-identification
TABLE I: Recent progress in person re-identification research
Fig. 1: The proposed method for Tube-to-tube Re-identification. Our contributions are marked with circle. The method takes a tube as query and rank the tubes by best matching.

Iii Proposed Approach

Our method can be regarded as tracking followed by re-identification. Moving persons are tracked using Simple Online Deep Tracking (SODT) that has been developed using YOLO [34] framework. A tube is defined as the sequence of spatio-temporal frames of a moving person. Training is done using the videos captured by a camera. Videos captured using cameras are used to construct the gallery of tubes. Assume a gallery () contains tubes as given in (1).


Suppose a tube () in the gallery contains frames as given in (2).


At the time of re-identification, a query tube is given as a probe. First, the noisy frames are eliminated and the query tube is minimized. Next, frames of the revised query tube are passed through a 3-stage hierarchical re-ranking process to get the final ranking of the tubes in the gallery. The method is depicted in Figure 1.

Iii-a Query Minimization

Re-identification using multiple images usually performs better as compared to single image-based frameworks. However, the former method consumes more computational power. Also, selecting a set of frames that can uniquely represent a tube can be challenging. To address this, we have used a deep similarity matching architecture to select a set of representative frames based on pose dissimilarity. First, a query tube is passed through binary classifier to remove noisy frames such as blurry, cropped, low-quality, etc. Next, a ResNet50 [35] framework has been trained using a few query tubes containing similar looking images. The similarity cost is calculated using (3).


The input tube contains images, whereas the output query tube contains images such that . The images in the optimized query tube can be represented using (4).


The pairwise query cost function for a given frame and other frame is defined in (5).


The loss of energy is defined as given in (6).


The optimal query energy () is defined in (7), where is the set of images that are not included in and is a weighting parameter called query threshold (between 0-1). Larger produces higher number of images in .


Figure 2 depicts the steps and the minimized query images TRiViD dataset.

Fig. 2: Examples of original tube (first row), detected noisy frames (second row), tube after noise removal (third row), and minimized tube for query execution (fourth row) taken from the TRiViD dataset.

Iii-B Image Re-identification using SVDNet

Our proposed method uses single image-based re-identification at the top layer of the hierarchy. We have used Singular Vector Decomposition Network (SVDNet) 

[1] as the baseline. It uses a convolutional neural network and an eigenlayer before the fully connected layer. The eigenlayer consists of a set of weights. Figure 3 demonstrates the architecture of a typical SVDNet. The outputs of SVDNet are a set of retrieved images with ranks up to as given in (8).

Fig. 3: Architecture of the SVDNet used in the fist stage of the re-identification framework shown in Figure 1. It contains an Eigenlayer before the fully connected layer. The Eigenlayer contains the weights to be used during training.

Iii-C Self Similarity Guided Re-ranking

In the next step, we have aggregated the self-similarity scores with the SVDNet outputs. A typical ResNet50 [35] architecture has been trained to learn self-similarity scores using the tubes of the query set. We assume the images available in a tube are similar. Next, a similarity score between the query image and every output image of SVD network up to rank , is calculated. Finally, the scores are averaged and the images are re-ranked. This step ensures that the dissimilar images get pushed toward the end of the ranked sequence of the retrieved images. Figure 4 illustrates this method.

Fig. 4: The self similarity estimation layer. It learns to measure self-similarity during training. We use ResNet50 [35] as the baseline. It takes a set of ranked images (SVDNet outputs) and produces a set of ranked images by introducing self-similarities between the query image and the retrieved images.

Iii-D Tube Ranking by Temporal Correlation

Final step of the proposed method is to rank the tubes by temporal correlation among the retrieved images. We assume the images that belong to a single tube, are temporally correlated as they are extracted by detection and tracking. Let the result matrix up to rank for the query tube after the first two stages be denoted by . Weight of an image of can be estimated using (9).


Similarly, weight of a tube () can be estimated using (10).


Finally, the temporal correlation cost () of an image in can be estimated as given in (11).


Based on the temporal correlation, the retrieved tubes are ranked. Let the ranked tubes up to be represented using (12), where higher rank tubes have higher weights.


The final ranked images are extracted by taking the highest scoring images from the tubes. The final ranked images are given in (13). Figure 5 explains the whole process of tube ranking and selection of final set of frames.

Fig. 5: Explanation of re-identification framework with the help of the proposed 3-stage framework depicted in Figure 1.

Iv Experiments

We have evaluated our proposed approach on two public datasets, iLIDS-VID [36] and PRID-11 [37] that are often used for testing video-based re-identification frameworks. In addition to that, we have also prepared a new re-identification dataset. It has been recorded using 2 cameras in an indoor environment with human movements with moderately dense crowd (with more than 10 people appearing within 4-6 sq-mt), varying camera angles, and persons with similar clothing. Such situations have not been covered yet in existing re-identification video datasets. Details about these datasets are presented in Table II. Several experiments have been conducted to validate our method and a through comparative analysis has been performed.

Number of
PRID-11 [37] 2 245 475 Large volume
iLIDS-VID [36] 2 119 300 Clothing Similarity
TRiViD 2 47 342 Dense, Tracking, Similarity
TABLE II: Dataset used in our experiments. Only TRiViD dataset is tracked to extract tube. In other datset the given sequence of images are considered as tube

Evaluation Metrics and Strategy: We have followed the well known experimental protocols for evaluating the method. For iLIDS-VID and TRiViD dataset videos, the tubes are randomly split into 50% for training and 50% for testing. For PRID-11, we have followed the experimental setup as proposed in [38, 36, 3, 39, 5]. Only first 200 persons who appeared in both cameras of the PRID-11 dataset, have been used in our experiments. A 10-folds cross validation scheme has been adopted and the average results are reported. We have prepared Cumulative Matching Characteristics (CMC) and mean average precision (mAP) curves to evaluate and compare the performance.

Iv-a Comparative Analysis

As per the state-of-the-art, our work though unique in design has some similarities with video re-id methods proposed in [40, 38], multiple query-based method [1], and the re-ranking method [20]. Therefore, we have compared our approach with the above three recently proposed methods. It has been observed that the proposed method can achieve a gain up to 9.6% as compared to the state-of-the-art methods when top rank accuracy is estimated. Even if we compute the accuracy up to rank 20, our method has the upper hand with a margin of 3%. This is the USP of the proposed method and we claim it to be significant at this stage. This happens because our method tries to reduce the number of false positives which has not yet been addressed by the re-identification research community. Figures 6-8 represent CMC curves and Table III summarizes the mAP up to rank 20 across the three datasets. Figure 9 shows a typical query and response applied on PRID-11 dataset.

Fig. 6: The accuracy (CMC) in PRID-11 dataset using RCNN [38], TDL [40], Video re-id [38], SVDNet [1] (single image), SVDNet (multiple images), SVDNet+Re-rank [20].
Fig. 7: The accuracy (CMC) in iLIDS dataset using RCNN [38], TDL [40], Video re-id [38], SVDNet [1] (single image), SVDNet (multiple images), SVDNet+Re-rank [20].
Fig. 8: The accuracy (CMC) using the TRiViD dataset with the help of RCNN [38], TDL [40], Video re-id [38], SVDNet [1] (single image), SVDNet (multiple images), SVDNet+Re-rank [20].
Fig. 9: Typical results obtained using PRID-11 dataset using single image query [1], video sequence [38], and using the proposed method. Green box indicates a correct retrieval.
Method/Dataset PRID iLIDS New
RCNN [38] 81.2 74.6 79.11
TDL [40] 78.2 74.1 80
Video-ReId [38] 73.31 64.29 83.22
SVD Net [1] (Single Image) 76.44 69 79.11
SVD Net [1] (Multiple Images) 79.21 66.71 82.66
SVD Net+Re Rank [20] 77.25 69.2 78.6
Proposed 86.17 79.22 91.66
TABLE III: mAP (%) up to rank 20 in across three video datasets

Iv-B Computational Complexity Analysis

re-identification in real-time is a challenging task. All research work carried out so far presume the gallery as a pre-recorded set of images and they try to rank best 5, 10, 15, 20 images from the set. However, executing a single query takes considerable time when multiple images are involvd in the query. We have carried out a comparative analysis on computation complexities across various re-identification frameworks including the proposed scheme. A Nvdia Quadro P5000 series GPU has been used to implement the frameworks. The results are reported in Figure 10. We have observed that the proposed tube-based re-identification framework takes lesser time as compared to video re-id framework proposed in [38] and the multiple images-based re-id using SVDNet [1].

Fig. 10: Average response time (in seconds) for a given query by varying the datasets. We have taken 100 query tubes in random and calculated the average response time with the help of RCNN [38], TDL [40], Video re-id [38], SVDNet [1] (single image), SVDNet (multiple images), SVDNet+Re-rank [20].

Iv-C Effect of

Our proposed method depends on the query threshold . In this section, we present an analysis about the effect of on results. Figure 11 depicts the average number of query images generated from various query tubes. It may be observed that, higher produces more query images.

Fig. 11: Average number of query images by varying the query threshold . We have taken 100 query sequences randomly and average number of optimized images, is reported. It may be observed that a higher produces more number of query images.

Figure 12 depicts average CMC by varying . It may be observed that the accuracy does not increase significantly when is increased above 0.4.

Fig. 12: Accuracy (CMC) by varying the query threshold . We have taken 100 query sequences randomly and average is reported. It may be observed that a higher may not produce higher accuracy

Figure 13 presents execution time (in seconds) by varying the query threshold. It can also be observed that an increase in leads to higher response time. Therefore, we have used in our experiments.

Fig. 13: Execution time by varying . It may be observed that a higher takes more time to execute as it produces more query images.

Iv-D Results After Various Stages

In this section, we present the effect of various stages of the overall framework on re-identification results. Table IV shows the accuracy (CMC) in each step of the proposed method. It may be observed that the proposed method gains 11% rank-1 accuracy after the first stage and 7% rank-1 accuracy after the second step. The method gains 7% rank-20 accuracy in the first stage and 6% rank-20 accuracy after the second stage. Table IV shows the accuracy (CMC) in each step. Figure 14 shows an example of scores (true positives and false positives) during the self-similarity fusion. It may be observed that SVDNet output scores and similarity scores are high in case of true positives. Similarity scores are relatively low in case of false positives. More results can be found in the form of supplementary data.

PRID11 [37] iLIDS [36] TRiViD
Method/Top Rank 1 5 10 20 1 5 10 20 1 5 10 20
SVD Net (Multi Image) 66 76 84 89 56 68 76 86 68 71 74 89
SVD Net+Self-similarity 69 77 84 89 61 71 79 86 71 77 76 91
SVD Net+Self-similarity+
Temporal Correlation (Proposed)
78 89 92 91 67 84 91 96 79 88 91 98
TABLE IV: Accuracy (CMC) in each step of the proposed method
Fig. 14: Typical examples of SVDNet outputs and self-similarity scores in TRiViD (first two rows) and PRID-11 [37] (last row).

V Conclusion

In this paper, we propose a new person re-identification framework that is able to outperform existing re-identification schemes when applied on videos or sequence of frames. The method uses a CNN-based framework (SVDNet) at the beginning. A self-similarity layer is used to refine the SVDNet scores. Finally, a temporal correlation layer is used to aggregate multiple query outputs and to match tubes. A query optimization has also been proposed to select an optimum set of images for a query tube. Our study reveals that the proposed method outperforms in several cases as compared to the state-of-the-art single image-based, multiple images-based, and video-based re-identification methods. The computational is also reasonably low.

One straight extension of the present work is to fuse methods like camera pose-based [2], video-based [38], and description-based [16]. It may lead to higher accuracy in complex situations. Also, group re-identification can be tried with the similar concept of tube guided analysis.


The work has been funded under KIST Flagship Project (Project No.XXXX) and Global Knowledge Platform (GKP) of Indo-Korea Science and Technology Center (IKST) executed at IIT Bhubaneswar under the Project Code: XXX. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P5000 GPU used for this research.


  • [1] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in ICCV.   IEEE, 2017, pp. 3820–3828.
  • [2] W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification,” in CVPR, vol. 1, no. 2, 2018, p. 6.
  • [3] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointly attentive spatial-temporal pooling networks for video-based person re-identification,” in ICCV.   IEEE, 2017, pp. 4743–4752.
  • [4] J. Lv, W. Chen, Q. Li, and C. Yang, “Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns,” in CVPR, 2018, pp. 7948–7956.
  • [5] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang, “Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding,” in CVPR, 2018, pp. 1169–1178.
  • [6] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in ECCV.   Springer, 2016, pp. 868–884.
  • [7] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015, pp. 1116–1124.
  • [8] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in ECCV.   Springer, 2016, pp. 17–35.
  • [9] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian et al., “Person re-identification in the wild.” in CVPR, vol. 1, 2017, p. 2.
  • [10] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistent similarity learning via deep crf for person re-identification,” in CVPR, 2018, pp. 8649–8658.
  • [11] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversarially occluded samples for person re-identification,” in CVPR, 2018, pp. 5098–5107.
  • [12] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose transferrable person re-identification,” in CVPR, 2018, pp. 4099–4108.
  • [13] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in CVPR, 2018.
  • [14] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person re-identification,” in CVPR, 2018, pp. 5157–5166.
  • [15] A. Barman and S. K. Shah, “Shape: A novel graph theoretic algorithm for making consensus-based decisions in person re-identification systems,” in ICCV.   IEEE, 2017, pp. 1124–1133.
  • [16] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” in CVPR, vol. 1, 2018, p. 2.
  • [17] D. Chung, K. Tahboub, and E. J. Delp, “A two stream siamese convolutional neural network for person re-identification,” in ICCV, 2017.
  • [18] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning,” in CVPR, 2018, pp. 5177–5186.
  • [19] J. Zhang, N. Wang, and L. Zhang, “Multi-shot pedestrian re-identification via sequential decision making,” in CVPR, 2018.
  • [20] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel, “Learning to rank in person re-identification with metric ensembles,” in CVPR, 2015, pp. 1846–1855.
  • [21] L. He, J. Liang, H. Li, and Z. Sun, “Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach,” in CVPR, 2018, pp. 7073–7082.
  • [22] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah, “Human semantic parsing for person re-identification,” in CVPR, 2018, pp. 1062–1071.
  • [23] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotemporal attention for video-based person re-identification,” in CVPR, 2018, pp. 369–378.
  • [24] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in CVPR, vol. 1, 2018, p. 2.
  • [25] Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion for unsupervised video person re-identification,” in ICCV.   IEEE, 2017, pp. 2448–2457.
  • [26] X. Q. Y. Fu, Y.-G. Jiang, and T. X. X. Xue, “Multi-scale deep learning architectures for person re-identification.”   ICCV, 2017.
  • [27] E. R. C. Tomasi, “Features for multi-target multi-camera tracking and re-identification,” in CVPR, 2018.
  • [28] S. Roy, S. Paul, N. E. Young, and A. K. Roy-Chowdhury, “Exploiting transitivity for learning person re-identification models on a budget,” in CVPR, 2018, pp. 7064–7072.
  • [29] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang, “Deep group-shuffling random walk for person re-identification,” in CVPR, 2018, pp. 2265–2274.
  • [30] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-end deep kronecker-product matching for person re-identification,” in CVPR, 2018, pp. 6886–6895.
  • [31] J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang, “Dual attention matching network for context-aware feature sequence based person re-identification,” arXiv preprint arXiv:1803.09937, 2018.
  • [32] Y. Wang, Z. Chen, F. Wu, and G. Wang, “Person re-identification with cascaded pairwise convolutions,” in CVPR, 2018, pp. 1470–1478.
  • [33] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-aware compositional network for person re-identification,” in CVPR, 2018.
  • [34] M. B. Jensen, K. Nasrollahi, and T. B. Moeslund, “Evaluating state-of-the-art object detector on challenging traffic light data,” in CVPRW.   IEEE, 2017, pp. 882–888.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [36] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by video ranking,” in ECCV.   Springer, 2014, pp. 688–703.
  • [37] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” in SCIA.   Springer, 2011, pp. 91–102.
  • [38] N. McLaughlin, J. Martinez del Rincon, and P. Miller, “Recurrent convolutional network for video-based person re-identification,” in CVPR, 2016, pp. 1325–1334.
  • [39]

    Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification,” in

    CVPR.   IEEE, 2017, pp. 6776–6785.
  • [40] J. You, A. Wu, X. Li, and W.-S. Zheng, “Top-push video-based person re-identification,” in CVPR, 2016, pp. 1345–1353.