Ordered or Orderless: A Revisit for Video based Person Re-Identification

12/24/2019 ∙ by Le Zhang, et al. ∙ 0

Is recurrent network really necessary for learning a good visual representation for video based person re-identification (VPRe-id)? In this paper, we first show that the common practice of employing recurrent neural networks (RNNs) to aggregate temporal spatial features may not be optimal. Specifically, with a diagnostic analysis, we show that the recurrent structure may not be effective to learn temporal dependencies than what we expected and implicitly yields an orderless representation. Based on this observation, we then present a simple yet surprisingly powerful approach for VPRe-id, where we treat VPRe-id as an efficient orderless ensemble of image based person re-identification problem. More specifically, we divide videos into individual images and re-identify person with ensemble of image based rankers. Under the i.i.d. assumption, we provide an error bound that sheds light upon how could we improve VPRe-id. Our work also presents a promising way to bridge the gap between video and image based person re-identification. Comprehensive experimental evaluations demonstrate that the proposed solution achieves state-of-the-art performances on multiple widely used datasets (iLIDS-VID, PRID 2011, and MARS).



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (Re-id) addresses the problem of re-association persons across disjoint camera views. In this paper, we consider more practical scenarios of video based person re-identification (VPRe-id), in which a video of a person, as seen in one camera, must be matched against a gallery of videos captured by a different non-overlapping camera. VPRe-id

is an active research topic in computer vision due to its wide-ranging applications in problems, including visual surveillance and forensics. Since the pioneering work

[14], several visual features [12, 25], and learning methods [18, 19, 57] have consistently improved the matching performance, leading the research community to address more challenging scenarios in complex datasets [18, 52, 61]. However, significant hurdles due to variations in appearance, viewpoint, illumination, and occlusion come in the way of solving the problem.

Fig. 1: Motivation. First two rows: existing methods adopts RNNs to model temporal dependencies for VPRe-id. Last two rows: human beings can easily performs this on image sequences with a random order.

Recently, deep convolutional neural networks (ConvNets) stand at the forefront of several vision tasks, including image classification

[24, 49, 46, 17], segmentation [34]

, pose estimation 


, face recognition

[41], crowd counting [44], and image based person re-identification [28], just to mention a few. Deep learning for VPRe-id, however, is also witnessing such vast popularity. Typically, the prior methods [35, 57, 53, 6, 56, 62, 4] process videos by using RNNs to temporally aggregate spatial information extracted from ConvNets.

However, unlike other sequential modelling tasks [3], it is little-known about the benefits of using the RNN for modelling temporally extracted spatial information for VPRe-id. Although theoretically fascinating, we show a typical recurrent structure may be less effective to capture temporal dependencies as they assumed. To simplify our analysis, we find that the pioneering work [35] of using RNN and its follow-ups for VPRe-id leads to an orderless representation. This motivates us to re-ponder over the task of VPRe-id as illustrated in Fig. 1. According to the separable visual pathway hypothesis [16], the human visual cortex contains the ventral stream and the dorsal stream, which recognize objects and how they move, respectively. Existing methods  [35] employing RNNs overemphasize the temporal dependencies related to person’s motion. Unfortunately, these behavioral biometrics suffer from large intra/inter-class variations, making existing RNNs difficult to generalize. In this work we postulate that the appearance information from a single still image plays a more important role in re-associating persons from different cameras. Considering an example in Fig. 1

, human beings can easily classify whether the images in two cameras come from the same identity, even the image frames in the second camera has been manually shuffled to remove temporal dependencies.

Declaring that orderless encoding’s benefits in video analytic is non-controversial, and indeed we are certainly not the first to inject orderless encoding into video based tasks. Even in other tasks such as human action recognition where the temporal dependency between each video frame is considered to be vital, orderless representations [51, 15, 63] can still demonstrate their superiority over recurrent structures [11]. While we take inspiration from these works and the recent success of image based person re-identification, we are the first to dive deep into the effect of modelling temporal information for VPRe-id. More specifically, we present a simple yet surprisingly powerful approach by dividing videos into individual images and perform VPRe-id with an orderless ensemble of shared image based rankers. We show that, under the i.i.d. assumption, the ensemble of rankers are consistent and the error bound of VPRe-id decreases exponentially with the number of image frames, when each base ranker is better than a random guess. Our work opens a path towards putting more emphasis on appearance information for VPRe-id. It not only achieves state-of-the-art performance on multiple benchmarks (iLIDS-VID, PRID 2011, and MARS) but also bridges the gap between video and image based person re-identification.

2 Related Work

2.1 Recurrent Structures for VPRe-id

Recurrent structures, e.g., RNNS, LSTMs, GRUs, etc., have been widely used to temporally aggregate spatial information extracted from each video frame. Yan et al. [57] adopts Long-Short Term Memory networks (LSTM) to sequentially fuse hand-crafted features. In the same way, recurrent networks [35]

are used to aggregate discriminative features extracted from other ConvNets on individual video frames. Wu

et al. [53] consider convolutional activations at low levels, which are embedded into Gated Recurrent Units (GRU) to capture temporal patterns. Bidirectional RNNs are introduced in [59] to integrate convolutional features. Deep Feature Guided Pooling (DFGP) is proposed in [30]

to jointly use both deep features and LOMO hand-crafted features. A similar “CNN-RNN” recipe is trained in 

[6] on full body and part-body image sequences respectively to learn complementary representations from holistic and local perspectives. Spatial and Temporal Attention Pooling Network (ASTPN) [56]

extends the standard CNN-RNNs by decomposing pooling into spatial-pooling on feature maps from CNN and an attentive temporal-pooling on the output of RNN. Spatial recurrent models and temporal attention models are proposed in

[62] to both select discriminative frames and exploit the contextual information when measuring the similarity. A recurrent based competitive snippet-similarity aggregation and co-attentive snippet embedding is introduced in [4].  [48] extracts spatial-temporal features using a novel recurrent-based spatial temporal synergic residual network.

2.2 Image based Person Re-id

Compared with VPRe-id, image based person re-identification has been more extensively studied. A filter pairing neural network is designed in [28] to jointly handle misalignment and geometric transformations. A cross-input difference CNN is proposed in [1] to capture local relationships between the two input images based on mid-level features from each input image. Xiao et al. introduces a domain guided drop-out technique to mitigate the domain gaps from various datasets [55]

. Some other works aim at designing more robust loss function such as the triplet loss

[10, 8, 60] to enforce the correct order of relative distances among image triplets or quadruplet loss [7] to combine the the advantages of contrastive loss and triplet loss. The similarities between different gallery images are reported to be beneficial for re-identification in [43, 5]. Song et al.  [47] propose a novel region-level triplet loss to pull the features from the full image and body region close, whereas pushing the features from backgrounds away.

2.3 Bridging the Research Gap

Despite the abundant work published using RNNs for VPRe-id, almost none of them has demonstrated that the improved accuracy of VPRe-id using RNNs can indeed be traced back to that the recurrent structure encodes temporal information that is missing in image based person re-id. Instead, the community have largely taken it for granted that recurrent structures are able to successfully capture the temporal dependency in VPRe-id. To investigate the effect of using recurrent structures for VPRe-id, we make the first attempt to investigate the effect of leveraging temporal information by using recurrent structures for VPRe-id. Please note that other non-recurrent method for VPRe-id include [58], which solves VPRe-id

by using reinforcement learning. However, they do not study the effect of the frame order in a systematic manner. The other example is

[2], in which the authors derive an approximation of the recurrent connections and achieve faster and competitive results with a simpler feed-forward architecture. Other methods [62] try to use recurrent structures to model context information whose effectiveness has been verified recently [45]. However, they still lack a systematical study to demonstrate the capability of RNN’s context modelling with respect to frame order. Our diagnostic analysis, as demonstrated in Sec. 4, shows the deficiency of the commonly used “CNN-RNN” pipeline: the recurrent structure is less effective in modelling the temporal information than what we expected. We then speculate that orderless approaches may be more pertinent for the task and introduce a simple yet powerful solution to bridge these two research domains. We formulate VPRe-id as an orderless ensemble of rankers and employ image based person re-identification methods as base rankers on each individual frame. Theoretical analysis is provided to further guarantee the performance of the proposed methods. Our proposed approach differs significantly from previous work that is based on recurrent structures, as our method does not rely on the temporal information in the video.

3 Notation and Problem Definition

Denoted by , a collection of pedestrian videos, where (probe video) and (gallery video) stand for the pedestrian video captured by two disjoint cameras and , and is the number of videos111For simplicity here we consider only two camera views are available and the extension to more views is straightforward.. Each video is defined as a set of consecutive frames, i.e., , where is the number of frames for the video captured by camera . We will denote a generic data point as by ignoring the subscript and superscript for convenience. In the same way, a generic video input sequence is denoted by . We consider that the video frames are present in a dimensional space: , where , , and denote the height, width, and number of channels for an input , respectively.

We now present the objective of VPRe-id and several definitions that will be frequently used later.

VPRe-id Objective.  Our objective is to classify whether the videos and are coming from the same identify. This is usually achieved by learning a mapping function such that:


where denotes a distance measurement such as the commonly used Euclidean Distance.

Definition 1. The following network is called a typical RNN:


where the output at each time-step is a linear combination of the input and RNN’s state at previous time-step. stands for a generic feature extractor. stands for the frame for a generic video sequence.

Definition 2. An ordinal video sequence is defined as . A shuffled video sequence is defined as , where is a random permutation operator that randomly permutates (or shuffles) the order of frames in the video sequence.

Fig. 2: Diagnostic analysis of a typical “CNN-RNN” framework in [35]. we consider a single RNN without temporal average pooling. “S-CNN-RNN-Ordinal” and “S-CNN-RNN-Shuffled” stands for a single RNN with ordinal, shuffled videos, respectively. “S-CNN-RNN-Stack” means RNN with a stack of the same frames.

4 Diagnostic Analysis of “CNN-RNN”

As a proof of concept, firstly a diagnostic analysis of recurrent network is provided for VPRe-id. Our analysis is conducted based on the method presented in [35] that has been widely adopted in many of its follow-ups works [59, 53, 57, 6, 62, 56]. We will omit the superscript of to represent a video sequence from a generic camera view for convenience.

Observation 1. A typical CNN-RNN method, as done in  [35], is not effective in capturing temporal dependencies as it expects, and implicitly results in an orderless representation.

We illustrate this by evaluating (without retraining) the “CNN-RNN” architecture on both ordinal and shuffled video sequences. The results are shown in Fig. 3. It is surprising to see that commonly used “CNN-RNN” architecture performs better on shuffled video sequences. This suggests that the sequential dependency assumption in video frames might be invalid for VPRe-id, so that the RNN learns to forget previous state. In this case, there are much more diversified input images in a shuffled scenario. For example, considering a video sequence with frames and an RNN receiving frames in total as input, there exist input proposals in shuffled scenario, whereas in ordinal cases, one may only have input proposals.

Train Test CMC Rank
Ordinal Shuffled Ordinal Shuffled 1 5 10 20
[35] 58 84 91 96
59 88 94 97
58 87 92 97
58 87 92 97
RNN 46 74 85 94
46 74 86 94
54 78 88 95
56 78 87 94
LSTM 41 68 82 91
44 72 84 91
49 73 85 92
52 76 86 93
GRU 44 74 83 93
46 76 85 92
51 74 86 93
54 74 87 94
Fig. 3: Detailed results of different “CNN-RNN” recipes under different training and testing settings. For the results of  [35], we follow the original protocol and use the pre-computed optical flow using the original ordinal video sequences. For the remaining methods, the optical flow input is disabled.

We further show that a single RNN in [35] usually leads to a negligible performance difference in various input scenarios such as ordinal , shuffled and stack of the same frame. More specifically, we remove the temporal pooling and take the last output of the RNN in [35] to further understand the merits of RNNs for VPRe-id. In the first two scenarios, we feed an ordinal and shuffled video frames ( video frames, as done in [35]) into RNN. In the third scenario, we repeatedly stack one randomly selected video frame times to generate a “fake” video snippet. All the experiments are repeated ten times and results are illustrated in Fig. 2. It can be seen that RNN leads to almost the same results, which implies that all previous inputs for RNN, whether it is in ordinal or shuffled manner, or even they are exactly the same, is less important. One may argue that forgetting its previous state may be the algorithmic deficiency of RNN. We empirically find that other advanced learning methods, such as LSTM, also lead to similar results. For more detailed results, please refer to Fig. 3.

Observation 2. The same CNN-RNN methods, when trained on randomly shuffled video sequences, perform comparably good as those trained on ordinal video sequences.

In order to further understand the merit of “CNN-RNN” and clarify our motivation, we re-train the “CNN-RNN” [35] system, but randomly shuffle the input videos in each mini-batch. We use the publicly available implementation of [35]. The results are summarized in Fig. 3. We also include the results of the above analysis for more detailed comparisons. In the first setting, we include the results of  [35]

. In this setting, the “CNN-RNN” is trained on the original ordinal video sequences and also evaluated on the ordinal ones. The second setting of “Train-ordinal, Test-shuffled” summarizes our previous investigations. That is, we train “CNN-RNN” on the original ordinal video sequences and evaluate it on randomly shuffled ones. In the last two rows, we randomly shuffle the video sequences in each epoch during the training stage. After that, we test the network on both randomly shuffled videos and the original ordinal sequences. Our results show that RNN fails to capture temporal dependencies as it expects.

One may argue that the optical flow used in [35] may play an essential role in these settings. To investigate this, we conduct another set of experiments, in which the optical flow input is entirely disabled. Furthermore, more advanced recurrent structures such as GRU and LSTM are investigated as well. What we observe are: i) The optical flow is beneficial for this task, especially when the system is trained in ordinal sequences. However, this advantage becomes less apparent when the system is trained on the shuffled sequences. ii) the frame order, which is widely believed to encode temporal information for VPRe-id, is not beneficial for all three recurrent structures studied here. For more observations and detailed analysis on the statistical significance test, please refer to the supplementary files.

5 Proposed Solution

5.1 Ensemble Ranker

It has been widely believed that the temporal information encoded in the frame order is vital for  VPRe-id and hence, the community have largely taken it for granted that using recurrent structures are good practices. For the first time, we dive deep into the effect of modelling temporal information for VPRe-id. Our analysis shows that the commonly used recurrent structure could be less effective as what we expected in capturing sequential dependencies for VPRe-id. Indeed, it essentially learns an orderless representation, which we believe could be more pertinent for VPRe-id, by using multiple feature representations from the temporal pooling on RNN’s output at different time-steps. Moreover, in each time step, the output of RNN is dominated by the current input features.

Motivated by this, we believe that it is beneficial to explicitly solve VPRe-id in an orderless manner. This further motivates us to solve  VPRe-id from a different angle. We propose a simple yet powerful solution by regarding VPRe-id as a task of orderless ensemble ranking where each base ranker is embodied with a person re-identifier with a single image frame. Multiple ranking results can be aggregated by the Kemeny–Young method [26]. From ensemble learning point of view [40], strength and diversities are both essential, and in our case they are guaranteed by the accurateness of image based person re-identification approaches and differences of image frames within a video, respectively. Our method provides much more diversified image pairs than video pairs in the conventional method. All those image pairs contribute to the final re-id accuracy by providing a more detailed ranking under the KemenyYoung method, and they are all well-trained in our setting. Through two examples in Sec. 5.2, we show that the proposed solution can be easily integrated with any image based person re-identification approaches.

Rationale. More formally, for each input video in the probe set, our solution returns an ensemble of ranking list, composed of base lists , where the ranking , are generated by a base ranker whose final objective is:


as defined in Eqn. (1), denotes distance measurement such as commonly used Euclidean Distance.

Different from the conventional ensemble learning approaches, where different base models are employed, here each base ranker for one video is embodied by the same re-identification network which receives different frames222Using different rankers, as done in conventional ensemble methods, is less convincing from our view because it significantly increases the model complexity coming from multiple different models..

By define the distance of the image of the query video to the gallery set as:


the base ranking list for the identity can be got by:


where is a ranking operator, which returns the indices of each video after ranking their distance with the image in the video (). Without loss of generality, we assume that videos are ranked with increasing distance, i.e., the one with smallest distance gets the lowest rank. Each base ranking list in Eqn. (5) is actually a permutation on . Moreover, each base ranking list is returned by different base ranker  which is introduced in Sec. 5.2.

For each pair of video (), denoted by the times of that is ranked in front of , that is


where is an indicator function ( i.f.f ). We can further define that be true i.f.f. . Moreover, suppose that the ground-truth ranking list for the video is , then based on Hoeffding’s inequality [42], we have the following results:

Proposition 1. Let denotes the error rate of each base ranker . Assume that , under the i.i.d. assumption we have:


Although the i.i.d. assumption for Hoeffding’s inequality may not hold in practice, it allows for an analytical study of VPRe-id and sheds light upon how we could improve it.

Discussion on consistency. For the gallery video , in order to aggregate multiple results from each base ranker , variables must define a transitive relation, that is is True if both and are True. When this is held for all , base rankers are said consistent [20]. Theoretical analysis in [20] shows that under the same condition (i.i.d and each base ranker is better than a random guess),

are consistent with probability 1 as

goes to infinity. This naturally guarantees us to Kemeny–Young method [26] to aggregate the final ranking list. Specifically, Kemeny–Young method uses preferential ballots on each base ranker . We first create a 0-1 binary matrix , where ( and i.f.f . Otherwise, . Then a summation along the and

dimensions returns a vector

, where the entry summarizes the frequency that the identify in the gallery set is favoured in all the base rankers. The final ranking list is obtained by ranking the vector in a decreasing manner.

5.2 Base Ranker

Generally, our solution is widely applicable to any image based person re-identification methods. However, as indicated in Eqn. (7) from Proposition , under the i.i.d. assumption, the error bound of the final ensemble decreases exponentially with the square of . Hence it would be beneficial in practice to use the best-performing base ranker. Motivated by this, we embark on the state-of-the-art image based person re-identification approaches. To illustrate our point, we present two realizations of our proposed method: EnsembleRanker-RW and EnsembleRanker-CRF. The former uses random walk based methods in [43] as base ranker while the later employs CRF based methods in [5]. We choose these two methods because of their superior performances for image based person re-identification. Below their main techniques are briefly introduced and interested readers are refereed to the original paper for more details.

Ensemble-RW. This method trains the person re-identification model with the technique in [43]. More specifically, it adopts a novel group-shuffling random walk operation for fully utilizing the affinities between gallery images to refine the affinities between probe and gallery images. It integrates random walk operation into the training process of deep neural networks. Apart from the ground-truth identify label, richer supervision could be provided by grouping and shuffling the features through the random walk operation. It achieves a top-1 accuracy of , , and on CUHK03, Market-1501, and DukeMTMC, respectively [43].

Ensemble-CRF. This method employs the technique in [5]. It uses a novel similarity learning approach for person re-identification by combining the CRF model with deep neural networks. It models the similarities between images in the group via a unified graphical model, and learns local similarities with the aid of group similarities in a multi-scale manner. As more inter-image relations are considered, the learned similarity metric is reported to be more robust and consistent with images under challenging scenarios. This method achieves an top-1 accuracy of , , and on CUHK03, Market-1501, and DukeMTMC, respectively [5].

5.3 Difference with conventional ensemble learning

Existing ensemble re-id methods [39, 37] mainly come from image based task point of view. The work in [39] trains the ensemble RankSVM, while methods in [37] uses multiple features to get multiple metrics and combine them through weighted average. In contrast, we investigate the feasibility of ensemble learning for VPRe-id under the umbrella of deep learning in which the efficiency and the robustness is guaranteed by a shared base ranker and the KemenyYoung method, respectively. More importantly, we show the effectiveness of the proposed method both empirically and theoretically. Other conventional ensemble methods [40], which typically employ different base models and yield much larger model complexity, and hence are less interesting to investigate here.

6 Experimental Results

In this section, we evaluate our proposed approach to video re-identification on three challenging benchmark datasets: iLIDS-VID [52], PRID-2011 [18], and MARS [61]. The iLIDS-VID dataset contains persons, where each person is represented by two video sequences captured by non-overlapping cameras. The PRID-2011 dataset contains persons, captured by two non-overlapping cameras. Following the protocol used in [52], sequences with more than frames are selected, leading to identities. For iLIDS-VID and PRID-2011, each dataset is randomly split into of persons for training and of persons for testing. The MARS dataset is the largest video-based person re-identification benchmark with identities and around video sequences generated by DPM detector [13] and GMMCP tracker [9]. Each identity is captured by at least cameras and has sequences on average. There are distractor sequences in the dataset.

For Ensemble-RW, the network is trained for

epochs using stochastic gradient descent with a learning rate of

, and a batch size of . For Ensemble-CRF, the network is trained for epochs using stochastic gradient descent with a learning rate of , and a batch size of . The other hyper-parameters are set to the same values as the original one. We train the network with two Nvidia TiTan GPUs. For the first two datasets, during testing we consider the first camera as the probe and the second camera as the gallery [52]. For MARS, we follow exactly the same evaluation protocol in [61]. A probe sequence is passed through the network to get the feature vector and ranked using Euclidean distance with pre-computed feature vectors from all gallery sequences. Then we use the KemenyYoung method to aggregate the results for a final ranking list.

(a) PRID-2011
Fig. 6: CMC Rank 1 accuracy of Ensemble-RW and Ensemble-CRF for different sample rate . We show how the sample rate impacts the accuracy of our solution. We report the results of , and , in which stands for the case where only the first image in the video is sampled. The experimental results show that the accuracy increases with the decrements of .

6.1 Ablation Studies

We first provide some ablation studies to investigate two key aspects in Eqn. (7). We conduct those experiments on iLIDS-VID and PRID-2011.

Effect of :. Our theoretical analysis shows that the error bound of the final system decreases exponentially with the number of image frames under the certain conditions. However, in practice we cannot leverage unlimited length of video sequences due to constrained storage, transmission as well as computation resources. In order to investigate the effect of the video length, we report the results both Ensemble-EW and Ensemble-CRF on PRID-2011 and iLIDS-VID dataset. More specifically, we sample one image from every images in each video sequences. We set to make sure , and in this case, we use the first frame to conduct the image based person re-identification experiment as done in [61]. When , we use all the available frames. Results are illustrated in Fig. 6. It is obvious that has an inverse effect on the final performance, that is, the accuracy increases with the decrements of .

With respect to the “probe-to-gallery” pattern, person re-identification can be mainly categorized into two strategies: image-to-image and video-to-video. The first mode is mostly studied in literature while only recently, the video-to-video re-id has been investigated. Other scenarios such as image-to-video and video-to-image are less common and can be regarded as a special case of video-to-video. Our findings apparently verify the the common belief that the video-to-video pattern, which is also the focus in this paper, is more favourable because both probe and gallery units contain much richer visual information than single images. In the following section further show that other common practices such as modelling temporal dynamics in videos for VPRe-id may not help to improve the performance.

Effect of :. The error bound in Eqn. (7) shows improving the final accuracy from another point of view, that is, reducing the error of each base ranker. To show this, we train the baseline model done in [43]. This method does not use group-shuffling random walk operation and thus the ability of refining the affinities between probe and gallery images is disabled. This version has been reported to be worse in all the dataset used in [43]. We call this method “Ensemble-Baseline” and the results in are summarized in Fig. 9. It is obvious that using stronger base rankers can lead to better overall accuracy in all cases.

(a) PRID-2011
Fig. 9: Comparison of different base rankers. We report the results of three different rankers. The baseline method in [43], which disable the functionality to refine the affinities between probe and gallery images, is integrated in our system as “Ensemble-Basline”. Our results on two datasets show that an ensemble of stronger base ranker perform better.
Rank 1 Rank 5 Rank 20 MAP
Liu et al.  [31] 68.3 81.4 90.6 52.9
Zheng et al.  [61] 68.3 82.6 89.4 49.3
Zhou et al.  [62] 70.6 90.0 97.6 50.7
Chen et al.  [6] 71.0 89.0 96.0 -
Zhang et al.  [58] 71.2 85.7 94.3 -
Li et al.  [27] 82.3 - - 65.8
Wu et al.  [54] 80.8 92.1 96.1 67.4
Zhang et al.  [58] 71.2 85.7 94.3 -
Ensemble-CRF 85.3 95.0 97.1 79.3
Ensemble-RW 86.8 94.6 97.5 80.1
Ensemble-CRF (MC) 85.1 94.8 97.0 79.2
Ensemble-RW (MC) 86.3 94.4 97.4 80.1
Fig. 10: Comparison with other state-of-the-art methods on MARS.

6.2 Comparisons with State-of-the-Art

Rank 1 Rank 5 Rank 10 Rank 20 Rank 1 Rank 5 Rank 10 Rank 20
Karanam et al.  [22] 35.1 59.4 69.8 79.7 24.9 44.5 55.6 66.2
Karanam et al.  [21] 40.6 69.7 77.8 85.6 25.9 48.2 57.3 68.9
Wang et al.  [52] 41.7 64.5 77.5 88.8 34.5 56.7 67.5 77.5
Li et al.  [29] 43.0 72.7 84.6 91.9 37.5 62.7 73.0 81.8
Wu et al.  [53] 49.8 77.4 90.7 94.6 42.6 70.2 86.4 92.3
Li et al.  [30] 51.6 83.1 91.0 95.5 34.5 63.3 74.5 84.4
Yan et al.  [57] 58.2 85.8 93.4 97.9 49.3 76.8 85.3 90.0
Liu et al.  [32] 64.1 87.3 89.9 92.0 44.3 71.7 83.7 91.7
McLaughlin et al.  [35] 70.0 90.0 95.0 97.0 58.0 84.0 91.0 96.0
Zhang et al.  [59] 72.8 92.0 95.1 97.6 55.3 85.0 91.7 95.1
Chen et al.  [6] 77.0 93.0 95.0 98.0 61.0 85.0 94.0 97.0
Xu et al.  [56] 77.0 95.0 99.0 99.0 62.0 86.0 94.0 98.0
Zhou et al.  [62] 79.4 94.4 - 99.3 55.2 86.5 - 97.0
Liu et al.  [31] 83.7 98.3 99.4 100.0 68.7 94.3 98.3 99.3
Zhang et al.  [58] 85.2 97.1 98.9 99.6 60.2 84.7 91.7 95.2
Liu et al.  [33] 90.3 98.2 99.3 100.0 68.0 86.8 95.4 97.4
Khan et al.  [23] 92.5 99.3 100.0 100.0 79.5 95.1 97.6 99.1
Li et al.  [27] 93.2 - - - 80.2 - - -
Zhang et al.  [58] 85.2 97.1 98.9 99.6 60.2 84.7 91.7 95.2
Ensemble-CRF 95.5 99.6 100.0 100.0 90.4 98.3 98.4 99.2
Ensemble-RW 97.1 99.5 100.0 100.0 91.3 98.1 100.0 100.0
Ensemble-CRF (MC) 94.7 98.9 100.0 100.0 89.7 97.7 97.6 99.1
Ensemble-RW (MC) 96.8 99.3 100.0 100.0 91.3 97.9 100.0 100.0
Fig. 11: Comparison with other state-of-the-art methods on PRID-2011 and iLIDS-VID dataset.

Notable recent works on VPRe-id, that have improved the state-of-the-art results are: 1) McLaughlin et al. ’s [35] method that uses temporal average pooling to aggregate RNN outputs at each time step; and 2) Chen et al. ’s method [6] that adopts the similar network structure with [35], but fuses both CNN and RNN features; and 3) Xu et al. ’s [56] that uses spatial pyramid pooling in CNN and attention models for more discriminative features; 4) Zhang et al. ’s [59] that integrates CNNS and BRNNs; 5) Zhou et al. ’s method [62] that uses both temporal attention model, recurrent units and more complicated triplet loss; 6) Liu et al. ’s method which uses AMOC network jointly learns appearance representation and motion context from a collection of adjacent frames using a two-stream convolutional architecture. 7) Li et al. ’s method [27] that uses spatial attention ensemble to discover the same body part. 8) Zhang et al. ’s method [54] that employs an interpretable reinforcement learning to train an agent to verify a pair of images at each time. From Fig. 11, we can observe that both the proposed methods outperform “CNN-RNN” and its several advanced extensions. Fig. 10

compares the proposed method with other state-of-the-arts methods on MARS. Following the standard evaluation protocol, we resort to both CMC and MAP as the evaluation metric 

[61]. The Average Precision (AP) is calculated based on the ranking result, and the mean Average Precision (mAP) is computed across all queries which is viewed as the final re-identification accuracy. For the ease of presentation, we omit the results of CMC Rank 15. Again, our proposed solution outperforms all those methods in terms of CMC Rank and MAP scores. In addition, we also include other variants of the proposed method, which use the Cross-Entropy Monte Carlo algorithm [38] to aggregate multiple rankers. We observe that with other aggregation methods, our proposed solutions still outperform all those methods in terms of CMC Rank and MAP scores.

Here we provide some open discussions as well. Eqn. (7) suggests a good way to improve person re-identification performance: using the best image person re-identification methods on all the available video frames. However, this is non-trivial and deserves future research work in terms of the following problems: 1) How to satisfy the i.i.d. assumption? and 2) given a limited computational budget, how to achieve a good trade-off between and ? These may be investigated in various aspects: (1) Enhanced dataset precessing in which dependencies among individual frames can be reduced, such as random frame skip or key frame selection [36]. However, this also reduces the video length. (2) Using more compute-intensive methods is beneficial in improving .

7 Conclusion

We first provide a diagnostic analysis on the commonly used “CNN-RNN” recipes for VPRe-id. We show that RNNs used in the literature usually may not be effective to capture the temporal dependencies and implicitly learn an orderless representation, which we believe is more pertinent for VPRe-id. Based on this observation, we then propose a simple yet surprisingly powerful ensemble approach for VPRe-id. Moreover, we theoretically prove that both employing strong image based person re-identification methods and long video sequences are beneficial. We demonstrate this idea with two examples and our proposed solution significantly outperforms the existing state-of-the-arts methods, on multiple widely used datasets.


  • [1] E. Ahmed, M. Jones, and T. K. Marks (2015) An improved deep learning architecture for person re-identification. In IEEE CVPR, Cited by: §2.2.
  • [2] J. Boin, A. Araujo, and B. Girod (2019) Recurrent neural networks for person re-identification revisited. In ICMIPR, pp. 147–152. Cited by: §2.3.
  • [3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In IEEE CVPR, Cited by: §1.
  • [4] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang (2018) Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In IEEE CVPR, Cited by: §1, §2.1.
  • [5] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang (2018) Group consistent similarity learning via deep crf for person re-identification. In IEEE CVPR, Cited by: §2.2, §5.2, §5.2.
  • [6] L. Chen, H. Yang, J. Zhu, Q. Zhou, S. Wu, and Z. Gao (2017) Deep spatial-temporal fusion network for video-based person re-identification. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., Cited by: §1, §2.1, §4, Fig. 10, Fig. 11, §6.2.
  • [7] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In IEEE CVPR, Cited by: §2.2.
  • [8] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In IEEE CVPR, Cited by: §2.2.
  • [9] A. Dehghan, S. Modiri Assari, and M. Shah (2015) Gmmcp tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In IEEE CVPR, Cited by: §6.
  • [10] S. Ding, L. Lin, G. Wang, and H. Chao (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48 (10), pp. 2993–3003. Cited by: §2.2.
  • [11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In IEEE CVPR, Cited by: §1.
  • [12] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani (2010) Person re-identification by symmetry-driven accumulation of local features. In IEEE CVPR, Cited by: §1.
  • [13] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 (9), pp. 1627–1645. Cited by: §6.
  • [14] N. Gheissari, T. B. Sebastian, and R. Hartley (2006) Person reidentification using spatiotemporal appearance. In IEEE CVPR, Cited by: §1.
  • [15] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell (2017) ActionVLAD: learning spatio-temporal aggregation for action classification. IEEE CVPR. Cited by: §1.
  • [16] M. A. Goodale and A. D. Milner (1992) Separate visual pathways for perception and action. Trends in neurosciences 15 (1), pp. 20–25. Cited by: §1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE CVPR, Cited by: §1.
  • [18] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof (2011) Person re-identification by descriptive and discriminative classification. In SCIA, Cited by: §1, §6.
  • [19] M. Hirzer, P. M. Roth, M. Köstinger, and H. Bischof (2012) Relaxed pairwise learned metric for person re-identification. In Eur. Conf. Comput. Vis., Cited by: §1.
  • [20] K. Jong, J. Mary, A. Cornuéjols, E. Marchiori, and M. Sebag (2004) Ensemble feature ranking. In PKDD, pp. 267–278. Cited by: §5.1.
  • [21] S. Karanam, Y. Li, and R. J. Radke (2015) Person re-identification with discriminatively trained viewpoint invariant dictionaries. In Int. Conf. Comput. Vis., Cited by: Fig. 11.
  • [22] S. Karanam, Y. Li, and R. J. Radke (2015) Sparse re-id: block sparsity for person re-identification. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., Cited by: Fig. 11.
  • [23] F. M. Khan and F. Brèmond (2017) Multi-shot person re-identification using part appearance mixture. In WACV, Cited by: Fig. 11.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [25] I. Kviatkovsky, A. Adam, and E. Rivlin (2013) Color invariants for person reidentification. IEEE Trans. Pattern Anal. Mach. Intell. 35 (7), pp. 1622–1634. Cited by: §1.
  • [26] J. Levin and B. Nalebuff (1995) An introduction to vote-counting schemes. Journal of Economic Perspectives 9 (1), pp. 3–26. Cited by: §5.1, §5.1.
  • [27] S. Li, S. Bak, P. Carr, and X. Wang (2018) Diversity regularized spatiotemporal attention for video-based person re-identification. In IEEE CVPR, Cited by: Fig. 10, Fig. 11, §6.2.
  • [28] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In IEEE CVPR, Cited by: §1, §2.2.
  • [29] Y. Li, Z. Wu, S. Karanam, and R. J. Radke (2015) Multi-shot human re-identification using adaptive fisher discriminant analysis.. In Brit. Mach. Vis. Conf., Cited by: Fig. 11.
  • [30] Y. Li, L. Zhuo, J. Li, J. Zhang, X. Liang, and Q. Tian (2017) Video-based person re-identification by deep feature guided pooling. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., Cited by: §2.1, Fig. 11.
  • [31] H. Liu, Z. Jie, K. Jayashree, M. Qi, J. Jiang, S. Yan, and J. Feng (2018) Video-based person re-identification with accumulative motion context. IEEE TCSVT 28 (10), pp. 2788–2802. Cited by: Fig. 10, Fig. 11.
  • [32] K. Liu, B. Ma, W. Zhang, and R. Huang (2015) A spatio-temporal appearance representation for viceo-based pedestrian re-identification. In Int. Conf. Comput. Vis., Cited by: Fig. 11.
  • [33] Y. Liu, J. Yan, and W. Ouyang (2017) Quality aware network for set to set recognition. In IEEE CVPR, Cited by: Fig. 11.
  • [34] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE CVPR, Cited by: §1.
  • [35] N. McLaughlin, J. Martinez del Rincon, and P. Miller (2016) Recurrent convolutional network for video-based person re-identification. In IEEE CVPR, Cited by: §1, §1, §2.1, Fig. 2, Fig. 3, §4, §4, §4, §4, §4, Fig. 11, §6.2.
  • [36] J. Meng, H. Wang, J. Yuan, and Y. Tan (2016) From keyframes to key objects: video summarization by representative object proposal selection. In IEEE CVPR, pp. 1039–1048. Cited by: §6.2.
  • [37] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel (2015) Learning to rank in person re-identification with metric ensembles. In CVR, pp. 1846–1855. Cited by: §5.3.
  • [38] V. Pihur, S. Datta, and S. Datta (2009) RankAggreg, an r package for weighted rank aggregation. BMC bioinformatics 10 (1), pp. 62. Cited by: §6.2.
  • [39] B. J. Prosser, W. Zheng, S. Gong, T. Xiang, and Q. Mary (2010) Person re-identification by support vector ranking.. In Brit. Mach. Vis. Conf., Vol. 2, pp. 6. Cited by: §5.3.
  • [40] Y. Ren, L. Zhang, and P. N. Suganthan (2016) Ensemble classification and regression-recent developments, applications and future directions. IEEE CIM 11 (1), pp. 41–53. Cited by: §5.1, §5.3.
  • [41] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a uni-fied embedding for face recognition and clustering. In IEEE CVPR, Cited by: §1.
  • [42] R. J. Serfling (1974) Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, pp. 39–48. Cited by: §5.1.
  • [43] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang (2018) Deep group-shuffling random walk for person re-identification. In IEEE CVPR, Cited by: §2.2, §5.2, §5.2, Fig. 9, §6.1.
  • [44] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng (2018) Crowd counting with deep negative correlation learning. In IEEE CVPR, Cited by: §1.
  • [45] J. Si, H. Zhang, C. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In IEEE CVPR, pp. 5363–5372. Cited by: §2.3.
  • [46] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
  • [47] C. Song, Y. Huang, W. Ouyang, and L. Wang (2018) Mask-guided contrastive attention model for person re-identification. In IEEE CVPR, pp. 1179–1188. Cited by: §2.2.
  • [48] X. Su, Y. Zou, Y. Cheng, S. Xu, M. Yu, and P. Zhou (2018) Spatial-temporal synergic residual learning for video person re-identification. arXiv preprint arXiv:1807.05799. Cited by: §2.1.
  • [49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In IEEE CVPR, Cited by: §1.
  • [50] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In IEEE CVPR, Cited by: §1.
  • [51] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Eur. Conf. Comput. Vis., Cited by: §1.
  • [52] T. Wang, S. Gong, X. Zhu, and S. Wang (2014) Person re-identification by video ranking. In Eur. Conf. Comput. Vis., Cited by: §1, Fig. 11, §6, §6.
  • [53] L. Wu, C. Shen, and A. v. d. Hengel (2016) Deep recurrent convolutional networks for video-based person re-identification: an end-to-end approach. arXiv preprint arXiv:1606.01609. Cited by: §1, §2.1, §4, Fig. 11.
  • [54] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Exploit the unknown gradually: one-shot video-based person re-identification by stepwise learning. In IEEE CVPR, pp. 5177–5186. Cited by: Fig. 10, §6.2.
  • [55] T. Xiao, H. Li, W. Ouyang, and X. Wang (2016) Learning deep feature representations with domain guided dropout for person re-identification. In IEEE CVPR, pp. 1249–1258. Cited by: §2.2.
  • [56] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou (2017) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In Int. Conf. Comput. Vis., Cited by: §1, §2.1, §4, Fig. 11, §6.2.
  • [57] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang (2016) Person re-identification via recurrent feature aggregation. In Eur. Conf. Comput. Vis., Cited by: §1, §1, §2.1, §4, Fig. 11.
  • [58] J. Zhang, N. Wang, and L. Zhang (2018) Multi-shot pedestrian re-identification via sequential decision making. In IEEE CVPR, pp. 6781–6789. Cited by: §2.3, Fig. 10, Fig. 11.
  • [59] W. Zhang, X. Yu, and X. He (2017) Learning bidirectional temporal cues for video-based person re-identification. IEEE TCSVT 28 (10), pp. 2768–2776. Cited by: §2.1, §4, Fig. 11, §6.2.
  • [60] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017) Deeply-learned part-aligned representations for person re-identification.. In Int. Conf. Comput. Vis., Cited by: §2.2.
  • [61] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) Mars: a video benchmark for large-scale person re-identification. In Eur. Conf. Comput. Vis., Cited by: §1, Fig. 10, §6.1, §6.2, §6, §6.
  • [62] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan (2017) See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification,. In IEEE CVPR, Cited by: §1, §2.1, §2.3, §4, Fig. 10, Fig. 11, §6.2.
  • [63] W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao (2016) A key volume mining deep framework for action recognition. In IEEE CVPR, Cited by: §1.