Given an image/tracklet taken from one camera, person re-identification (re-ID) is the process of matching the person from images/tracklets of interest in another view. In recent years, the computer vision field has witnessed the increasing attention in person re-ID. A large mount of person re-ID approaches have emerged, due to its widely potential applications, such as criminal spotting, pedestrian searching  and cross-camera tracking .
Currently, the prominent progresses of person re-ID are achieved in the static image setting, which only uses the single images and spatial information. Most of existing works in this setting focus on designing robust feature representation [4, 5, 6, 7, 8, 9, 10, 11, 12], learning discriminative distance metric [13, 14, 15, 16, 17, 11, 18, 19, 20]
or combining them under a deep convolutional neural network (CNN) based framework[21, 22, 23, 24, 25, 26, 27, 28, 29, 30]. Because the video setting is more close to the practical scenario, researchers have shifted their attention to the video re-ID [31, 32, 33, 34, 35, 36, 37]. The advantages of video re-ID are located in several aspects. First, videos are the first-hand materials captured by surveillance cameras and pedestrian tracklets can be automatically detected by existing detectors. Second, compared to the static image re-ID, videos contain more information, e.g., temporal cues, pose variations and multi-view observations. In addition, videos can provide more training samples for a tracklet sequence, usually consisting of multiple images. More importantly, the ubiquitous videos offer society far-reaching benefits in terms of security and law enforcement. Therefore, we attempt to tackle the video setting in this paper.
Video re-ID also faces several severe challenges. Samples in static image re-ID are usually captured under well-controlled visual conditions or even framed by professional photographers. However, as video acquisition is much less constrained, the image qualities of video frames tend to be rather low and pedestrians also exhibit a large range of pose variations, which can be observed from the Fig. 1. In particular, pedestrians in videos are usually moving, resulting in serious out-of-focus, blurring and scale variations. Furthermore, the automatically detected pedestrian tracklets by existing pedestrian detectors may fail with cluttered backgrounds and poor alignments, which exacerbate the re-ID problems with more difficulties.
Therefore, the following issues in video re-ID need to be carefully considered: 1) how to construct appropriate person representations, so that they can effectively incorporate temporal information available in videos? 2) How to effectively harness the temporal dependency to keep better pedestrian alignments, so that sample noises and cluttered backgrounds can be alleviated and less influenced?
To deal with the first issue, recent works in video re-ID have tended to utilize the recurrent neural networks (RNNs), which take consecutive frames as inputs and adaptively incorporate temporal information[33, 38, 32, 39, 34]
. These methods first extract the frame-wise features with deep CNNs. Then the extracted features are fed into several RNNs to capture the temporal structure information. Finally, the average or max temporal pooling procedure is conducted on the outputs of the RNNs to aggregate the features. However, the average pooling operation only considers the generic features of pedestrian sequences, the specific features of samples in a sequence are neglected. While the max pooling operation concentrates on finding the local salient features, a lot of useful information may be abandoned. A key innovation of this work is that a new technical solution to fully leverage the temporal information is provided. Specifically, we propose a temporal residual learning (TRL) module which is equipped with two bi-directional LSTMs (BiLSTMs)
to simultaneously learn the generic and specific features of a video sequence. The generic features indicate the common temporal structure of a video, which is captured by averaging the outputs of the first BiLSTM. The specific features extracted by the second BiLSTM, characterizes the deviations and properties in specific sample frames. The joint features can provide complementary information for person descriptions and promote the representation power of a video. In addition, the two BiLSTMs in our framework make the generic and the specific information flow forward and backward in a flexible manner, allowing the underlying temporal information interaction to be fully exploited.
To alleviate sample noises and poor alignments appeared in videos, we propose to take advantage of the spatial transformer network (STN). The STN is a learnable module which can explicitly allow spatial manipulations of data within CNNs. The STN has achieved impressive performance on fine-grained recognition [41, 42]43, 44] and image-level person re-ID [45, 46]. The main difficulty in applying the STN to video tasks is how to ensure the spatial transformation is continuous. In fact, consecutive frames of videos are very similar without significant changes and fluctuations. This fact indicates that the spatial transformation in videos should also vary smoothly according to time in order to preserve the temporal continuity. To address this problem, we extend the STN method and propose a novel spatial-temporal transformer network (STN). The STN includes a recurrent structure in temporal dimension to automatically learn the optimal spatial transformation parameters in consecutive frames. The STN also utilizes a BiLSTM, that can leverage knowledge of consecutive frames to ensure accurate spatial alignments.
In summary, our main contributions are three-folds:
We propose a new temporal information learning method, i.e., the TRL module, to jointly learn the generic and specific features of a video sequence. The complementary features can promote the representation capability of conventional RNN based temporal pooling methods.
We extend the classical STN method and propose the STN module to leverage the context knowledge of consecutive frames. The STN module is very useful to ensure smooth person alignments in videos.
The rest of this paper is organized as follows. In Section II, we give an overview of the video based re-ID and spatial transformer network. Then we introduce the proposed learning approach in Section III. In Section IV, we evaluate and analyze the proposed method by extensive experiments and comparisons with other methods. Finally, we provide the conclusion and future work in Section V.
Ii Related Work
Video based re-ID.
In recent years, video based re-ID has drawn increasing attention due to its numerous applications. Existing video re-ID methods can be roughly categorized into two classes: hand-crafted feature based methods [49, 51, 52, 53, 54, 55, 56]
and deep learning based methods[31, 32, 33, 34, 35, 36].
A majority of conventional algorithms develop their solutions from two aspects: extracting reliable feature representations [49, 51, 52, 55, 56] or learning robust distance metrics [53, 54]. For instance, Wang et al. 
select the discriminative video fragments from noisy sequences by estimating the flow energy profile (FEP). The fragments are represented by the HOG3D feature and the average color histogram. In 
, the periodicity of a walking person is exploited to generate spatial-temporal body-action units, which are then represented by Fisher vectors. Choet al.  conduct the multi-shot matching by efficiently estimating the target poses. You et al  propose the top-push distance learning model (TPDL) , which integrates a top-push constrain for matching video features of persons.
exploit the siamese network architecture for re-ID, where the convolutional layer, the recurrent layer, and the temporal pooling layer are jointly trained to act as a feature extractor. The difference is that the former takes advantage of RNNs to capture the temporal information based on the fixed-layer CNN features, while the later leverages the feature maps at all levels of a CNN and models the temporal information by convolutional gated recurrent units (GRUs). In addition, since the quality of each sample cannot be guaranteed and poor images will hurt the accuracy, Liuet al.  propose to learn the quality of each sample automatically. Quality scores and features of all samples are aggregated to get the final feature representation. Zhou et al. 
propose the temporal attention model (TAM) to focus on discriminative frames. The TAM is jointly learned with the spatial recurrent model (SRM) to integrate the surrounding information at different spatial locations for better similarity evaluation. Xuet al.  present an attentive spatial-temporal pooling network (ASTPN) to select key regions or frames from the sequences for the feature representation learning.
In a nutshell, existing frameworks directly aggregate the frame-level features by conducting the average, max or attention temporal pooling. However, the average temporal pooling only summarizes the generic features, the max temporal pooling focuses on the local salient regions and the attention temporal pooling concerns the informative frames. The specific characteristics of samples in video sequences are not explored, which may lead to a suboptimal video representation for video re-ID. To take advantage of the temporal information and appearance speciality of pedestrians, we propose the TRL module to simultaneously extract the generic and specific features of samples within a video.
Spatial transformer network.
To capture more spatial invariant within deep networks, Jaderberg et al.  propose the spatial transformer module, which can automatically learn the optimal transformation parameters. The spatial transformer module is differentiable and can be inserted into existing convolutional architectures, giving neural networks the ability to actively transform the spatial feature maps. The resulting network improves the feature invariance to translation, scale, rotation and more generic warping. Hence, the STN has been widely applied in fine-grained image recognition, e.g., face recognition [43, 44] and static image re-ID [45, 46]. In , Sønderby et al. extend the STN with RNNs in the spatial dimension for digit sequence recognition within a image. Our proposed method shares the similar idea with , but we infuse the RNNs into the STN in the temporal dimension. The resulting STN allows the learned transformation parameters to be smooth changes within consecutive frames. Although efforts in [45, 46] also take advantage of the STN to perform the pedestrian alignment, our model differers them in two aspects. First, their methods are developed for static image re-ID, while our proposed module is designed to address video setting. Second, there is no need to consider the temporal context information for static image re-ID in [45, 46]. For the video setting, we need to control the transformation parameters in temporal dimension.
Iii The Proposed Approach
In this section, we first describe the overall architecture of our framework. Then we give the details of spatial-temporal transformer network module. After that, we elaborate the temporal residual learning in detail. At last, we describe the inference process of our proposed model.
Iii-a Overview of Network Architecture
The overall architecture of our proposed model is illustrated in Fig.2. The GoogleNet  is selected as the base network, however, other convolutional networks are also feasible. The model consists of two streams: the main stream (upper) and the alignment stream (lower), which process the original input sequences and the aligned sequences, respectively. The two streams have the same architecture and share the same front-end (Convs in Fig.2). From the Inception-1 layers to the end, the two streams do not share parameters. The main reason of this design is that the main stream and the alignment stream focus on different features of the sequences. Making the parameters of two streams unshare can ensure the adaptiveness.
In detail, each frame of a pedestrian sequence is first fed into the GoogLeNet to extract multi-level convolutional features. As pointed out in , feature maps in high-level layers can encode the attention region and semantic cues. Inspired by this fact, we utilize the high-level feature maps (the outputs of the Inception-2 layers in the main stream, size = ) as inputs of the STN module. The STN predicts the transformation parameters by leveraging the spatial information of the current frame and the temporal context knowledge from consecutive frames. The grid generator of the STN uses the predicted transformation parameters to construct a sampling grid. The sampling grid is a set of points, where the low-level feature maps (the outputs of Convs, ) should be sampled to produce the aligned outputs. The bilinear sampler takes the outputs of Convs and the sampling grid as inputs to produce the transformed feature maps (). Then, we extract the feature vectors of video frames on the original feature maps and the transformed feature maps via the global average pooling (GAP). We name the extracted features as the original sequence descriptors (OSD) and the aligned sequence descriptors (ASD). Either the OSD or the ASD is sent into the following TRL module to get the video features.
We take the aligned pedestrian sequence as an example to demonstrate the processing procedure of the TRL module. First, we feed the ASD to the first BiLSTM of the TRL module. The outputs of the first BiLSTM are aggregated by a temporal poling unit. The aggregated results act as the generic features. Then, the differences between the generic features and the outputs of the first BiLSTM are formulated as the specific residuals. The specific residuals are fed into the second BiLSTM, the outputs of which are summarized by the second temporal pooling. The aggregated results are the specific features of the aligned sequence. The final representations of the aligned sequences are obtained by fusing the generic features and the specific features.
Iii-B Spatial-Temporal Transformer Network Module
The proposed STN module (shown in Fig. 2) consists of three parts: the localization network, the grid generator and the bilinear sampler. The localization network includes a shallow CNN to aggregate high-level features, a BiLSTM to capture temporal context information and a fully connected layer to predict the transformation parameters. In detail, for the shallow CNN, we use the structure of . For the BiLSTM, we follow previous works [42, 32], and set the hidden unit size and dropout rate to 256 and 0.5, respectively. The number of neural units in the fully connected layer is set to 4 for adapting the transformation parameters.
Formally, we describe the proposed spatial-temporal transformer process as follows. Given an input frame at time , we first extract its high-level feature maps (the outputs of Inception-2 layers in the main stream). Then the shallow CNN aggregates the feature maps as
where is the aggregated feature and denotes the shallow CNN of the localization network. By taking the aggregated feature and the bi-directional contexts (, ) of previous frames and next frames, the BiLSTM captures the bi-directional temporal context of the current frame as
where and are the forward and backward processes of the BiLSTM. After that, we use the fully connected layer to predict the transformation parameters , which is conditioned on the temporal context information (, ). The transformation parameters can be expressed as
In addition, because pedestrians almost stand vertically in videos, the scale and translation transformations are enough to ensure person alignments. Thus, we only consider the scale and and translation transformers, and perform the spatial transformation by the following affine matrix:
where , , and are the scale and and translation parameters along the width and height directions, respectively.
Similar to the STN , in our STN we also utilize the predicted transformation parameters to construct a sampling grid . The bilinear sampler takes the input feature maps and the sampling grid as inputs, and produce the spatial transformed features , which is sampled from the inputs at the grid points. Formally, we extract the spatial transformed features from the input maps (the outputs of the Convs in Fig. 2) by
where is the spatial transformer. Note that, our proposed STN model essentially searches an target region and adaptively extracts the corresponding features based on the spatial-temporal information. and provide the bi-directional information for the STN. It can not only preserve the temporal continuity, but also allow the smooth and robust spatial transformation between consecutive frames. We apply the spatial transformation operation on the high-level feature maps instead of the input image to reduce computation efforts greatly. Based on the STN, we extract the features of the original pedestrian sequence and the aligned pedestrian sequence. The resulting OSD and ASD will be sent into the temporal residual learning module to extract complementary features to represent pedestrian sequences.
Iii-C Temporal Residual Learning
The relationship of consecutive frames provides important hints for person recognition. Existing re-ID methods mainly introduce the forward RNNs and temporal pooling methods to capture temporal structure information. However, the plain forward RNNs have several obvious drawbacks that may hamper the performance of video re-ID. First, the forward RNNs merely summarize the information of previous inputs. Thus, the outputs of forward RNNs may be biased towards the later time-steps, making the later frames more dominant than the earlier ones. Second, the average or max temporal pooling used with forward RNNs only focuses on generic features or local salient parts of a sequence, which may neglect the valuable and specific information of samples in videos.
To address aforementioned drawbacks, we propose the temporal residual learning (TRL) module. The TRL module can simultaneously learn the generic and specific features of pedestrian sequences. Specifically, the TRL module contains two BiLSTMs. Each of the BiLSTMs is followed by an average temporal pooling layer. The employed BiLSTM inherently has forward connections and backward connections, allowing information flow backward and forward over the temporal dimension. Thus, it is reasonable to use the BiLSTM to extract the temporal structures of both the original sequences and the aligned sequences. In addition, the aligned sequences can provide complimentary information for the original sequences. Therefore, the enhanced sequence features have better representation capability and robustness.
We take the aligned video sequence as an example to demonstrate the implementation details of the TRL module. Assume the aligned video sequence contains frames, i.e., , and is the image at the -th time step (). Each frame has a corresponding ASD, , extracted from the alignment stream. For the notional simplicity, we subsequently drop the superscript and consider each ASD independently. By feeding the descriptors into the first BiLSTM, we get the bi-directional temporal features, which can be expressed as:
where and represent the forward and backward processes of the first BiLSTM in the TRL module, respectively. is the hidden state of the forward LSTM, which contains the context information of previous frames. is the hidden state of the backward LSTM, that captures the context information of future frames.
We concatenate the bi-directional features as the temporal features of the -th frame:
Then the above features are aggregated by the temporal pooling layer to extract the generic representation of a pedestrian sequence, which can be formulated as:
Most existing methods use the
to perform the supervise learning for video re-ID. However, the average temporal pooling procedure on the overall features only concerns the generic features of a video sequence, the specific features are not explicitly highlighted. Therefore, we introduce another BiLSTM to model the specific features. The input to the second BiLSTM is the difference between the generic featureand the bi-directional feature at the -th frame. The detailed processing procedure of the second BiLSTM can be expressed as follows:
We note that the feature residues capture the characteristics of independent features. They contain the specific features of each sample in a sequence. We also aggregate the specific features to obtain the characteristic representation of the video sequence by
Therefore, the final representation of a sequence is acquired by weighting the generic and specific features:
where is a trade-off parameter to balance the importance of the generic features and the specific features. In the following experiments, we set to give the two kind of features the equal importance.
The above procedure can also be conducted on the original video sequence , the resulting features can be expressed as:
From the above derivation, one can see that our proposed TRL forces the temporal module not only perceive the generic features of the overall sequence, but also exploit the specific features of independent frames. The resulting representation makes full use of the temporal structure cues, which has more representation power in contrast to only using temporal pooling methods. We note that the BiLSTMs used in the two streams do not share the weights though we use the same symbols ( and ) to represent them.
Iii-D Overall Feature Representation and Inference
In the training phase, we employ the softmax cross-entropy loss to supervise the feature learning of the main stream and the alignment stream. In the testing phase, for a new video sequence, we follow the feature fusion strategy in  to perform the identification matching. More specifically, given the features and , extracted from the original video sequence and the aligned video sequence, respectively. The overall sequence representation can be expressed as:
where denotes the L2-norm,
is a hyperparameter and controls the importance of each term. In this paper, we setto emphasize equal importance between the main stream and the alignment stream. In the experiments, we will show that the two streams can provide complementary information and promote each other to achieve better performance.
In this section, we thoroughly evaluate the proposed framework on several public video re-ID datasets. We first analyze the importance of each component of the proposed model. To validate the effectiveness and superiority of our method, we then make extensive comparisons with other state-of-the-art video re-ID methods. At last, we conduct cross-dataset experiments to evaluate the generalization of the proposed model, and perform hyperparameter analysis to demonstrate its insensitivity.
The MARS dataset  is a large-scale video dataset for person re-ID. It is captured from six near-synchronized cameras on the Tsinghua campus. As an extension of Market-1501  dataset, MARS contains 1,261 different identities forming a total number of 20,478 tracklets. These tracklets are automatically collected via the DPM  detector and GMMCP  tracker. Each identity is captured by at least 2 cameras and has 13.2 tracklets on average. Among all the tracklets, there exist 3,278 distracted tracklets due to the false detection and association, making the dataset very challenging to achieve high performance. Following previous works, in this work we also divide this dataset into 625 persons for training and the rest for testing.
The PRID 2011 dataset  is collected in several uncrowded outdoor scenes with relatively simple backgrounds and rare occlusions. Image frames of this dataset are captured by two static non-overlapping surveillance cameras. One camera view has identities while the other has identities. This dataset has an overlap of pedestrians appeared in both views. Each person sequence has a variable length from to frames and with an average number of . To guarantee the effective length of image sequences, we select identities with the sequence number more than frames, following the work in .
The ILIDS-VID dataset  is captured at an airport arrival hall under a multi-camera CCTV network. The dataset consists of 300 distinct individuals observed in two non-overlapping views, forming 600 pedestrian sequences. Each person has 23 to 192 images and the average number per person is 73. It is very challenging because of the large clothing similarities among pedestrians, lighting and viewpoint variations across views as well as cluttered backgrounds and random occlusions.
The SDU-VID dataset  is collected by two non-overlapping camera views in outdoor scenes. The dataset contains 600 pedestrian sequences for 300 different identities. Each sequence has a variable length from 16 to 346 image frames and the average number is 130. There are more image frames in each pedestrian sequence comparing with the ILIDS-VID and PRID 2011. The SDU-VID is also a challenging dataset due to the cluttered backgrounds, occlusions and viewpoint variations.
Iv-B Experimental Setup
To evaluate the performance, we employ the cumulative matching characteristics (CMC) curves  and the mean average precision (mAP)  as the evaluation criterions. For the MARS dataset, we utilize the protocol in  and divide the MARS dataset into two subsets with 625 persons for training and the rest for testing. While for the PRID 2011, ILIDS-VID and SDU-VID datasets, we adopt the protocol in . Each of the three datasets is randomly split into the training subset and testing subset by half, with non-overlapping identities. During testing, we regard the pedestrians in the first camera as the probes while the second camera as the galleries. The performances of three datasets are evaluated by the CMC score for 10 trials with different train/test splits. The average results of 10 splits are reported.
Network parameter settings.
We perform all the experiments on the Tensorflow platform. Most of the network parameter settings can be found in Fig. 2. The GoogLeNet 
pre-trained on the ImageNet dataset is selected as the CNN feature extractor. During training, we uniformly resize the input image to
. The number of the neural units in the two temporal BiLSTMs are set to 512. For each sequence, we randomly choose 10 consecutive frames for training and the mini-batch size is 12. For the the convolutional and fully connected layers, the weights are initialized by the Gaussian distribution with mean 0 and variance 0.01. While for the weights of the BiLSTMs, we initialize them by the orthogonal initialization method. To prevent the exploding gradients, we clip all gradients to make sure them lie in the interval . The standard Adam algorithm  is utilized to optimize the proposed framework.
Because we introduce the STN for the spatial alignment, which deeply interacts with the feature extraction part, the training strategy of the proposed framework needs to be carefully designed. In other words, the effective learning of complementary features should be conditioned under well pre-trained STN. Thus, we train the proposed framework in a stage-wise fashion: 1) pre-training the STN module on the large-scale MARS dataset. In detail, we pre-train the Convs and the STN module with frame-level supervision information. The image-level feature vectors after the GAP are utilized to predict the identities of pedestrians in a sequence. In this stage, we use the learning rate = and max iteration = 10,000; 2) updating the whole network for complementary feature learning. After pre-training the STN module, we use the learned parameters to initialize the whole network. Then the Convs, STN and TRL module are jointly optimized under the sequence-level supervision. In this stage, we use the decreased learning rate = , and max iteration = 10,000, 8,000, 8,000, 5,000 on the MARS, PRID 2011, ILIDS-VID, SDU-VID datasets, respectively. Both the frame-level supervision and the sequence-level supervision are optimized by the softmax cross-entropy loss.
Iv-C Ablation Analysis
In this subsection, we perform in-depth studies on the MARS dataset to validate the contribution of each component of the proposed model. Table. I summarizes the ablation results of our model with the same training hyperparamters described in previous subsections. The first four rows are conducted on the original videos, and the last six rows are conducted on both the original videos and the aligned videos. “” refers to the GoogLeNet, that extracts the convolutional features for each frame in a pedestrian sequence. “” is a single direction LSTM and is employed to extract the temporal features. “” used in this paper intends to use the bi-direction information to extract the complementary features. “” is the classical spatial transformer network, which predicts the spatial transformation parameters on single images. While “” is the proposed spatial-temporal transformer module to predict parameters conditioned on extra temporal context information. The “” and “” stand for the generic and specific features, respectively.
Effectiveness of the bi-directional LSTM structure.
From the first two rows of Table. I, one can observe that the “” shows a performance advantage of , , and in terms of the CMC scores at rank-1, 5, 20 and mAP over the “”. This indicates that compared to the conventional single directional structure, the bi-directional information flow helps the isolated frame to interact with other frames. The BiLSTMg have better ability to capture the temporal information of pedestrian sequences. Therefore, in following experiments we employ BiLSTMg as the base structure for extracting temporal information.
Effectiveness of the STN and STN.
Comparing the results of the 2-7 rows in Table. I, we can see that performing spatial alignments with spatial transformers can consistently improve the matching performance. In particular, the models with the STN outperform the plain models with a large margin (about 5% improvement) in term of the mAP metric. From the results of the 8-10 rows, we can see that our proposed STN further boosts the matching performance about 2.0%, 1.0%, 2.0% and 1.5% in terms of the CMC scores at rank-1, 5, 20 and mAP. The results validate the effectiveness of incorporating the temporal context information for spatial transformation in practical video applications.
In addition, to clarify the effectiveness of the STN module, we visualize the aligned images and transformed feature maps of a randomly selected pedestrian video in Fig. 3. This figure convincingly shows that the proposed STN module helps to align images with smooth spatial variations. The transformed feature maps can help to improve the target attention regions and alleviate some cluttered backgrounds.
Effectiveness of the generic and specific features in TRL.
In the TRL module, the extracted generic and specific features essentially capture different characteristics of pedestrians. To verify the complementary behaviors, we also perform feature-level experiments. As shown in the 2-9 rows of Table. I, only using one kind of features, the models have already achieved very impressive results. The models with only generic features consistently outperform the ones with only specific features. This may explains why most of previous works only use the generic features can achieve remarkable performance in video re-ID. Though the specific features are inferior to the generic features, the combined features achieve higher performance (about 2% improvement) in all compared metrics. The three group experiments convincingly demonstrate the effectiveness and superiority of the proposed TRL module.
Complementarity of two streams.
In our proposed model, the main stream is utilized to address the original video sequences, while the aligned stream deals with the aligned video sequences. To verify the effectiveness of each stream, we perform additional experiments on the MARS, PRID 2011, ILIDS-VID and SDU-VID datasets. The performances of the main stream, the alignment stream and the complete model are shown in Fig. 4. It can be seen that the alignment stream achieves comparable performance to the main stream, and the fusion of the two streams achieves better than the individual streams on the four video datasets. Specifically, the fusion results achieve , , and of CMC matching rate at rank-1 on the four datasets, which are higher than the main steam by around , , and
, respectively. The performance gains of the complete model are much larger than the alignment stream. The consistent improvements indicate that the two streams can provide complementary information for each other. When taking them together, the model achieves best results in all evaluation metrics.
|(a) MARS||(b) PRID 2011||(c) ILIDS-VID||(d)SDU|
Iv-D Comparison with State-of-the-art Methods
To demonstrate the superiority of our approach, we compare the proposed video re-ID model with several state-of-the-art methods on the large-scale MARS dataset and three small datasets, i.e., PRID 2011, ILIDS-VID and SDU-VID.
Results on the large-scale dataset.
On the MARS dataset, we compare our method with seven state-of-the-art methods, including the ASTPN , CAR , CNN+XQDA , CNN+TAM+SRM , MSCAN , QAN  and TriNet . The detailed results are summarized in Table. II. As can be seen, our baseline model achieves the rank-1 accuracy of and mAP of , which performs better than most of the existing state-of-the-art methods except for the TriNet . Our proposed model with Euclidean distance achieves rank-1 accuracy of and mAP of . The results outperform the state-of-the-art method QAN  by a large margin. The improvements are and in terms of the rank-1 and mAP, respectively. The TriNet  is currently one of the optimal metric learning methods and our model can obtain competitive performance. The results motivate us to employ distance metric learning to further improve the performance. When our method combined with KISSME , we arrive at rank-1 accuracy and mAP, with nearly and relative accuracy gains. The proposed model with XQDA  achieves the optimal results with , , and for the CMC scores at rank-1, 5, 20 and mAP, which outperforms all of compared methods. These results confirm the superiority and effectiveness of the proposed model on the large-scale automatically detected MARS dataset.
Results on three small datasets.
In addition, we also conduct experiments on three small video re-ID datasets. On these datasets, we compare with twenty state-of-the-art methods, including VR , DVR , DVDL , STFV3D , AFDA , LFDA , STFV3D+KISSME , PaMM , TDL , SIIDL , RFA , RNN , RNN+OF , RCN+KISSME , CNN+BRNN , ASTPN , CNN+XQDA , CNN+SRM+TAM , CAR  and QAN . Among the compared approaches, the first ten approaches use hand-crafted features and the left are deep learning based models. The comparison results of the CMC scores at rank-1, 5, 20 are reported in Table. III.
Results in Table. III demonstrate the significant superiority of the proposed model over existing state-of-the-art methods on the SDU-VID dataset. Specially, our model achieves the rank-1 accuracy of , and can fully distinguish different queries at rank-5. It surpasses the previous best approach CAR  by and in terms of rank-1 and rank-5, respectively. For the PRID 2011 dataset, the proposed approach obtains and at rank-1 and rank-5, respectively. Other than the QAN  model, our model achieves the best performance in contrast to all compared methods. On the more challenging ILIDS-VID dataset, the advantage of our method is less obvious than on the MARS and SDU-VID datasets. But we obtain very comparable results with the CAR  and the RNN+OF . The QAN model can still achieves the best performance on the ILIDS-VID dataset. The reason may be that training samples in the ILIDS-VID dataset are very limited, only consisting of 150 persons with 300 training sequences. In addition, the dataset contains large variations, cluttered backgrounds and low-resolution images. Those factors make it challenge to discriminate different pedestrians of the dataset. The QAN  model takes fully advantages of the labeled information by utilizing both image-level and sequence-level supervision information to train. The results give us an evidence that it is a suitable practice to include image-level supervision when very limited labeled videos are available. It is interesting to note that the performance margins between the hand-crafted features and the deep features are not very large, which means that hand-crafted features are befitting for small datasets.
In addition, it is not difficult to observe from Table. III that the proposed model obviously outperforms the baseline model by a large margin. Our TRL module improves the rank-1 accuracy by , and on the PRID2011, ILIDS-VID and SDU-VID datasets, respectively. Moreover, we arrive at a new state-of-the-art performance on the SDU-VID dataset, win the second performance on the PRID 2011 datastet and achieve the competitive results on the challenge ILIDS-VID dataset. These results clearly prove that the proposed model has obvious superiority and shows its effectiveness on small datasets as well.
Iv-E Cross-dataset Generalization
In practice, different datasets are usually collected under different visual conditions. Models trained with one dataset can perform badly on a different dataset, which means that dataset bias is inevitable. Hence, it is important to design models that can generalize well to new environments. Cross-dataset testing is an effective way to evaluate the potential ability of a system. To better understand the generalization performance of our model, we also perform cross-dataset experiments. More specifically, we conduct two kinds of experiments. The first one is training on the large-scale MARS dataset and testing on the three small datasets. The second is training on one of the three small datasets, while testing on the other two datasets. Experimental results are listed in Table. IV.
The results in Table. IV reveal that when training on the large-scale MARS dataset, the proposed model can get better generalization performance. In particular, we achieve , and of the CMC scores at rank-1, 5 and 20 on the SDU-VID dataset. The results are better than current state-of-the-art methods, which can be observed from Table. III. However, the CMC scores for the PRID 2011 and the ILIDS-VID dataset is less fascinating. The rank-1 scores for the two datasets are and , respectively. The main reason may be that sample distributions of the PRID 2011 and ILIDS-VID datasets are very different from the MARS, while the pedestrians in the SDU-VID dataset are relative simple and have similar appearances to the MARS.
When the model is trained on one of the three small datasets, the matching rates for the other two datasets decline nearly by half. Specifically, when we train the model on the PRID 2011 dataset, we achieve rank-1 accuracy for the ILIDS-VID and for the SDU dataset. If the challenging ILIDS-VID dataset is selected as the training dataset, our model reports and rank-1 score for the PRID 2011 and the SDU dataset, respectively. When training on the SDU-VID, the CMC scores for the PRID 2011 and the ILIDS-VID are the lowest. Those results demonstrate that the dataset biases do have affects on the matching rates. Our model has certain generalization for the cross-dataset testing, but need to be further optimized to improve the generalization performance.
Iv-F Trade-off Parameter Analysis
There are two trade-off parameters in our approach: and . We conduct empirical analysis on the MARS dataset to analyze the parameter sensitivity based on the person re-ID accuracy . When analyzing one parameter, we keep the other one fixed. We report the CMC scores at rank-1, 5, 10, 20 as well as the mAP with and varying in the range of . Experimental results are illustrated in Fig. 5.
The parameter balances the importance between the generic and specific features of pedestrian sequences. When , it means that we merely consider the specific features. indicates that we only utilize the generic features to describe pedestrians. Fig. 5 (a) demonstrates that when we take both terms into consideration, we achieve better performance. The parameter controls the significance of the main stream and the alignment stream. It can be observed from Fig. 5 (b) that both the main stream and the alignment stream are important and can complement each other. It can be also noticed that the matching rates have relative small fluctuations when and varies. The results indicates that the proposed algorithm does not rely on parameter tuning to obtain outstanding performance. However, when we choose and , we can get the highest results. Therefore, in experiments, we set to emphasize equal importance for any of the balance terms in the proposed model.
In this work, we focus on the problems of unconstrained video person re-ID, where video sequences contain severe challenges, like out-of-focus target and cluttered backgrounds. To address these key challenges, we propose a novel spatial-temporal transformer network (STN), which leverages temporal information to keep smooth spatial alignments between consecutive frames. In addition, to fully take advantage of the extra temporal information, we propose an innovative temporal residual learning (TRL) module to learn the generic as well as the specific features of pedestrian sequences. The TRL module incorporates bi-directional LSTM structures to allow temporal information not only propagate from front to back but also in the reverse direction. Extensive experimental results on the MARS, PRID 2011, ILIDS-VID and SDU-VID datasets demonstrate the effectiveness and superiority of the proposed method over other state-of-the-art methods.
S. Gong, M. Cristani, S. Yan, and C. C. Loy, Eds., Person
, ser. Advances in Computer Vision and Pattern Recognition. Springer, 2014.
-  C. C. Loy, T. Xiang, and S. Gong, “Multi-camera activity correlation analysis,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009.
-  X. Wang, “Intelligent multi-camera video surveillance: A review,” Pattern Recognition Letter, vol. 34, no. 1, pp. 3–19, 2013.
-  D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in Proceedings of European Conference on Computer Vision, 2008.
-  M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  B. Ma, Y. Su, and F. Jurie, “Bicov: a novel image representation for person re-identification and face verification,” in Proceedings of British Machine Vision Conference, 2012.
-  I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for person reidentification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1622–1634, 2013.
-  B. Ma, Y. Su, and F. Jurie, “Covariance descriptor based on bio-inspired features for person re-identification and face verification,” Image and Vision Computing, vol. 32, no. 6, pp. 379–390, 2014.
-  Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, “Salient color names for person re-identification,” in Proceedings of European Conference on Computer Vision, 2014.
-  M. Tetsu, O. Takahiro, S. Einoshin, and S. Yoichi, “Hierarchical gaussian descriptor for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  J. Dai, Y. Zhang, H. Lu, and H. Wang, “Cross-view semantic projection learning for person re-identification,” Pattern Recognition, 2017.
-  W. Zheng, S. Gong, and T. Xiang, “Person re-identification by probabilistic relative distance comparison,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
-  A. Mignon and F. Jurie, “Pcca: A new approach for distance learning from sparse pairwise constraints,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locally-adaptive decision functions for person verification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  S. Pedagadi, J. Orwell, S. A. Velastin, and B. A. Boghossian, “Local fisher discriminant analysis for pedestrian re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  S. Paisitkriangkrai, C. Shen, and A. van den Hengel, “Learning to rank in person re-identification with metric ensembles,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  Y. Zhang, B. Li, and H. L. A. I. X. Ruan, “Sample-specific svm learning for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null space for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014.
-  J. Hu, J. Lu, and Y. Tan, “Deep transfer metric learning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person re-identification,” Pattern Recognition, vol. 48, no. 10, pp. 2993–3003, 2015.
-  X. Tong, L. Hongsheng, O. Wanli, and W. Xiaogang, “Learning deep feature representations with domain guided dropout for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
C. De, G. Yihong, Z. Sanping, W. Jinjun, and Z. Nanning, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  S. Zhou, J. Wang, J. Wang, Y. Gong, and N. Zheng, “Point to set similarity based deep feature learning for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Z. Chen, C. C. Guo, and J. H. Lai, “Deep ranking for person re-identification via joint representation learning,” IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2353–2367, 2016.
-  A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv:1703.07737, 2017.
-  H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Niall, M. del Rincon Jesus, and M. Paul, “Recurrent convolutional network for video-based person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  L. Wu, C. Shen, and A. van den Hengel, “Deep recurrent convolutional networks for video-based person re-identification: An end-to-end approach.”
-  L. Z. ands Zhi Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in Proceedings of European Conference on Computer Vision, 2016.
-  Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointly attentive spatial-temporal pooling networks for video-based person re-identification,” in Proceedings of IEEE International Conference on Computer Vision, 2017.
-  W. Zhang, S. Hu, and K. Liu, “Learning compact appearance representation for video-based person re-identification,” arXiv:1702.06294.
-  W. Zhang, X. Yu, and X. He, “Learning bidirectional temporal cues for video-based person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
-  Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, “Person re-identification via recurrent feature aggregation,” in Proceedings of European Conference on Computer Vision, 2016.
-  Y. Liu, J. Yan, and W. Ouyang, “Quality aware network for set to set recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks. IEEE Press, 1997.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems.
-  S. K. Sønderby, C. K. Sønderby, L. Maaløe, and O. Winther, “Recurrent spatial transformer networks,” arXiv:1509.05329, 2015.
-  W. Wu, M. Kan, X. Liu, Y. Yang, S. Shan, and X. Chen, “Recursive spatial transformer (rest) for alignment-free face recognition,” in Proceedings of IEEE International Conference on Computer Vision, 2017.
-  Y. Zhong, J. Chen, and B. Huang, “Toward end-to-end face recognition through alignment learning,” IEEE Signal Processing Letters, vol. 24, no. 8, pp. 1213–1217, 2017.
-  D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Z. Zheng, L. Zheng, and Y. Yang, “Pedestrian alignment network for large-scale person re-identification,” arXiv:1707.00408, 2017.
-  M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” 2011.
-  T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by video ranking,” in Proceedings of European Conference on Computer Vision, 2014.
-  K. Liu, B. Ma, W. Zhang, and R. Huang, “A spatio-temporal appearance representation for viceo-based pedestrian re-identification,” in Proceedings of IEEE International Conference on Computer Vision, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  Y. Li, Z. Wu, S. Karanam, and R. J. Radke, “Multi-shot human re-identification using adaptive fisher discriminant analysis,” in Proceedings of British Machine Vision Conference, 2015.
-  T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by discriminative selection in video ranking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2501–2514, 2016.
X. Zhu, X. Jing, F. Wu, and H. Feng, “Video-based person re-identification by
simultaneously learning intra-video and inter-video distance metrics,” in
Proceedings of International Joint Conference on Artificial Intelligence, 2016.
-  J. You, A. Wu, X. Li, and W.-S. Zheng, “Top-push video-based person re-identification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  Y. K. J. Cho Y J, “Improving person re-identification via pose-aware multi-shot matching,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K. M. Lam, and Y. Zhong, “Person re-identification by unsupervised video matching,” Pattern Recognition, vol. 65, no. C, pp. 197–210, 2017.
-  A. Kl?ser, M. Marszalek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in Proceedings of British Machine Vision Conference, 2008.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Proceedings of IEEE International Conference on Computer Vision, 2015.
-  P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
-  A. Dehghan, S. M. Assari, and M. Shah, “GMMCP tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior, “The relation between the roc curve and the cmc,” in Automatic Identification Advanced Technologies, Fourth IEEE Workshop on, 2005.
“TensorFlow: Large-scale machine learning on heterogeneous systems.” [Online]. Available:http://tensorflow.org/
-  M. Henaff, A. Szlam, and Y. LeCun, “Recurrent orthogonal networks and long-memory tasks,” arXiv preprint arXiv:1602.06662, 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
-  S. Karanam, Y. Li, and R. J. Radke, “Person re-identification with discriminatively trained viewpoint invariant dictionaries,” in Proceedings of IEEE International Conference on Computer Vision, 2015.
-  W. Zhang, B. Ma, K. Liu, and R. Huang, “Video-based pedestrian re-identification by adaptive spatio-temporal appearance model,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2042–2054, 2017.