Convolutional Temporal Attention Model for Video-based Person Re-identification

The goal of video-based person re-identification is to match two input videos, so that the distance of the two videos is small if two videos contain the same person. A common approach for person re-identification is to first extract image features for all frames in the video, then aggregate all the features to form a video-level feature. The video-level features of two videos can then be used to calculate the distance of the two videos. In this paper, we propose a temporal attention approach for aggregating frame-level features into a video-level feature vector for re-identification. Our method is motivated by the fact that not all frames in a video are equally informative. We propose a fully convolutional temporal attention model for generating the attention scores. Fully convolutional network (FCN) has been widely used in semantic segmentation for generating 2D output maps. In this paper, we formulate video based person re-identification as a sequence labeling problem like semantic segmentation. We establish a connection between them and modify FCN to generate attention scores to represent the importance of each frame. Extensive experiments on three different benchmark datasets (i.e. iLIDS-VID, PRID-2011 and SDU-VID) show that our proposed method outperforms other state-of-the-art approaches.



There are no comments yet.


page 5


Video-based Person Re-identification Using Spatial-Temporal Attention Networks

We consider the problem of video-based person re-identification. The goa...

Video Summarization Using Fully Convolutional Sequence Networks

This paper addresses the problem of video summarization. Given an input ...

SCAN: Self-and-Collaborative Attention Network for Video Person Re-identification

Video person re-identification attracts much attention in recent years. ...

Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect

Recently, the research interest of person re-identification (ReID) has g...

Person Re-identification in Videos by Analyzing Spatio-Temporal Tubes

Typical person re-identification frameworks search for k best matches in...

Unsupervised Temporal Feature Aggregation for Event Detection in Unstructured Sports Videos

Image-based sports analytics enable automatic retrieval of key events in...

Inserting Videos into Videos

In this paper, we introduce a new problem of manipulating a given video ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification is an active area of research in computer vision. Earlier work (e.g.

[1, 2]) in this area focuses on image-based re-identification. Given an input query image (called the probe image) of a person, the goal is to find this person from a collection of gallery images. Recently, there has been work [3, 4] exploring video-based person re-identification. Compared with images, video-based person re-identification is a more realistic setting in real-world application. In this paper, we focus on video-based person re-identification. Given a query video of a person, we would like to identify the person by matching the query video to a collection of gallery videos.

Most work in person re-identification uses some form of metric learning. Given two images (or videos), we would like their distance to be small if they contain the same person, and the distance to be large otherwise. See Figure 1

for an illustration in the case of video-based person re-identification. In image-based person re-identification, convolutional neural networks (CNNs) are often used to learn this distance metric. For example, in

[1, 2], CNNs are used to extract features from images in a way that the distance between the extracted features can be used as the distance metric. Most existing approaches [5, 3, 4] in video-based person re-identification follow a similar strategy. First, image features are extracted from each frame in the video. These frame-level images features of a video are then aggregated together to form a video-level feature with fixed length. Finally, the distance between two videos are calculated based on their video-level features.

Figure 1: Illustration of video-based person re-identification problem. The problem can be formulated as learning the distance between two input videos. If the two videos contain the same person, their distance should be small. Otherwise, the distance should be large.

As observed in previous work [3, 4], not all frames in a video are equally informative. For example, the person might be heavily occluded in some frames. Ideally, we would like to pay attention to “good” frames when constructing the video-level feature representation. Previous works [4, 3]

have used recurrent neural networks (RNNs) to assign a temporal attention score to each frame in a video. The attention score indicates how informative a frame is. Intuitively, the learning algorithm should assign small attention scores to frames where the person is heavily occluded. The video-level feature is obtained by the summation of frame-level features weighted by their attention scores.

Although RNN has been popular in many sequence labeling problems, it has some inherent limitations. The computation involved in RNN is sequential, i.e. we cannot process a frame until all previous frames have been processed. Due to the sequential nature of RNN, it is difficult to take advantage of the GPU hardware and fully parallelize the computation involved in RNN. The same observation has also been made in natural language processing 

[6, 7]. Some recent works [6, 7] have advocated using convolutional models instead of recurrent models for sequence labeling tasks. Similar to RNN, convolutional models can also capture contextual dependencies in sequential data via their effective receptive field. But different from RNN, convolutional models can better exploit the GPU hardware.

Our contributions include: (1) to the best of our knowledge this is the first attempt to formulate video based person re-identification as semantic segmentation. We propose a fully convolutional temporal model for generating the attention scores of frames in a video. (2) Unlike previous work (e.g. [4]) that uses RNN to generate the attentions, our model directly generates attentions based on frame-based features. As a consequence, the computation of the attentions is much simpler and can be easily parallelized. (3) Extensive experiments on three benchmark datasets show that our proposed model outperforms other state-of-the-art methods.

2 Related Work

The problem of person re-identification can be divided into two categories: image-based and video-based. Previous methods for person re-identification from static images focus on two tasks: (1) extracting features from input images and (2) measuring their similarity or distance metric to determine whether the images belong to same person or not. Discriminative features play a vital role to handle environmental and view points changes. [8] and [9] propose methods that consider patch appearance statistics to localize important part of an individual person. In [10]

, an ensemble of spatial and color information is used to increase viewpoint variations. After extracting features from images, distance metric learning is applied to increase the distance between different persons. For the same persons distance should decrease. In deep learning, both feature extraction and distance metric learning are applied in an end-to-end fashion for re-identifying person.

In recent years, researchers start to pay more attention to video-based person re-identification, partly because this is a more realistic setting in real-world applications. Previous method [11] consider frame level similarity to identify the same person. Recently, deep learning based approaches are gaining popularity for video re-identification. Most of them use the siamese architecture where each branch contains RNN to capture temporal information. McLaughlin et al. [5] propose a method which collects temporal information using optical flows, recurrent neural network (RNN) and temporal pooling layers. Following [5], Xu et al. [3] propose a Spatial and Temporal Attention Pooling Network (ASTPN) for learning interdependence information. Our work is motivated by the recent success of attention-based models [12, 13]. In this paper, we generate an attention score for each frame which indicates the importance of that frame within the input video sequence. The main contribution of our method is establishing the connection between video based person re-identification and semantic segmentation. Instead of RNN, here we adopt fully convolution network (FCN) for generating attention over the video frames.

3 Our Approach

Our proposed approach uses a Siamese network architecture (see Figure 2). Our network takes a pair of input video sequences as its input. It outputs a scalar value indicating how like these two videos contain the same person. The network architecture has two identical branches with shared parameters. Each branch of the network takes a video sequence as the input and extract per-frame features using CNN. Then we compute attention score for each frame using Fully Convolution Network (FCN). The attention score indicates the importance of the frame for the re-identification task. The video-level feature representation is obtained by aggregating the frame-level feature weighted by the corresponding attention score on the frame. Finally, the video-level features of the two input videos are used to compute their distance for re-identification.

Figure 2: Illustration of our proposed network architecture for video-based person re-identification. The network takes a pair of input video sequences where each sequence consists of number of frames. Each frame in a video is passed to a Convolutional Neural Network (CNN) to generate a 128-dimensional frame level feature. In other words, each video is represented as a feature matrix of dimensions . We then use fully convolutional attention module that takes the feature matrix and generates a

vector of attention scores for frames in the video. Then we use the attention scores to re-weight frame-level features and produce a 128-dimensional video-level feature vector using a temporal pooling layer. The video-level feature vector is then normalized and used to compute the distance of two input videos. We use a squared hinge loss on the distance for learning. During training, we also use the video-level feature vector to classify the identity of the person in a video and use a standard softmax loss for the classification.

3.1 Frame Feature Extraction Module

Following [5]

, we use both RGB color and optical flow channels for extracting frame-level features. Color channels give information about a person’s appearance while optical flows give motion related information. As a preprocessing step, first we convert input video frames from RGB to YUV color space. Each color channel is then normalized to have a zero mean and unit variance. Both vertical and horizontal optical flow channels (e.g.

and respectively) for each frame is calculated using Lucas-Kanade algorithm [14]

. In the end, each input frame is represented as a tensor of dimensions

where the 5 channels correspond to 3 color channels and 2 optical flow channels. We use a CNN architecture similar to [5]

to extract frame-level features. The CNN architecture consists of three stages of convolution, max-pooling, and nonlinear (

) activations. Each convolution filter uses kernels with stride and

zero padding. In the end, CNN produces a 128-dimensional feature vector (i.e.

) to represent each frame () in the input video.

3.2 Fully Convolutional Attention Module

Motivated by the recent attention-based models (e.g. [12, 13]), we introduce an attention-based approach for re-identifying person from video sequences. The attention-based approach is inspired by the visual processing of human brains which often pay attention on discriminative regions of different frames instead of whole video when try to re-identify persons [3]. In this paper, we focus on temporal attentions by formulating it as a sequence labeling problem like semantic segmentation. In particular, we adopt a 1D temporal version of the Fully Convolution Network (FCN) [15] to generate the temporal attentions. FCN is a widely used network architecture for semantic segmentation. Let be an input image with spatial dimensions and 3 color channels. FCN first uses an encoder network to extract a feature map from the image . The feature map is then used to produce an output map with the same spatial dimension of the input image, i.e. . Each entry of the output map represents the semantic label at the corresponding pixel location in the image. In summary, FCN processes a 2D image (with 3 channels) and produces a 2D map as the output.

Figure 3: Detailed architecture of our fully convolutional attention module for generating temporal attentions on frames in a video. The input of the attention module is the feature vectors generated by the frame-level CNN (see Sec. 3.1) where each frame is represented as a 128-dimensional vector. We adopt fully convolution network (FCN) to take an input as the frame features () and generate attention scores. Here, and

are used to represent window size of the convolutions. The attention scores are then normalized using the Sigmoid activation function and multiplied with the frame-level feature vectors to generate a weighted feature for the entire video.

The main insight of this paper is as follows. Suppose each frame in a video has been processed (see Sec. 3.1) and represented as a 1-dimension feature vector. We can treat a video with as a 1D input “image” with 128 channels. If we want to assign an attention score to each frame, we can treat these attention scores as a output map. In other words, we can adopt FCN to generate attention scores for frames in a video by making the following modifications: 1) instead of taking an input image of size with 3 channels, we now take input with 128 channels; 2) instead of producing output map, we now produce output map; 3) instead of 2D convolution and 2D pooling, we will perform 1D convolution and 1D pooling operations. We call it fully convolutional attention module.

Figure 3 shows the detailed architecture of the fully convolutional temporal attention module. Each frame of an input video is represented as a 128-dimensional frame-level feature vector (see Sec. 3.1). The sequence of frame-level feature vectors are passed to the fully convolutional attention module to generate temporal attention scores. In this paper, we adopt FCN32s [15] which is the basic fully convolutional network used in semantic segmentation. Input to the attention module is a tensor of dimension where corresponding to the number of input video frames. The output is a

vector. The output is followed by a Sigmoid function. In the end, we obtain an attention score

for each feature vector where represents -th frame in video sequence. We can express the attention scores using following equation:

&& α_1,α_2,…,α_N=FCN(z_1, z_2,…,z_N)
&& λ_i=11+exp(-αi),   where i=1,2,…,N where represent the FCN network that takes frame-level features and produce a vector of unnormalized attention scores. This is followed by a Sigmoid function to produce normalized attention scores ().

The frame-level attention scores () are then combined with the per-frame feature vectors () via an attention pooling to generate a weighted feature vector as follows:


Here can be treated as a feature vector of entire video sequence. We have also tried Softmax in Eq. 3 and found that it does not perform as well as Sigmoid. This is consistent with the observation in previous work [16],

In our experiments, we find that the video-level feature obtained by a regular temporal pooling layer [5] (i.e. ) can also help to improve the performance. So our final video-level feature is the average of the feature vectors obtained from the attention pooling and the regular temporal pooling. We apply normalization on in the end.

3.3 Model Learning

In this section, we describe the details of learning the parameters of our model. Let and be the feature vectors of two input videos from the Siamese network. Following [5, 3], we calculate Euclidean distance between the feature vectors and apply the squared hinge loss() as follows:


where the hyper-parameter represents the margin of separating two classes in . Here we use and to represent the identities of the persons from two input videos. The idea is that if the two videos contain the same person (i.e. ), the distance between the feature vectors should be small. Otherwise, the distance should be large if the persons are different (i.e. ).

Similar to [5], we also use another loss (i.e. identity loss ) to each branch of the Siamese network to predict the person’s identity. We use a linear classifier to predict one of the person’s identity from the feature vector extracted through each branch of the Siamese network. We then apply a Softmax loss over the prediction for each Siamese branch. The final loss is the combination of two identity losses (i.e. and ) from each Siamese branch and the hinge loss as follows:


We use stochastic gradient decent to optimize the loss function define in Eq. 

3. After training, we remove all loss functions including the identity and hinge losses. During testing, we only use the feature vectors to compute the distance between two input videos for re-identification.

4 Experiments

In this paper, we use three benchmark datasets (i.e. iLIDS-VID [17], PRID-2011 [18] and SDU-VID [19]) for evaluating our proposed method. We first describe the experiment setup and some implementation details (Sec. 4.1). Then, we present the experimental results and compare with previous work (Sec. 4.2).

Figure 4: Visualization of learned attention scores of our approach. Each row shows selected frames in a video. The color of the box enclosing the frame indicates the corresponding attention score of the frame. Warm colors correspond to high attention scores. We can see that frames with high attention scores tend to have less background clutters and occlusions.

4.1 Setup and Implementation Details

We follow the experimental protocol of McLaughlin et al. [5] on the iLIDS-VID and PRID-2011 datasets. We randomly split the datasets into two equal parts: one part is used for training and the remaining part is used for testing. We repeat all experiments 10 times for stable result. We use the Cumulative Matching Characteristics (CMC) curve. We use equal numbers of positive and negative samples during training to alleviate the effect of class imbalance. We use

as the value of hyper-parameter in the hinge loss. The network is trained for 1400 epochs. We keep the batch size as one. The initial learning rate is set as

. For the iLIDS-VID dataset, we decrease the learning rate by a factor of 10 after 1300 epochs. On the PRID-2011 dataset, we decrease the learning rate after 800 and 1100 epochs by a factor of 10. For the SDU-VID dataset, we follow the experimental protocol of Zhang et al.[20]. For this dataset, we decrease the learning rate by a factor of 10 after 1200 epochs.

4.2 Experimental Results

We show the experimental results and comparisons with previous methods on the three datasets in Table 1, Table 2 and Table 3, respectively. We can see that our proposed methods significantly outperforms previous methods in terms of the rank-1 CMC accuracy on all datasets. The comparison with [4] is particularly interesting since [4] uses a similar temporal attention approach. The difference is that [4] uses RNN to generate attention scores, while our method uses a fully convolutional network (FCN) to generate attention scores. This shows that convolutional models provide a competitive alternative to RNN for temporal attentions. In addition to temporal attentions, [4] additionally uses a recurrent model to generate spatial attentions. In contrast, our model only uses temporal attentions and is much simpler, yet still achieves much better performance. Figure 4 shows some visualizations of the attention scores on some frames within same videos. Intuitively, frames with high attention scores tend to be the ones with less background clutters and occlusions.

Method Rank-1 Rank-5 Rank-10 Rank-20
Ours 64 92 96 98
Xu et al.[3] 62 86 94 98
Zhou et al.[4] 55.2 86.5 - 97.0
McLaughlin et al.[5] 58 84 91 96
Yan et al.[21] 49 77 85 92
Table 1: Performance comparison of our proposed method with other state-of-the-art on the iLIDS-VID dataset in terms of CMC(%) ranking metric.
Method Rank-1 Rank-5 Rank-10 Rank-20
Ours 90 98 98 99
Xu et al.[3] 77 95 99 99
Zhou et al.[4] 79.4 94.4 - 99.3
McLaughlin et al.[5] 70 90 95 97
Yan et al.[21] 64 86 93 98
Table 2: Performance comparison of our proposed method with other state-of-the-art on the PRID-2011 dataset in terms of CMC(%) ranking metric.
Method Rank-1 Rank-5 Rank-10 Rank-20
Ours 87 97 98 100
Zhang et al. [20] 85.6 97 98.3 99.6
RNN [5] 75.0 86.7 - 90.8
STA+KISSME [19] 73.3 92.7 95.3 96.0
Liu et al. [19] 62.0 81.3 - 92.7
Table 3: Performance comparison of our proposed method with other state-of-the-art on the SDU-VID dataset in terms of CMC(%) ranking metric.

4.3 Cross-Dataset Testing

In this section, we perform cross-dataset testing to further test the generalizability of our method. A system usually performs better when it is trained and tested on the same dataset due to the data bias. But in real-world applications, test data are usually totally different from the training data used during learning. To estimate the real-world performance of a system, a better way is to do cross-dataset testing where we train the model using one dataset and test the model using a completely different dataset. Following previous work 

[5], we use 50% of the larger and more challenging iLIDS-VID dataset for training our network, and use 50% of the PRID-2011 dataset for testing. For this type of cross-dataset testing, previous methods [5, 3] have used two different settings: single-shot re-identification and multi-shot re-identification. In the single-shot re-identification, only one frame of a video is used. But in our case, we can not apply the single-shot setting as our method generates attention scores based on frame features and later combines frame-level feature vectors to produce the video-level features. If we only use one image, the method is equivalent to simply rescaling a frame-based feature vector by a constant and then use it for re-identification. We compare the results with other multi-shot cross-dataset testing scenario in Table 4. Again, our method outperforms other methods in terms of CMC ranking accuracy.

Method Trained on Rank-1 Rank-5 Rank-10 Rank-20
Ours iLIDS-VID 32 60 72 86
[3] iLIDS-VID 30 58 71 85
[5] iLIDS-VID 28 57 69 81
Table 4: CMC Rank accuracy (%) using cross dataset testing (using multi-shot re-identification) on the PRID-2011 dataset. The model is trained on the iLIDS-VID dataset.

5 Conclusion

In this paper, we have proposed a temporal attention approach for video-based person re-identification. The novelty of our model is that we use a fully convolutional model for generating the temporal attentions. Fully convolutional network (FCN) has been widely used to produce 2D output (e.g. in semantic segmentation). Our proposed method modifies traditional FCN to produce a 1D output (i.e. temporal attentions). Through extensive experiments, we have demonstrated that the proposed method outperforms existing state-of-the-art video-based person re-identification methods.

Acknowledgments: The authors acknowledge financial support from NSERC, MGS and UMGF funding. We also thank NVIDIA for donating some of the GPUs used in this work.


  • [1] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in CVPR, 2014.
  • [2] Ejaz Ahmed, Michael Jones, and Tim K Marks, “An improved deep learning architecture for person re-identification,” in CVPR, 2015, pp. 3908–3916.
  • [3] Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou, “Jointly attentive spatial-temporal pooling networks for video-based person re-identification,” in ICCV, 2017.
  • [4] Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification,” in CVPR, 2017.
  • [5] Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller, “Recurrent convolutional neural network for video-based person re-identification,” in CVPR, 2016.
  • [6] Shaojie Bai, J Zico Kolter, and Vladlen Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  • [7] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, “Convolutional sequence to sequence learning,” ICML, 2017.
  • [8] Hanxiao Wang, Shaogang Gong, and Tao Xiang,

    Unsupervised learning of generative topic saliency for person re-identification,”

    in BMVC, 2014.
  • [9] Rui Zhao, Wanli Ouyang, and Xiaogang Wang, “Unsupervised salience learning for person re-identification,” in CVPR, 2013.
  • [10] Douglas Gray and Hai Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in ECCV, 2008.
  • [11] Damien Simonnet, Michal Lewandowski, Sergio A Velastin, James Orwell, and Esin Turkbeyler, “Re-identification of pedestrians in crowds using dynamic time warping,” in ECCV, 2012.
  • [12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,

    Neural machine translation by jointly learning to align and translate,”

    in ICLR, 2015.
  • [13] Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou, “ABCNN: Attention-based convolutional neural networks for modeling sentence pairs,” TACL, 2016.
  • [14] Bruce D Lucas and Takeo Kanade, “An iterative image registration technique with an application to stereo vision,” IJCAI, 1981.
  • [15] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [16] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang, “Deeply-learned part-aligned representations for person re-identification,” in CVPR, 2017.
  • [17] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang, “Person re-identification by video ranking,” in ECCV, 2014.
  • [18] Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof, “Person re-identification by descriptive and discriminative classification,” in SCIA, 2011.
  • [19] Kan Liu, Bingpeng Ma, Wei Zhang, and Rui Huang, “A spatio-temporal appearance representation for video-based pedestrian re-identification,” in ICCV, 2015.
  • [20] Wei Zhang, Xiaodong Yu, and Xuanyu He, “Learning bidirectional temporal cues for video-based person re-identification,” TCSVT, 2017.
  • [21] Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang, “Person re-identification via recurrent feature aggregation,” in ECCV, 2016.