Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect

11/28/2019 ∙ by Xinyang Jiang, et al. ∙ Tencent 0

Recently, the research interest of person re-identification (ReID) has gradually turned to video-based methods, which acquire a person representation by aggregating frame features of an entire video. However, existing video-based ReID methods do not consider the semantic difference brought by the outputs of different network stages, which potentially compromises the information richness of the person features. Furthermore, traditional methods ignore important relationship among frames, which causes information redundancy in fusion along the time axis. To address these issues, we propose a novel general temporal fusion framework to aggregate frame features on both semantic aspect and time aspect. As for the semantic aspect, a multi-stage fusion network is explored to fuse richer frame features at multiple semantic levels, which can effectively reduce the information loss caused by the traditional single-stage fusion. While, for the time axis, the existing intra-frame attention method is improved by adding a novel inter-frame attention module, which effectively reduces the information redundancy in temporal fusion by taking the relationship among frames into consideration. The experimental results show that our approach can effectively improve the video-based re-identification accuracy, achieving the state-of-the-art performance.



There are no comments yet.


page 3

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 1: Illustration of temporal fusion. The cuboids represent features of different temporal and semantic levels. The time axis (vertical) represents the image frames in chronological order. The semantic axis (horizontal) represents the frame feature maps extracted from different stages of a CNN, from low-level local feature to high-level semantics. We conduct temporal and semantic fusion to select and weight features from different time and different semantic levels and aggregate the selected features.

Person re-identification (ReID) is an important technology to match images of pedestrians in different, non-overlapping cameras. During the past few years, person re-identification has drawn increasing attention due to its wide applications in surveillance, tracking, smart retail, etc. Unlike standard image-based re-identification approaches, video-based re-identification directly takes video/tracklet (i.e. a sequence of images) as input and learns a feature to represent the entire tracklet in an end-to-end fashion, which captures more information from multiple frames in the tracklet, such as temporal cues, variant views, and poses, etc.

One of the key problems in video-based re-identification is temporal fusion, which is to aggregate feature from each time frame into a comprehensive representation of the tracklet. This paper tries to rethink the temporal fusion problem on the dimension of time and semantics, as well as proposing a unified temporal fusion framework based on temporal and semantic attention. As shown in Figure 1, we model the temporal fusion process in a rectangular coordinates system. Frame features are first extracted from different stages of a CNN-based network (the output from different layers in the CNN). The temporal fusion of frame features is performed on the time axis, selectively aggregating these features of different frames, while the semantic fusion aggregates the temporally-fused features that are outputs from different stages of the CNN-based network into a comprehensive tracklet feature. The objective of temporal fusion is to select and weight distinctive frame features, while semantic fusion aggregates tracklet features of different semantic scope. To sum it up, the feature fusion is conducted in two key aspects: time and semantics.

On the semantic axis, frame-feature aggregating at different stages captures semantic information from different levels. Fusing frame information at early stage results in richer temporal information in low-level structural information, while aggregating frame information at late stage results in more information in high-level semantics. Thus, a good temporal fusion method should be able to aggregate frame information in multiple semantic levels, i.e. to fuse frame feature-maps at multiple CNN stages. As shown in Figure 1, the feature maps of four network stages are fused with different importance weights on the semantic axis.

Most of the existing video-based ReID methods aggregate frame features in single stages, such as late fusion ReID methods in [38, 18] and early fusion methods in video classification and action recognition [28, 12]. We propose a novel multi-stage fusion method to fuse feature-maps with multiple semantic levels. Furthermore, a novel semantic attention module is proposed to adaptively assign importance weights of different semantic levels based on the content of the tracklets.

On the time axis, due to the visual similarity between consecutive frames, a tracklet usually contains a large amount of redundant information, which causes some of the redundant and unimportant frames having large importance weight. A good temporal fusion method should be able to select the important frames on the temporal axis while giving the redundant frames lower weight. As shown in Figure

1, the frames of the first two rows have a similar visual appearance, so they are assigned to lower attention weight, while the third frame with a side view is assigned to a higher weight.

State-of-the-art video-based ReID approach uses attention method to assign different attention weights to different time frames during temporal fusion [2, 15]. Most of the existing attention based approaches obtain the frames’ attention based on its own content, and do not consider the relationships among frames to lower the importance weight for redundant frames and reward frame with distinct features. In this paper, we propose an inter-frame attention method that obtains attention weights based on a frame’s relationship with others.

In conclusion, our goal is to design a temporal fusion method for video-based ReID, which has low information loss in semantic aspect and low information redundancy in temporal aspect. To achieve this goal, we propose a multi-stage fusion framework select appropriate frame features along both semantic and time axis. Our contributions are as follows. 1) On semantic aspect, we propose a novel multi-stage fusion method that uses a semantic level attention module to select and fuse appropriate features from all semantic levels. 2) On temporal aspect, we propose a novel intra/inter-frame attention method, which is the first attention-based fusion method to consider the inter-relationship among frames.

We verify the effectiveness of our proposed method on three public datasets. The experiments show the multi-stage fusion framework and the intra/inter-frame attention can effectively improve the performance of video-based re-identification.

The rest of the paper is organized as follows. In section 2 we give a brief overview of the existing video-based ReID methods. Then we elaborate our proposed multi-stage fusion framework and the intra/inter-frame attention approach in section 3. In section 4 we report our experiment results.

Figure 2: The illustration of Multi-stage Temporal Fusion with intra/inter-frame attention. On the semantic axis, the backbone is broken into three parts. Firstly, on the time axis, each of their outputs are fused using a weighted average with the importance weights from intra/inter-frame attention module. Then the fused feature-maps are encoded into tracklet features at different semantic levels. Then, on semantic axis, these tracklet features are fused using a weighted average with the importance weights from semantic attention module into the final tracklet feature.

Related Works

Derived from multi-camera tracking [10]

, person re-identification as an independent computer vision task was first proposed in

[6]. Through years of advance in the community cognition towards the topic [37] and development in some fundamental techniques, person ReID has grown as a heated research domain with prospective applications in reality.

Image based person ReID. Since the problem can be simplified as finding the most similar image in the database given a query image, two components play the key roles during the process—pedestrian description [6]and distance metric learning [34, 13] . As the CNN-based models prevail, two types of models are commonly deployed. The first type treats the problem as image classification [1] while the second takes a Siamese model and use image pairs [24] or triplets [26, 7] as input. More recently, explicitly leveraging image local stripes [30]or implicitly use attention scheme show decent results [31][27] in many dataset.

Temporal attention for video-based person ReID. Different from image-based task, video-based analysis [12, 28]

takes in a sequence of images thus can leverage the extra temporal information. In video-based person ReID, attention models highlight informative frames by assigning them higher scores. Zhou et al.

[38] combine spatial and temporal information to jointly learn features and metrics. Liu et al.[20] propose a multi-directional attention module to exploit the global and local contents for image-based person ReID. Researchers like Li et al. [15] and Fu et all. [4] propose using Spatial-temporal attention model that automatically discovers a diverse set of distinctive body parts. [9] proposes a method that computes attention weights based on patches in adjacent frames, which considers the relationship between neighboring frames, while our paper considers the relationship among all frames and uses Relation Network to mine complex relationship feature instead of simple predefined feature similarity. In this paper, we propose using attention methods on both time (frame-level) axis and semantic axis (model stages) and design a novel intra/inter attention module.

Feature aggregation for video-based person ReID

. Fusing features in an early or late stage with average or max pooling to enhance the feature expression ability is widely used in video analysis

[12, 29]

. Recurrent Neural Networks have also been used to integrates the surrounding information

[22, 38, 33]. K L et al.[23] performs operator-in-the-loop feature fusion from multiple camera images for person re-identification. Johnson et al. [11]

fuse the handcrafted feature and deep feature to complement the global body features. In this work, we propose a novel multi-stage fusion framework that integrates features from multiple stages into a two-branch structure.


We propose a new video-based person re-identification model with a novel temporal fusion method that has advantages in both temporal and semantic aspect. To reduce the information redundancy, we propose an intra/inter-frame attention model. To reduce the information loss in the semantic aspect, we propose a multi-stage fusion structure.

We first briefly introduce the general structure of the proposed method. As shown in Figure 2, the input of a video-based person re-identification model is an image sequence containing a certain person, called a tracklet. Our framework contains multiple branches, each performs image-level fusion on time axis on different network stages, from early to late. Then, on semantic axis, the fused tracklet features of all the branches are fused together to get the final feature of the tracklet. We use a novel inter/intra relational attention module to get the importance weight for fusion on time axis, and use a semantic attention module to get the importance weight for fusion on semantic axis.

Figure 3: Inter-Frame Attention Module with Relation Network

Time Aspect: Intra/Inter-Frame Attention

On the time axis, we need to conduct an image-level fusion to merge the features of all the images in a tracklet into one tracklet feature. In order to filter redundant frames and emphasize on important frames, we propose a novel intra/inter-frame attention method.

To fuse image features in a person tracklet, many existing methods adopt average or max pooling across the frames. However, as the quality and content vary drastically across frames, it is essential to weaken the impact of noisy, low-quality frames and strengthen the impact of high-quality informative frames. Thus, each frame should be assigned to an important weight in temporal fusion. We call this kind of fusion method attention-based method. Given a tracklet with frames, to get a fused tracklet feature denoted as , attention-based approaches adopt a weighted average pooling operation to fuse the image features:


where in this paper, the attention weight is the average of intra-frame attention and inter-frame attention :


In the following subsections, we elaborate the proposed intra/inter-frame attention method and how intra-frame and inter-frame attention is computed.

Intra-frame Attention

Most of the existing attention-based approaches obtain the attention of a frame based on its own quality and content (e.g., resolution, occlusion, camera angle, etc). We call this type of attention-based approaches intra-frame attention.

Our implementation of intra-frame attention is as follows. Given a video tracklet containing frames, we first use a frame-level backbone network to extract feature for each frame , denoted as . Then, a binary regressor is used to predict an importance score for each frame:


Inter-frame Attention

To focus more on frames with distinct features and reduce redundancy, frames with similar visual appearance should be assigned to lower attention weights while visually distinct frames should be assigned to higher attention weights. As a result, the importance of a frame not only depends on its own content, but also its relationship and differences with the other frames in the tracklet. We call this kind of relationship-based attention inter-frame attention.

Figure 3 is an illustration of our proposed relation network based on inter-frame attention module. Same as the intra-frame attention, we use the same frame-level backbone network to extract frame-level feature for each frame in the tracklet. The most straight forward way to obtain the correlation between two frames is to use a similarity measure, such as euclidean distance or cosine distance. Thus, for any frame feature in a tracklet with length , denoted as , its inter-frame attention is the mean distance between and every other frame in the tracklet:


where is a predefined similarity and the euclidean distance is used in our experiments.

Furthermore, we argue that besides simple similarity, much richer correlation information is needed for inter-frame attention. Thus, we propose to apply a relation network to obtain a relation embedding for each pair of frames in the tracklet. Given a video tracklet containing frame features. The inputs of the relation network are generated by concatenating each pair of the -dimensional frame features. Then, vectors with dimensions are embedded into a dimensional relation space by a multi-layer perception. As a result, the relation network (denoted as function ) generate a pair-wise relation embedding for each pair of the frame feature , , as follow:


Noted that the relational attention features from both directions are added together in order to obtain a symmetric attention matrix.

The relational embedding is then reshaped into a tensor. We feed the tensor into a convolutional layer with output channel and obtain a attention matrix . Each element in the matrix indicate the attention weight of the th frame based on its relationship with the th frame. The inter-frame attention of the th frame is the mean value of all attention weights respect to the other frames, which is equivalent to apply a column-wise average operation on the attention matrix:


where is the parameter of the convolution layer.

Semantic Aspect: Multi-Stage

(a) Late Fusion
(b) Early Fusion
(c) Multi-stage Fusion
Figure 4: The illustration of different fusion methods.

Multi-stage Fusion

To reduce the information loss caused by aggregating image features in a single semantic level, we propose to conduct the aforementioned intra/inter-frame fusion on multiple semantic levels. As a result, the fused tracklet feature vectors from multiple semantic levels should then be fused one more time on semantic level.

Traditional temporal fusion method only fuses image-level features on one certain stage of the backbone network. As shown in Figure 4 (a), the late-stage fusion fuses the image features at the bottom of the network. It makes sure the information within the frame is fully analyzed by a large amount of network layers, but information among frames in the video track are not explored enough. On the other hand, early-stage fusion fuses the image feature-maps at the earlier stage of the network and uses more layers to analyze the temporal information in the whole tracklet at the cost of insufficient frame level information extraction, as shown in Figure 4 (b).

Instead of leveraging between late fusion and early fusion, we propose a fusion framework taking advantages of the both, namely the Multi-Stage Fusion. The temporal fusion on feature maps is conducted at multiple stages, from early to late. For example, in Figure 4 (c), there are four fusion branches, which fuses the output feature-maps of layers at different stages of a backbone network, from early to late. In this paper, we use the same backbone network to encode the tracklet feature-maps into tracklet feature vectors, i.e. all four branches feed the fused tracklet feature-map back into the subsequent layers in the original backbone network to get the final tracklet feature vectors.

Semantic Attention Module

In this sub-section, we elaborate on our novel attention-based semantic fusion mechanism. Same as the fusion on time axis, we believe that the importance of different semantic levels heavily depends on the content of the tracklet. As a result, we propose a novel semantic attention module to assign different attention weights to feature vectors on different semantic levels.

Our implementation of the semantic attention is as follows. Given video tracklet features from different semantic levels, for each of the branch output features denoted as

, we use a softmax classifier

to predict an importance score for every branch, so for the -th branch feature, the importance weight of the -th branch (denoted as ) is computed as follows:


As a result, the over importance of the -th branch is:


The final tracklet feature of a tracklet is computed as follow,


where is the number of semantic branches. The pipeline of our method is shown in Algorithm 1.


0:  Video tracklet containing frames
0:  Tracklet feature
  for each frame in tracklet with index  do
     Extract frame level feature
     Compute intra-attention with Eq.3
     Compute inter-attention with Eq.6
     Compute image-level attention with Eq.2
  end for
  Normalize image-level attention
  for each network stage with index  do
     Obtain the output from the layer at network stage
     Compute tracklet-level feature-map with Eq.1
     Obtain tracklet feature by feeding feature-map back into the subsequent layers of the network
     Obtain semantic attention with Eq.7 and Eq.8
  end for
  Obtain final output with Eq.9
Algorithm 1 Multi-stage Fusion with Intra/Inter Frame Attention


Datasets and Evaluation Protocol

We evaluate the proposed algorithm on three benchmark datasets: PRID2011 [8], iLIDS-VID [14][21] and MARS [36]. PRID2011 consists of 749 people from two camera views, 178 of which appear in both cameras. iLIDS-VIDS consists of 300 persons’ 600 tracklets, where each person has two tracklets from different cameras with a length of 23 to 192 frames. The MARS dataset contains 1261 persons’ over 20,000 tracklets. Each person appears in at least two cameras and has an average of 13.2 tracklets.

For PRID2011 and iLIDS-VID dataset, following [32], the dataset is randomly split into training and testing set for 10 times, each containing of the data. For PRID2011 only the identities appeared in both camera are used. The averaged accuracy is computed over the 10 different training/testing splits. For MARS dataset, we use the training/testing split in [36], where 631 identities are used for training and the rest are used for testing. The re-identification performance is measured by mean average precision (MAP) and rank-n accuracy.

Implementation details

We used a pre-trained ResNet-50 image-based person re-identification network as the backbone network. The image-based ReID model was first trained on image-based ReID training set including CUHK03 [16], DukeMTMC [25]

, Market1501. Then the model was fine-tuned independently on PRID2011, iLIDS-VIDS, and MARS. After that, we used the trained image-based network as the backbone of the multi-stage fusion framework with attention modules and continued fine-tuning the model in an end-to-end video-based way. On semantic axis we fuse tracklet features from four stages. The input image was resized to a size of 256 * 128. We used the stochastic gradient descent algorithm to update the weights and used a staircase schedule strategy where the learning rate decayed 0.8 every 20 epochs. For each tracklet, the model output a 768-dimensional feature vector.

Ablation Study

Methods MAP top1 top5
Feature Average 70.3 78.1 91.4
Early Fusion 74.9 82.2 93.6
Late Fusion 75.4 82.5 93.5
MS Fusion (Average) 75.6 83.3 94.1
MS Fusion (Semantic Attention)
Table 1: Performance comparisons of different fusion methods on MARS.
Attention Methods MAP top1 top5
Average Pooling (late-fusion) 75.4 82.5 93.5
Intra-frame 76.6 85.3 94.4
Inter-frame (Euclidean) 75.7 83.9 94.1
Inter-frame (RN) 76.8 84.2 94.2
Intra/Inter-frame (Euclidean) 79.1 84.0 94.6
Intra/Inter-frame (RN) 82.4 85.8 95.7
Intra/Inter-frame (RN) + multi-stage 85.2 87.1 96.8
Table 2: Performance comparisons of different attention methods on MARS.

In this sub-section, we verify the effectiveness of our proposed fusion and attention components by ablation study.

To verify the effectiveness of multi-stage fusion, we compare the performance of different fusion methods. Table 1 reports the performances of different fusion approaches on MARS. Following approaches are compared:

  • Feature Average. Training a image-based model and uses the average of the image features as the tracklet feature without any end-to-end video-based training.

  • Early Fusion. Training an end-to-end video-based model with an early fusion after the 2nd res-block.

  • Late Fusion. Training an end-to-end video-based model with late fusion after the 4th res-block.

  • Multi-Stage (MS) Fusion (Average). Tracklet features from multiple semantic stages are fused with average pooling .

  • Multi-Stage (MS) Fusion with Semantic Attention. Tracklet features from multiple semantic stages are fused with semantic attention.

Methods Fusion Type Attention Type PRID i-LIDS-VID MARS
top1 top1 MAP top1
See-Forest [38] Late NA 79.4 55.2 50.7 70.6
AMOC+epicFlow [18] Late NA 83.7 68.7 52.9 68.3
Spatial-temporal [15] Late Intra-Frame 93.2 80.2 65.8 82.3
LSTM [5] Late NA - - 73.9 81.6
Non-local [17] Multistage NA 91.2 81.3 77.0 84.3
STA [4] Late Intra-Frame - - 80.4 85.5
Attribute Driven [35] Late Intra-Frame Attention 93.9 86.3 78.2 87.0
VRSTC [9] Late Inter-Frame - 83.4 82.3
Ours Multistage Intra-Inter-Frame / Semantic 87.1
Table 3: The comparisons of our method to the state-of-the-art methods on PRID2011, iLIDS-VID, and MARS datasets.
Figure 5: Feature-maps computed by early fusion, late fusion and multi-stage fusion.
Figure 6: Example of intra/inter-frame attention (left) and inter-frame attention (right) for same tracklet.

From Table 1, we can make following observation. 1) Compared to the image-based model, all end-to-end temporal fusion methods achieve higher performance, which verifies the advantage of adopting an end-to-end video-based approach. 2) Multi-stage fusion methods outperform the single-stage early fusion and late fusion method, showing that taking advantage of multi-level semantics from both early and late stage boosts the video reid performance. 3) Adding semantic attention improves the multi-stage fusion by 2 percent, proving the superiority of adaptively assigning the importance weights to different stages.

To verify the effectiveness of the intra/inter attention module, we compare the performance different attention methods. Table 2 shows the performance comparison of different attention based methods. We compare following attention approaches:

  • Average Pooling. A late-fusion model that uses image features with average pooling. Equal attention weights are assigned for each frame.

  • Intra-frame Attention. Late-fusion module with intra-frame attention module.

  • Inter-frame Attention (Euclidean). The euclidean based inter-frame attention module (Eq.(4)) is integrated into the late fusion baseline.

  • Inter-frame Attention (RN). A Relation Network based attention module ( Eq.(6)) is added to the baseline.

  • Intra/Inter-frame Attention (Euclidean). We add euclidean distance based inter-frame attention module to the intra-frame attention baseline.

  • Intra/Inter-frame Attention (RN). We add a RN based inter-frame attention module to the intra-frame attention baseline.

  • Intra/inter-frame (RN) + Multi-Stage. Image level fusion with intra/inter-frame attention module are applied on multiple stages and then the multi-level semantic fusion is applied with semantic attention module.

Pooling vs. Attention. From Table 2, we observe that the attention based approaches perform better than directly average pooling, which verify the effectiveness of adding attention weights during fusion.

Different Attention Approaches. From Table 2, we observe that adding an extra inter-frame attention module outperforms intra-frame attention module by 2 to 3 percentage point in terms of MAP, which verifies that obtaining attention based on relationship among frames effectively boosts the performance of video-based ReID. Furthermore, only using inter-frame attention module cannot achieve as high performance as using both inter and intra-frame attention. These results indicate that although the relation network in the inter-frame attention can discover the relationship among frames, its ability to discover the information with a single frame is limited. Therefore, an intra/inter-frame two-branch structure is necessary to make up for its disadvantage.

RN vs. Predefined Correlation. Compared to predefined cross correlation (i.e. euclidean distance in Table 2) , we observe that applying Relation Network in inter-frame attention achieves better performance. This is because RN is able to mine more complex relation among frames by using deep neural structures other than simple similarity between feature vectors, which is essential for inter-frame attention.

Time axis vs. Semantic Axis. After adding multi-stage fusion to the intra/inter-frame attention based fusion method, we achieve the best performance, (i.e. Intra/inter-frame (RN) + multi-stage in table 2

), which verifies the effectiveness of conducting feature selection and fusion on both time and semantic axis.

To demonstrate the advantage of multi-stage fusion, we visualize the feature maps from different semantic levels as shown in Figure 5. Here we can see that feature maps fused at the earlier stage tend to be sparser and focus more on distinct structural information, while feature maps from the later stage tend to have a strong response on a wider range of structures. Compared to only using the feature from the late stage, fusing it with the feature map from an earlier stage enables the model to represent the tracklet more comprehensively. Take the first case with the woman in white as an example, while the late fusion feature map does not have a strong response on the umbrella, a distinctive feature of the tracklet, the feature map from the early stage captures the structural information of the umbrella and helps the fused feature map to better represent the tracklet.

Figure 6 shows an example of a tracklet and its attention weights computed by the inter-frame attention module and intra-frame attention module. We observe that, the inter-frame attention module assigns a higher weight to the third frame of the tracklet because it is the only side view frame in the tracklet and is more distinct. On the other hand intra-frame attention assigns similar weights to every frame. This example shows that inter-frame attention has ability to lower the importance of redundant frames and increase the weight of frames with distint visual features.

Comparison with the State-of-the-Art Method

In table 3, we compare our approaches with state-of-the-art temporal fusion methods for video ReID approaches. All the comparing approaches use only video tracklets as the inputs, and no re-ranking and multi-query strategy are used in post-processing.

As shown in table 3, we compare our method with state-of-the-art approaches that adopt various ways of feature fusion and attention generation. We compare our method with SOTA late fusion methods including See-Forest [38], AMOC+epicFlow [18] and [5] that uses LSTM to fuse image features. We compare our method with SOTA attention based methods ([3] [19], and STA [4]) which uses both temporal and spatial attention. An attribute driven methods [35] that include extra body attribute to boost the acurracy of attention prediction. Non-local [17] do not use late fusion, but adopts a 3D CNN network with a non-local attention module. Table 3

shows that our approach outperforms the state-of-the-art approaches on PRID2011, i-LIDS-VID, and achieves highest MAP score on MARS. Note that, on MARS dataset, VRSTC achieves higher top-1 by 1 percentage point compared our method, probably because it also considers inter-frame relationship between the adjacent frames and uses extra spatial attention to obtain attention weights on local patches. Although not apply spatial attention, Our method outperforms VRSTC by 2.9 percentage points in terms of MAP because 1) our inter-frame attention not only considers relationship between adjacent frames but also all possible frame pairs, and 2) we uses a Relation Network to discover the complex relationship features instead of predefined similarity measure.


This paper proposes a novel temporal fusion method for video-based re-identification. We propose a general temporal fusion method to automatically select features on both semantic and time aspects. On the semantic aspect, our method aggregates feature maps on multiple semantics levels by a multi-branch structure. On the time aspect, we propose an intra/inter-frame attention module to take the relationship between frames into consideration. The experiments verify the effectiveness of our two novel model components and our approach achieves state-of-the-art performance on video-based re-identification benchmarks.


  • [1] E. Ahmed, M. Jones, and T. K. Marks (2015)

    An improved deep learning architecture for person re-identification

    In CVPR, pp. 3908–3916. Cited by: Related Works.
  • [2] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang (2018) Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, pp. 1169–1178. Cited by: Introduction.
  • [3] Y. Fu, X. Wang, Y. Wei, and T. Huang (2018) STA: spatial-temporal attention for large-scale video-based person re-identification. arXiv preprint arXiv:1811.04129. Cited by: Comparison with the State-of-the-Art Method.
  • [4] Y. Fu, X. Wang, Y. Wei, and T. Huang (2019) STA: spatial-temporal attention for large-scale video-based person re-identification. In AAAI, Cited by: Related Works, Comparison with the State-of-the-Art Method, Table 3.
  • [5] J. Gao and R. Nevatia (2018) Revisiting temporal modeling for video-based person reid. arXiv preprint arXiv:1805.02104. Cited by: Comparison with the State-of-the-Art Method, Table 3.
  • [6] N. Gheissari, T. B. Sebastian, and R. Hartley (2006) Person reidentification using spatiotemporal appearance. In CVPR, Vol. 2, pp. 1528–1535. Cited by: Related Works, Related Works.
  • [7] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: Related Works.
  • [8] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof (2011) Person re-identification by descriptive and discriminative classification. In Scandinavian conference on Image analysis, pp. 91–102. Cited by: Datasets and Evaluation Protocol.
  • [9] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen (2019) VRSTC: occlusion-free video person re-identification. In CVPR, pp. 7183–7192. Cited by: Related Works, Table 3.
  • [10] T. Huang and S. Russell (1997) Object identification in a bayesian context. In IJCAI, pp. 1276–1282. External Links: ISBN 1-555860-480-4 Cited by: Related Works.
  • [11] J. Johnson, S. Yasugi, Y. Sugino, S. Pranata, and S. Shen (2018) Person re-identification with fusion of hand-crafted and deep pose-based body region features. arXiv preprint arXiv:1803.10630. Cited by: Related Works.
  • [12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    In CVPR, pp. 1725–1732. Cited by: Introduction, Related Works, Related Works.
  • [13] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints. In CVPR, pp. 2288–2295. Cited by: Related Works.
  • [14] M. Li, X. Zhu, and S. Gong (2018) Unsupervised person re-identification by deep learning tracklet association. In ECCV, pp. 737–753. Cited by: Datasets and Evaluation Protocol.
  • [15] S. Li, S. Bak, P. Carr, and X. Wang (2018) Diversity regularized spatiotemporal attention for video-based person re-identification. In CVPR, pp. 369–378. Cited by: Introduction, Related Works, Table 3.
  • [16] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In CVPR, pp. 152–159. Cited by: Implementation details.
  • [17] X. Liao, L. He, and Z. Yang (2018) Video-based person re-identification via 3d convolutional networks and non-local attention. arXiv preprint arXiv:1807.05073. Cited by: Comparison with the State-of-the-Art Method, Table 3.
  • [18] H. Liu, Z. Jie, K. Jayashree, M. Qi, J. Jiang, S. Yan, and J. Feng (2018) Video-based person re-identification with accumulative motion context. t-CSVT 28 (10), pp. 2788–2802. Cited by: Introduction, Comparison with the State-of-the-Art Method, Table 3.
  • [19] K. Liu, B. Ma, W. Zhang, and R. Huang (2015) A spatio-temporal appearance representation for viceo-based pedestrian re-identification. In ICCV, pp. 3810–3818. Cited by: Comparison with the State-of-the-Art Method.
  • [20] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang (2017) Hydraplus-net: attentive deep features for pedestrian analysis. In ICCV, pp. 350–359. Cited by: Related Works.
  • [21] X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K. Lam, and Y. Zhong (2017) Person re-identification by unsupervised video matching. Pattern Recognition 65, pp. 197–210. Cited by: Datasets and Evaluation Protocol.
  • [22] N. McLaughlin, J. Martinez del Rincon, and P. Miller (2016) Recurrent convolutional network for video-based person re-identification. In CVPR, pp. 1325–1334. Cited by: Related Works.
  • [23] N. Murthy, R. K. Sarvadevabhatla, R. V. Babu, and A. Chakraborty (2018) Deep sequential multi-camera feature fusion for person re-identification. arXiv preprint arXiv:1807.07295. Cited by: Related Works.
  • [24] F. Radenović, G. Tolias, and O. Chum (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In ECCV, pp. 3–20. Cited by: Related Works.
  • [25] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pp. 17–35. Cited by: Implementation details.
  • [26] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In CVPR, pp. 815–823. Cited by: Related Works.
  • [27] J. Si, H. Zhang, C. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, pp. 5363–5372. Cited by: Related Works.
  • [28] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeuralIPS, pp. 568–576. Cited by: Introduction, Related Works.
  • [29] C. G. Snoek, M. Worring, and A. W. Smeulders (2005) Early versus late fusion in semantic video analysis. In ACM-MM, pp. 399–402. Cited by: Related Works.
  • [30] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, pp. 480–496. Cited by: Related Works.
  • [31] C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In ECCV, pp. 365–381. Cited by: Related Works.
  • [32] T. Wang, S. Gong, X. Zhu, and S. Wang (2014) Person re-identification by video ranking. In ECCV, pp. 688–703. Cited by: Datasets and Evaluation Protocol.
  • [33] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou (2017) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, pp. 4733–4742. Cited by: Related Works.
  • [34] L. Yang and R. Jin (2006) Distance metric learning: a comprehensive survey. Michigan State Universiy 2 (2), pp. 4. Cited by: Related Works.
  • [35] Y. Zhao, X. Shen, Z. Jin, H. Lu, and X. Hua (2019) Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In CVPR, pp. 4913–4922. Cited by: Comparison with the State-of-the-Art Method, Table 3.
  • [36] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) MARS: a video benchmark for large-scale person re-identification. In ECCV, Cited by: Datasets and Evaluation Protocol, Datasets and Evaluation Protocol.
  • [37] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: Related Works.
  • [38] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan (2017) See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In CVPR, pp. 4747–4756. Cited by: Introduction, Related Works, Related Works, Comparison with the State-of-the-Art Method, Table 3.