A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification

04/05/2021 ∙ by Xuehu Liu, et al. ∙ 0

Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras. Previous methods usually focus on limited views, such as spatial, temporal or spatial-temporal view, which lack of the observations in different feature domains. To capture richer perceptions and extract more comprehensive video representations, in this paper we propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID. More specifically, we design a trigeminal feature extractor to jointly transform raw video data into spatial, temporal and spatial-temporal domain. Besides, inspired by the great success of vision transformer, we introduce the transformer structure for video-based person Re-ID. In our work, three self-view transformers are proposed to exploit the relationships between local features for information enhancement in spatial, temporal and spatial-temporal domains. Moreover, a cross-view transformer is proposed to aggregate the multi-view features for comprehensive video representations. The experimental results indicate that our approach can achieve better performance than other state-of-the-art approaches on public Re-ID benchmarks. We will release the code for model reproduction.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (Re-ID) aims to retrieve given pedestrians across different times and places. Recently, due to the needs of safe communities, intelligent surveillance and criminal investigations, this task has become a hot research topic. Meanwhile, since the widespread deployment of video surveillance in cities and the massive data of pedestrians in videos, researchers are paying more and more attention to video-based person Re-ID. Similar to image-based person Re-ID, video-based person Re-ID also faces many challenges, such as illumination changes, view point difference, complicated backgrounds and person occlusions. Different from image-based person Re-ID, video-based person Re-ID utilizes multiple image frames as inputs rather than single image, which will contain additional motion cues, pose variations and multi-view observations. Although those information is conducive to pedestrian recognition, it also brings more noises and misalignments. Thus, how to fully utilize the abundant information in sequences is worthy of research in video-based person Re-ID.

Figure 1: The (a) (b) (c) shows the previous works with single-view. The (d) shows the trigeminal views in our work.

In video-based person Re-ID, researchers have explored various methods to process video data. Some of previous methods utilize spatial feature extractors to obtain attentive feature maps from spatial view. Li et al. [22]

design a set of diverse spatial attention modules to extract aligned local features across multiple images. Meanwhile, temporal learning networks are usually designed for discovering the temporal relationships in temporal domain, such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM). For example, Mclaughlin 

et al. [26] introduce a recurrent architecture to model temporal cues cross frames. Liu et al. [25] propose a refining recurrent unit to integrate video representation. Besides, for spatial-temporal observation, Li et al. [21] construct a multi-scale 3D network to learn the multi-scale spatial-temporal cues in video sequences. However, previous methods often focus on single views, which lacks of the multi-view observations from different view domains. There are a thousand Hamlets in a thousand people’s eyes. After observing from different perspectives to video data, the model could get more comprehensive and robust video representation. In our work, we attempt to capture three different observations from spatial, temporal and spatial-temporal domains simultaneously. The intuitive comparisons are shown in Fig. 1.

Recently, transformers [31, 9, 2]

have shown the strong representation ability and gained a great success in Natural Language Processing (NLP). Nowadays, researchers extend transformers for numerous computer-vision applications. The transformers in vision are mainly developing in two directions: pure-transformer methods 

[10, 15] and “CNN + transformer” methods [29, 3]

. Although pure-transformer methods show great potentiality and are seen as an alternative of CNNs, it is limited by the need of amounts of data for pre-training. Thus, it is difficult to deploy quickly on a variety of visual tasks. The “CNN + transformer” methods retain the powerful spatial feature extraction capability of CNNs, at the same time, introduce the transformers to model the relationships of local features in high-dimensional space. It becomes a popular paradigm for computer vision. Inspired by this, in this work, we combine CNNs and transformers for video-based person Re-ID. Actually, the interactions in transformers will help to assign different attention weights to local features in spatial-temporal for better video representations.

In this paper, we proposed a novel Trigeminal Transformer (TMT) for video-based person Re-ID. Our method attempts to capture three observations from spatial, temporal and spatial-temporal domains, simultaneously. Then, in each view domain, we explore the relationships among local features. Meanwhile, in order to aggregate multi-view cues and enhance the final representation, the cross-view interactions are taken into consideration in our work. Specifically, our framework mainly consists of three key modules. Firstly, we design a trigeminal feature extractor to obtain three different features under spatial, temporal and spatial-temporal views, simultaneously. Secondly, we introduce three independent transformers as self-view transformers for different views. Each transformer takes the coarse features as inputs and exploits the relationships between local features for information enhancement. Finally, we propose a novel cross-view transformer, which models the interactions among features of different views and aggregates them for the final video representation. Based on above modules, our TMT can not only capture different perceptions in spatial, temporal and spatial-temporal view domains, but also fully aggregate multi-view information to generate more comprehensive video representations. Extensive experiments on public benchmarks demonstrate that our approach outperforms several state-of-the-art methods.

In summary, our contributions are four folds:

  • We propose a novel Trigeminal Transformer (TMT) framework for video-based person Re-ID.

  • We design a trigeminal feature extractor to transform raw video data into spatial, temporal and spatial-temporal view domains for different observations.

  • We introduce transformer structures for video-based person Re-ID. Specifically, we propose a self-view transformer to refine single-view features and a cross-view transformer to aggregate multi-view features.

  • Extensive experiments on public benchmarks demonstrate that our framework synthetically attains a better performance than several state-of-the-art methods.

Figure 2: The overall structure of our proposed method. Firstly, the trigeminal feature extractor is utilized to transform raw video data into different view domains individually. Then, the self-view transformers are introduced for feature enhancement. After that, a cross-view transformer is designed to model the interactions between multiple views for comprehensive observations.

2 Related works

2.1 Video-based person re-identification

In recent years, with the rise of deep learning 

[8, 19], person Re-ID has got a great success and the recognition accuracy has been improved significantly. Recently, due to the urgent need for video matching in the intelligent community system, video-based person Re-ID has drawn more and more researchers’ interest. Compared with static images, videos contain more views which are worth of observing, such as spatial view, temporal view and spatial-temporal view. Thus, in video-based person Re-ID, some existing works [22, 39, 11, 40] concentrate on extracting attentive spatial features in the spatial view. Meanwhile, some methods [26, 7, 25, 17] attempt to obtain temporal observations by temporal learning mechanisms. Besides, some approaches [23, 21, 34, 13] utilize 3D-CNN to jointly explore spatial-temporal cues. For example, for spatial information, Li et al. [22] attempt to extract extract aligned spatial features across multiple images by diverse spatial attention modules. Zhao et al. [40] disentangle frame-wise features for various attribute-aware representations in spatial. For temporal information, Mclaughlin et al. [26]

utilize recurrent neural networks cross frames for temporal learning. Liu

et al. [25] propose a refining recurrent unit to integrate temporal cues frame by frame. For spatial-temporal information, Gu et al. [13] propose an appearance preserving 3D convolutional network to address appearance destruction and model temporal information. Meanwhile, beyond single view, Li et al. [21] design a two-stream convolutional network to explicitly leverage spatial and temporal cues. Different from previous methods, in this paper, we design a trigeminal network to transform raw video data into spatial, temporal and spatial-temporal feature space in three different views. Besides, the self-view transformer and the cross-view transformer are proposed to fully explore discriminative cues in self-view domain and aggregate diverse observations cross multi-view domains for final comprehensive and robust representations.

2.2 Transformer in vision

Transformer [31] is initially proposed for NLP tasks and brings significant improvement in many tasks [9, 2]. Inspired by the powerful ability of transformer for handling sequential data, the transformer-based vision models are springing up like mushrooms, showing the great potential in vision fields. Researchers have extended transformer for computer vision tasks, such as image and video classification [12, 10, 6], object detection [3], semantic segmentation [33], video inpainting [38] and so on. For example, Girdhar et al. [12] introduce an action transformer model to recognize and localize human behaviors in videos. Carion et al. [3] utilize transformers to redesign an end-to-end object detector. Dosovitskiy et al. [10] propose the Vision Transformer (ViT) , which applies transformer directly to sequences of image patches and achieves promising results. Recently, He et al. [15] utilize a pure transformer with a jigsaw patch module for image-based Re-ID. Compared with existing transformers in vision, our transformers are constructed in spatial, temporal and spatial-temporal domains for video-based Re-ID. Besides, we propose a novel cross-view transformer to aggregate multi-view cues for comprehensive video representations.

3 Proposed method

In this section, we introduce the proposed Trigeminal Transformers (TMT). We first give an overview of the proposed TMT. Then, we elaborate the key modules in the following subsections.

3.1 Overview

The TMT is shown in Fig. 2. The overall framework mainly consists of three key modules: Trigeminal Feature Extractor, Self-view Transformer and Cross-view Transformer. To begin with, we adopt the Restricted Random Sampling (RRS) [22] to generate sequential frames as inputs. Then, we use the designed trigeminal feature extractor to transform the raw video data into different high-dimensional spaces, which consists of three non-shared embedding branches for multi-view observations. For extracted initial features in each single-view domain, we utilize the self-view transformer to explore the relationships between local features for information enhancement. Afterwards, the cross-view transformer is proposed to capture the interaction among multi-view features and adaptively aggregate them for comprehensive representations. Finally, to train our model, we introduce the Online Instance Matching (OIM) [35]

loss and verification loss. In the test stage, the feature vectors of different views are concatenated for the retrieval list.

3.2 Trigeminal feature extractor

Multi-stream networks are usually used to handle sequential data in action recognition and video classification [21, 37, 27]. Actually, the success of multi-stream networks can be attributed to the assemble multiple observations from different views, which help to extract more comprehensive representations. Inspired by this, in our work, we design a trigeminal feature extractor for spatial, temporal and spatial-temporal views.

The structure of our proposed trigeminal feature extractor is shown in the left of Fig. 2. Formally, given a long sequence, we firstly sample frames as the inputs of our network. is the length of the sequence. In our work, the ResNet-50 [14] is used as the basic backbone for frame-wise feature maps. Different from the previous multi-steam works, we deploy the parameter-shared shallow residual blocks from conv1 to conv3x in ResNet-50 for reducing network parameters. Three non-shared conv4x are deployed after basic CNN as the embedding network and output three video feature cubes, , which have same sizes of . represents the height, the width and the number of channels, respectively. To further separate the feature cubes in high-dimensional space, we propose a self-attention pooling to project the features into spatial and temporal domains for different views.

Figure 3: The temporal self-attention pooling (a) and the spatial self-attention pooling (b).
Figure 4: The self-view transformer (Left) and the cross-view transformer (Right) in temporal domain.

Self-attention pooling. The spatial self-attention pooling and temporal self-attention pooling have the similar structures and are shown in Fig. 3. Given a feature cube , we utilize the temporal self-attention pooling to project into spatial domain. To begin with, a linear projection is applied to each spatial local feature , and generates by


where W is the parameter of the linear projection. Then, we apply a matrix multiplication between and its transposition to generate the self-attention matrix .


where indicates the transpose operation. After that, we sum the self-attention matrix along one dimension and then perform a softmax operation along other dimension to infer the temporal attention vector .


where . Then, a dot production is applied between each local spatial feature and its temporal attention vector . By this way, we obtain the attentive spatial feature


where represent the matrix dot multiplication and . Finally, the local spatial feature sets () as the output of temporal self-attention pooling .

The resulting spatial features can be regard as the weighted feature in temporal, which utilizes the relationships cross frames. By this way, we can pool the feature cube to . Noted that, the spatial self-attention pooling has the similar process as the temporal self-attention pooling. The difference is the size of the self attention matrices. In the spatial self-attention pooling, the size of the spatial attention matrix is , and the output can be represented by . Thus, our trigeminal feature extractor utilizes a partially shared network to project raw video data into independent spatial, temporal and spatial-temporal domains for multi-view observations. After that, self-view transformers and a cross-view transformer are deployed for further global feature enhancement.

3.3 Self-view transformer

Based on the above trigeminal feature extractor, we can easily obtain particular features in different views. However, these features lack global observations in each self-view domain. Recently, transformer models have been proposed to explore the interactions among contextual information and deliver a great success. Actually, the interactions among global features are beneficial to mine more discriminative cues. Inspired by the strong capacity of transformer and its great potentiality in vision fields, in our work, we introduce the transformer structure into different view domains for feature enhancement. Self-view transformers are multi-layer architectures formed by stacking blocks on top of one another. Each block of transformer is composed of a multi-head self-attention layer, a position-wise feed-forward network, layer normalization modules and residual connections. For the temporal domain, the obtained feature

from the trigeminal feature extractor is passed into the self-view transformer (The left part of Fig. 4), which can be formulated as follows. First, a positional encoding is added to the original feature for the positional information on temporal, which can be trainable like [10]

. Then, the feature with position is passed through the multi-head self-attention layer. In each head, the feature is firstly feed into three linear transformations to generate feature

, and , where . and is the number of heads. The self attention operation is defined as:


The outputs of multiple heads, , are concatenated together as the output of the multi-headed self-attention layer. The input and output of the multi-head self-attention layer are connected by residual connections and a normalization layer,


The position-wise feed-forward network (FFN) is applied after the multi-head self-attention layer. Specifically, the FFN consists of two linear transformation layers and a ReLU activation function within them, which can be denoted as the following function:


where and are the parameters of two linear transformation layers and is the ReLU activation function.

In the spatial and spatial-temporal views, the and obtained from the trigeminal feature extractor are passed through the spatial transformer and spatial-temporal transformer individually, which have the same structures as the temporal transformer. In our work, we deploy the transformers in each self-view domain, called self-view transformers, to model the relationships among local features for further feature enhancement.

3.4 Cross-view transformer

In each domain, self-view transformers help to refine the initial feature by exploiting the attention relationships in current view. Actually, the attention relationships cross multiple views are also instructive to the feature refinement. Considering this, in this paper, a cross-view transformer is constructed cross different view domains, which has spatial, temporal and spatial-temporal views simultaneously. The right part of Fig. 4 shows the detailed structure of the proposed cross-view transformer, which consists of multi-head cross-attention layer, domain-wise feed-forward networks, layer normalization modules and residual connections. Formally, in the temporal-based cross-attention head, six linear projections are applied to generate six features , where are from , are from and are from . and is the number of heads. In one head, the cross-attention is formulated as:


where is the transposition. The concatenation of multi-head outputs can be represented by and , which have the same size to . Then, the output of multi-head cross-attention head can be expressed as:


After that, similar to self-view transformer, a domain-wise feed-forward network is applied.


where and are parameters of the two linear transformation layers and represents the ReLU activation function. By this way, in the temporal domain, its feature can aggregate the related information from other views. For the spatial and spatial-temporal views, the interactions are explored in similar cross-view transformers. Finally, the features from different views are adaptively integrated by proposed cross-view transformer and generate more comprehensive representations.

Method mAP Rank-1 Rank-5 Rank-20 Rank-1 Rank-5 Rank-20 Rank-1 Rank-5 Rank-20
Baseline 81.2 88.5 95.5 97.9 88.2 97.6 99.7 93.5 98.7 99.7
+ Three branches 82.0 88.9 95.6 98.0 89.3 97.8 99.6 94.4 99.1 99.8
+ Trigeminal feature extractor 83.9 89.7 96.1 98.2 89.7 98.4 99.7 94.8 99.1 100
Spatial view 80.2 86.9 95.2 97.5 86.7 97.7 99.5 91.3 98.2 99.8
Temporal view 82.8 89.3 95.8 98.1 86.3 98.0 99.8 93.2 98.7 100
Spatial-temporal view 83.4 89.5 95.7 98.1 87.6 97.9 99.6 92.7 98.6 99.8
+ Self-view transformer 85.6 90.8 96.7 98.4 90.6 98.2 100 95.7 99.0 99.8
Spatial view 83.4 90.0 96.1 98.2 88.9 97.8 99.7 93.3 98.5 99.8
Temporal view 83.5 89.6 95.7 98.0 88.4 97.5 99.8 94.7 98.8 100
Spatial-temporal view 84.6 90.2 96.6 98.2 90.2 98.2 99.9 93.7 98.7 99.6
+ Cross-view transformer 85.8 91.2 97.3 98.8 91.3 98.6 100 96.4 99.3 100
Table 1: Ablation results of key components on MARS, iLIDS-VID and PRID2011.
Figure 5: Ablation results on the depth of self-view transformers and the cross-view transformer in on MARS.
Methods Temporal Spatial-temporal
Transformer Transformer
Length (T) mAP Rank-1 Rank-20 mAP Rank-1 Rank-20
6 83.4 88.6 98.0 82.6 89.5 97.8
8 84.0 88.7 98.1 84.0 89.8 98.5
10 83.2 89.1 97.8 83.8 89.2 97.9
12 83.0 88.6 97.9 83.7 88.7 98.0
Table 2: Ablation results on the length of sequences on MARS.
Methods Spatial Spatial-temporal
Transformer Transformer
Size (HW) mAP Rank-1 Rank-20 mAP Rank-1 Rank-20
84 82.4 89.2 98.0 83.2 90.0 97.8
168 84.4 89.9 98.4 84.0 89.8 98.5
Table 3: Ablation results on the size of spatial feature maps on MARS.

4 Experiments

4.1 Datasets and evaluation protocols

In this paper, we adopt three widely-used benchmarks to evaluate our proposed method, i.e., iLIDS-VID [32], PRID-2011 [16] and MARS [41]. iLIDS-VID [32] and PRID-2011 [16] are two small datasets. Both of them collected images by two cameras. iLIDS-VID [32] has 600 video sequences of 300 different identities. PRID-2011 [16] consists of 400 image sequences for 200 identities from two non-overlapping cameras. MARS [41] is one of large-scale datasets, and consists of 1,261 identities around 18,000 video sequences. All the video sequences are captured by at least 2 cameras. Noted that, there are round 3,200 distractors sequences in the dataset to simulate actual detect conditions. For evaluation, the Cumulative Matching Characteristic (CMC) table and mean Average Precision (mAP) are adopted following previous works for MARS dataset. For more details, we refer readers to the original paper [42]. In terms of iLIDS-VID and PRID2011, there is one single correct match in the gallery set. Thus, only the cumulative re-identification accuracy is reported.

4.2 Implementation details

We implement our framework based on the Pytorch 

111https://pytorch.org/ toolbox. The experimental devices include an Intel i4790 CPU and two NVIDIA 3090 (24G memory). Experimentally, we set the = 16. In our work, if not specified, we set the length of sequence = 8 and the depth of transformer to 2. Each image in a sequence is resized to 256128 and augmented by random cropping, horizontal flipping and random erasing. The ResNet-50 [14]

pre-trained on the ImageNet dataset 

[8] is used as our backbone network. Following previous works [30], we remove the last spatial down-sampling operation to increase the feature resolution. In a mini-batch, we selected a number of positive and negative sequence pairs for verification loss. Besides, we utilize frame-wise and video-wise OIM losses to supervise the whole network following [4]

. In experiments, the outputs of backbone in three branches are also supervised to strength the learning of the underlying network. During training, we train our network for 50 epochs and the learning rate is decayed by 10 at every 15 epochs. The whole network is updated by stochastic gradient descent 

[1] algorithm with an initial learning rate of , weight decay of

and nesterov momentum of 0.9. We will release the source code for model reproduction.

Methods Source mAP Rank-1 Rank-5 Rank-20 Rank-1 Rank-5 Rank-20 Rank-1 Rank-5 Rank-20
SeeForest [43] CVPR17 50.7 70.6 90.0 97.6 55.2 86.5 97.0 79.4 94.4 99.3
ASTPN [36] ICCV17 - 44 70 81 62 86 98 77 95 99
Snippet [4] CVPR18 76.1 86.3 94.7 98.2 85.4 96.7 99.5 93.0 99.3 100
STAN [22] CVPR18 65.8 82.3 - - 80.2 - - 93.2 - -
STMP [25] AAAI19 72.7 84.4 93.2 96.3 84.3 96.8 99.5 92.7 98.8 99.8
M3D [21] AAAI19 74.0 84.3 93.8 97.7 74.0 94.3 - 94.4 100 -
Attribute [40] CVPR19 78.2 87.0 95.4 98.7 86.3 87.4 99.7 93.9 99.5 100
VRSTC [18] CVPR19 82.3 88.5 96.5 97.4 83.4 95.5 99.5 - - -
GLTR [20] ICCV19 78.5 87.0 95.8 98.2 86.0 98.0 - 95.5 100 -
COSAM [28] ICCV19 79.9 84.9 95.5 97.9 79.6 95.3 - - - -
MGRA [39] CVPR20 85.9 88.8 97.0 98.5 88.6 98.0 99.7 95.9 99.7 100
STGCN [37] CVPR20 83.7 89.9 - - - - - - - -
AFA [5] ECCV20 82.9 90.2 96.6 - 88.5 96.8 99.7 - - -
TCLNet [17] ECCV20 85.1 89.8 - - 86.6 - - - - -
GRL [24] CVPR21 84.8 91.0 96.7 98.4 90.4 98.3 99.8 96.2 99.7 100
TMT(Ours) - 85.8 91.2 97.3 98.8 91.3 98.6 100 96.4 99.3 100
Table 4: Comparison with state-of-the-art video-based person re-identification methods on MARS, iLIDS-VID and PRID2011.

4.3 Ablation study

To investigate the effectiveness of our TMT, we conduct a series of experiments on three public benchmarks: MARS, iLIDS-VID and PRID2011.

Effectiveness of key components. We gradually add our modules to backbone for ablation analysis. The results are shown in Tab. 1. In this table, “Baseline” represents the ResNet-50 as the feature extractor following directly spatial and temporal average pooling for the final video representation. “+ Three branches” means that three non-shared Res4x blocks in ResNet-50 are deployed as same as the trigeminal feature extractor, which brings slight improvements. Compared with “+ Three branches”, “+ Trigeminal feature extractor” represents the proposed temporal and spatial self-attention pooling replacing the direct average pooling in spatial and temporal branches respectively. We can see that our proposed self-attention pooling is beneficial to transform same features into different domains and improves in term of mAP on MARS. Moreover, we add three self-view transformers in spatial, temporal and spatial-temporal domains for “ + Self-view transformer”. The individual view-wise feature attains a significant improvement than “+ Trigeminal feature extractor” on three benchmarks. Compared with “Baseline”, our proposed three self-view transformers can improve the mAP by and the Rank-1 accuracy by on MARS. The improvements indicate that self-view transformers could help to model the relationships among global features for better performances. Noted that, the combination of three views boosts higher results than single view, which clarifies our statement: a video is worth three views. Last but not least, we apply the proposed cross-view transformer in “ + Cross-view transformer”. The performance has been further improved on three benchmarks. It gains , and in term of the Rank-1 accuracy on MARS, iLIDS-VID and PRID2011 respectively. It indicates that the cross-view transformer is beneficial to aggregate multi-view observations for comprehensive video representations.

Effect of the length of sequences. In Tab. 2, the ablation results show the influence of different sequence length on MARS. The temporal or spatial-temporal transformer is applied to the baseline for single-view observation. We can see that, the temporal and spatial-temporal transformer are sensitive to the length of the sequences. When varying the length of the sequence to 8, both of temporal and spatial-temporal transformers get best performance.

Effect of the size of spatial feature map. We also perform ablation experiments to investigate the effect of varying the spatial size of feature maps. The standard ResNet-50 is utilized and generates the frame-wise feature maps, which have the spatial size of . When we remove the last spatial down-sampling operation of ResNet-50, the size of feature maps will increase to . In this way, the proposed the spatial or spatial-temporal transformer could be applied to different features with different sizes. The ablation results for spatial and spatial-temporal transformer are reported in Tab. 3. The results show that the increments of spatial size will gain significant improvement.

Effect of the depth of transformer. The depth of transformer is an important factor for exploring the interaction of contextual information. The increments of depth will improve the representation capacity of transformer, but it will also bring training difficulties. In Fig. 5, performances with different depths are reported on different view-wise transformers. The experiments are conducted on MARS. We add the spatial, temporal or spatial-temporal transformer to “ Baseline” and add the cross-view transformer to “ + Three branches” respectively. From the results, we can see that, the depths of achieving best performance in one transformers are different to other transformers. Besides, for the cross-view transformers, a two-layer structure gains better than shallower transformers or deeper transformers.

4.4 Compared with the state of the arts

In this section, our TMT is compared with state-of-the-art methods on MARS, iLIDS-VID and PRID2011. The results are reported in Tab. 4. One can observe that, on MARS dataset, our TMT attains comparable even better performances than other compared methods. In addition, our method achieves highest Rank-1 accuracy on three benchmarks. Compared with existing methods, our method integrates multi-view cues to obtain more comprehensive video representations. MGRA [39] extracts multi-granularity spatial cues under the guidance of a global view, which gains remarkable mAP on MARS dataset. Attribute [40] mines various attribute-aware features in spatial for alignment. Those methods focus on capturing diverse spatial features and gain remarkable performances. Even so, comapred with MGRA [39], our method improves the performances by and in terms of mAP and Rank-1 accuracy on MARS and iLIDS. GLTR [20] attempts to model the multi-granular temporal dependencies for short and long-term temporal cues. GRL [24] utilizes a bi-direction temporal learning to refine and accumulate disentangled spatial features. Those methods are indeed helpful to capture the discriminative cues in temporal for better recognition accuracy. Especially, GRL [24] attains an expressive Rank-1 accuracy on iLIDS-VID datasets. Our method has still gained improvements about Rank-1 accuracy than GRL [24]. Besides, STA [11] introduces a free-parameter spatial-temporal attention module to weight local features in spatial-temporal domain. Different from the above single-view methods, our proposed TMT resembles spatial, temporal and spatial-temporal observations and achieves better results on three public datasets. Meanwhile, it is worth noting that some methods construct two-stream networks for different feature representations. For example, M3D [21] combines 3D-CNN and 2D-CNN to explicitly leverage spatial and temporal cues. STGCN [37] constructs two parallel graph convolutional networks to explore the relations in spatial and temporal domains. Compared with these two-stream approaches, our method introduces a self-view transformer into single-view domain for feature enhancement, and a cross-view transformer to aggregate multiple views for better representations. In this way, our method surpasses STGCN [37] by , in terms of mAP and Rank-1 accuracy on MARS. In summary, our method performs better than most existing state-of-the-arts. These results validate the superiority of our method.

5 Conclusion

In this paper, we propose a novel framework named Trigeminal Transformer for video-based person Re-ID. A trigeminal feature extractor is designed to capture spatial, temporal and spatial-temporal view domains, in which the proposed temporal self-attention pooling and spatial self-attention pooling are applied to transform features into spatial and temporal domains respectively. Besides, self-view transformers are introduced to explore the relationships among global features for feature enhancement in self-view domains. In addition, a cross-view transformer is proposed to model the interactions among multi-view features for more comprehensive representations. Based on our proposed modules, we model a video from spatial, temporal and spatial-temporal views. Extensive experiments on there public benchmarks demonstrate that our approach performs better than some state-of-the-arts.


  • [1] L. Bottou (2010)

    Large-scale machine learning with stochastic gradient descent

    In COMPSTAT, pp. 177–186. Cited by: §4.2.
  • [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §2.2.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In ECCV, pp. 213–229. Cited by: §1, §2.2.
  • [4] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang (2018) Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, pp. 1169–1178. Cited by: §4.2, Table 4.
  • [5] G. Chen, Y. Rao, J. Lu, and J. Zhou (2020) Temporal coherence or temporal motion: which is more critical for video-based person re-identification?. In ECCV, pp. 660–676. Cited by: Table 4.
  • [6] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In ICML, pp. 1691–1703. Cited by: §2.2.
  • [7] J. Dai, P. Zhang, D. Wang, H. Lu, and H. Wang (2019) Video person re-identification by temporal residual learning. TIP 28, pp. 1366–1377. Cited by: §2.1.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.1, §4.2.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §2.2, §3.3.
  • [11] Y. Fu, X. Wang, Y. Wei, and T. Huang (2019) STA: spatial-temporal attention for large-scale video-based person re-identification. In AAAI, pp. –. Cited by: §2.1, §4.4.
  • [12] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019)

    Video action transformer network

    In CVPR, pp. 244–253. Cited by: §2.2.
  • [13] X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen (2020) Appearance-preserving 3d convolution for video-based person re-identification. In ECCV, pp. 228–243. Cited by: §2.1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2, §4.2.
  • [15] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang (2021) TransReID: transformer-based object re-identification. arXiv preprint arXiv:2102.04378. Cited by: §1, §2.2.
  • [16] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof (2011) Person re-identification by descriptive and discriminative classification. In SCIA, pp. 91–102. Cited by: §4.1.
  • [17] R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen (2020) Temporal complementary learning for video person re-identification. arXiv preprint arXiv:2007.09357. Cited by: §2.1, Table 4.
  • [18] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen (2019) VRSTC: occlusion-free video person re-identification. In CVPR, pp. 7183–7192. Cited by: Table 4.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167. Cited by: §2.1.
  • [20] J. Li, J. Wang, Q. Tian, W. Gao, and S. Zhang (2019) Global-local temporal representations for video person re-identification. In ICCV, pp. 3958–3967. Cited by: §4.4, Table 4.
  • [21] J. Li, S. Zhang, and T. Huang (2019) Multi-scale 3d convolution network for video based person re-identification. In AAAI, pp. 8618–8625. Cited by: §1, §2.1, §3.2, §4.4, Table 4.
  • [22] S. Li, S. Bak, P. Carr, and X. Wang (2018) Diversity regularized spatiotemporal attention for video-based person re-identification. In CVPR, pp. 369–378. Cited by: §1, §2.1, §3.1, Table 4.
  • [23] J. Liu, Z. Zha, X. Chen, Z. Wang, and Y. Zhang (2019)

    Dense 3d-convolutional neural network for person re-identification in videos

    TOMM 15, pp. 1–19. Cited by: §2.1.
  • [24] X. Liu, P. Zhang, C. Yu, H. Lu, and X. Yang (2021) Watching you: global-guided reciprocal learning for video-based person re-identification. Cited by: §4.4, Table 4.
  • [25] Y. Liu, Z. Yuan, W. Zhou, and H. Li (2019) Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, pp. 8786–8793. Cited by: §1, §2.1, Table 4.
  • [26] N. McLaughlin, J. Martinez del Rincon, and P. Miller (2016) Recurrent convolutional network for video-based person re-identification. In CVPR, pp. 1325–1334. Cited by: §1, §2.1.
  • [27] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199. Cited by: §3.2.
  • [28] A. Subramaniam, A. Nambiar, and A. Mittal (2019) Co-segmentation inspired attention networks for video-based person re-identification. In ICCV, pp. 562–572. Cited by: Table 4.
  • [29] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo (2020) TransTrack: multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460. Cited by: §1.
  • [30] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, pp. 480–496. Cited by: §4.2.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.2.
  • [32] T. Wang, S. Gong, X. Zhu, and S. Wang (2014) Person re-identification by video ranking. In ECCV, pp. 688–703. Cited by: §4.1.
  • [33] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia (2020) End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503. Cited by: §2.2.
  • [34] L. Wu, Y. Wang, L. Shao, and M. Wang (2019) 3-d personvlad: learning deep global representations for video-based person reidentification. NNLS 30, pp. 3347–3359. Cited by: §2.1.
  • [35] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In CVPR, pp. 3415–3424. Cited by: §3.1.
  • [36] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou (2017) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, pp. 4733–4742. Cited by: Table 4.
  • [37] J. Yang, W. Zheng, Q. Yang, Y. Chen, and Q. Tian (2020) Spatial-temporal graph convolutional network for video-based person re-identification. In CVPR, pp. 3289–3299. Cited by: §3.2, §4.4, Table 4.
  • [38] Y. Zeng, J. Fu, and H. Chao (2020) Learning joint spatial-temporal transformations for video inpainting. In ECCV, pp. 528–543. Cited by: §2.2.
  • [39] Z. Zhang, C. Lan, W. Zeng, and Z. Chen (2020) Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In CVPR, pp. 10407–10416. Cited by: §2.1, §4.4, Table 4.
  • [40] Y. Zhao, X. Shen, Z. Jin, H. Lu, and X. Hua (2019) Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In CVPR, pp. 4913–4922. Cited by: §2.1, §4.4, Table 4.
  • [41] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) Mars: a video benchmark for large-scale person re-identification. In ECCV, pp. 868–884. Cited by: §4.1.
  • [42] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv:1610.02984. Cited by: §4.1.
  • [43] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan (2017) See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In CVPR, pp. 4747–4756. Cited by: Table 4.