Person re-identification (Re-ID) aims to retrieve specific pedestrian cross different cameras at different times and places. Recently, this task has become a hot research topic due to its importance in advanced applications, such as safe community, intelligent surveillance and criminal investigation. In the past decade, the researches have made a great progress in this filed, and derive some interesting tasks associated with person Re-ID, such as image-based person Re-ID, video-based person Re-ID, person search and so on. Compared with other related Re-ID tasks, video-based person Re-ID provides a video as the input to retrieve rather than a single image. Although videos can provide comprehensive appearance information, motion cues, pose variations in temporal, at the same time, it brings more illumination changes, complicated backgrounds and person occlusions in a clip. Thus, there are still many challenges for researches to handle in video-based person Re-ID.
Previous methods [xu2017jointly, wu2016deep, liu2015spatio]
can be coarsely summarized into two steps: spatial feature extraction and temporal feature aggregation. First, Convolutional Neural Networks (CNNs) are utilized to extract frame-level spatial features from each single image. Then, frame-level spatial features are temporally aggregated into a feature vector as the video representation to compute the similarity scores. Naturally, how to fully explore the discriminative spatial-temporal cues from multiple frames is seen as the key to tackle video-based person Re-ID. Technically, the video representation, obtained by Global Average Pooling (GAP) followed by Temporal Average Pooling (TAP), can directly focus on the important information of persons in the sequence. However, the average operation has some obvious drawbacks, such as the inability to tackle the misalignment in temporal, the pollution of background noises, and the difficulty of capturing small but meaningful subjects in videos. To address these drawbacks, in recent years, researchers have proposed some rigid-partition-based methods or soft-attention-based methods to instead the direct average operation. These methods are beneficial to learn more discriminative and diverse local features, resulting higher performance of video-based person Re-ID. However, previous methods generally ignore the role of the global features in whole person recognition while strengthening the local features. For examples, Liet al. [li2018diversity] do not consider the relationships between global and local features, and lacks global guidance for the local feature learning on each frame. Based on this consideration, Zhang et al. [zhang2020multi] utilize the local affinities with respect to inference global features to help assign different weights to local features. Different from the guidance for estimating weights of each node feature in [zhang2020multi], we correlate the video-level global features with the pixel-level local features in a frame to generate two correlation maps. Unlike the previous approaches using spatial self-attention mechanism, we utilize the global information to estimate the correlation degree in a frame. High- or low-correlation areas are adaptively mined according to the learned correlation maps. Intuitively, the higher the correlation values are, the more stable in temporal and more important in spatial the visual cues are. If the correlation values are low, the visual cues may be not continuous in temporal.
Based on above fact, we proposed a novel Global-guided Reciprocal Learning (GRL) framework for video-based person Re-ID. The whole framework mainly consists of two key modules. To begin with, we proposed a Global-guided Correlation Estimation (GCE) module to generate the correlation values of frame-level local features under the global guidance. With this new module, the frame-level feature maps will be disentangled into two kinds of discriminative features with distinct correlation degrees. The one with high-correlation, usually covers the most conspicuous and continuous visual information. The other with inverse-correlation, as the supplement, is exploited to mine the sub-critical cues. Besides, we propose a novel Temporal Reciprocal Learning (TRL) module to fully exploit all the discriminative features in the forward and backward process. More specifically, for high-correlation features, we adopt a semantic reinforcement strategy to mine spatial conspicuous and temporal aligned features. For low-correlation features, we introduce a temporal memory strategy to accumulate the discontinuous but discriminative cues frame by frame. In this way, our proposed method can not only explore the most conspicuous information from the highlighted regions in a sequence, but also capture the meaningful sub-critical information from the low-correlation regions. Extensive experiments on public benchmarks demonstrate that our framework delivers better results than other state-of-the-art approaches.
In summary, our contributions are four folds:
We propose a novel Global-guided Reciprocal Learning (GRL) framework for video-based person Re-ID.
We propose a new Global-guided Correlation Estimation module to generate the feature correlation degrees under the guidance of video-level representations.
We introduce a Temporal Reciprocal Learning (TRL) module to effectively capture the conspicuous information and the fine-grained clues in videos.
Extensive experiments on public benchmarks demonstrate that our framework synthetically attains a better performance than several state-of-the-art methods.
2 Related Works
2.1 Video-based Person Re-identification
In recent years, with the rise of deep learning[deng2009imagenet, ioffe2015batch], person Re-ID has gained a great success and the performance has been improved significantly. At the early stage of person Re-ID, researchers pay most attention on image-based person Re-ID. Recently, video-based person Re-ID is seen as a generalization of image-based person Re-ID task, and has drawn more and more researchers’ interests. Generally, videos contain richer spatial and temporal information than still images. Thus, on the one hand, some existing methods [wang2014person, mclaughlin2016recurrent, li2019multi] attempt to capture temporal information to strength the video representations. On the other hand, some works [li2018diversity, si2018dual, song2017region, fu2019sta] concentrate on extracting attentive spatial features. For example, Li et al. [li2018diversity] employ a diverse set of spatial attention modules to consistently extract similar local patches across multiple images. Fu et al. [fu2019sta] design an attention module to weight horizontal parts using a spatial-temporal map for more robust clip-level feature representations . Zhao et al. [zhao2019attribute] propose a attribute-driven method for feature disentangling to learn various attribute-aware features. Liu et al. [liu2019jointpyramid] propose a soft-parsing attention network and joint utilize a spatial pyramid non-local block to learn multiple semantic-aware aligned video representations. Zhang et al. [zhang2020multi] utilize a representative set of reference feature nodes for modeling the global relations and multi-granularity level attention to capture semantics at different levels. In this paper, we attempt to estimate the correlation maps guided by the whole video representation, which is beneficial to cover the conspicuous visual cues in each frame. Besides, a novel temporal reciprocal learning mechanism is proposed to explore more discriminative information for video-based person Re-ID.
2.2 Temporal Feature Learning
For video-related tasks, such as video-based person Re-ID, action recognition, video segmentation [miao2020memory] and so on, the temporal feature learning is seen as the core module in most algorithms. Typically, the temporal information modeling methods encodes the temporal relations or utilizes the temporal cues for video representation learning. Most of video-based existing methods exploit optical flow [chung2017two, chen2018video]
, Recurrrent Neural Networks (RNN)[mclaughlin2016recurrent, hochreiter1997long], or temporal pooling [zheng2016mars] for temporal feature learning. For the action recognition, Weng et al. [weng2020temporal] introduce a progressive enhancement module to sequentially excite the discriminative channels of frames. In video-based person Re-ID filed, Mclaughlin et al. [mclaughlin2016recurrent] introduce a recurrent architecture to pass the feature message of each frame for aggregating temporal information . Xu et al. [xu2017jointly] propose a joint spatial CNN and temporal RNN model for video-based person Re-ID. Dai et al. [dai2019video] design a temporal residual learning module to simultaneously extract the generic and specific features from consecutive frames. Zhang et al. [zhang2018multi]
introduce a reinforcement learning method for pairwise decision making. Liuet al. [liu2019spatial] design a refining recurrent unit and spatial-temporal integration module integrate abundant spatial-temporal information. Compared with existing methods, our method adopts temporal reciprocal learning for bi-directional semantic feature enhancement and temporal information accumulation. Thus, the global-guided spatial features could focus on complementary objects, such as moving human body and key accessories.
3 Proposed Method
In this section, we introduce the proposed Global-guided Reciprocal Learning (GRL) framework. We first give an overview of the proposed GRL. Then, we elaborate the key modules in the following subsections.
The overall architecture of our proposed GRL is shown in Fig. 1. Our approach consists of frame-level feature extraction, global-guided feature disentanglement, temporal reciprocal learning. Given a video, we first use Restricted Random Sampling (RRS) [li2018diversity] to generate training image frames. Then, we extract frame-level features by a pre-trained backbone network (ResNet-50 [he2016deep] in our work). Then, we adopt a Temporal Average Pooling (TAP) and a Global Average Pooling (GAP) to generate video-level representations based on the frame-level features. With the guidance of video-level representations, we design a Global-guided Correlation Estimation module to generate the correlation map for disentangling high- and low-correlation features according to the correlation degree. Afterwards, the Temporal Reciprocating Learning (TRL) is introduced to enhance and accumulate disentangled features in forward and backward directions. Finally, to train our model, we introduce Online Instance Matching (OIM) [xiao2017joint] loss and verification loss to optimize the whole network. By the GRL, our method can not only capture the conspicuous information but also mine more meaningful fine-grained cues in sequences. In the test stage, the average of the frame-level feature vectors at different time steps and the video-level feature vector at the last time step are concatenated for the retrieval list.
3.2 Global-guided Correlation Estimation
The attention mechanism has been widely adopted to tackle the misaligned problem in video-based person Re-ID. However, computing attention scores from a single frame lacks global perceptions of the whole video. To relieve this issue, we propose the GCE module to highlight the conspicuous features in each frame and ignore the distractive information. Fig. 2 shows the structure of the proposed GCE module. Formally, given a video, we first sample frames, as the input of our network. The ResNet-50 is used as the single frame feature extractor to obtain a set of frame-level features , where . represent the height, width and the number of channels, respectively. Then, we sequentially utilize a TAP and a GAP to obtain the video-level representation,
The feature vector can coarsely represent the whole video.
Based on the video-level representation, the proposed GCE takes the frame-level features and the video-level feature vector as inputs. To guide the feature learning, the is first expanded to with the same shape of . The expanded features are concatenated with . Then, we integrate the global and local features, and jointly inference the degree of correlation of them. The correlation map about under the global guidance can be computed by
where represents the concatenation operation. is learnable weight of one 1
represents the sigmoid activation function. By reversing the obtained correlation map, we can obtained a low-correlation map. Then, the correlation maps are multiplied with the original frame-level featuresto activate the distinct local regions of images. Finally, under the guidance, we disentangle frame-level features into the high-correlation features and the low-correlation features by
where represents element-wise multiplication, and . Based on above procedures, we disentangle the frame-level features into two distinct features, which is different from previous methods on local feature extraction.
3.3 Temporal Reciprocal Learning
The temporal feature aggregation plays an important role in video-based person Re-ID. The proposed GCE can highlight the informative regions under a global view. However, the sub-critical misaligned fine-grained cues are easily missed out due to the visual varieties in a long sequence. To address this problem, we proposed a novel Temporal Reciprocal Learning (TRL) mechanism to fully exploit the discriminative information from the high- and low-correlation features disentangled by GCE. Considering to the frame orders in videos, the TRL is designed for both forward and backward. More specificaly, we introduce the Enhancement and Memory Units (EMU) to enhance the high-correlation features and accumulate the low-correlation features. Finally, the features passed through the forward and backward directions are integrated as the outputs of the TRL module.
Enhancement and Memory Unit. As illustrated in Fig. 3, at the time step , the EMU takes three inputs: the high-correlation features and the low-correlation features , and the accumulated features from previous time steps. Then, we use the difference of feature maps to model the difference in semantics. Thus, the repeated semantic-aware feature between the high-correlation feature and the accumulative representation , could be suppressed. Mathematically, the difference operation is defined as
where and represent two individual convolution operations with ReLU activation, respectively. Then, the difference maps are aggregated by GAP to get overall response for each channel, i.e.,
. We introduce the channel attention for the feature selection, as
where are the parameters for generating the channel weights. is the final enhancement features. To fully exploit low-correlation features, we design a memory block to accumulate the sub-critical cues. Specifically, we first concatenate the enhanced features at steps and the low-correlation features at -th frame. Then, a residual block [he2016deep] is utilized to integrate current low-correlation features and accumulated to the next EMU.
where is the residual block in [he2016deep]. In the first time step, is initialized with the mean of .
Bi-directional Information Integration. In previous part, we design a bi-directional learning mechanism for assembling more robust representations. With the outputs of EMUs in forward and backward directions, we integrate them as the final video-level representation. Specially, the enhanced feature maps , , the accumulated features in forward and backward are concatenated after GAP. Then, a fully connected layer is utilized to integrate the concatenated robust representations.
With our proposed the temporal reciprocal learning mechanism, our method can progressively enhance the conspicuous features from high-correlation regions and adaptively mine the sub-critical details from low-correlation regions.
3.4 Training Schemes
In this paper, we adopt a binary cross entropy loss and the Online Instance Matching loss (OIM) [xiao2017joint] to train the whole network following [chen2018video]. For each probe-gallery sequence vector pair
in the training mini-batch, a binary cross entropy loss function can be utilized as
where is the number of sampled sequence pairs, denotes the similarity estimation function and . denotes the ground-truth label of and . Define if sequence and belong to the same person, otherwise .
Meanwhile, in our work, we use multi-level training objective to deeply supervise our proposed modules, which consists of the frame-level OIM loss and Video-level OIM loss. Instead of the conventional cross-entropy with a multi-class softmax layer, the OIM loss function uses a lookup table to store features of all identifies in the training set. In the temporal reciprocal learning, the featureenhanced by the enhancement block at time index t, is supervised by frame-level OIM loss. which is related to high-correlation map and aims to learn informative and continuous feature from different frames. The frame-level OIM loss can be defined as:
where indicates enhanced high-correlation feature vector of t-th image in n-th video. If the t-th image in n-th video belongs to the i-th person, , otherwise . are the coefficients associated with the feature embedding of the i-th person, which is online updated with the the frame-wise feature vector of the i-th person. Meanwhile, the feature accumulated by the memory block at the time step,is supervised by video-level OIM loss, which attempts to progressively collect all the sub-critical details from the low-correlation regions.
The total loss for training is a combination of the frame-level OIM loss, the video-level OIM loss and the verification loss.
4.1 Datasets and Evaluation Protocols
To evaluate the performance of our proposed method, we adopt three widely-used benchmarks, i.e., iLIDS-VID [wang2014person], PRID-2011 [hirzer2011person] and MARS [zheng2016mars]. iLIDS-VID [wang2014person] dataset is a small dataset, which consists of 600 video sequences of 300 different identities. Two cameras are used to collected images. Each video sequence contains 23 to 192 frames. PRID-2011 [hirzer2011person] dataset consists of 400 image sequences for 200 identities from two non-overlapping cameras. The sequence lengths range from 5 to 675 frames, with an average of 100. Following previous practice [wang2014person], we only utilize the sequence pairs with more than 21 frames. MARS [zheng2016mars] is one of large-scale datasets, and consists of 1,261 identities around 18,000 video sequences. All the video sequences are captured by at least 2 cameras. In order to simulate to actual detect conditions, there are round 3,200 distractors sequences in the dataset.
For evaluation, we follow previous works and adopt the Cumulative Matching Characteristic (CMC) table and mean Average Precision (mAP) to evaluate the performance for all the datasets. For more details, we refer readers to the original paper [zheng2016person]. In terms of iLIDS-VID and PRID2011, we only report the cumulative re-identification accuracy because that there only contain a single correct match in the gallery set.
4.2 Implementation Details
We implement our framework based on the Pytorch111https://pytorch.org/ toolbox. The experimental devices include an Intel i4790 CPU and two NVIDIA GTX 2080ti GPUs (12G memory). To generate training sequences, we employ the RRS strategy [fu2019sta], and divide each video sequence into 8 chunks with equal duration. Experimentally, we set the = 16 and = 8. Each image in a sequence is resized to 256128 and the input sequences are augmented by random cropping, horizontal flipping and random erasing. To provide a number of positive and negative sequence pairs in each training mini-batch, we randomly sampled the half batchsize sequences firstly. For one sampled sequences, we select another sequence with the same identify but under different cameras to fill the total batch. In this way, there is at least one positive sample for any sequence in a mini-batch. The ResNet-50 [he2016deep]
pre-trained on the ImageNet dataset[deng2009imagenet] is used as our backbone network. Following previous works [sun2018beyond]
, we remove the last spatial down-sample operation to increase the feature resolution. During training, we train our network for 50 epochs combining with the multi-level OIM losses and a binary cross-entropy loss. The whole network is updated by stochastic gradient descent[bottou2010large] algorithm with an initial learning rate of , weight decay of
and nesterov momentum of 0.9. The learning rate is decayed by 10 at every 15 epochs. We will release the source code for model reproduction.
|Methods||Feat. to test||mAP||Rank-1||Rank-1||Rank-5||Rank-1||Rank-5|
4.3 Ablation Study
In this subsection, we conduct experiments to investigate the effectiveness of proposed approach. The network is trained and evaluated on MARS, iLIDS-VID and PRID2011 datasets. The results are shown in Tab. 1, 2 and 3. In these tables, “Baseline” represents the backbone trained only with video-level OIM loss on the global branch. The direct spatial-temporal average pooling operation is applied on the multiple frame feature maps obtained from backbone. “GRL” represents our proposed global-guided reciprocal learning approach.
Effectiveness of Key Components. The ablation results of key components is reported Tab. 1 on three benchmark datasets. In this table, denotes the global feature vector without correlation estimation is used to train with video-level OIM loss. denotes the final feature vector responds to low-correlation maps also is supervised with video-level OIM loss. denotes the final feature vector with respect to high-correlation maps is trained with frame-level OIM loss. Compared with “Baseline”, “+ GCE” means that we add the global-guided correlation estimation to guide the learning of spatial features. we can see that, when the the correlation maps are utilized, the performance have a significant improvement. The improvements indicate that it is beneficial to estimate the correlations in a global view.
Considering that the low-correlated regions in a frame could be meaningful for identification, we design a novel temporal reciprocal learning mechanism to execute different learning strategies. “+ TRL” means that temporal reciprocal learning module with bi-directions is used to enhance and accumulate temporal information. In Tab. 1, we find that, compared with the “+ GCE”, the performance gets a slight drop when using to test. This is due to the inhibition of most discriminative information in the low-correlation regions. However, the high-correlation feature after reciprocal learning, has a better performance than only using the global-guided approach. It verified the effectiveness of semantic the designed enhancement block. Moreover, the combination of the low- and high-correlation feature vectors will boost the performance than either one does alone. Heavily, the increments benefit from jointly exploiting the conspicuous and informative characteristics and the sub-critical details. Compared with the “+ GCE” model, our proposed TRL mechanism can further improve the mAP by and the Rank-1 accuracy by on MARS. The results demonstrate the effectiveness of our global-guided reciprocal learning.
Influence of the Direction on Orders. We perform experiments to investigate the effect of varying the direction of the temporal process. As shown in Tab. 2, the proposed temporal learning with forward or backward direction gains similar results. And the bi-directional reciprocating learning shows higher performances, which benefits from the combination between forward and backward temporal learning. The effectiveness shows that the aggregated features by reciprocating learning are more robust.
Influence of the Different Sequence Lengths. We train and test our bi-directional temporal reciprocating learning module with various sequence lengths , and the results are shown in Fig. 3. From this table, we can see that increasing the length of sequence gains better performance and the length of gets best performance. One possible reason is that our temporal reciprocating learning will collect more meaningful cues with the increase of the sequence length. But excessive long sequences are bad for training the temporal reciprocal learning module.
4.4 Visualization Analysis
We visualize the learned high-correlation maps and the channel activate maps in Fig. 4. At the second row of this table, the high-correlation maps guided by the global view are shown. It can be obviously observed that, the high-correlation maps will focus on the foreground regions. The more significant in spatial and more consecutive in temporal the visual information is, the higher the correlation value is. For examples, the upper of human body usually has higher correlation values than the lower body. At the third and fourth rows, the channel activate maps of the accumulative features after a Res block, are visualized . Compared with the features learned from the high-correlation maps, the features from low-correlation maps in forward or backward process, could capture the incoherent but meaningful cues, such as the shoes or bags, with red bounding boxes in Fig. 4. Meanwhile, we can find that, at the same time step, there are some variation among the features from the forward and backward process. which is useful to assemble more discriminative information for identification and improve the performance. The visual maps further validate that our method could highlight the most significant and aligned characteristics in temporal and pay more attentions on the sub-critical subjects on spatial, simultaneously.
4.5 Comparison with State-of-the-arts
In this section, the proposed approach is compared with state-of-the-art methods on three video-based person Re-ID benchmarks. The results are reported in Tab. 4. On MARS dataset, the mAP and Rank-1 accuracy of our proposed method are and , respectively. Besides, our method achieves and on Rank-1 accuracy on iLIDS-VID dataset and PRID2011 dataset. We can see that the Rank-1 accuracy of our method outperforms all the compared methods, showing significant improvement over existing best state-of-the-art method. We notice that the MGRA [zhang2020multi] also employs the global view to aid the video-based person Re-ID task. It gains remarkable mAP on MARS dataset. Different from MGRA using a reference representation for modeling the relations between global and all the node features, our method utilizes the global vectors to estimate two correlation maps for the feature disentanglement on spatial features. jointly the proposed reciprocal learning, our proposed could fully take the advantage of the disentangled features , and could explore more informative and fine-grained cues via the high- or low-correlation maps. Thereby, our method surpasses MGRA by , and in terms of Rank-1 accuracy on MARS, iLIDS-VID and PRID2011, respectively. Meanwhile, it is worth noting that those methods, ASTPN [mclaughlin2016recurrent], STMP [liu2019spatial] and GLTR [li2019global], explore the temporal learning for video-based person Re-ID. ASTPN [mclaughlin2016recurrent] utilizes a temporal RNN to modeling the temporal information for video representations. STMP [liu2019spatial] introduces a refining recurrent unit to recover the missing parts by referring historical frames. In GLTR [li2019global], they employ dilated temporal convolutions to capture the multi-granular temporal dependencies and aggregates obtained short and long-term temporal cues for global-local temporal representation. Particularly, compared with other methods for temporal learning, our proposed method achieves better results on three public datasets. Specifically,our method improves the performances by and in terms of mAP and Rank-1 accuracy on MARS dataset. In summary, compared to existing methods, our method utilizes the global information to guide the feature disentanglement. In additional, we adopt two strategies for mining richer cues for temporal learning, which can fully exploit the spatial-temporal information for more discriminative video representations. These experimental results validate the superiority of our method.
In this paper, we propose a novel global-guided reciprocal learning framework for video-based person Re-ID. We present a GCE module to estimate the correlation degree of each pixel-level features in one frame. Then, the spatial features are disentangled into the high- and low-correlation features. To take full advantage of abundant visual information in sequences for identification, we propose a novel TRL module. In the temporal reciprocal learning module, multiple enhancement and memory unit is designed and connected in series for temporal learning. Besides, bi-directional process is arranged for forward and backward learning to resemble more robust representation. Based on the proposed modules, our approach could not only enhance the conspicuous and aligned information from the high-correlation features, but also accumulative sub-critical and fine-grained cues from the low-correlation features. Extensive experiments on public benchmarks show that our framework outperforms the state-of-the-arts.