Pytorch implementation of "Appearance-Preserving 3D Convolution for Video-based Person Re-identification"
Due to the imperfect person detection results and posture changes, temporal appearance misalignment is unavoidable in video-based person re-identification (ReID). In this case, 3D convolution may destroy the appearance representation of person video clips, thus it is harmful to ReID. To address this problem, we propose AppearancePreserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel. With APM aligning the adjacent feature maps in pixel level, the following 3D convolution can model temporal information on the premise of maintaining the appearance representation quality. It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds. Extensive experiments demonstrate the effectiveness of AP3D for video-based ReID and the results on three widely used datasets surpass the state-of-the-arts. Code is available at: https://github.com/guxinqian/AP3D.READ FULL TEXT VIEW PDF
Pytorch implementation of "Appearance-Preserving 3D Convolution for Video-based Person Re-identification"
Video-based person re-identification (ReID) [32, 11, 13] plays a crucial role in intelligent video surveillance system. Compared with image-based ReID [28, 12], the main difference is that the query and gallery in video-based ReID are both videos and contain additional temporal information. Therefore, how to deal with the temporal relations between video frames effectively is of central importance in video-based ReID.
The most commonly used temporal information modeling methods in computer vision include LSTM[10, 23], 3D convolution [29, 2, 24], and Non-local operation . LSTM and 3D convolution are adept at dealing with local temporal relations and encoding the relative position. Some researchers  have demonstrated that 3D convolution is superior to CNN+LSTM on the video classification tasks. In contrast, Non-local operation does not encode the relative position, but it can model long-range temporal dependencies. These methods are complementary to each other. In this paper, we mainly focus on improving existing 3D convolution to make it more suitable for video-based ReID.
Recently, some researchers [19, 17] try to introduce 3D convolution to video-based ReID. However, they neglect that, compared with other video-based tasks, the video sample in video-based ReID consists of a sequence of bounding boxes produced by some pedestrian detector [25, 35] (see Figure 1), not the original video frames. Due to the imperfect person detection algorithm, some resulting bounding boxes are smaller (see Figure 1 (a)) or bigger (see Figure 1
(b)) than the ground truths. In this case, because of the resizing operation before feeding into a neural network, the same spatial positions in adjacent frames may belong to different body parts and the same body parts in adjacent frames may be scaled to different sizes. Even though the detection results are accurate, the misalignment problem may still exist due to the posture changes of the target person (see Figure1 (c)). Note that one 3D convolution kernel processes the features at the same spatial position in adjacent frames into one value. When temporal appearance misalignment exists, 3D convolution may mixture the features belonging to different body parts in adjacent frames into one feature, which destroys the appearance representations of person videos. Since the performance of video-based ReID highly relies on the appearance representation, so the appearance destruction is harmful. Therefore, it is desirable to develop a new 3D convolution method which can model temporal relations on the premise of maintaining appearance representation quality.
In this paper, we propose Appearance-Preserving 3D convolution (AP3D) to address the appearance destruction problem of existing 3D convolution. As shown in Figure 1 (d), AP3D is composed of an Appearance-Preserving Module (APM) and a 3D convolution kernel. For each central feature map, APM reconstructs its adjacent feature maps according to the cross-pixel semantic similarity and guarantees the temporal appearance alignment between the reconstructed and central feature maps. The reconstruction process of APM can be considered as feature map registration between two frames. As for the problem of asymmetric appearance information (e.g., in Figure 1 (a), the first frame does not contain foot region, thus can not be aligned with the second frame perfectly), Contrastive Attention is proposed to find the unmatched regions between the reconstructed and central feature maps. Then, the learned attention mask is imposed on the reconstructed feature map to avoid error propagation. With APM guaranteeing the appearance alignment, the following 3D convolution can model the spatiotemporal information more effectively and enhance the video representation with higher discriminative ability but no appearance destruction. Consequently, the performance of video-based ReID can be greatly improved. Note that the learning process of APM is unsupervised. In other words, no extra correspondence annotations are required, and the model can be trained only with identification supervision.
The proposed AP3D can be easily combined with existing 3D ConvNets (e.g., I3D  and P3D ) just by replacing the original 3D convolution kernels with AP3Ds. Extensive ablation studies on two widely used datasets indicate that AP3D outperforms existing 3D convolution significantly . Using RGB information only and without any bells and whistles (e.g., optical flow, complex feature matching strategy), AP3D achieves state-of-the-art results on both datasets.
In summary, the main contributions of our work lie in three aspects: (1) finding that existing 3D convolution is problematic for extracting appearance representation when misalignment exists; (2) proposing an AP3D method to address this problem by aligning the feature maps in pixel level according to semantic similarity before convolution operation; (3) achieving superior performance on video-based ReID compared with state-of-the-art methods.
Video-based ReID. Compared with image-based ReID, the samples in video-based ReID contain more frames and additional temporal information. Therefore, some existing methods [17, 32, 4, 22] attempt to model the additional temporal information to enhance the video representations. In contrast, other methods [21, 18, 27, 3] extract video frame features just using image-based ReID model and explore how to integrate or match multi-frame features. In this paper, we try to solve video-based ReID through developing an improved 3D convolution model for better spatiotemporal feature representation.
Temporal Information Modeling. The widely used temporal information modeling methods in computer vision include LSTM [10, 23], 3D convolution [29, 2], and Non-local operation . LSTM and 3D convolution are adept at modeling local temporal relations and encoding the relative position, while Non-local operation can deal with long-range temporal relations. They are complementary to each other. Zisserman et al. has demonstrated that 3D convolution outperforms CNN+LSTM on the video classification task. In this paper, we mainly improve the original 3D convolution to avoid the appearance destruction problem and also attempt to combine the proposed AP3D with some existing 3D ConvNets.
. These images may be obtained at different times, from different viewpoints or different modalities. The spatial relations between these images may be estimated using rigid, affine, or complex deformation models. As for the proposed method, the alignment operation of APM can be considered as feature map registration. Different feature maps are obtained at sequential times and the subject of person is non-rigid.
In this section, we first illustrate the overall framework of the proposed AP3D. Then, the details of the core module, i.e. Appearance-Preserving Module (APM), are explained followed with discussion. Finally, we introduce how to combine AP3D with existing 3D ConvNets.
3D convolution is widely used on video classification task and achieves state-of-the-art performance. Recently, some researchers [19, 17] introduce it to video-based ReID. However, they neglect that the performance of ReID tasks is highly dependent on the appearance representation, instead of the motion representation. Due to the imperfect detection results or posture changes, appearance misalignment is unavoidable in video-based ReID samples. In this case, existing 3D convolutions, which process the same spatial position across adjacent frames as a whole, may destroy the appearance representation of person videos, therefore they are harmful to ReID.
In this paper, we propose a novel AP3D method to address the above problem. The proposed AP3D is composed of an APM and a following 3D convolution. An example of AP3D with convolution kernel is shown in Figure 2. Specifically, given an input tensor with frames, each frame is considered as the central frame. We first sample two neighbors for each frame and obtain
adjacent feature maps in total after padding zeros. Secondly, APM is used to reconstruct each adjacent feature map to guarantee the appearance alignment with corresponding central feature map. Then, we integrate the reconstructed adjacent feature maps and the original input feature maps to form a temporary tensor. Finally, theconvolution with stride (3, 1, 1) is performed and an output tensor with frames can be produced. With APM guaranteeing appearance alignment, the following 3D convolution can model temporal relations without appearance destruction. The details of APM are presented in next subsection.
Feature Map Registration. The objective of APM is reconstructing each adjacent feature map to guarantee that the same spatial position on the reconstructed and corresponding central feature maps belong to the same body part. It can be considered as a graph matching or registration task between each two feature maps. On one hand, since the human body is a non-rigid object, a simple affine transformation can not achieve this goal. On the other hand, existing video-based ReID datasets do not have extra correspondence annotations. Therefore, the process of registration is not that straightforward.
We notice that the middle-level features from ConvNet contain some semantic information 
. In general, the features with the same appearance have higher cosine similarity, while the features with different appearances have lower cosine similarity[1, 13]. As shown in Figure 3, the red crosses indicate the same position on the central (in Figure 3 (a)) and adjacent (in Figure 3 (b)) frames, but they belong to different body parts. We compute the cross-pixel cosine similarits between the marked position on the central feature map and all positions on the adjacent feature map. After normalization, the similarity distribution is visualized in Figure 3 (c) (). It can be seen that the region with the same appearance is highlighted. Hence, in this paper, we locate the corresponding positions in adjacent frames according to the cross-pixel similarities to achieve feature map registration.
Since the scales of the same body part on the adjacent feature maps may be different, one position on the central feature map may have several corresponding pixels on its adjacent feature map, and vice versa. Therefore, filling the corresponding position on the reconstructed feature map with only the most similar position on the original adjacent feature map is not accurate. To include all pixels with the same appearance, we compute the response at each position on the reconstructed adjacent feature map as a weighted sum of the features at all positions on the original adjacent feature map:
where is the feature on the central feature map with the same spatial position as and is defined as the cosine similarity between and with a scale factor :
is a linear transformation that maps the features to a low-dimensional space. The scale factoris used to adjust the range of cosine similarities. And a big can make the relatively high similarity even higher while the relatively low similarity lower. As shown in Figure 3 (c), with a reasonable scale factor , APM can locate the corresponding region on the adjacent feature map precisely. In this paper, We set the scale factor to 4.
Contrastive Attention. Due to the error of pedestrian detection, some regressive bounding boxes are smaller than the ground truths, so some body parts may be lost in the adjacent frames (see Figure 1 (a)). In this case, the adjacent feature maps can not align with the central feature map perfectly. To avoid error propagation caused by imperfect registration, Contrastive Attention is proposed to find the unmatched regions between the reconstructed and central feature maps. Then, the learned attention mask is imposed on the reconstructed feature map. The final response at each position on the reconstructed feature map is defined as:
Here produces an attention value in accoring to the semantic similarity between and :
is a learnable weight vector implemented byconvolution, and is Hadamard product. Since and are from the central and reconstructed feature maps respectively, we use two asymmetric mapping functions and to map and to a shared low-dimension semantic space.
The registration and contrastive attention of APM are illustrated in Figure 4. All three semantic mappings, i.e. , and , are implemented by convolution layers. To reduce the computation, the output channels of these convolution layers are set to .
Relations between APM and Non-local. APM and Non-local (NL) operation can be viewed as two graph neural network modules. Both modules consider the feature at each position on feature maps as a node in graph and use weighted sum to estimate the feature. But they have many differences:
(a) NL aims to use spatiotemporal information to enhance feature and its essence is graph convolution or self-attention on a spatiotemporal graph. In contrast, APM aims to reconstruct adjacent feature maps to avoid appearance destruction by the following 3D Conv. Its essence is graph matching or registration between two spatial graphs.
(b) The weights in the weighted sum in NL are used for building dependencies between each pair of nodes only and do not have specific meaning. In contrast, APM defines the weights using cosine similarity with a reasonable scale factor, in order to find the positions with the same appearance on the adjacent feature maps accurately (see Figure 3).
(c) After APM, the integrated feature maps in Figure 2 can still maintain spatiotemporal relative relations to be encoded by the following 3D Conv, while NL cannot.
(d) Given a spatiotemporal graph with frames, the computational complexity of NL is , while the computational complexity of APM is only , much lower than NL.
Relations between Contrastive Attention and Spatial Attention. The Contrastive Attention in APM aims to find the unmatched regions between two frames to avoid error propagation caused by imperfect registration, while the widely used spatial attention  in ReID aims to locate more discriminative regions for each frame. As for formulation, Contrastive Attention takes two feature maps as inputs and is imposed on the reconstructed feature map, while Spatial Attention takes one feature map as input and is imposed on itself.
To leverage successful 3D ConvNet designs, we combine the proposed AP3D with I3D  and P3D  Residual blocks. Transferring I3D and P3D Residual blocks to their AP3D versions just needs to replace the original temporal convolution kernel with AP3D with the same kernel size. The C2D, AP-I3D and AP-P3D versions of Residual blocks are shown in Figure 5.
To investigate the effectiveness of AP3D for video-based ReID, we use the 2D ConvNet (C2D) form  as our baseline method and extend it into AP3D ConvNet with the proposed AP3D. The details of network architectures are described in Section 4.1
, and then the loss function we use is introduced in Section4.2.
C2D baseline. We use ResNet-50 
pre-trained on ImageNet as the backbone and remove the down-sampling operation of following  to enrich the granularity. Given an input video clip with frames, it outputs a tensor with shape 14] operation is used to normalize the feature following . The C2D baseline does not involve any temporal operations except the final temporal average pooling.
AP3D ConvNet. We replace some 2D Residual blocks with AP3D Residual blocks to turn C2D into AP3D ConvNet for spatiotemporal feature learning. Specifically, we investigate replacing one, half of or all Residual blocks in one stage of ResNet, and the results are reported in Section 5.4
Datasets. We evaluate the proposed method on three video-based ReID datasets, i.e. MARS , DukeMTMC-VideoReID  and iLIDS-VID . Since MARS and DukeMTMC-VideoReID have fixed train/test splits, for convenience, we perform ablation studies mainly on these two datasets. Besides, we report the final results on iLIDS-VID to compare with the state-of-the-arts.
Training. In the training stage, for each video tracklet, we randomly sample 4 frames with a stride of 8 frames to form a video clip. Each batch contains 8 persons, each person with 4 video clips. We resize all the video frames to pixels and use horizontal flip for data augmentation. As for the optimizer, Adam 
with weight decay 0.0005 is adopted to update the parameters. We train the model for 240 epochs in total. The learning rate is initialized toand multiplied by 0.1 after every 60 epochs.
In the test phase, for each video tracklet, we first split it into several 32-frame video clips. Then we extract the feature representation for each video clip and the final video feature is the averaged representation of all clips. After feature extraction, the cosine distances between the query and gallery features are computed, based on which the retrieval is performed.
AP3D vs. original 3D convolution. To verify the effectiveness and generalization ability of the proposed AP3D, we implement I3D and P3D residual blocks using AP3D and the original 3D convolution, respectively. Then, we replace one 2D block with 3D block for every 2 residual blocks in and of C2D ConvNets, and 5 residual blocks in total are replaced. As shown in Table 1, compared with the C2D baseline, I3D and P3D show close or lower results due to appearance destruction. With APM aligning the appearance representation, the corresponding AP3D versions improve the performance significantly and consistently on both two datasets with few additional parameters and little extra computational complexity. Specifically, AP3D increases about 1% top-1 and 2% mAP over I3D and P3D on MARS dataset. Note that the mAP improvement on DukeMTMC-VideoReID is not as much as that on MARS. One possible explanation is that the bounding boxes of video samples in DukeMTMC-VideoReID dataset are manually annotated and the appearance misalignment is not too serious, so the improvement of AP3D is not very significant.
Compared with other varieties, AP-P3D-C achieves the best performance among most settings. So we conduct the following experiments based on AP-P3D-C (denoted as AP3D for short) if not specifically noted.
|Deformable 3D Conv ||27.75||19.53||88.5||81.9||95.2||95.0|
AP3D vs. Non-local. Both APM in AP3D and Non-local (NL) are graph-based methods. We insert the same 5 NL blocks into C2D ConvNets and compare AP3D with NL in Table 2. It can be seen that, with fewer parameters and less computational complexity, AP3D outperforms NL on both two datasets.
To compare more fairly, we also implement Contrastive Attention embedded Non-local (CA-NL) and the combination of NL and P3D (NL-P3D). As shown in Table 2, CA-NL achieves the same result as NL on MARS and is still inferior to AP3D. On DukeMTMC-VideoReID, the top-1 of CA-NL is even lower than NL. It is more likely that the Contrastive Attention in APM is designed to avoid error propagation caused by imperfect registration. However, the essence of NL is graph convolution on a spatiotemporal graph, not graph registration. So NL can not co-work with Contrastive Attention. Besides, since P3D can not handle appearance misalignment in video-based ReID, NL-P3D shows close results to NL and is inferior to AP3D, too. With APM aligning the appearance, further improvement is achieved by NL-AP3D. This result demonstrates that AP3D and NL are complementary to each other.
AP3D vs. other methods for temporal information modeling. We also compare AP3D with Deformable 3D convolution  and CNN+LSTM . To compare fairly, the same backbone and hyper-parameters are used. As shown in Table 2, AP3D outperforms these two methods significantly on both two datasets. This comparison further demonstrates the effectiveness of AP3D for learning temporal cues.
Effective positions to place AP3D blocks. Table 3 compares the results of replacing a residual block with AP3D block in different stages of C2D ConvNet. In each of these stages, the second last residual block is replaced with the AP3D block. It can be seen that the improvements by placing AP3D block in and are similar. Especially, the results of placing only one AP3D block in or surpass the results of placing 5 P3D blocks in . However, the results of placing AP3D block in or are worse than the C2D baseline. It is likely that the low-level features in are insufficient to provide precise semantic information, thus APM in AP3D can not align the appearance representation very well. In contrast, the features in are insufficient to provide precise spatial information, so the improvement by appearance alignment is also limited. Hence, we only consider replacing the residual blocks in and .
How many blocks should be replaced by AP3D? Table 3 also shows the results with more AP3D blocks. We investigate replacing 2 blocks (1 for each stage), 5 blocks (half of residual blocks in and ) and 10 blocks (all residual blocks in and ) in C2D ConvNet. It can be seen that more AP3D blocks generally lead to higher performance. We argue that more AP3D blocks can perform more temporal communications, which can hardly be realized via the C2D model. As for the results with 10 blocks, the performance drop may lie in the overfitting caused by the excessive parameters.
Effectiveness of AP3D across different backbones. We also investigate the effectiveness and generalization ability of AP3D across different backbones. Specifically, we replace half of the residual blocks in of ResNet-18 and ResNet-34 with AP3D blocks. As shown in Table 5, AP3D can improve the results of these two architectures significantly and consistently on both datasets. In particular, AP3D-ResNet-18 is superior to both its ResNet-18 counterparts (C2D and P3D) and the deeper ResNet-34, a model which has almost double the number of parameters and computational complexity, on MARS dataset. This comparison shows that the effectiveness of AP3D does not rely on additional parameters and computational load.
The effectiveness of Contrastive Attention. As described in Section 3.2, we use Contrastive Attention to avoid error propagation of imperfect registration caused by asymmetric appearance information. To verify the effectiveness, we reproduce AP3D with/without Contrastive Attention (CA) and the experimental results on MARS, a dataset produced by pedestrian detector, are shown in Table 5. It can be seen that, without Contrastive Attention, AP-I3D and AP-P3D can still increase the performance of I3D and P3D baselines by a considerable margin. With Contrastive Attention applied on the reconstructed feature map, the results of AP-I3D and AP-P3D can be further improved.
The influence of the scale factor . As discussed in Section 3.2, the larger the scale factor , the higher the weights of pixels with high similarity. We show the experimental results with varying on MARS dataset in Figure 7. It can be seen that AP3D with different scale factors consistently improves over the baseline and the best performance is achieved when .
|Snippet ||RGB + Flow||86.3||76.1||-||-||85.4|
|AttDriven ||RGB + Att.||87.0||78.2||-||-||86.3|
We select some misaligned samples and visualize the original feature maps and the reconstructed feature maps in after APM in Figure 7. It can be seen that the highlighted regions of the central feature map and the adjacent feature map before APM mainly focus on their own foreground respectively and are misaligned. After APM, the highlighted regions of the reconstructed feature maps are aligned w.r.t.the foreground of the corresponding central frame. It can further validate the alignment mechanism of APM.
We compare the proposed method with state-of-the-art video-based ReID methods which use the same backbone on MARS, DukeMTMC-VideoReID, and iLIDS-VID datasets. The results are summarized in Table 6. Note that these comparison methods differ in many aspects, e.g., using information from different modalities. Nevertheless, using RGB only and with a simple feature integration strategy (i.e. temporal average pooling), the proposed AP3D surpasses all these methods consistently on these three datasets. Especially, AP3D achieves 85.1% mAP on MARS dataset. When combined with Non-local, further improvement can be obtained.
In this paper, we propose a novel AP3D method for video-based ReID. AP3D consists of an APM and a 3D convolution kernel. With APM guaranteeing the appearance alignment across adjacent feature maps, the following 3D convolution can model temporal information on the premise of maintaining the appearance representation quality. In this way, the proposed AP3D addresses the appearance destruction problem of the original 3D convolution. It is easy to combine AP3D with existing 3D ConvNets. Extensive experiments verify the effectiveness and generalization ability of AP3D, which surpasses start-of-the-art methods on three widely used datasets. As a future work, we will extend AP3D to make it a basic operation in deep neural networks for various video-based recognition tasks.
Acknowledgement This work is partially supported by Natural Science Foundation of China (NSFC): 61876171 and 61976203.
A two stream siamese convolutional neural network for person re-identification. In ICCV, Cited by: §2.
CosFace: large margin cosine loss for deep face recognition. In CVPR, Cited by: §4.2.