Person re-identification (Re-ID) tackles the problem of retrieving pedestrian images/videos across non-overlapping cameras. Previous approaches mostly focus on image-based Re-ID, where each pedestrian possesses multiple images for retrieval [market, trip, cuhk03, viper, dukefeature, msmt17, grid, caviar]. Recently, video-based Re-ID has drawn significant attention in literature since retrieving pedestrian videos is more realistic and critical in real-world surveillance applications [prid2011, ilidsvid, mars, dukevideo]. With the emergence of large-scale video-based Re-ID datasets [mars, dukevideo], researchers design Deep Neural Networks to learn robust representation for videos [mars, rcnn, forest, diversity, snip, sta].
To perform video-based Re-ID, typical methods require learning a mapping function to project the video sequences to a low-dimensional feature space, where Re-ID can then be performed by comparing distances between samples. As demonstrated by numerous works, training the convolutional Neural Network (CNN) as a mapping function has dominated over classic methods with hand-crafted features[sdalf, lomo, kissme]. Usually, they obtain features for a sequence by aggregating image features with average or maximum pooling [rcnn, mars]. However, their approaches fail to handle occlusion or spatial misalignment in video sequences since it treats all images in a sequence with equal importance [snip]
. In order to distill relevant information for Re-ID, some works integrate Recurrent Neural Network to learn the spatial-temporal dependency in an end-to-end training manner[rcnn, recurrent, deepfusion]. Recently, several works propose attention mechanism to weight the importance of different frames or different spatial locations to aggregate a better representation [diversity, snip, sta]. While these methods successfully capture both the spatial and temporal characteristics of video sequences, they only explore the aggregation of high-level features for representation, which might not be sufficiently robust for fine-grained classification tasks such as Re-ID [HACNN, yu2018hierarchical, shih2017deep].
In this paper, we first aim to improve the representation for video sequences by exploiting spatial and temporal characteristics in both low-level and high-level features. Inspired by Wang et al [non-local], we propose a Non-local Video Attention Network (NVAN) by introducing the non-local attention layer into an image classification CNN model. The non-local attention layer enriches the local image feature with global sequence information by generating attention masks according to features of different frames and different spatial locations. By inserting non-local attention layers at different feature levels, NVAN explores the spatial and temporal diversity of a sequence and alters its feature representation subsequently rather than combining individual image features with a set of weights as in previous works. Our NVAN model surpasses all state-of-the-art video-based Re-ID methods by a large margin on the challenging MARS [mars] dataset, proving that exploiting global information for multi-level features is crucial for learning representation for video sequences.
While applying non-local attention layer to multi-level features significantly improves the Re-ID performance, it comes at a great cost in terms of computation complexity. In fact, it increases the total floating point operations (FLOP) by , making it difficult to scale up to practical applications. To alleviate such challenge, we take advantage of the space-time redundancy in pedestrian videos and propose a Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN). We first reduce the granularity of attention masks in non-local attention layers by exploiting the spatial redundancy exhibited in pedestrian images. On the other hand, we explore the temporal redundancy between video frames to aggregate image-wise information into a representative video feature with a hierarchical structure. By reducing the computation complexity both spatially and temporally, our STE-NVAN cut down of FLOP compared to original NVAN with only drop in rank-1 accuracy on MARS dataset. Our proposed STE-NVAN demonstrates a much superior trade-off between performance and complexity compared to existing video-based Re-ID methods. The contribution of our work can be summarized as follows:
We introduce the non-local attention operation into the backbone CNN at multiple feature levels to incorporate both spatial and temporal characteristics of pedestrian videos into the representation.
We significantly reduce the computation count for our Non-local Video Attention Network by exploring the spatial and temporal redundancy presented in pedestrian videos.
Extensive experiments validate that our proposed model not only outperforms state-of-the-art methods in Re-ID accuracy but also requires less computation count than existing attention methods for video-based Re-ID.
2 Related Work
In this section, we briefly review the related works regarding image-based person Re-ID, video-based person Re-ID and the usage of attention mechanisms for the Re-ID problem.
Image-based person Re-ID has been extensively studied over the years. With the success of CNNs [domaindrop, svdnet, IDE, trip, HACNN]
, deep features learned from the networks has replaced hand-crafted features[viper, market, caviar, lomo] for representing pedestrian images. As suggested by Zheng et al [ppp], these networks can be categorized into discriminative learning and metric learning. Discriminative learning learns deep features for identity classification with the help of the cross-entropy loss [domaindrop, svdnet, IDE]. As for metric learning, Hermans et al [trip]
use the triplet loss to teach the network to push together features of the same person and pull away features of different people. In this work, we utilize both loss functions to train our network for video-based person Re-ID.
Video-based person Re-ID is an extension of image-based person Re-ID. Zheng et al [mars] introduce a large-scale dataset to enable the learning of deep features for video-based Re-ID. They first train a CNN to extract image features then aggregate them into a sequence features with average/maximum pooling. Other works [rcnn, forest, recurrent] adopt Recurrent Neural Networks to summarize image-wise features into a single feature by exploiting temporal relation within a sequence.
Recently, attention mechanisms are introduced for capturing spatial and temporal characteristics of pedestrian sequences within the deep features. Xu et al [jointatten]
introduce the joint attentive spatial and temporal pooling network to extract sequence features by jointly considering the query and gallery pairs with an affinity matrix. Liet al [diversity] learn attention weights to combine features of different spatial locations and different temporal frames into a sequence feature. Chen et al [snip] utilize techniques in [attention_is_all] to perform self-attention on each video snippet and co-attention between video snippets for learning sequence features. Fu et al [sta] learn sequence features by mining features of discriminative regions and select important frames with a parameter-free attention scheme. While these works achieve promising results by introducing spatial and temporal attention on top of high-level features obtained from image-based CNNs, they overlook the importance of utilizing video characteristics at intermediate feature levels. In contrast, our proposed NVAN is able to refine intermediate features with spatial and temporal information of videos and our efficient STE-NVAN model substantially reduces the computation cost for incorporating video characteristics at lower feature levels.
3 Proposed Method
Given an image sequence of any pedestrians, we aim to learn a CNN to extract its feature representation that enables video-based person Re-ID in the embedding space. The key to learning a representative feature for a sequence is to incorporate video characteristics into the feature itself. To this end, we introduce the non-local attention layer into the CNN to explore the spatial and temporal dependency of a video sequence. We propose a Non-local Video Attention Network (NVAN) in Sec. 3.1 to apply such operations at different feature levels. However, we observe incredibly large computation complexity with the introduction of attention mechanisms. Hence, we further propose the Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN) in Sec. 3.2 to alleviate the computation cost by exploiting spatial and temporal redundancy which exists in pedestrian videos.
3.1 Non-local Video Attention Network
To extract features for an image sequence, we take input as a subset of video frames selected by restricted random sampling (RRS) strategy and forward through a backbone CNN network incorporating non-local attention layers and a feature pooling layer (FPL) to obtain the representation vector for video-based Re-ID, as shown in Figure1 (b).
Restricted Random Sampling (RRS).
There are several ways to handle the long-range temporal structure. To balance speed and accuracy, we adopt the restricted random sampling strategy [diversity, wang2016temporal]. Given an input video , we divide it into chunks of equal duration. For training, we randomly sample an image in each chunk. As for testing, we use the first image of each chunk. The video is then represented by the ordered set of sampled frames .
Non-local Attention Layer.
To embed video characteristics into the features, we introduce the non-local layer proposed by Wang et al [non-local] into the backbone CNN, as illustrated in Figure 1
(a). Given an input feature tensorobtained from a sequence of feature maps of size , we desire to exchange information between features across all spatial locations and frames. Let sampled from , the corresponding output of non-local operation can be formulated as follow:
Here, indexes all locations across a feature map and all frames. We first project to a lower dimensional embedding space
by using linear transformation functions( convolution). Then, the response of each location is computed by the weighted average of all positions by using Embedded Gaussian instantiation. The Equation 1 in non-local layer is a self-attention mechanism which is also mentioned in [non-local]. The overall non-local layer is finally formulated as , where the output of non-local operation is added to the original feature tensor with a transformation ( convolution) that maps to the original feature space . The intuition behind the non-local operation is that when extracting features at a specific location in a specific time, the network should consider the spatial and temporal dependency within a sequence by attending on the non-local context. In our person Re-ID scheme, we embed five non-local layers into our backbone CNN which is a ResNet-50 network [resnet] to comprehend the semantic relation presented in videos, as shown in Figure 1 (b).
Feature Pooling Layer (FPL).
After passing the image sequence through the backbone CNN and non-local attention layers, we employ the feature pooling layer to obtain the final feature for Re-ID, shown in Figure 1
(b). We apply 3D average pooling (3DAP) along the spatial and temporal dimension to aggregate the output features of each image into a representative vector, followed by a batch normalization (BN) layer[bn]. We train the network by jointly optimizing the cross-entropy loss and the soft-margin batch-hard triplet loss [trip]
. Interestingly, we empirically find that optimizing cross-entropy loss on the final feature while optimizing triplet loss on the feature before BN results in the best Re-ID performance. A rational explanation is that the embedding space without normalization is more suitable for distance metric learning such as the triplet loss, while the normalized feature space forces the model to classify samples on a more constraint angular space with cross-entropy loss[trip, sphereface, arcface, cosface].
3.2 Spatially and Temporally Efficient Non-local Video Attention Network
While our proposed NVAN is able to capture sophisticated properties of video sequence with the help of non-local operations, we observe a significant increase in the computation complexity as shown in Table 1, where FLOP ramp up from G to G. For scaling NVAN to practical usage scenarios, we introduce two complexity reduction techniques to cut down the computation count.
Spatial Reduction with Pedestrian Part Characteristics.
Originally, the introduced non-local operations perform dense affinity calculation between features of all positions to obtain a fine attention mask. This results in heavy computation of complexity for each non-local attention layer. Applying the non-local attention layer to lower feature levels incurs larger complexity since low level features are typically of higher . To alleviate such effect, we group the features along the horizontal direction to form a more compact representation of the feature tensor. The intuition is that pixels of the same horizontal stripe tend to share similar characteristics which can be utilized to generate coarse representation of the image. It is worth noting that while similar ideas have been explored in Re-ID literature [pcb, deepfusion, cheng2016person, lomo], they use this concept to generate finer features for Re-ID. In contrast, we exploit this redundancy to obtain coarser representation. We partition the original feature tensor into horizontal groups by adding the “Make stripe” module at the input of non-local operations. The resulting tensor requires only to complete the operation, which is irrelevant to the spatial size of feature maps. This dramatically reduces the computation complexity and enables us to deploy non-local operation to lower feature levels with constant computation cost. We name it Spatial Reduction Non-local Layer and illustrate the idea in Figure 2 (a).
Temporal Reduction with Hierarchical Structure.
During our experiments, we observe that features refined by non-local operations are often temporally similar as non-local operation aims to embed global temporal information into the features. Inspired by this observation, we exploit the temporal redundancy between features of different frames and propose a hierarchical structure to reduce the heavy computation of extracting sequence feature. We illustrate this idea in Figure 2. After passing a sequence of images through a series of convolutions (Residual blocks) and non-local attention layers, we apply max pooling across features of adjacent frames and reduce the temporal feature dimension by a factor of 2. We perform the same reduction operation after another stacks of Residual blocks until the temporal dimension is reduced to 2, which is then sent to FPL for final feature summarization. This temporal reduction technique cuts down the computation required for extracting sequence feature with Residual blocks and non-local attention layers. By applying both the Spatial Reduction Non-local Layers and the Hierarchical Temporal Reduction structure, we come up with the final Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN) for video-based person Re-ID.
We evaluate our approach on two large-scale video-based person Re-ID datasets, MARS [mars] and DukeMTMC-VideoReID [dukevideo]. We conduct ablation studies to validate the effectiveness of non-local operations and the two proposed reduction methods. We compare our NVAN and STE-NVAN models to existing state-of-the-arts to demonstrate that our proposed models display superior performance while requiring less computation counts.
4.1 Experimental Setup
Datasets and Evaluation Protocal.
MARS [mars] is one of the large video-based person Re-ID datasets, consisting of 17,503 tracks and 1,261 identities. Each track has 59 frames on average. Deformable Part Model [dpm] is employed to detect pedestrians and GMCP [gmcp] is used to track pedestrians. To make the dataset even more challenging, they include 3,248 distractor tracks in the dataset. DukeMTMC-VideoReID [dukevideo] is another large-scale benchmark recently introduced for video-based person Re-ID. It comprises 4,832 tracks and 1,404 identities and 408 distractor identities. Each track contains 168 frames on average. Detection and tracking ground truth are manually labeled. In the following literature, DukeMTMC-VideoReID will be abbreviated as “DukeV” for convenience. In our experiments, we adopt the standard train/test split and report both rank-1 accuracy (R1) and Mean Average Precision (mAP) to evaluate the Re-ID performance.
For the RRS strategy described in Sec. 3.1, we segment each video into chunks and sampled images as the input sequence. Each frame is resized to
and synchronously augmented with random horizontal flip for each track. We adopt the ImageNet pre-trained ResNet-50[resnet] as our backbone network, and modified
to stride 1 instead of stride 2 to better adapt the Re-ID task. For our NVAN, we insert 2 non-local attention layers afterand another 3 after respectively. As for STE-NVAN, we set
in Spatial Reduction Non-local layer and perform max-pooling right after the second and the fifth non-local attention layer to reduce temporal dimension. We train our network for 200 epoch with both cross-entropy loss and triplet loss[trip] and choose Adam [adam] optimizer with an initial learning rate of and decay it by 10 every epochs. Following the suggestion in [trip], we sample 8 identities, each with 4 tracks, to form a batch of size images.
|NVAN+Spatial Reduc.||FPL||89.7||82.5||96.3||94.7||30.4 G|
|NVAN+Temporal Reduc.||FPL||89.2||81.2||95.6||93.7||40.4 G|
4.2 Ablation Studies
Effectiveness of Non-local Attention Layer and Two Reduction Methods.
We first compare our NVAN model with two baseline models to demonstrate the power of non-local operations. The two baseline models (ResNet-50) use the same backbone network as NVAN but without non-local attention layers. The only difference between the two baselines is that one replace the 3DAP in FPL with 3D maximum pooling operation. The first three rows in Table 1 illustrate the results. It reveals that non-local operations improve the R1 and mAP significantly by on MARS and on DukeV. The improvement confirms the effectiveness of incorporating spatial and temporal characteristics in the sequence feature of different semantic levels. However, we observe an dramatic increase in FLOP accompanying the introduction of non-local operations. Therefore, we propose two reduction techniques by exploiting spatial and temporal redundancy in pedestrian videos. Table 1 shows that our spatial reduction strategy cuts down the FLOP to approximately the same level as baseline networks while only incurring R1/mAP drop on MARS and mAP drop on DukeV. As for temporal reduction, we save of FLOP from NVAN and sustain only R1 loss on both datasets and and mAP loss. Finally, by applying both spatial and temporal reduction techniques on NVAN, which is our STE-NVAN, we achieve FLOP reduction compare to NVAN and requires less FLOP compare to the baseline that doesn’t employ any attention mechanism. It shows that our proposed STE-NVAN not only improves the Re-ID performance but also demonstrates a more efficient method of extracting sequence features.
Analysis of NVAN.
To better understand the property of non-local operations, we conduct analysis on NVAN regarding RRS strategy and number of inserted non-local attention layers. In Table 5, we discover that by increasing the number of frames sampled from a sequence in RRS, Re-ID performance increases steadily as more frames provide richer information about a pedestrian. We pick for our NVAN and STE-NVAN in consideration of the memory capacity of our machine. On the other hand, we observe performance gain as we insert more non-local attention layers. In Table 5, we insert a non-local layer at for “1 layer” and insert 3 non-local layers at for “3 layers”. We insert 5 non-local layers for NVAN and STE-NVAN since it performs the best.
Analysis of STE-NVAN.
Next we investigate the parameters for designing STE-NVAN. Starting from NVAN, we apply the spatial reduction techniques to group features into horizontal stripes in non-local attention layer. Table 5 shows that while increasing number of stripes does not introduce excessive additional FLOP, it improves the Re-ID performance subtly. As for analyzing temporal reduction, we increase the pooling operations throughout the network. For comparison, “in 3DAP” in Table 5 is the NVAN model that pools all features after the last convolutional layer. By employing additional pooling after the non-local layers located in (“+ stage 4”), we reduce of FLOP from NVAN. And by introducing another additional pooling after non-local layers at (“+ stage 3”), we remove of FLOP from NVAN while only dropping and of R1 on MARS and DukeV.
4.3 Comparison with State-of-the-arts Approaches
Table 6 reports the comparison of our NVAN and STE-NVAN to state-of-the-art video-based person Re-ID approaches. For STA [sta], we display their results sampling 8 images per sequence to be fair with our method. On MARS, our NVAN achieves in R1 and in mAP, surpassing all methods by a large margin. Our efficient STE-NVAN also performs better than all methods in R1 and breaks even with STA in mAP despite using less FLOP than NVAN. On the other hand, our NVAN and STE-NVAN still displays competitive results on DukeV, where Re-ID on DukeV is easier than MARS since detection are manually annotated. The superior Re-ID performance on two benchmark datasets proves the value of applying non-local operations for extracting a better representation of videos.
To take the computation complexity into consideration, we compare our method with existing methods that also uses attention mechanisms on the performance-computation plot in Figure 3. We visualize mAP on MARS dataset for the performance and #FLOP for computation counts. For STA, we report three variants of their with different numbers of sampled frames per sequence to better demonstrate their trade-off. Results show that our proposed STE-NVAN exhibits a much better mAP-FLOP trade-off compared to current state-of-the-arts. STAN [diversity] and CSACSE+OF [snip] even lands outside of the plot since their mAP and FLOP are beyond the scale of our plot. The results not only indicates the advantage of our proposed spatial and temporal reduction techniques but also reveal the importance of considering computation complexity when design feature extractors for video sequences.
|STA (N=8) [sta]||AAAI19||86.2||81.2||96.0||95.0|
We introduce a Non-local Video Attention Network (NVAN) which incorporates multiple non-local attention layers to extract spatial and temporal video characteristics from low to high feature levels, which enrich the representation of videos in person re-identification. To alleviate the computation cost, we proposed a Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN), which spatially reduce the non-local operation by utilizing pedestrian part characteristics and temporally reduce the operation with hierarchical structure. Extensive experiments are conducted to prove that our STE-NVAN is a superior trade-off between performance and computation.
This research was supported in part by the Ministry of Science and Technology of Taiwan (MOST 108-2633-E-002-001), National Taiwan University(NTU-108L104039), Intel Corporation, Delta Electronics and Compal Electronics.