Person Re-Identification (Re-ID) has been extensively applied as a retrieval technique in large scale person tracking and related scenarios of intelligent video surveillance. Given a query sample of specific person, it aims to match the same person in the gallery set of samples which may be captured by cameras from different viewpoints in different backgrounds [69, 61]. Due to the increasing demand of public safety, the practical importance of person Re-ID leads to more and more attention from community. Although significant progress has been witnessed during the last decade, there still exist some challenging problems in the study of person Re-ID, e.g., partial occlusions [19, 16], drastic deformations of human pose [65, 39], complex environment and background clutter , etc.
Recently, deep learning based models have been proven to be quite effective to tackle aforementioned problems[69, 43, 34]
. Pre-trained Convolutional Neural Network (CNN) models such as ResNet and InceptionNet  serve as a strong backbone for extracting representative visual features from images. The generic framework of such approaches mainly follows a fine-tuning stage on labeled samples to deliver a model which can discriminate person from each other within training set. Then intermediate features before final classification layer are retained to accomplish the deep learning based feature embedding. These low-dimensional but representative features instead of original images of person can be exploited for high efficient matching as well as retrieval of unknown person in new set of instances.
In contrast to conventional vision tasks with large scale datasets, existing datasets for person Re-ID normally require algorithm to learn a classification model over a relatively large number of classes ( persons) with limited samples (
images). Also, due to the slight difference between the final goal of feature embedding and learning of classification, simply using existing backbone models in Re-ID tends to deliver intermediate features without sufficient generalization. Many efforts have been devoted to alleviate this problem, including 1) particularly designed loss function to discover discriminative features, e.g., sphere loss, triplet loss  and center loss ; 2) adopting affiliated structure such as local part [43, 20] or multiple branches [60, 67, 51] to learn fine-grained features for higher diversity; 3) introducing attention mechanisms to emphasize feature correlations and prompt the efficiency of model . However, recent works focus on tediously adding extra structures to backbone networks for the gain of performance, which raises an interesting question. Are these structures really effective or performance can also be prompted by simply increasing the number of parameters, e.g., model with larger backbones or ensemble techniques .
In this paper, we comprehensively consider the performance as well as efficiency of person Re-ID models, proposing a compact model consisting of a backbone and a Feature Pyramid Branch (FPB). FPB is mainly inspired by the Feature Pyramid Network (FPN) structure in the field of object detection. As a common structure in prevailing object detectors [30, 45, 3]
, FPN has been proven to be quite effective in aggregating features at different scales. Although many works in the literature have proven that person Re-ID also requires feature extraction at different granularities[43, 50], due to the difference between tasks of Re-ID and object detection, it is still challenging to exploit FPN into person Re-ID architecture.
Here we address this problem, proposing a bidirectional pyramid structure cooperated by attentive auxiliary modules as a lightweight branch solution. In Fig. 1, we compare our proposed model with other ResNet based state-of-the-art methods. We choose ResNet50 as a typical backbone instance since it is extensively adopted by existing Re-ID methods and shows high compatibility with various structures. Based on ResNet50, a backbone with 25.56M parameters, our proposed FPB based method achieves the best performance on all benchmarks by only introducing less than 1.5M extra parameters.
From Fig. 1 one can see that, 1) for methods with parallel amounts of parameters as FPB (28M), e.g., Bag-Of-Tricks, PCB and KPM, the mean Average Precision (mAP) is significantly prompted over 5% on all of three datasets. 2) Other leading methods with approaching performance normally require models with more than 32M parameters, which implies at least 4 times extra parameters are added to the backbone. 3) Furthermore, FPB also outperforms methods with much higher complexity such as Pyramid, ABD and LocalCNN, which retain models with more than 50M learnable parameters. Our main contributions in this paper can be summarized as follows:
We propose a lightweight FPB which can be plugged into backbone network to form an asymmetrical multi-branch architecture. Diverse features from different scales are extracted and integrated for final matching. As far as we know, we are the first to successfully exploit feature pyramid network into the model of person Re-ID.
To further prompt the performance of FPB, self-attention modules as auxiliary modules are carefully evaluated and inserted into different positions of network. Also, we propose an extra cross orthogonality regularization over features from two layers of FPB. This penalty effectively reduces the correlation of feature maps especially after attention modules, and thus improves the diversity of resulting features.
Extensive validations on different person Re-ID datasets demonstrate that FPB structure can deliver significant improvement with trivial increase of computation cost (1.5M extra parameters). It outperforms other leading methods and achieves new state-of-the-art on all prevailing benchmarks with an efficient implementation. Our results also empirically prove that FPN could be a potential structure in related feature embedding tasks.
Ii Related works
Ii-a Person Re-Identification
Recently, deep learning based person Re-ID methods show clear advantages to conventional methods which still rely on handcrafted features [22, 23]. There also exist two paradigms to learn deep models with the feature embedding capability for final matching. The first one is to adopt pre-trained backbone, e.g., ResNet, and fine-tune parameters of specific architecture on person Re-ID datasets [43, 34, 60, 67, 6, 62]. The other one is to design novel architecture specifically for the task of person Re-ID and train the model from scratch [27, 73, 24, 59]. Although specific models such as Omni-Scale Network (OSNet)  and Harmonious Attention CNN (HA-CNN)  normally imply higher efficient models with much less parameters, we still focus on the former paradigm based on generic backbones in this paper. This is because we observe that these models deliver higher performance especially on large scale benchmarks, e.g., MSMT17 . Also, they demonstrate higher feasibility for further modification as well as distribution in realistic scenarios.
On the other hand, different mechanisms are also introduced into the stage of model training to prompt the performance. Data augmentation methods, such as random erasing , random patch , are commonly adopted by various methods. Based on essential random erasing at every single sample, randomly dropping block [7, 57] further proposed to randomly drop the same part at all samples within a batch for learning more attentive local features. Besides augmentation, other strategies such as stochastic weight averaging , extra regularization  have also been proven to be effective to prompt the capability of learned models. In this paper, since we focus on the structure of FPB, only essential strategies during training are adopted for a fair comparison.
Ii-B Diversity of Features
Due to aforementioned reasons, to prompt the generalization capacity of feature embedding models is a critical issue in person Re-ID. An effective way is to increase the diversity of extracted features. Conventional output feature of deep models is normally resulted by direct averaging operation over the whole feature map. Conversely,  proposed to extract local features from different parts for higher diversity in final matching. Besides part based models, to extract features from multiple levels [10, 37] is another potential way to deliver more representative features. Features from shallow layers of networks can naturally reflect detailed local information at images. On the other side, recent research demonstrates that a complementary feature consisting of both global and local features could be more diverse for feature matching. Thus multi-branch structure becomes a prevailing architecture in the literature of person Re-ID [6, 60, 59, 67].
An inevitable problem brought by multi-branch structure is the increase of complexity. As an essential way, [6, 62] construct the dual-branch structure by duplicating part of backbone with separated learnable parameters. For more branches, since most of existing works tend to design symmetrical multi-branch structure, multiple duplications of the last part of backbone apparently lead to significant increase of parameters. For some lightweight backbones, multiple branches could even introduce extra parameters more than original backbones [24, 59]. Our proposed approach follows the paradigm of multi-branch networks. However, we carefully restrain the complexity of local part branch by a lightweight pyramid network structure rather than direct duplication of backbone. The pyramid network can also extract features from multiple levels of backbone, thus detailed information from shallow layers are naturally retained. Finally, complementary feature embedding at different granularities are learned jointly in an end-to-end framework.
Ii-C Feature Pyramid Network
The FPN structure is extensively exploited in the field of object detection. The first successful application of FPN is . It proposed a top-down pathway to output detection results at multiple scales simultaneously. In this way, conventional pyramid of original input image is replaced by a CNN structure to achieve multi-scale object detection more efficiently. Following the idea, PANet  further proposed a bidirectional FPN consisting of a top-down as well as a bottom-up path to aggregate features at each layer of FPN. M2det  adopted a block of alternating joint U-shape module to fuse multi level features. EfficientDet  introduced the down-sampling structure from ResNet and proposed a Bidirectional FPN BiFPN. State-of-the-art object detectors including both one-stage approaches (SSD , YOLO , EfficientDet ) and two-stage approaches (Mask RCNN , DetNet ) all exploit FPN structure to tackle the scale variation problem. In this paper, we introduce the FPN into the task of person Re-ID since it also requires feature matching at both global and local scales. Despite difference exists between person Re-ID and object detection, our results empirically prove that with delicate design, FPN can bring significant benefit with trivial increase of model complexity.
Ii-D Attention Mechanisms
There are various implementations of attention modules in the literature. The original attention module is proposed for the Natural Language Processing (NLP) tasks[1, 47], which is normally referred as the multi-head attention. It focuses on reducing the ambiguity of input features by their context information. Recently, multi-head attention as well as the transformer architecture have also been proven to be effective in various vision tasks 
. Differing from original multi-head attention module in NLP, the most extensively applied attention module in vision tasks is the so-called Non-Local Neural Networks structure. It aims to encode the correlation between features at different positions to output more attentive features as well. Differing from Non-Local Neural Networks, another design of attention, namely Position Attention Module (PAM) is also extensively adopted in person Re-ID tasks 
. PAM can be treated as a simplified version of multi-head attention without dimension reduction or multi-head structure on the value branch. It also focuses on attending features at different positions, extracting correlation information as position affinity matrix to reweigh features.
Besides PAM, there exists another series of attention modules, namely Channel Attention Module (CAM), aim to extract correlation information over different channels of feature maps. Some typical implementations of CAM can be referred to Squeeze-and-Excitation block  and Efficient Channel Attention . Since no extra parameter is required in the implementation of CAM, it can be deployed as an efficient mechanism to extract channel-wise response over features with trivial cost. Many variants of CAM attend as building block in different architectures of CNNs, specifically in lightweight designs such as OSNet  and EfficientNet . Also, CAM is extensively deployed in existing person Re-ID frameworks such as HA-CNN  and Attentive but Diverse Network (ABD-Net) .
In this paper, our implementation of attention module consists of the concatenation of a PAM and a CAM. This structure is analogous to the design of attention module in . We observe that the PAM can deliver slightly better performance than prevailing Non-Local Neural Networks in our architecture. Meanwhile, the CAM can also bring explicit gain via negligible increase of parameters.
Iii Proposed Method
As illustrated in Fig. 2, our proposed method is a dual-branch framework, consisting of a global branch and a feature pyramid branch. The global branch is mainly based on the Bag-Of-Tricks method , which contains a modified version of ResNet50  as backbone network. Differing from standard ResNet50 based ID-Discriminative Embedding (IDE) 
, here the last down-sampling operation in layer 4 of ResNet50 is removed to increase the size of the output. After the Global Average Pooling (GAP) operation over the output, a 2048-dimension vector is delivered as the global feature. A BNNeck  is also adopted to diliver the normalized version of global feature here. More details can be referred to the description of loss fuctions in Section III-D.
Based on the backbone network, we propose a lightweight Feature Pyramid Branch (FPB) to enrich the diversity of features for person Re-ID. The outter structure of this branch is inspired by the Part-based Convolutional Baseline (PCB) . Output feature map is devided into parts to emphasize local information. The difference is we take shallower features from the layer 2 and layer 3 of backbone as input here. These shallow features can retain more local details from image. Simultaneously, features with lower dimensions reduce the number of learnable parameters within the branch.
Specificially, the output feature map of FPB is processed by different strategies during training and inference respectively. Average pooling operation is adopted here to archive 1024-dimension vectors as s (s to 256-dimension vectors as s to optimize the classification loss on the branch during training. The concatenation of s and output feature from backbone forms the final resulting feature of the whole model. This feature can represent the original image for an efficient matching and retrieval of images from the same person in numbers of candidates with unknown identities.
The integration of multiple branches here is a reminiscent of AsNet , which also prompted the diversity of resulting features with asymmetrical branches. However, our implementation here is much more compact since we propose FPB as the replacement of simple duplication of layer 4 of ResNet in . Note that we choose ResNet50 as backbone in this work only because its popularity and flexibility with other structures. FPB can also be exploited as compatible plugin to other common feature extraction backbones, such as Densenet , InceptionNet .
Iii-a Feature Pyramid Branch
The pivotal structure of FPB is a two layers FPN as shown in Fig. 2. This affiliated branch network begins with lateral convolutional filters to convert feature maps with different channel numbers to unified 256. The lateral filters consist of standard convolutional filters followed by BN and ReLU. Within the FPB, four low dimensional convolutional filters fullfil the aggregation of features at different scales. The structures of these filters are similar to lateral filters, except that they adopt kernel.
There exist two cross-scale connections between feature maps with different spatial resolutions in FPB. One top-down connection is implemented by nearest interpolation to increase the size of feature map. Conversely, one bottom-up connection is implemented by max pooling withkernel. There also exist two down-sampling operations as extra edges from input to output node at each layer. As illustrated by curved arrows in Fig. 2, we implement down-sampling analogously to the residual structure in ResNet. As the output of FPB, we take the feature at the deeper layer of FPB and recover the channel to 1024 for consequent processing. Hence, each layer within FPB can actually be treated as a small bottleneck structure, which extracts information by filters with relatively larger kernel as well as fewer channels.
, BiFPN proposed to add an extra edge from the input to output node at the same scale. Our empirical results also prove the efficacy and simplicity of this structure. However, note that there exist two essential differences between the structures of FPB and BiFPN. First, FPN aims to tackle the problem of occurring objects with different sizes by aggregating features at different scales, while FPB aims to integrate diverse features from different scales into final matching. Therefore FPB aggregates features at a single output to average pooling operation, rather than multiple outputs at each layer as BiFPN. Second, we implement inner nodes as well as down-sampling connections with wider filters than BiFPN. This is because, the task of Re-ID requires complicated information from different scales for the classification over a relatively larger amount of identities. In contrast, object detection focuses on a classification problem only over tens of categories, taking COCO detection datasets as a typical instance. Thus filters with more channels here are adopted to fit the complexity of problem. Moreover, there also exist some other subtle modifications in FPB according to our empirical results. For example, we found that weighted feature fusion with learnable parameters at each node is inappropriate in Re-ID. More detailed analysis can be referred to the ablation study in Section IV-D.
Attention module has been proven as an effective mechanism in various machine learning scenarios. It exploits the correlation between features to help model focus on more related features and reduce the ambiguity of features for final tasks. In the case of person Re-ID, the main goal is to train the model to discriminate persons with samples in training set, and to extract representative features from images of unknown identities in testing set for final matching. This divergence between training and inference requires the model to extract more generic features rather than baised features related to specific instances in training set. The exploitation of self-attention modules can effectively prompt the generalization capacity of model by forcing it to focus on the relationship of features. Based on this observation, we insert two self-attention modules at backbone and feature pyramid branch respectively. Both self-attention modules consist of a PAM followed by a CAM. Their structures are shown in Fig.3.
Position Attention Module: Our implementation of PAM is analogous to ABD-Net , which can be treated as a simplification of the extensively applied multi-head attention mechanism in Natural Language Processing (NLP) . Given input feature maps , where , and are channel number, height and width, respectively. The PAM projects and reshapes feature at every position onto two lower dimensional subspaces, resulting in query and key . Here is the spatial size of feature map and is a hyper parameter to control the dimension of subspace. We follow the usual choice and simply set as 8 for all experiments. Specifically, the projections from to and are implemented by 2 convolution with kernel size as . Then the attention of can be calculated from query , key and value as
where is the Softmax function, value is another projection of with equivalent dimension. Note that if we ignore all learnable parameters here, the position affinity matrix can be simplified as a Gram matrix, which can measure the correlation between features at different positions of . From this perspective, the essential goal of position attention is to reweigh each feature by its correlations with other features. As shown in Fig. 3(a), we also adopt a residual structure as well as a learnable parameter to adjust the impact of attention.
Channel Attention Module: Based on similar motivation, we also implement CAM to extract attention over different channels of features in . As illustrated in Fig. 3(b), without projection, the channel affinity matrix is directly calculated as
A learnable parameter is also adopted here to adjust the impact of attention in the final sum operation with original features. Comparing with the PAM, the implementation of CAM only requires a few parameters and brings explicit improvement to performance.
Iii-C Cross Orthogonality Regularization
On top of self-attention modules, we further enforce the diversity of features by orthogonality regularization. As suggested in , the orthogonality regularization aims to prompt the representative efficiency of features by reducing the feature correlation between different channels. The influence is especially obvious on features after attention modules. Given feature map
, conventional hard regularization normally relies on the Singular Value Decomposition (SVD), which is computationally expensive especially for high dimensional features. A substitute is the soft orthogonality regularization which optimizes the conditional number ofas
where is a small constant, and
denote the largest and smallest eigenvalues of matrix, respectively. With the fast iterative algorithm for solving eigenvalues, the orthogonality regularization can be implemented efficiently during training.
Rather than merely applying orthogonality regularization on single feature map, we propose the Cross Orthogonality Regularization (COR) as shown in Fig. 2. Two feature maps, and after attention modules are taken into account jointly here. Feature maps with different resolutions are unified by max pooling operation, and then concatenated into one higher dimensional feature map. The orthogonality regularization is then applied to enforce the orthogonality over all channels of features from different positions. The motivation of COR instead of single orthogonality regularization on different feature maps respectively is that we observe standard back propagation can naturally reduce the correlation over features. It intrinsically ensures the efficiency of deep models. In this case, the effectiveness of simple orthogonality regularization on feature map is partially overlapped by the learning of entire model, while COR can be more complementary since it affects multiple branches simultaneously. Without any extra computation increase to inference, COR brings small by obvious improvement to final performance.
Iii-D Loss Functions
At the training stage, we can get four types of output features as shown in Fig. 2, , , s and s. Here notation s and s mean multiple features from each part after average pooling. The following loss function is optimized to learn all parameters within the model:
where is the index of training sample . is the hard mining triplet loss  between sample and another sample within a batch. represents the concatenation operation of all vectors in and s. is the cross entropy loss, and are the FC layers after and each , respectively. is the COR version of in Equation 3
. A hyperparameteris adopted as a balance between different losses. The utilization of follows the conventional framework of learning person Re-ID models [69, 34], while the prompts the generalization capacity of model by ensuring a larger distance between output features from samples of different identities than the same one.
As listed in Equation 4, two kinds of features and , delivered by the BNNeck  mechanism on the global branch, are included in the loss function. The BNNeck deploys a BN layer after to get the normalized version . is utilized for the opization of the on global branch during training as well as part of the final feature during inference. While original is utilized as part of the output feature to optimize the during training. Our experimental results demonstrate that merely deploying the BNNeck on the global branch rather than both branches can deliver the optimal performance since it naturally forces an asymmetrical structure over outputs and thus ensures the diversity of features from different paths.
Iv Empirical Results
In this section, we conduct a series of experiments to analyze the performance of our proposed FPB and compare it with other state-of-the-art works. Four prevailing person Re-ID datasets are considered here, Market1501 , DukeMTMC  CUHK03  and MSMT17 .
Market1501  consists of 32,668 images from 1501 identities captured by six cameras, in which each identity is at least captured by two cameras with multiple images. For the training set, 12,936 images from 751 identities are considered, leading to an average of 17.2 training samples for one person are available. For the testing set, 19,732 images from 750 other identities are considered, in which 3,368 images are used as probe set while the rest are used as gallery set.
DukeMTMC  consists of 36,411 images from 1,404 identities captured by more than two cameras, and 408 identities captured by only one camera as distractors. For the training set, 16,522 images from 702 identities are considered. For the testing set, 17,661 images from 702 other identities are considered, in which 2,228 images are used as probe set while the rest images from the 702 identities as well as distractors are used as gallery set.
CUHK03  consists of images from 1467 identities captured by five cameras, in which 767 identities are used as training set and 700 other identities are used as testing set. The dataset contains two tasks, person Re-ID with labeled images and with detected images. The labeled dataset has 7,368 images for training and 6,728 images for testing. The detected dataset has 7,365 images for training and 7,732 images for testing.
MSMT17  is relatively larger than aforementioned three datasets. It consists of 126,441 images from 4,101 identities captured by a 15-camera network (12 outdoor, 3 indoor). For the training set, 32,621 images from 1,041 identities are considered. For the testing set, 93,820 images from 3,060 other identities are considered, in which 11,659 images are used as probe set while the rest are used as gallery set.
|Adaptive L2 ||88.9||95.6||81.0||90.2||-||-||-||-|
Iv-B Implementation Details
. First, the training starts with the pre-trained backbone from ImageNet. For the rest part of the architecture in Fig. 2, we adopt the widely applied He method  as initialization. Standard augmentation methods including random horizontal flip, random crop, random erasing  and random patch 
are also adopted during training. We fine-tune the model with Adam optimizer for 120 epochs. The linear warmup strategy is used, in which the learning rate is initialized at 3.5e-5 and increased to 3.5e-4 in 20 epochs. Then the learning rate is decayed after 60 and 90 epochs with a rate of 0.1, respectively.
For Market1501, DukeMTMC and CUHK03, the size of input image is resized to . Experiments are executed with a hardware environment as Intel E5-2680CPU at 2.4GHz and a single NVidia Tesla P40 GPU. The model is trained with a batch size of 64 from 16 identities. For MSMT17, to prompt the efficiency of learning this massive dataset, we adopt multiple GPUs and keep the batch size at single GPU still to 64.
Evaluation Protocol: For quantitative comparison over different methods, we consider the Cumulative Matching Characteristics (CMC) at mean Average Precision (mAP) and top-1 accuracy (rank-1) as standard metrics. All results are obtained without any re-ranking  or multi-query fusion  techniques.
Iv-C Comparison with State-of-the-art Methods
In this section, we compare our proposed approach with other state-of-the-art methods. In Table I, we list the performance of different methods on four tasks, Market1501, DukeMTMC, CUHK03 (Labeled) and CUHK03 (Detected). One can see that our proposed scheme outperforms other approaches with obvious margins. The only exception is the rank-1 accuracy of HOReID  at Market1501. Here we adopt the result of HOReID  with ResNet50 as backbone and an extra Global Hard Identity Searching (GHIS)  method during training. Without this augmentation, the rank-1 accuracy of HOReID reduces to 95.74%, which is worse than FPB. For the mAP, FPB substantially exceeds the second best approaches by 0.6%, 1.8%, 2.9% and 3.8%. From our observations in experiments, we treat mAP as a more reliable indicator in scenarios of person Re-ID. This observation also agrees with . Note that the identical architecture with a number of 27.04M learnable parameters is used for all experiments on different datasets. It only brings an increase of less than 1.5M extra parameters to Resnet50 baseline, which demonstrates the significant efficacy of FPB on extracting representative features.
|Adaptive L2 ||ResNet50||59.4||79.6||-|
|Adaptive L2 ||ResNet101||61.9||81.3||-|
|Adaptive L2 ||ResNet152||62.2||81.7||-|
MSMT17: For the more challenging large scale person Re-ID dataset MSMT17, we compare different configurations of our proposed FPB with other state-of-the-art methods in Table II. Here we implement two versions of FPB with ResNet50 and ResNet101 as backbones respectively. From the perspective of mAP, one can see that the ResNet50 based implementation of FPB outperforms other methods with obvious gap. It even exceeds the second best approach, ResNet152 based Adaptive L2  with more than 60M parameters, from 62.2% to 63.5%. Meanwhile, ResNet101 based FPB retains a parallel performance here with larger backbone. However from the perspective of rank-1 accuracy, ResNet50 based FPB is worse than the best performance from ABD-Net . This situation is mitigated by the larger ResNet101 backbone. In this case, the FPB dilivers the best performance of rank-1 accuracy. Note that, even with ResNet101 backbone, the parameters of our proposed FPB is still comparable to ABD-Net.
Iv-D Ablation Study
We carefully construct the final structure of proposed FPB step by step from the IDE as baseline. The Influences of each attempt are listed in Table III. First, we empirically prove the effectiveness of proposed tricks in , then we focus on configurations of the feature pyramid structure here. Typical FPNs in the frameworks of object detection, e.g., Single Shot Detector (SSD)  and PANet , normally retain a structure with more than three layers. Therefore we construct a three-layers FPB as initialization. It takes feature maps after layer 1, 2 and 3 of backbone as input. As listed in Table III, our final comparison demonstrates that FPB with two layers after layer 2 and 3 of backbone can deliver the optimal performance, which implies that too detailed local features from small respective field might not be useful for feature matching in Re-ID. Due to the variation of view and pose, these features could be too unstable to be fixed in specific positions in resulting . This is a major difference between person Re-ID and object detection, in which detailed edge information is required for localization of object.
We also compared other configurations in Table III. Here the down-sampling mechanism refers to the extra edges from input to output node at each layer in FPB. This operation also bring obvious improvement based on simple connections. In contrast to the observation in , the weighted feature fusion at each node in FPB causes reduction of both mAP nd top-1 accuracy of models. At last, we compare different widths of convolutional filters including 256 and 512 channel numbers in the FPB. It turns out wider filters with extra computation consumptions can merely brings trivial impact here. Hence, filters with 256 channels become to our final choice in subsequent experiments.
Self-Attention Module: Although attention module has been proven as an effective mechanism in various vision tasks. We observe that only deploying them at carefully selected positions can bring positive influence to the whole model. In FPB, two self-attention modules are injected into the architecture at positions as illustrated in Fig. 2. We list the improvements brought by them in Table IV. The deployment of attention modules after the layer 2 of backbone and the lateral filter at the shallower layer of FPB finally delivers the optimal performance. This observation agrees with strategy of attention modules in , although there is only one attention module in AsNet. The extra attention module at FPB further prompts the performance with trivial increase of parameters, since channel number is severely reduced by lateral filters.
|+ Attention on backbone||89.3||95.2||98.3|
|+ Attention on FPB||90.2||95.6||98.7|
Cross Orthogonality Regularization: In Table IV, we also list the improvement brought by our proposed COR, which delivers the final performance of FPB. One can see that this joint constraint on features from two layers brings small but non-trivial improvement to the performance of whole model. We also illustrate the learning curves of s in Fig. 4, which reflect the variation of correlations over feature maps during training. Here we take feature maps at the two positions in Fig. 2 into account, comparing three different strategies of orthogonality regularization during training. For the first case (No OR), we simply output the sum of correlations without any optimization of regularization. Then we adopt conventional OR on the two feature maps respectively. At last, we deploy our proposed COR on the two feature maps to deliver three learning curves.
Note that without any optimization, the learning curve of correlation can still decline during the training. To reduce the correlation of resulting feature maps is a natural function of standard back propagation learning. It results in more efficient model as well as more representative output features. The OR mechanism accelerates this process, helping the feature correlation reach a lower point more efficiently at the beginning of training. However, one can see that despite due to different calculation method our proposed COR starts from a higher point than other strategies, it finally delivers the lowest correlation between different channels of resulting feature maps.
This observation also helps to understand why injection of attention modules into relatively shallow levels of networks is more effective. The forward inference of CNNs naturally reduces the feature correlation layer by layer. Thus available information in Equations 1 and 2 for the calculation of attention is getting less and less as well.
From Fig. 5, we can observe this process more inituitively from the correlation of outpuyt feature maps. From the correlation matrices at the first and second rows, one can see that self-attention modules introduce extra correlation while enhancing the related features for embedding. This observation agrees with the effect of attention in . At Fig. 5(c), the OR mechanism significantly reduces the correlation of both and respectively, improving the efficiency of these features. However, from the correlation matrix of we can see that correlation between and still exist. Our proposed COR finally reduces the co-redundancy and further prompts the diversity between different branches of the model. Another observation from Fig. 5(b) is that correlation of is larger than . This observation also agrees with our aforementioned assumption, the correlation of features is naturally shrinked during inference of CNNs to ensure a more efficient extraction.
Influence of Part Number: For the output features of our proposed FPB, we adopt a similar structure as , in which several features represent corresponding parts of image respectively. In Fig. 6, we study the variations of mAP and top-1 accuracy alongside the change of part numbers as well. It can be shown that on three different datasets, the tendency is consistent and explicit. The configuration of three parts in has been proven as optimal and is adopted in all of our experiments. This configuration is the same as , while differing from six parts in the original part-based framework for person Re-ID . It implies that a smaller part number as part branch can work better with the help from global branch.
To further study the influence of our proposed dual-branch structure, we illustrate some feature embedding samples with their activation maps  in Fig. 7. For each input image, we compare activation maps from three different configurations. The first one is generated by IDE as baseline. The second one is activation map of output feature map after the layer 4 of global branch in Fig. 2. The third one is at the output of the feature pyramid branch before average pooling.
Fig. 7 shows two observations: (1) Rather than small highlighted points in IDE, the output feature maps of global branch correspond to higher activation over the main part of person. This is mainly because of the utilization of self-attention modules. (2) The attentive feature branch serves as complementary function, focusing on more detailed local parts of person. Such diversity is ensured by two mechanisms. An asymmetrical architecture between branches naturally lead to different processses of features, help branches focus on different parts during training. Also, the triplet loss function over the concatenated output features from both branches encourages more diverse information being extracted.
V Conclusion and Future Work
We propose a novel structure called feature pyramid branch for the task of Person Re-ID. The structure can server as an affiliated branch to complement the backbone. Features at different scales can be extracted and aggregated in this branch for the final embedding with higher diversity. Cooperated with attention mechanism as well as orthogonality regularization, our proposed FPB structure can significantly prompt the model performance with trivial increase of computational complexity. For future work, we will continure to investigate the potentiality of the feature pyramid structure in various scenarios of feature embedding, specifically its compatibility with more lightweight backbones.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §II-D.
-  (2017-07) Scalable person re-identification on supervised smoothed manifold. In , Vol. , pp. 3356–3365. External Links: Cited by: §IV-B.
-  (2020) YOLOv4: optimal speed and accuracy of object detection. External Links: Cited by: §I, §II-C.
-  (2019) Mixed high-order attention network for person re-identification. arXiv preprint arXiv:1908.05819. Cited by: TABLE I.
-  (2021) Learning 3d shape feature for texture-insensitive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8146–8155. Cited by: TABLE I.
-  (2019) ABD-net: attentive but diverse person re-identification. In 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), Vol. , pp. 8351–8361. External Links: Cited by: §I, §II-A, §II-B, §II-B, §II-D, §II-D, §II-D, §III-B, §III-C, §IV-C, §IV-C, §IV-D, TABLE I, TABLE II.
-  (2019) Batch dropblock network for person re-identification and beyond. In 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), Vol. , pp. 3691–3701. External Links: Cited by: §II-A, TABLE I.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §II-D.
-  (2019) Spherereid: deep hypersphere manifold embedding for person re-identification. Journal of Visual Communication and Image Representation 60, pp. 51–58. Cited by: §I.
-  (2018) Efficient and deep person re-identification using multi-level similarity. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2335–2344. External Links: Cited by: §II-B.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1026–1034. External Links: Cited by: §IV-B.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §II-C.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §III, §IV-B.
-  (2017) In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737. Cited by: §I, §III-D.
-  (2019) Interaction-and-aggregation network for person re-identification. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 9309–9318. External Links: Cited by: TABLE II.
-  (2019) VRSTC: occlusion-free video person re-identification. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 7176–7185. External Links: Cited by: §I.
-  (2020) Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (8), pp. 2011–2023. External Links: Cited by: §II-D.
-  (2017) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2261–2269. External Links: Cited by: §III.
-  (2018) Adversarially occluded samples for person re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 5098–5107. External Links: Cited by: §I.
-  (2020) Improve person re-identification with part awareness learning. IEEE Transactions on Image Processing 29 (), pp. 7468–7481. External Links: Cited by: §I, TABLE II.
-  (2018) Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §II-A.
-  (2014) Joint learning for attribute-consistent person re-identification. In European Conference on Computer Vision, pp. 134–146. Cited by: §II-A.
-  (2012) Large scale metric learning from equivalence constraints. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2288–2295. External Links: Cited by: §II-A.
-  (2019) Attention network robustification for person reid. arXiv preprint arXiv:1910.07038. Cited by: §II-A, §II-B.
-  (2021) Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6729–6738. Cited by: TABLE I, TABLE II.
-  (2014-06) DeepReID: deep filter pairing neural network for person re-identification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 152–159. External Links: Cited by: §IV-A, §IV-A.
-  (2018) Harmonious attention network for person re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2285–2294. External Links: Cited by: §II-A, §II-D.
-  (2021) Diverse part discovery: occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2898–2907. Cited by: TABLE I.
-  (2018-09) DetNet: design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §II-C.
-  (2017) Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 936–944. External Links: Cited by: §I, §II-C, §III-A.
-  (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: Cited by: §III-A.
-  (2018) Path aggregation network for instance segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 8759–8768. External Links: Cited by: §II-C, §III-A, §IV-D.
-  (2016) SSD: single shot multibox detector. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 21–37. External Links: Cited by: §II-C, §IV-D.
-  (2019) Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I, §II-A, §III-D, §III-D, §III, §IV-B, §IV-D, TABLE I.
-  (2020) Adaptive l2 regularization in person re-identification. External Links: Cited by: §II-A, §IV-C, TABLE I, TABLE II.
-  (2015) Learning to rank in person re-identification with metric ensembles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1846–1855. Cited by: §I.
-  (2017) Multi-scale deep learning architectures for person re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 5409–5418. External Links: Cited by: §II-B.
-  (2016) Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 17–35. Cited by: §IV-A, §IV-A.
-  (2018) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 420–429. External Links: Cited by: §I.
-  (2018) End-to-end deep kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6886–6895. Cited by: TABLE I.
Mask-guided contrastive attention model for person re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1179–1188. External Links: Cited by: §I.
-  (2017) Pose-driven deep convolutional model for person re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 3980–3989. External Links: Cited by: TABLE II.
-  (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: §I, §I, §I, §II-A, §II-B, §III, §IV-B, §IV-D, TABLE I.
Inception-v4, inception-resnet and the impact of residual connections on learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §I, §III.
-  (2020) EfficientDet: scalable and efficient object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 10778–10787. External Links: Cited by: §I, §II-C, §III-A, §IV-D.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §II-D.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §II-D, §III-B.
-  (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–381. Cited by: TABLE I.
-  (2018) Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 274–282. Cited by: TABLE I.
-  (2018) Parameter-free spatial attention network for person re-identification. arXiv preprint arXiv:1811.12150. Cited by: §I.
-  (2021) HOReID: deep high-order mapping enhances pose alignment for person re-identification. IEEE Transactions on Image Processing 30 (), pp. 2908–2922. External Links: Cited by: §I, §IV-C, TABLE I, TABLE II.
-  (2019) ECA-net: efficient channel attention for deep convolutional neural networks. External Links: Cited by: §II-D.
-  (2018) Non-local neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 7794–7803. External Links: Cited by: §II-D.
-  (2018) Person transfer gan to bridge domain gap for person re-identification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 79–88. External Links: Cited by: §II-A, §IV-A, §IV-A.
-  (2017) Glad: global-local-alignment descriptor for pedestrian retrieval. In Proceedings of the 25th ACM international conference on Multimedia, pp. 420–428. Cited by: TABLE II.
A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §I.
-  (2020) Diversity-achieving slow-dropblock network for person re-identification. External Links: Cited by: §II-A.
-  (2019) Second-order non-local attention networks for person re-identification. In 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), Vol. , pp. 3760–3769. External Links: Cited by: TABLE I.
-  (2020) Learning diverse features with part-level resolution for person re-identification. In Pattern Recognition and Computer Vision, Y. Peng, Q. Liu, H. Lu, Z. Sun, C. Liu, X. Chen, H. Zha, and J. Yang (Eds.), Cham, pp. 16–28. External Links: Cited by: §II-A, §II-B, §II-B.
-  (2019-06) Towards rich feature discovery with class activation maps augmentation for person re-identification. In 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1389–1398. External Links: Cited by: §I, §II-A, §II-B, §IV-E, TABLE I.
-  (2021) Deep learning for person re-identification: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §I.
-  (2020) AsNet: asymmetrical network for learning rich features in person re-identification. IEEE Signal Processing Letters 27 (), pp. 850–854. External Links: Cited by: §II-A, §II-B, §III, §IV-D, §IV-D, TABLE I.
-  (2019) Learning incremental triplet margin for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9243–9250. Cited by: §IV-C.
-  (2020) Relation-aware global attention for person re-identification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3183–3192. External Links: Cited by: TABLE I, TABLE II.
-  (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 907–915. External Links: Cited by: §I.
-  (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9259–9266. Cited by: §II-C.
-  (2019) Pyramidal person re-identification via multi-loss dynamic training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8514–8522. Cited by: §I, §II-A, §II-B, TABLE I.
-  (2015-12) Scalable person re-identification: a benchmark. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1116–1124. External Links: Cited by: §IV-A, §IV-A.
-  (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §I, §I, §III-D, §III, §IV-B.
-  (2019) Joint discriminative and generative learning for person re-identification. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2133–2142. External Links: Cited by: TABLE II.
-  (2017-07) Re-ranking person re-identification with k-reciprocal encoding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3652–3661. External Links: Cited by: §IV-B.
-  (2020) Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 13001–13008. Cited by: §II-A, §IV-B.
-  (2019) Omni-scale feature learning for person re-identification. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 3701–3711. External Links: Cited by: §II-A, §II-A, §II-D, §IV-B, TABLE II.