Person re-identification (PReID) is the task of identifying the presence of a person from multiple surveillance cameras. Given a query image, the aim is to retrieve all images of the specified person in a gallery dataset. This task has attracted the attention of many researchers in computer vision for its great importance in multiple applications such as video surveillance for public security. With the recent success of deep convolution neural networks (CNNs), PReID performance has made significant progress. Deep representations provide high discriminative ability, especially when aggregated from part-based deep local features.
Current related studies in PReID can be categorized to global feature-based and local part-based models. The local part-based models perform better with certain variations such as partial occlusion. Sun et al., for instance, presented the part-based convolutional baseline (PCB) that horizontally divided the last feature maps into multiple stripes where each one contains part of the person’s body in the input image. After that, a refinement mechanism was applied to each piece to guarantee that the feature map of this part focuses on the correct body part. PCB is a simple and effective framework that outperforms the other part-based models. However, it does not consider global features which play an important role in recognition and identification tasks and are normally robust to multiple variations. On the other hand, since their stripes have no overlaps, it loses important information that might lie at the edges of the divided stripes.
Global feature-based models focus on contour, shape, and texture representations. For example, Wang et al. built the DaReNet model based only on global information using a multiple granularity network to extract global features at different resolutions. Hermans et al.
presented a ResNet-50 based classifier which uses global information. Shenet al. combined global features with random walk algorithm. Li et al. proposed an attention-based model. Luo et al. reported a strong CNN based model with bag of learning tricks including augmentation and regularization. However, those methods may fail in the presence of object occlusion, multiple poses and lighting variations and usually depend on pre- and post-processing steps to boost their performances.
To address the above problems, other groups combined both global and local features. Li et al. fused local and global features while using mutual learning but they did not train the model with multiple loss functions. While He et al. used attention aware model that combines global and local features. Quan et al. introduced neural architecture search to PReID by focusing on searching the best CNN structure and applied the part-aware module in PReID search space that employs both part and global information.
Different loss functions have also been presented to boost the performance of the PReID models. Two loss functions are widely used: triplet loss  and cross-entropy loss . Triplet loss is based on the feature metrics distances while cross-entropy loss is based on classification with fully connected (FC) layers. Hermans et al. and Zhang et al. modified triplet loss to increase the training performance. Fan et al. presented a classification model based on an extended version of the cross-entropy loss function and a warming-up learning rate to learn a hypersphere manifold embedding. Recently, several models [4, 11] are trained using a combination of triplet loss and cross-entropy loss.
In this paper, we propose a multi-resolution overlapping stripes (MROS) model by combining global and local information at multiple resolutions with different loss functions. First, based on the residual network (ResNet50) , multiple levels are created each of which has different resolution. Inspired by the PCB model , the feature map from each level is divided horizontally into multi-stripes which will be processed later in pairs with overlapping rather than individually. The overlapping avoids lost of information at the boundaries/edges of stripes which usually occurs when using the part-based models. Secondly, instead of using the features from all multi-resolution levels for classification as in , only the features from the last two levels are considered. This is because the later levels of the model learn more semantic representations compared to the early layers. Thirdly, local and global features are combined using different loss functions. Experiments on the Market-1501  dataset – a large-scale person dataset most widely used for the PReID task – show the effectiveness of the presented approach.
2 Multi-Resolution Overlapping Stripes Model
Given a collection of images divided into query, gallery and training sets, PReID aims to find the images of each pedestrian from a query set in the gallery set. To address this problem, we propose a multi-resolution overlapping stripes (MROS) model as shown in Fig. 1.
The MROS model is constructed as follows. Firstly, inspired by the model presented in DaReNet , we construct multi-level features model. Instead of using every feature level, we only use the last two feature levels (Section 2.1). This reduces the computational complexity and increases the model performance. Secondly, the local features are extracted by extending the PCB  network. Instead of using stationary and non-overlapping stripes, an overlapping partitioning technique is employed based on pairs of stripes rather than individual ones (Section 2.1). This technique helps our method to avoid missing features at the boundaries of the individual stripes. Lastly, inspired by the recent successful performances achieved by local and global feature fusion [8, 7, 4] and loss function fusion [9, 4, 11], various loss functions based on local and global features are employed in this work to boost the performance of the model (Section 2.2).
2.1 Network Architecture
As shown in Figure 1
, the backbone network for our model is the ResNet-50. It is a CNN trained on more than a million images from the ImageNet database and consists of four convolutional blocks , where . To build a multi-resolution model, we only consider the output of the last two conv blocks, i.e.tensors and . The part-based model is constructed by dividing feature tensors and to equal stripes, then the adjacent stripes are grouped in pairs and the global average pooling (GAP) is applied on overlapping stripes. For each tensor with or , the GAP operation generates new feature vectors with ,
, stripes in each. After that, batch normalization (BN) layers are applied onto obtain to overcome the overfitting and boost the performance of the system.
For classification, FC layers are added after the local features . Note that FC layers for are 2048-dimensional while FC layers for are 1024-dimensional. Feature descriptor is defined by concatenating the feature vectors . Feature vectors and are used during the training while feature vector is used at testing.
2.2 Loss Functions
During the training, the MROS model is optimized by minimizing the fusion of three different loss functions including the triplet loss combined with center loss for metric learning and the cross-entropy loss for classification.
Firstly, instead of calculating individual losses for each stripe, a global feature is defined by concatenating feature vectors and . The batch-hard triplet loss  is then applied on the feature vector as follows:
where is the number of identities in a batch, is the number of images for the same identities in a batch, is loss margin, , and are features vectors from anchor, positive and negative samples.
At this stage, the center loss  is also applied on global feature vector to minimize the feature distribution in the feature space as following:
where is the batch size, is th class center vector for the features.
Secondly, the cross-entropy loss is computed for each stripe of the local feature vectors as follows:
where is the batch size, is the number classes in the training set, is the weight vector for the FC layers and is the bias. Also, total cross-entropy loss is calculated as mean of all cross-entropy losses as follows:
The label smoothing (LS)  technique is applied to improve the accuracy and prevent classification overfitting.
The MROS is evaluated using the Market-1501 , which is a large-scale person dataset most widely used for PReID. It is collected from six different cameras with overlapping fields of view where five cameras have HD resolution and one camera has SD resolution. The dataset has bounding boxes generated using a person detector for individuals. Following , the dataset is split into images for training and images for testing. Single-query mode is used for searching the query images in gallery set individually.
The mean average precision (mAP) , Rank-1, Rank-5 and Rank-10 accuracies are used to evaluate the MROS performance. The area under the Precision-Recall curve also known as average precision (AP) is calculated for each query image. The mean value of APs over all queries is then calculated as mAP.
3.1 Experimental Setup
We use two Nvidia GeForce GTX Ti GPUs with CUDA cores and GB video memory for implementation. All implementations are done on Python
with PyTorch library.
Data augmentation is used to overcome the overfitting by artificially enlarging the training samples with class-preserving transformations. This helps to produce more training data and reduce overfitting. In our experiment, different types of data augmentation are employed including zero padding withpixels, random cropping, horizontal flipping with probability and image normalization with the same mean and standard deviation values as ImageNet dataset . Random erasing  is also applied with probability and ImageNet pixel mean values.
period epochs. The learning rate is set toand is reduced using the staircase function by a factor of after every epochs. The batch size is while and in Equation 1 are set to and , respectively. The weight of center loss in Equation 5 is set to . The ECN  is used as re-ranking method.
3.2 Experimental Results
This section presents the performance evaluation of different settings of the MROS model on Market-1501 dataset. It also includes comparisons with the state-of-the-art methods.
To evaluate the effectiveness of each step of the presented model, we incrementally measure the accuracy as follows.
Setting I presents the baseline model constructed using part-based features followed by none-overlapping stripes method with stripes to generate local feature vectors and . During this experiment, all loss functions – triplet, center, and cross-entropy losses – are applied on local feature vectors and .
Setting II is similar to Setting I except that it uses overlapping stripes.
Setting III evaluates the effectiveness of combining global and local features by generating the global feature vector and using it with the triplet and center losses while using cross-entropy loss with .
Setting IV evaluates the effectiveness of the multi-level features by considering last two level features, i.e. and .
|II||Overlapping Stripes (OS)||82.8||93.5|
|III||OS with Global Features,||84.0||94.2|
Table 1 presents these settings along with experimental results. The baseline Setting I achieves promising results, however, Setting II increases the performance by using overlapping stripes. On the other hand, using global and local features boosts the performance in Setting III. Finally, the best results are obtained by Setting IV which combines all previous settings with multi-resolution features.
|Strong ReID ||85.9||94.5||-||-|
A comparison of the experimental results between MROS using single-query mode and the related methods are presented in Table 2 and Table 3 without and with re-ranking, respectively. The MROS model achieved mAP = 84.2% and Rank-1 = 94.4% without re-ranking and mAP = 93.5% and Rank-1 = 95.5% with re-ranking . The results in Table 2 show that the proposed MROS model without re-ranking achieves competitive performances. On the other hand, most of the re-ranked PReID models in Table 3 reported rank-1 results in a small margin -. As can be observed from the table, our MROS model outperforms the state-of-the-art models. This is because MROS is more able to learn body parts and the spatial correlation between them by employing the overlapping stripes and learn discriminative features by employing multi-resolution and different loss functions.
|Strong ReID ||94.2||95.4||-||-|
This paper extended the part-based convolutional baseline (PCB) and the multi-resolution model to solve the problem of pedestrian retrieval. Using the residual network (ResNet50) as backbone network, multi-levels with different resolutions are created to generate feature maps. After that a simple uniform partition technique is applied on the last two conv blocks and the generated features are combined with overlapping. Using different types of loss functions, both global and local representations were considered for classification. Experimental results show that our approach outperforms the state-of-the-art methods.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2.1, §3.1.
-  (2019) SphereReID: deep hypersphere manifold embedding for person re-identification. Journal of Visual Communication and Image Representation 60, pp. 51 – 58. External Links: Cited by: §1, §3.1, Table 2.
Deep residual learning for image recognition.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: Figure 1, §1.
MFBN: an efficient base model for person re-identification.
Proceedings of the 2019 4th International Conference on Mathematics and Artificial Intelligence, pp. 44–50. Cited by: §1, §1, §2, Table 2, Table 3.
-  (2017) In defense of the triplet loss for person re-identification. ArXiv abs/1703.07737. Cited by: §1, §1, §2.2.
-  (2014) Adam: A method for stochastic optimization. CoRR 1412.6980. Cited by: §3.1.
-  (2019) Pedestrian re-identification based on tree branch network with local and global learning. arXiv preprint arXiv:1904.00355. Cited by: §1, §2, Table 2, Table 3.
-  (2018) Harmonious attention network for person re-identification. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2285–2294. Cited by: §1, §2, Table 2.
-  (2019) Bag of tricks and a strong baseline for deep person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2, Table 2, Table 3.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §3.1.
-  (2019) Auto-reid: searching for a part-aware convnet for person re-identification. ArXiv abs/1903.09776. Cited by: §1, §1, §2, Table 2, Table 3.
-  (2018) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 420–429. Cited by: §3.1, §3.2.
FaceNet: a unified embedding for face recognition and clustering. In The IEEE Conference on Computer Vision and Pattern Recognition–CVPR, pp. 815–823. Cited by: §1.
-  (2018) Deep group-shuffling random walk for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2265–2274. Cited by: §1, Table 3.
-  (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Computer Vision – ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), pp. 501–518. Cited by: §1, §1, §2, Table 2, Table 3.
-  (2015) Rethinking the inception architecture for computer vision. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Cited by: §2.2.
-  (2018) Resource aware person re-identification across multiple resolutions. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8042–8051. Cited by: §1, §1, §2, Table 2, Table 3.
-  (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §1, §2.2.
-  (2019) Learning incremental triplet margin for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9243–9250. Cited by: §1.
-  (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §1, Table 1, Table 2, Table 3, §3, §3.
-  (2017) Random erasing data augmentation. ArXiv abs/1708.04896. Cited by: §3.1.