Multi-Resolution Overlapping Stripes Network for Person Re-Identification

10/27/2019 ∙ by Arda Efe Okay, et al. ∙ 0

This paper addresses the person re-identification (PReID) problem by combining global and local information at multiple feature resolutions with different loss functions. Many previous studies address this problem using either part-based features or global features. In case of part-based representation, the spatial correlation between these parts is not considered, while global-based representation are not sensitive to spatial variations. This paper presents a part-based model with a multi-resolution network that uses different level of features. The output of the last two conv blocks is then partitioned horizontally and processed in pairs with overlapping stripes to cover the important information that might lie between parts. We use different loss functions to combine local and global information for classification. Experimental results on a benchmark dataset demonstrate that the presented method outperforms the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (PReID) is the task of identifying the presence of a person from multiple surveillance cameras. Given a query image, the aim is to retrieve all images of the specified person in a gallery dataset. This task has attracted the attention of many researchers in computer vision for its great importance in multiple applications such as video surveillance for public security. With the recent success of deep convolution neural networks (CNNs), PReID performance has made significant progress. Deep representations provide high discriminative ability, especially when aggregated from part-based deep local features.

Current related studies in PReID can be categorized to global feature-based and local part-based models. The local part-based models perform better with certain variations such as partial occlusion. Sun et al.[15], for instance, presented the part-based convolutional baseline (PCB) that horizontally divided the last feature maps into multiple stripes where each one contains part of the person’s body in the input image. After that, a refinement mechanism was applied to each piece to guarantee that the feature map of this part focuses on the correct body part. PCB is a simple and effective framework that outperforms the other part-based models. However, it does not consider global features which play an important role in recognition and identification tasks and are normally robust to multiple variations. On the other hand, since their stripes have no overlaps, it loses important information that might lie at the edges of the divided stripes.

Global feature-based models focus on contour, shape, and texture representations. For example, Wang et al.[17] built the DaReNet model based only on global information using a multiple granularity network to extract global features at different resolutions. Hermans et al.[5]

presented a ResNet-50 based classifier which uses global information. Shen

et al.[14] combined global features with random walk algorithm. Li et al.[8] proposed an attention-based model. Luo et al.[9] reported a strong CNN based model with bag of learning tricks including augmentation and regularization. However, those methods may fail in the presence of object occlusion, multiple poses and lighting variations and usually depend on pre- and post-processing steps to boost their performances.

To address the above problems, other groups combined both global and local features. Li et al.[7] fused local and global features while using mutual learning but they did not train the model with multiple loss functions. While He et al.[4] used attention aware model that combines global and local features. Quan et al.[11] introduced neural architecture search to PReID by focusing on searching the best CNN structure and applied the part-aware module in PReID search space that employs both part and global information.

Figure 1: Illustration of multi-resolution overlapping stripes (MROS) model for PReID. The backbone network is ResNet-50 [3]

with four conv blocks. The input image goes through ResNet-50 to generate two feature tensors T

where or , which are used by the overlapping stripe approach to obtain vectors. The global feature vector is then obtained by as a concatenation of while the local feature vectors are obtained by applying a BN layer after Triplet and center losses are finally applied to and cross-entropy loss applied to . The model trained with loss .

Different loss functions have also been presented to boost the performance of the PReID models. Two loss functions are widely used: triplet loss [13] and cross-entropy loss [18]. Triplet loss is based on the feature metrics distances while cross-entropy loss is based on classification with fully connected (FC) layers. Hermans et al.[5] and Zhang et al.[19] modified triplet loss to increase the training performance. Fan et al.[2] presented a classification model based on an extended version of the cross-entropy loss function and a warming-up learning rate to learn a hypersphere manifold embedding. Recently, several models [4, 11] are trained using a combination of triplet loss and cross-entropy loss.

In this paper, we propose a multi-resolution overlapping stripes (MROS) model by combining global and local information at multiple resolutions with different loss functions. First, based on the residual network (ResNet50) [3], multiple levels are created each of which has different resolution. Inspired by the PCB model [15], the feature map from each level is divided horizontally into multi-stripes which will be processed later in pairs with overlapping rather than individually. The overlapping avoids lost of information at the boundaries/edges of stripes which usually occurs when using the part-based models. Secondly, instead of using the features from all multi-resolution levels for classification as in [17], only the features from the last two levels are considered. This is because the later levels of the model learn more semantic representations compared to the early layers. Thirdly, local and global features are combined using different loss functions. Experiments on the Market-1501 [20] dataset – a large-scale person dataset most widely used for the PReID task – show the effectiveness of the presented approach.

2 Multi-Resolution Overlapping Stripes Model

Given a collection of images divided into query, gallery and training sets, PReID aims to find the images of each pedestrian from a query set in the gallery set. To address this problem, we propose a multi-resolution overlapping stripes (MROS) model as shown in Fig. 1.

The MROS model is constructed as follows. Firstly, inspired by the model presented in DaReNet [17], we construct multi-level features model. Instead of using every feature level, we only use the last two feature levels (Section 2.1). This reduces the computational complexity and increases the model performance. Secondly, the local features are extracted by extending the PCB [15] network. Instead of using stationary and non-overlapping stripes, an overlapping partitioning technique is employed based on pairs of stripes rather than individual ones (Section 2.1). This technique helps our method to avoid missing features at the boundaries of the individual stripes. Lastly, inspired by the recent successful performances achieved by local and global feature fusion [8, 7, 4] and loss function fusion [9, 4, 11], various loss functions based on local and global features are employed in this work to boost the performance of the model (Section 2.2).

2.1 Network Architecture

As shown in Figure 1

, the backbone network for our model is the ResNet-50. It is a CNN trained on more than a million images from the ImageNet

[1] database and consists of four convolutional blocks , where . To build a multi-resolution model, we only consider the output of the last two conv blocks, i.e.tensors and . The part-based model is constructed by dividing feature tensors and to equal stripes, then the adjacent stripes are grouped in pairs and the global average pooling (GAP) is applied on overlapping stripes. For each tensor with or , the GAP operation generates new feature vectors with ,

, stripes in each. After that, batch normalization (BN) layers are applied on

to obtain to overcome the overfitting and boost the performance of the system.

For classification, FC layers are added after the local features . Note that FC layers for are 2048-dimensional while FC layers for are 1024-dimensional. Feature descriptor is defined by concatenating the feature vectors . Feature vectors and are used during the training while feature vector is used at testing.

2.2 Loss Functions

During the training, the MROS model is optimized by minimizing the fusion of three different loss functions including the triplet loss combined with center loss for metric learning and the cross-entropy loss for classification.

Firstly, instead of calculating individual losses for each stripe, a global feature is defined by concatenating feature vectors and . The batch-hard triplet loss [5] is then applied on the feature vector as follows:

(1)

where is the number of identities in a batch, is the number of images for the same identities in a batch, is loss margin, , and are features vectors from anchor, positive and negative samples.

At this stage, the center loss [18] is also applied on global feature vector to minimize the feature distribution in the feature space as following:

(2)

where is the batch size, is th class center vector for the features.

Secondly, the cross-entropy loss is computed for each stripe of the local feature vectors as follows:

(3)

where is the batch size, is the number classes in the training set, is the weight vector for the FC layers and is the bias. Also, total cross-entropy loss is calculated as mean of all cross-entropy losses as follows:

(4)

The label smoothing (LS) [16] technique is applied to improve the accuracy and prevent classification overfitting.

Finally, the total loss function is calculated by the weighted sum of the previous losses in Equations 1, 2 and 4, where the priority of triplet loss and cross-entropy loss are kept equal.

(5)

where is the weight of center loss.

3 Experiments

The MROS is evaluated using the Market-1501 [20], which is a large-scale person dataset most widely used for PReID. It is collected from six different cameras with overlapping fields of view where five cameras have HD resolution and one camera has SD resolution. The dataset has bounding boxes generated using a person detector for individuals. Following [20], the dataset is split into images for training and images for testing. Single-query mode is used for searching the query images in gallery set individually.

The mean average precision (mAP) [20], Rank-1, Rank-5 and Rank-10 accuracies are used to evaluate the MROS performance. The area under the Precision-Recall curve also known as average precision (AP) is calculated for each query image. The mean value of APs over all queries is then calculated as mAP.

3.1 Experimental Setup

We use two Nvidia GeForce GTX Ti GPUs with CUDA cores and GB video memory for implementation. All implementations are done on Python

with PyTorch

[10] library.

Data augmentation is used to overcome the overfitting by artificially enlarging the training samples with class-preserving transformations. This helps to produce more training data and reduce overfitting. In our experiment, different types of data augmentation are employed including zero padding with

pixels, random cropping, horizontal flipping with probability and image normalization with the same mean and standard deviation values as ImageNet dataset [1]. Random erasing [21] is also applied with probability and ImageNet pixel mean values.

We use the Adam method [6] as our optimizer using the warm-up learning rate technique [2] with coefficient and

period epochs. The learning rate is set to

and is reduced using the staircase function by a factor of after every epochs. The batch size is while and in Equation 1 are set to and , respectively. The weight of center loss in Equation 5 is set to . The ECN [12] is used as re-ranking method.

3.2 Experimental Results

This section presents the performance evaluation of different settings of the MROS model on Market-1501 dataset. It also includes comparisons with the state-of-the-art methods.

To evaluate the effectiveness of each step of the presented model, we incrementally measure the accuracy as follows.

  • Setting I presents the baseline model constructed using part-based features followed by none-overlapping stripes method with stripes to generate local feature vectors and . During this experiment, all loss functions – triplet, center, and cross-entropy losses – are applied on local feature vectors and .

  • Setting II is similar to Setting I except that it uses overlapping stripes.

  • Setting III evaluates the effectiveness of combining global and local features by generating the global feature vector and using it with the triplet and center losses while using cross-entropy loss with .

  • Setting IV evaluates the effectiveness of the multi-level features by considering last two level features, i.e. and .

# MROS Settings mAP Rank-1
I Non-Overlapping Stripes 81.8 93.2
II Overlapping Stripes (OS) 82.8 93.5
III OS with Global Features, 84.0 94.2
IV Complete MROS 84.2 94.4
Table 1: Comparison of proposed three approaches on Market-1501 [20] dataset.

Table 1 presents these settings along with experimental results. The baseline Setting I achieves promising results, however, Setting II increases the performance by using overlapping stripes. On the other hand, using global and local features boosts the performance in Setting III. Finally, the best results are obtained by Setting IV which combines all previous settings with multi-resolution features.

Model mAP Rank-1 Rank-5 Rank-10
DaRe [17] 76.0 89.0 - -
HA-CNN [8] 75.7 91.2 - -
PCB [15] 77.4 92.3 97.2 98.2
TBN+ [7] 83.0 93.2 - -
PCB+RPP [15] 81.6 93.8 97.5 98.5
MFBN [4] 84.9 93.9 - -
SphereReID [2] 83.6 94.4 98.0 98.7
Auto-ReID [11] 85.1 94.5 98.5 99.0
Strong ReID [9] 85.9 94.5 - -
Proposed MROS 84.2 94.4 97.8 98.7
Table 2: Comparison with the state-of-the-art results on the Market-1501 [20] dataset without re-ranking.

A comparison of the experimental results between MROS using single-query mode and the related methods are presented in Table 2 and Table 3 without and with re-ranking, respectively. The MROS model achieved mAP = 84.2% and Rank-1 = 94.4% without re-ranking and mAP = 93.5% and Rank-1 = 95.5% with re-ranking [12]. The results in Table 2 show that the proposed MROS model without re-ranking achieves competitive performances. On the other hand, most of the re-ranked PReID models in Table 3 reported rank-1 results in a small margin -. As can be observed from the table, our MROS model outperforms the state-of-the-art models. This is because MROS is more able to learn body parts and the spatial correlation between them by employing the overlapping stripes and learn discriminative features by employing multi-resolution and different loss functions.

Model mAP Rank-1 Rank-5 Rank-10
DaRe [17] 86.7 89.0 - -
GSRW [14] 82.5 92.7 96.9 98.1
PCB+RPP [15] 81.9 95.1 - -
MFBN [4] 93.2 95.2 - -
TBN+ [7] 91.3 95.4 - -
Auto-ReID [11] 94.2 95.4 97.9 98.5
Strong ReID [9] 94.2 95.4 - -
Proposed MROS 93.5 95.5 97.2 97.8
Table 3: Comparison with the state-of-the-art results on the Market-1501 [20] dataset with re-ranking.

4 Conclusions

This paper extended the part-based convolutional baseline (PCB) and the multi-resolution model to solve the problem of pedestrian retrieval. Using the residual network (ResNet50) as backbone network, multi-levels with different resolutions are created to generate feature maps. After that a simple uniform partition technique is applied on the last two conv blocks and the generated features are combined with overlapping. Using different types of loss functions, both global and local representations were considered for classification. Experimental results show that our approach outperforms the state-of-the-art methods.

References

  • [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2.1, §3.1.
  • [2] X. Fan, W. Jiang, H. Luo, and M. Fei (2019) SphereReID: deep hypersphere manifold embedding for person re-identification. Journal of Visual Communication and Image Representation 60, pp. 51 – 58. External Links: ISSN 1047-3203 Cited by: §1, §3.1, Table 2.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. Cited by: Figure 1, §1.
  • [4] S. He, J. Wu, and Y. Li (2019) MFBN: an efficient base model for person re-identification. In

    Proceedings of the 2019 4th International Conference on Mathematics and Artificial Intelligence

    ,
    pp. 44–50. Cited by: §1, §1, §2, Table 2, Table 3.
  • [5] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. ArXiv abs/1703.07737. Cited by: §1, §1, §2.2.
  • [6] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR 1412.6980. Cited by: §3.1.
  • [7] H. Li, M. Yang, Z. Lai, W. Zheng, and Z. Yu (2019) Pedestrian re-identification based on tree branch network with local and global learning. arXiv preprint arXiv:1904.00355. Cited by: §1, §2, Table 2, Table 3.
  • [8] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2285–2294. Cited by: §1, §2, Table 2.
  • [9] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §2, Table 2, Table 3.
  • [10] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §3.1.
  • [11] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang (2019) Auto-reid: searching for a part-aware convnet for person re-identification. ArXiv abs/1903.09776. Cited by: §1, §1, §2, Table 2, Table 3.
  • [12] M. Saquib Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen (2018) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 420–429. Cited by: §3.1, §3.2.
  • [13] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: a unified embedding for face recognition and clustering

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition–CVPR, pp. 815–823. Cited by: §1.
  • [14] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang (2018) Deep group-shuffling random walk for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2265–2274. Cited by: §1, Table 3.
  • [15] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Computer Vision – ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), pp. 501–518. Cited by: §1, §1, §2, Table 2, Table 3.
  • [16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2015) Rethinking the inception architecture for computer vision. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Cited by: §2.2.
  • [17] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger (2018) Resource aware person re-identification across multiple resolutions. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8042–8051. Cited by: §1, §1, §2, Table 2, Table 3.
  • [18] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §1, §2.2.
  • [19] Y. Zhang, Q. Zhong, L. Ma, D. Xie, and S. Pu (2019) Learning incremental triplet margin for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9243–9250. Cited by: §1.
  • [20] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §1, Table 1, Table 2, Table 3, §3, §3.
  • [21] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. ArXiv abs/1708.04896. Cited by: §3.1.