MagnifierNet: Towards Semantic Regularization and Fusion for Person Re-identification

02/25/2020 ∙ by Yushi Lan, et al. ∙ SenseTime Corporation Nanyang Technological University The Chinese University of Hong Kong 7

Although person re-identification (ReID) has achieved significant improvement recently by enforcing part alignment, it is still a challenging task when it comes to distinguishing visually similar identities or identifying occluded person. In these scenarios, magnifying details in each part features and selectively fusing them together may provide a feasible solution. In this paper, we propose MagnifierNet, a novel network which accurately mines details for each semantic region and selectively fuse all semantic feature representations. Apart from conventional global branch, our proposed network is composed of a Semantic Regularization Branch (SRB) as learning regularizer and a Semantic Fusion Branch (SFB) towards selectively semantic fusion. The SRB learns with limited number of semantic regions randomly sampled in each batch, which forces the network to learn detailed representation for each semantic region, and the SFB selectively fuses semantic region information in a sequential manner, focusing on beneficial information while neglecting irrelevant features or noises. In addition, we introduce a novel loss function "Semantic Diversity Loss" (SD Loss) to facilitate feature diversity and improves regularization among all semantic regions. State-of-the-art performance has been achieved on multiple datasets by large margins. Notably, we improve SOTA on CUHK03-Labeled Dataset by 12.6 We also outperform existing works on CUHK03-Detected Dataset by 13.2 and 7.8 method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person Re-IDentification (ReID) has attracted significant attention for its critical role in video surveillance and public security. Specifically, given a query image, the ReID system amounts to retrieve all the images of the same identity according to their semantic similarity from a large gallery, typically captured by distinctively different cameras from various viewpoints.

Figure 1: Negative pairs with minor differences zoomed in by magnifier. The left pair has similar global representation but distinctive glasses; the right pair is hard to distinguish except for shirt texture and white bag. Our proposed network is able to discover tiny differences of semantic regions.
Figure 2: The overall architecture of the proposed MagnifierNet. The Semantic Regularization Branch in blue explores the hidden details in each semantic region by limiting available information for training. The Semantic Fusion Branch in red encodes beneficial features from each semantic region sequentially while filtering out noises and irrelevant information. The SD Loss further improves diversity among all semantic features and improve representation diversity.

Although advancement has been witnessed thanks to the advent of large scale datasets as well as improved deep feature extraction methods

[7, 12], there are still some challenging issues waiting to be solved. ReID task is often challenged by huge human pose variation, body occlusion and the body part misalignment problem under different camera views. Many algorithms have been proposed to tackle these problems via methods including attention mechanism [26, 51], body feature cropping [50, 36] and human parsing [23] to align person body among images and improve accuracy. However, as most of the previous approaches mainly focus on enhancing person feature alignment, how to accurately mine the fine-grained details in each semantic region and filter out the irrelevant noise from an input image still remain unaddressed. As shown in Figure 1, some identities only share minor differences in certain semantic regions. These scenarios require the model to magnify regional details and ignore misleading noises, which pure alignment-based methods may fail to produce satisfying results.

To effectively address the challenges mentioned above, we propose MagnifierNet, a novel triple-branch network that not only extracts aligned representation but also magnifies fine-grained details and selectively fuses each semantic region. Apart from holistic representation provided by conventional global branch, the Semantic Regularization Branch (SRB) learns fine-grained representation for each semantic region, and the Semantic Fusion Branch (SFB) fuses each semantic feature selectively to focus only on beneficial information. Meanwhile, a light-weight segmentation module is applied to impose alignment constraint on the feature map and provide segmentation masks during inference. In this way, our model is able to distinguish similar identities, recognize occluded samples with semantic details, as well as avoiding negative influence from noises.

As high-level features have overlapping receptive fields, different semantic representations tend to contain redundant information. Hence, we propose a novel Semantic Diversity Loss (SD Loss) to improve diversity among them, which has further enhanced the network performance as shown in our experiments.

Our main contribution can be summarized as follows: (1) We propose a novel Semantic Regularization Branch that magnifies fine-grained details in each semantic region. (2) We design a novel Semantic Fusion Branch which selectively fuses semantic information sequentially, focusing on beneficial features while filtering out noises. (3) We further improve semantic-level region diversity by introducing a novel Semantic Diversity Loss. The proposed MagnifierNet achieves state-of-the-art performance on three benchmark datasets including Market-1501, DukeMTMC-reID and CUHK03-NP.

Figure 3: Visualization of feature map activation with different network components.

2 Related Work

2.1 ReID and Part Based Methods

Person ReID task aims to retrieve target images belonging to the same person based on their similarity. In terms of feature representation, the great success of deep convolution networks has pushed ReID benchmarks to a new level [36, 56, 4, 51]. Recently, researchers have intensively focused on extracting local person features to enforce part alignment and feature representation capability. For instance, Zhao et al. [50] proposed a body region proposal network which utilized human landmark information to compute relative aligned part representation. Zhao et al. [51] shared similar idea, but with feature representation generated from K part detectors. Sun et al.[36] proposed a Part-based Convolution Baseline (PCB) network which focuses on the consistency of uniform partition part-level feature with refined stripe pooling. Based on PCB, Multiple Granularity Network (MGN) [38] and Coarse-to-Fine Pyramidal Model(Pyramid) [52] explored multi-branch network to learn feature of different granularity, which attempted to incorporate global and local information. However, their methods result in complicated structure and increasing parameters, and also lacks the ability to represent accurate human regions. In comparison, our approach aggregates multi-granularity feature in a more efficient way, resulting in smaller network size and better performance.

2.2 Semantic Assistance for ReID

Considering the highly structured composition of human body, more and more researchers aim to push the margin of ReID task via addressing distinguishing useful semantic information. The development of human parsing methods [16, 23]

, pose estimation

[43, 2, 45] and person semantic attributes[40, 29] facilitate the use of human semantic information as external cues to promote the performance of ReID task. Xu et al.[45] resort to the assistance of predicted keypoints confidence map to extract aligned human parts representation. Similarly, Huang et al.[22] proposed part-aligned pooling based on delimited regions which shows significant improvement in cross-domain ReID task. Kalayel et al. [23] adopts predefined body regions as supervision to drive the model to learn alignment representation in an automatic way. In [27], Lin et al. argued human attributes to be helpful information in ReID task. He has also provided the attribute annotation for two ReID datasets Market-1501 [53] and DukeMTMC-reID [35]. However, these methods treat each semantic region equally and did not further explore the details in each region.

2.3 Metric Learning for ReID

With the advent of large datasets and adoption of deeper convolution networks, person ReID task has gradually evolved from a classification problem to a metric learning problem, which aims to find robust feature representation for input images in order to retrieve target images based on their semantic similarity. For this purpose, recent works in metric learning have paid intensive attention to loss function design, such as triplet loss [11], lifted structure loss [32], center loss [41], triplet loss [7] and margin-based loss [28, 39, 10], etc. Apart from loss functions, Cheng et al. [8] proposed a hard triplet mining strategy to improve metric embedding performance in ReID. Margin sampling mining [44] and distance weighted sampling [42] also showed nontrivial advantage in person retrieval task. Recently, Dai et al. [9] introduced Batch DropBlock Network (BDB) which utilized batch-hard triplet mining and feature dropping branch. Nevertheless, the above-mentioned approaches mostly focus on image-level feature refinement, where our proposed SD Loss fine-tunes features at intra-sample level among semantic regions.

3 The Proposed Method

In this section, we first elaborate on the overall structure of our proposed MagnifierNet, then introduce each network components in details. As shown in Figure 2, MagnifierNet consists of three main branches, namely the Semantic Regularization Branch (SRB), the Semantic Fusion Branch (SFB) and the Global Branch. A segmentation module with an optional attribute facilitator has been deployed to obtain semantic representation. We adopt ResNet-50 [18] as our backbone feature extractor, both SRB and SFB are connected to ResNet stage 3’s output to maintain more spatial information with an additional ResNet bottleneck block deployed as in [9]. For fair comparison with recent works [9, 39], we remove the final down-sampling layer in stage 4. The feature embeddings from all branches are concatenated as the final person representation during inference.

3.1 Semantic Aligned Pooling

The previous state-of-the-art network[23] demonstrates the effectiveness of human semantic parsing due to its pixel-level accuracy and robustness to pose variation. Motivated by this, we propose Semantic Aligned Pooling, which applies fine-grained regional pooling to aggregate pixel-wise representations belonging to the same human semantic landmark. We illustrate how to compute our proposed approach in the following section.

To obtain semantic aligned representation, suppose each semantic label L from the human segmentation model contains S+1 different regions, we denote them as ,,…, for each human semantic region and for background region. During training, given an input image I, we feed it into our backbone network [18] for feature extraction to obtain feature map F with shape , where H and W are the height and width of the feature map and C is the number of channels respectively. After rescaling L into the same size as F

using bilinear interpolation, the aligned semantic region can be represented by:

(1)

where denotes the aligned binary representation of the semantic region and is the binary value of semantic region respectively. Therefore, the aligned ID feature can be represented as follows,

(2)

where S is the group of predefined semantic region(s) belonging to the same identity. Max pooling is then applied on the semantic aligned identity representation

to get the most discriminative feature of each semantic region.

3.2 Semantic Regularization Branch

Though the features have been aligned via Semantic Aligned Pooling, we argue that detailed feature representations haven’t been captured in all semantic regions. Further magnification of semantic details is necessary to maintain robust model performance, especially in crowded venues where the identities are usually partially visible. To address this issue, we propose Semantic Regularization Branch (SRB) to drive the network to learn discriminative representation for each semantic region. Hence, identity with limited visible semantic regions can be better retrieved.

Figure 4: Top 1 search results with Baseline and Baseline + SRB.

Regularization With Partial Semantic Learning. As illustrated in Figure 2, we sample a semantic region group made up of aligned semantic regions formulated in Equation 1. Besides, the background category is also utilized in our network to maintain useful information and augment original representation. Therefore, the feature map of an partially visible identity can be represented by

(3)

where and is the representation of an identity when only portion of his/her semantic regions are visible. Besides, the sampled semantic regions should maintain sufficient yet dynamic cues to represent correct identity to drive the network to learn fine-grained representation for all semantic regions. Therefore, we manually split the aligned semantic regions into upper torso regions {head, upper arm, lower arm, chest} and lower torso regions {upper leg, lower leg, foot}. Therefore, contains upper torso semantic regions and lower torso regions respectively.

As and are dynamically sampled in each batch, limited number of semantic regions are sampled to represent an identity. Hence, the network is regularized to capture fine-grained details in each semantic region instead of relying on certain region(s). In this way, the model can learn more detailed and attentive representation on viable semantic regions as shown in Figure 4. We argue the superiority of our approach compared to BDB network [9] as we extract aligned semantic regions for each identity in each batch and drive the network to magnify discriminative feature for visible parts, which can naturally alleviate occlusion problem. The influence of different combination of and is further studied in Table 3.

3.3 Semantic Fusion Branch

In order to yield correct predictions even in cases of similar identities or noisy backgrounds, the model is supposed to focus more on informative semantic regions instead of equally considering noisy parts with misleading information.

To address this problem, we propose the Semantic Fusion Branch (SFB) to selectively fuse all semantic regions together, focusing on informative features while filtering out noises. Specifically, we firstly perform Semantic Aligned Pooling in Equation 2 for all semantic regions including background to obtain their aligned features. The gating mechanism of LSTM[15] is able to estimate the importance of a given semantic feature conditioned by itself and other encoded features. Hence, it is natural to apply LSTM to encode these semantic features sequentially to focus more on relevant information.

Figure 5: Feature activation of “ablation branch” (left) and SFB (right).

We feed the semantic features sequentially into a one-layer LSTM cell, and adopt the last LSTM output as the semantic fusion of the input image. To compute the loss, we adopt the “BNNeck” method proposed by [31] to jointly minimize cross-entropy loss and triplet loss with the encoding. In addition, we apply ReID supervision for each semantic feature to stabilize the training process, but the separate features are not utilised during inference.

It is notable that the SFB and SRB shares the same backbone features and segmentation masks. Hence, SFB benefits significantly from SRB’s semantic detail magnification process. As the feature representation under each semantic mask grows more informative, the SFB is able to accurately capture these information and selectively fuses them into a better feature embedding.

3.4 Semantic Feature Diversification

As the Semantic Aligned Pooling is done at high level feature maps, features under different semantic masks might share overlapping information due to large receptive fields, which reduces the network’s capability to capture details from different semantic regions. To improve diversity among semantic representation, we propose a novel Semantic Diversity Loss (SD Loss) to remove redundancy among region features.

For a pair of semantic regions, we aim to diversify them by increasing their distance in the feature space. Specifically, we minimize the pair-wise cosine similarity among all semantic features as shown in Equation

4.

(4)

where and

are a pair of semantic region feature vectors, N is batch size, M is the number of semantic regions,

is matrix transpose, is vector magnitude and is a small positive value used to avoid division by zero.

As SFB generates individual semantic features via Semantic Aligned Pooling, we embed the SD Loss into SFB and calculate the loss on these pooled semantic features. Notably, as SFB and SRB share the same backbone layers and segmentation masks, minimizing SD Loss in SFB forces the backbone to extract diverse features which supports SRB to further magnify semantic details. The overall loss function of the network is shown in Equation 5, which is shared by all the network branches.

(5)

where the first two terms optimize all branches, diversifies semantic features and distills segmentation information which is used during inference. Weight coefficients and balance the importance of semantic diversification and segmentation respectively.

3.5 Optional Attribute Facilitator

Although most segmentation predictions are generally accurate, there might be corner cases where high quality segmentation masks haven’t been generated. Meanwhile, attributes may provide complementary latent information weakly exploited in human semantic regions. Therefore, we design an optional attribute facilitator to compensate the affected masks in SFB and provide extra important cues.

Provided the annotated or externally predicted attributes, we group them into head region, upper body region, lower body region and general attributes. We learn an attribute activation mask for head, upper and lower region with a

1 convolution, a Sigmoid activation and a fully-connected classifier as proposed in

[29]. These activation masks are then added to the semantic masks element-wisely. Head semantic mask will share the head attribute activation, upper body semantic masks will share the upper body attribute activation and lower body semantic masks will share the lower body attribute activation.

Note that this is merely an optional module and we did not apply it to CUHK03-NP dataset for lack of attributes annotation, but we still achieved distinctive result which shows the advantage and robustness of MagnifierNet.

CUHK03-Labeled CUHK03-Detected DukeMTMC-reID Market1501
Method Publication Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP
MGN[38] ACM 2018 68.0 67.4 66.8 66.0 88.7 78.4 95.7 86.9
Local-CNN[46] ACM 2018 58.7 53.8 56.8 51.6 82.2 66.0 95.9 87.4
PCB[36] ECCV 2018 - - 63.7 57.5 83.3 69.2 93.8 81.6
DG-Net[55] CVPR 2019 - - 65.6 61.1 86.6 74.8 94.8 86.0
DSA[49] CVPR 2019 78.9 75.2 78.2 73.1 86.2 74.3 95.7 87.6
CAMA[47] CVPR 2019 70.1 66.5 66.6 64.2 85.8 72.9 94.7 84.5
AA-Net[37] CVPR 2019 - - - - 87.7 74.3 93.9 83.4
IANet[21] CVPR 2019 - - - - 87.1 73.4 94.4 83.1
CASN[54] CVPR 2019 73.7 68.0 71.5 64.4 87.7 73.7 94.4 82.8
Pyramid[52] CVPR 2019 78.9 76.9 78.9 74.8 89.0 79.0 95.7 88.2
MHN[3] ICCV 2019 77.2 72.4 71.7 65.4 89.1 77.2 95.1 85.0
-Net[17] ICCV 2019 78.3 73.6 74.9 68.9 86.5 73.1 95.2 85.6
BDB[9] ICCV 2019 79.4 76.7 76.4 73.5 89.0 76.0 95.3 86.7
OS-Net[58] ICCV 2019 - - 72.3 67.8 88.6 73.5 94.8 84.9
BAT-Net[13] ICCV 2019 78.6 76.1 76.2 73.2 87.7 77.3 95.1 87.4
ABD-Net[6] ICCV 2019 - - - - 89.0 78.6 95.6 88.3
FPR[19] ICCV 2019 - - 76.1 72.3 88.6 78.4 95.4 86.6
SCAL[5] ICCV 2019 74.8 72.3 71.1 68.6 89.0 79.6 95.8 89.3
CAR[59] ICCV 2019 - - - - 86.3 73.1 96.1 84.7
AlignedRe-ID[48]* Arxiv 2017 - - - - - - 94.4 90.7
MGN[38]* ACM 2018 - - - - - - 96.6 94.2
TriNet[20]* CVPR 2019 - - - - - - 86.6 81.1
AA-Net* CVPR 2019 - - - - 90.4 86.9 95.1 92.4
SPT[30]* ICCV 2019 - - - - 88.3 83.3 93.5 90.6
MagnifierNet (Ours) - 82.6 79.9 80.4 77.3 90.0 81.1 95.8 89.7
MagnifierNet (Ours)* - 87.8 89.3 86.7 88.0 91.8 90.6 96.5 95.1
Table 1: Comparison with other State-Of-the-Art methods. Re-ranking [57, 34] is applied for method with “*”. Top performances on each dataset with and without re-ranking are in bold.

4 Experiments

4.1 Dataset

dataset contains 32,688 images of 1,501 identities captured by 6 cameras. The training set contains 751 identities with 12,936 images, the testing set contains 750 identities with 3,368 query images and 15,913 gallery images.

dataset contains two sub-sets, namely labeled set and detected set according to data generation method. The labeled set contains 14,096 images and detected set contains 14,097 images. We adopt the new train / test split protocol proposed by [25], which contains 767 identities for training and 700 identities for testing. The detected dataset contains 7,365 training images, 5,332 gallery images, and 1,400 query images. The labeled dataset contains 7,368 training images, 5,328 gallery, and 1,400 query images.

contains 1,812 identities with 36,411 images captured by eight cameras. 408 identities appear in only one camera and are commonly treated as distractors. The rest 1,404 identities are divided, in which 702 identities with 16,522 images are used for training and other 702 identities are merged with the distractors to form the testing set. There are 2,228 query images and 17,661 gallery images in the testing set.

4.2 Implementation Details

During training and testing, the images are resized to . We adopt Adam [24] optimizer with an initial learning rate of 0.00035. Training procedure is divided into two stages, we firstly train the model near convergence without SD Loss for epochs, we then add in the loss and fine-tune the model until convergence for another epochs. The training epochs and loss weight coefficient values for each dataset are elaborated in Table 2. We apply learning rate decay of 0.1 at both 40 and 70 epochs for Market1501/DukeMTMC-reID and 40 epochs for CUHK03-NP.

Dataset
CUHK03-Labeled 560 40 0.01 2
CUHK03-Detected 560 40 0.002 2
Market1501 460 40 0.001 2
DukeMTMC-reID 460 40 0.05 0.5
Table 2: Training configurations on each dataset.

We apply the recently published segmentation architecture DANet[14] to predict the semantic human regions for all training images in three benchmarks before training. Semantic labels are transformed from the Densepose [1] dataset as utilized in [22].

All of our implementations are based on PyTorch framework

[33]. We apply the commonly adopted data augmentation techniques including flipping, cropping and erasing [citation] during training. Same as the Strong Baseline [31], label smoothing and warm-ups are also applied to facilitate the network training. During training, the segmentation masks generated by the pre-trained segmentation model are applied for SRB and SFB. During testing, the model predicts its own segmentation masks for the two branches.

Market1501 DukeMTMC-reID
Rank-1 Rank-1
1 1 95.0 1 1 89.0
2 2 95.7 2 2 89.2
3 3 95.3 3 3 89.0
All 95.4 All 88.9
CUHK03-Labeled CUHK03-Detected
Rank-1 Rank-1
1 1 80.6 1 1 77.8
2 2 82.6 2 2 78.6
3 3 82.4 3 3 78.9
All 81.6 All 78.3
Table 3: Effect of different number of selected semantic regions.
Components CUHK03-Labeled CUHK03-Detected DukeMTMC-reID Market1501
G SEG SRB SFB A SD Rank1 mAP Rank1 mAP Rank1 mAP Rank1 mAP
73.0 70.0 70.2 67.3 88.6 77.9 94.7 86.8
73.0 71.0 68.9 66.7 89.1 78.9 94.9 86.6
78.0 75.6 74.6 71.6 89.4 81.0 95.0 88.7
79.1 77.6 76.6 74.4 88.7 79.6 95.2 88.4
82.6 79.5 78.6 76.4 89.6 80.8 95.3 89.4
- - - - 89.2 81.0 95.7 89.6
82.6 79.9 80.4 77.3 90.0 81.1 95.8 89.7
Table 4: Gain from each network component.

4.3 Comparison with other State-of-the-Arts

We compare our method against 19 latest State-of-the-Art methods and present the results in Table 1 with the widely adopted “Rank 1” and “mAP” as our evaluation criterion. Our method achieves the best results on all the datasets except for Rank-1 on Market1501. Notably, our method outperforms other networks by a large margin on the small and challenging dataset CUHK03-NP with relatively heavy viewpoint limitation and occlusion. Further more, the MagnifierNet outperforms part-alignment related methods including BDB and -Net, which validates the effectiveness of our method to go beyond alignment and further improve Re-ID performance via semantic regularization and fusion.

4.4 Ablation Study

We have carried out extensive ablation studies to validate the effectiveness of each module in MagnifierNet, which will be covered in the following sections.

4.4.1 Improvement from each Network Component

We take the Global Branch (G) as the baseline, and add network components one by one onto the baseline. The performance improvements from each components are presented in Table 4. It is shown that SRB and SFB are able to boost the performance both individually and simultaneously to a large margin. Although the semantic distillation task (SEG) does not improve performance over the baseline on CUHK03-Detected, the advantages outweigh the drawbacks as it is the foundation of SRB and SFB. The proposed SD Loss is able to attain further gain on all datasets, which validates its effectiveness across different domain. On a side note, the attribute facilitator (A) is an optional module and the results show that our method can already achieve competitive results without it. Hence, its utilization can be considered base on the real-world application scenario.

In addition, to further elaborate the benefit of each component qualitatively, we visualize the feature map activation after ResNet block 3 of some randomly selected images using different settings in Table 4. As shown in Figure 3, the Global baseline alone can only capture a coarse representation of the image. Both SRB and SFB are able to significantly improve model’s activation on each semantic region, while SRB tends to highlight every region and SFB tends to selectively focus on certain semantic parts. When applying SRB and SFB simultaneously, their individual advantages complement each other which yields a joint-representation that selectively highlights hidden details in each informative regions.

4.4.2 Effect of Semantic Regularization

We visualize the top 1 querying results between Global baseline and Baseline + SRB for two probe images in Figure 4. As shown in the feature activations, SRB can significantly improve model’s sensitivity on semantic details. For the first query on top, SRB model magnifies the top 1 gallery image’s upper body and foot region for comparison. For the second query at the bottom, SRB model captures its top 1’s collar and back region when retrieving. However, the baseline method produces incorrect results on these cases due to inability to highlight critical details as shown in its coarse activation maps.

In addition, we also analyze different sampling strategies for the partial semantic representation. Specifically, we train MagnifierNet with different and in SRB. Based on the results in Table 3, improvement is guaranteed when applying partial semantic representation while the combination 2-2 yields the best results on most datasets. The training is done without SD Loss to highlight the impact of semantic regularization itself.

4.4.3 Effect of Semantic Fusion

To validate the benefit of SFB, we replace the SFB with an “ablation branch” (AB) and compare its performance with MagnifierNet. To construct the ablation branch, instead of feeding semantic features into LSTM sequentially, we directly perform a convolution to reduce their dimensions and concatenate them together. Note that the dimension reduction is to ensure the ablation branch has the same output feature dimension as the SFB. We train both model until converge without SD Loss to validate the effectiveness of network structure alone. As shown in Table 5, the proposed SFB surpasses the Ablation Branch on all datasets, which validates its positive influence on our framework.

CUHK03-Labeled CUHK03-Detected
Method Rank-1 mAP Rank-1 mAP
AB 81.1 78.6 77.4 76.0
SFB 82.6 79.5 78.6 76.4
Market1501 DukeMTMC-reID
Method Rank-1 mAP Rank-1 mAP
AB 95.3 89.6 89.7 80.4
SFB 95.7 89.6 90.0 81.1
Table 5: Comparison between SFB and Ablation Branch (AB).

In addition, we also compare the feature activation after the bottleneck between “SFB” and “ablation branch” qualitatively. As shown in Figure 5, the activation from “ablation branch” is not focused and includes noisy information, while SFB is able to filter out irrelevant features and focus on the most beneficial information.

4.4.4 Coefficient of Semantic Diversify Loss and Segmentation Loss

As shown earlier quantitatively, the SD Loss is able to further improve model performance by diversifying semantic features. However, as deep learning models are sensitive to variation in feature distribution, the optimal SD loss coefficient

needs to be determined empirically. We present the mAP score for the challenging CUHK03-Detected dataset with different in Figure 6 (left), where the SD Loss can steadily benefit the model to show its effectiveness.

We also investigate the impact of different segmentation loss weight . We present the mAP score with different values on CUHK03-Detected in Figure 6 (right), where all results surpass other state-of-the-arts that validate our semantic segmentation module’s robustness. Note that these results are without SD Loss to study the impact of alone.

Figure 6: Model performance with different and .

5 Conclusion

In this paper, we propose a novel network MagnifierNet that improves ReID performance beyond pure alignment. The Semantic Regularization Branch mines the fine-grained details in each semantic region by learning with limited semantic representation, the Semantic Fusion Branch selectively encodes semantic features by filtering out noises and focusing only on beneficial information. We further improve the model performance by introducing a novel Semantic Diversifying Loss which promotes features diversity among semantic regions. MagnifierNet achieves State-of-the-Art performance on three major datasets Market1501, DukeMTMC-reID and CUHK03, which showcases the effectiveness of our method.

References

  • [1] R. Alp Güler, N. Neverova, and I. Kokkinos (2018)

    Densepose: dense human pose estimation in the wild

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 7297–7306. Cited by: §4.2.
  • [2] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017-07) Realtime multi-person 2d pose estimation using part affinity fields. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [3] B. Chen, W. Deng, and J. Hu (2019) Mixed high-order attention network for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 371–381. Cited by: Table 1.
  • [4] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang (2018) Group consistent similarity learning via deep crf for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8649–8658. Cited by: §2.1.
  • [5] G. Chen, C. Lin, L. Ren, J. Lu, and J. Zhou (2019) Self-critical attention learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9637–9646. Cited by: Table 1.
  • [6] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, and Z. Wang (2019) ABD-net: attentive but diverse person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8351–8361. Cited by: Table 1.
  • [7] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) A multi-task deep network for person re-identification. In

    Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.3.
  • [8] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the iEEE conference on computer vision and pattern recognition, pp. 1335–1344. Cited by: §2.3.
  • [9] Z. Dai, M. Chen, X. Gu, S. Zhu, and P. Tan (2019) Batch dropblock network for person re-identification and beyond. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3691–3701. Cited by: §2.3, §3.2, Table 1, §3.
  • [10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019-06)

    ArcFace: additive angular margin loss for deep face recognition

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  • [11] S. Ding, L. Lin, G. Wang, and H. Chao (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48 (10), pp. 2993 – 3003. Note: Discriminative Feature Learning from Big Data for Visual Recognition External Links: ISSN 0031-3203, Document, Link Cited by: §2.3.
  • [12] S. Ding, L. Lin, G. Wang, and H. Chao (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48 (10), pp. 2993–3003. Cited by: §1.
  • [13] P. Fang, J. Zhou, S. K. Roy, L. Petersson, and M. Harandi (2019) Bilinear attention networks for person retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8030–8039. Cited by: Table 1.
  • [14] J. Fu (2019) Dual attention network for scene segmentation. Cited by: §4.2.
  • [15] F. A. Gers, J. Schmidhuber, and F. Cummins (1999) Learning to forget: continual prediction with lstm. Cited by: §3.3.
  • [16] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin (2017-07) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [17] J. Guo, Y. Yuan, L. Huang, C. Zhang, J. Yao, and K. Han (2019) Beyond human parts: dual part-aligned representations for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3642–3651. Cited by: Table 1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.1, §3.
  • [19] L. He, Y. Wang, W. Liu, X. Liao, H. Zhao, Z. Sun, and J. Feng (2019) Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification. arXiv preprint arXiv:1904.04975. Cited by: Table 1.
  • [20] A. Hermans*, L. Beyer*, and B. Leibe (2017) In Defense of the Triplet Loss for Person Re-Identification. arXiv preprint arXiv:1703.07737. Cited by: Table 1.
  • [21] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen (2019) Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9317–9326. Cited by: Table 1.
  • [22] H. Huang, W. Yang, X. Chen, X. Zhao, K. Huang, J. Lin, G. Huang, and D. Du (2018) EANet: enhancing alignment for cross-domain person re-identification. External Links: 1812.11369 Cited by: §2.2, §4.2.
  • [23] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah (2018-06) Human semantic parsing for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, §3.1.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.2.
  • [25] W. Li, R. Zhao, T. Xiao, and X. Wang (2014)

    Deepreid: deep filter pairing neural network for person re-identification

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 152–159. Cited by: §4.1.
  • [26] W. Li, X. Zhu, and S. Gong (2018) Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294. Cited by: §1.
  • [27] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Z. Hu, C. Yan, and Y. Yang (2019) Improving person re-identification by attribute and identity learning. Pattern Recognition. External Links: Document Cited by: §2.2.
  • [28] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017-07) SphereFace: deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  • [29] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang (2017-10) HydraPlus-net: attentive deep features for pedestrian analysis. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, §3.5.
  • [30] C. Luo, Y. Chen, N. Wang, and Z. Zhang (2019) Spectral feature transformation for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4976–4985. Cited by: Table 1.
  • [31] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019-06) Bag of tricks and a strong baseline for deep person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §3.3, §4.2.
  • [32] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §2.3.
  • [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • [34] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. Van Gool (2011) Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors. In CVPR 2011, pp. 777–784. Cited by: Table 1.
  • [35] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, Cited by: §2.2.
  • [36] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018-09) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, Table 1.
  • [37] C. Tay, S. Roy, and K. Yap (2019) AANet: attribute attention network for person re-identifications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7134–7143. Cited by: Table 1.
  • [38] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 274–282. Cited by: §2.1, Table 1.
  • [39] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018-06) CosFace: large margin cosine loss for deep face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3, §3.
  • [40] J. Wang, X. Zhu, S. Gong, and W. Li (2017-10) Attribute recognition by joint recurrent learning of context and correlation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
  • [41] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §2.3.
  • [42] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §2.3.
  • [43] B. Xiao, H. Wu, and Y. Wei (2018-09) Simple baselines for human pose estimation and tracking. In The European Conference on Computer Vision (ECCV), Cited by: §2.2.
  • [44] Q. Xiao, H. Luo, and C. Zhang (2017) Margin sample mining loss: a deep learning based method for person re-identification. arXiv preprint arXiv:1710.00478. Cited by: §2.3.
  • [45] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang (2018-06) Attention-aware compositional network for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [46] J. Yang, X. Shen, X. Tian, H. Li, J. Huang, and X. Hua (2018)

    Local convolutional neural networks for person re-identification

    .
    In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 1074–1082. Cited by: Table 1.
  • [47] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang (2019) Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1389–1398. Cited by: Table 1.
  • [48] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun (2017) Alignedreid: surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184. Cited by: Table 1.
  • [49] Z. Zhang, C. Lan, W. Zeng, and Z. Chen (2019) Densely semantically aligned person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 667–676. Cited by: Table 1.
  • [50] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang (2017-07) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
  • [51] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017) Deeply-learned part-aligned representations for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3219–3228. Cited by: §1, §2.1.
  • [52] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji (2018) Pyramidal person re-identification via multi-loss dynamic training. arXiv preprint arXiv:1810.12193. Cited by: §2.1, Table 1.
  • [53] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.2.
  • [54] M. Zheng, S. Karanam, Z. Wu, and R. J. Radke (2019) Re-identification with consistent attentive siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5735–5744. Cited by: Table 1.
  • [55] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz (2019) Joint discriminative and generative learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2138–2147. Cited by: Table 1.
  • [56] Z. Zheng, L. Zheng, and Y. Yang (2018) A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (1), pp. 13. Cited by: §2.1.
  • [57] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: Table 1.
  • [58] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang (2019) Omni-scale feature learning for person re-identification. arXiv preprint arXiv:1905.00953. Cited by: Table 1.
  • [59] S. Zhou, F. Wang, Z. Huang, and J. Wang (2019) Discriminative feature learning with consistent attention regularization for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8040–8049. Cited by: Table 1.