Diversity-Achieving Slow-DropBlock Network for Person Re-Identification

02/09/2020 ∙ by Xiaofu Wu, et al. ∙ Huazhong University of Science u0026 Technology IEEE 0

A big challenge of person re-identification (Re-ID) using a multi-branch network architecture is to learn diverse features from the ID-labeled dataset. The 2-branch Batch DropBlock (BDB) network was recently proposed for achieving diversity between the global branch and the feature-dropping branch. In this paper, we propose to move the dropping operation from the intermediate feature layer towards the input (image dropping). Since it may drop a large portion of input images, this makes the training hard to converge. Hence, we propose a novel double-batch-split co-training approach for remedying this problem. In particular, we show that the feature diversity can be well achieved with the use of multiple dropping branches by setting individual dropping ratio for each branch. Empirical evidence demonstrates that the proposed method performs superior to BDB on popular person Re-ID datasets, including Market-1501, DukeMTMC-reID and CUHK03 and the use of more dropping branches can further boost the performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Person re-identification (Re-ID) focuses on matching images associated with the same person taken by the same or different cameras at different angles, time or location. It has attracted significant interest due to its fundamental role in emerging computer vision applications such as surveillance, human identity validation, and authentication, and human-robot interaction

[1, 2, 3, 4, 5, 6, 7, 8, 9]. Despite its enormous progress, identifying the person of interest accurately and reliably is still very challenging under huge variations in lighting, human pose, background, camera viewpoint, etc. With an end-to-end training approach, one of the main goals in the field of person Re-ID is to discover useful features as rich as possible from the labeled dataset.

Fig. 1: Visualization of Class Activation Maps (CAMs) for diverse SDB branches. The proposed local dropping branches with different dropping ratios (), along with the global branch for SDB allow the model to learn diverse features (marked in orange).

In general, person Re-ID can be considered as a feature-embedding (or metric-learning) problem [10, 11, 12], where the distance between intra-class samples (associated with the same person) should be less than the distance between inter-class ones (associated with different persons) by at least a margin. Unfortunately, most existing feature-embedding solutions require grouping samples in a pairwise manner, which is known to be computationally intensive. In practice, the classification task is often employed to find the feature-embedding solution due to its clear advantage on the implementation complexity for training. Today, most state-of-the-art methods [13, 2, 6, 14, 15] for person Re-ID have evolved from a single metric-learning problem or a single discriminative classification problem to a multi-task problem, where both the discriminative loss and the triplet loss are employed [16]

. As each sample image is only labeled with the person ID, an end-to-end training approach usually has difficulty to learn diverse and rich features without elaborate design of the underlying neural network and further use of some regularization techniques.

In the past years, various part-based approaches [17, 18, 19] and dropout-based approaches [20] have been proposed in order to learn rich features from the ID-labeled dataset. Differing from conventional pose based Re-ID approaches [10, 21, 22, 23], part-based approaches usually locate a number of body parts firstly, and force each part meeting individual identification loss in order to get discriminative part feature representations [24, 25, 26, 27]. Dropout-based approaches, however, intend to discover rich features from enlarging the dataset with various dropout-based data-augmentation methods, such as cutout [28] and random erasing [29]

, or from dropping the intermediate features from feature-extracting networks, such as Batch BropBlock

[2].

Drawbacks of Part-based Methods: The performance of part-based methods relies heavily on the employed partition mechanism. Semantic partitions may offer stable cues to good alignment but are prone to noisy pose detections, as it requires that human body parts should be accurately identified and located. The uniform horizontal partition is widely employed in [18, 25], which, however, provides limited performance improvement, along with multi-branch network architecture because of the lack of semantic support and the difficulty in determining the appropriate number of partitions.

Drawbacks of Dropout-based Methods: Dropout-based methods have been widely used in person Re-ID, including various data dropping methods (for the purpose of data augmentation) and feature dropping methods. Data dropping methods, such as random erasing [29], cutout [28] and DropBlock [30], have been shown to be effective for extracting rich features for person Re-ID. One of the main drawbacks is that the proportion of dropping pattern should be kept small enough for ensuring the convergence of training, which may hamper the discover of more diverse features. As a typical feature dropping method, Batch DropBlock (BDB) [2] has been proven to be effective for person Re-ID. However, the dropping pattern in BDB is fixed within only one iteration (one batch of samples), where the network may have difficulty in learning the corresponding structure. One way to improve the diversity of feature discovery is to increase the number of branches. Unfortunately, BDB is restricted to a two-branch architecture. Currently, it remains unknown on how to extend the existing two-branch architecture to the architecture with an arbitrary number of branches, e.g., -branch (), for achieving improved diversity.

This motivates the work in this paper where we propose a novel multi-branch architecture for discovering rich features in person Re-ID. We briefly summarize the main contribution of this paper as follows:

  1. Based on 2-branch BDB network, we propose to move the dropping operation from the intermediate feature layer towards the input image. In particular, we propose a Slow-DropBlock (SDB) method, a slower version of Batch DropBlock, in constructing a multi-branch network for person Re-ID. Our method allows the dropping pattern keeping fixed over a number of () batches, which facilitates the learning of stable features with extended time duration.

  2. By dropping a large portion of input images, the proposed SDB often makes the training process hard to converge. To address this issue, we propose a novel double-batch-split co-training approach with guaranteed convergence.

  3. One of main challenges in designing multi-branch networks for person Re-ID is to ensure feature diversity among individual branches. We show that feature diversity can be achieved with the use of multiple SDB branches by setting individual dropping ratio for each branch, as demonstrated in Figure 1.

  4. The proposed SDB, along with double-batch-split co-training approach, has been proved very efficient for achieving the state-of-the-art performance on the three popular person Re-ID datasets, Marktet1501, DukeMTMC-reID and CUHK03. For example, SDB achieves the rank-1 accuracy of 90.2% for DukeMTMC-reID without using re-ranking 111Source codes are available at https://github.com/AI-NERC-NUPT/SDB.

Ii Related Work

Ii-a Variation of Dropout Techniques

Dropout is a standard tool for avoiding overfiting [31]

, which randomly discards the output of each hidden neuron with a probability during training and forces the neural network to learn more diverse features. In recent years, novel dropout-based techniques have been proposed and further adopted in the field of person Re-ID.

Cutout: Cutout[28] is a simple data augmentation technique that involves removing contiguous sections of input images, effectively augmenting the dataset with partially occluded versions of existing samples. It randomly masks out square regions of input during training.

Random Erasing: Instead of generating occluded samples, Random Erasing [29] randomly selects a rectangle region in an image and erases its pixels with random values.

DropBlock

: For a batch of input tensors (samples or features), DropBlock

[30] randomly drops a contiguous region for each input tensor.

Spatial Dropout: Spatial Dropout [32] randomly zeroes whole channels of the input tensor. The channels to zero-out are randomized on every forward call.

Batch DropBlock: Batch DropBlock [2] randomly drops the same region of a batch of input feature maps and reinforces the attentive feature learning of the remaining parts.

Note that among various dropout methods, Spatial Dropout [32], and Batch DropBlock [2] are performed over intermediate feature maps, while Cutout, Random Erasing, and DropBlock [30], are data-augmentation methods over input images.

Ii-B Two-Branch vs. Multi-Branch for Diverse Person Re-ID

To get diverse features from an end-to-end training approach, multi-branch network architectures have been widely employed [18, 2, 33], where a shared-net is followed by branches of subnetworks for achieving diversity in feature spaces. It is a natural problem to ask how many branches are necessary for person Re-ID. However, there is no easy answer to this question.

Two-Branch Network (): Two-branch network architectures, such as BDB [2], ABD [6] and SONA[34], have been successfully employed for person Re-ID. All these networks consist of two branches: a global branch and a local branch. To achieve feature diversity between branches, distinct mechanisms should be imposed over the global and local branches, such as attention [6] and feature dropping [2, 34].

Multi-Branch Network (): PCB [18] employed a 6-branch network by dividing the whole body into 6 horizontal stripes. Following PCB, MGN [24] employed a multi-branch network architecture consisting of one branch for global feature representations and two branches for local feature representations. A 4-branch network architecture was adopted in Auto Re-ID [35]. MHN [3] employed 6-branch high-order attention modules for achieving best possible performance. CAMA [4] used a 3-branch network by forcing an end-to-end overlapped activation penalty for achieving the best possible performance. The number of branches is 21 in Pyramid [36] based on a basic 6 horizonal stripes of PCB.

Despite of the recent efforts on the multi-branch network architecture, there is no clear advantage for adopting over the two-branch counterpart. For example, The 2-branch solution of SONA [34] is superior to the 6-branch solution of MHN [3], both focusing on the attention mechanism. Hence, it remains to be exploited for how to obtain rich feature diversity for a multi-branch solution with .

Fig. 2: Comparison of (a) DropBlock, (b) Batch DropBlock, and (c) Slow-DropBlock (=2) for a batch size of 2.
Fig. 3: The overall -branch network architecture for the proposed SDB-. With double-batch-split co-training for SDB-2, a super batch of images are input to the network, among which half of them are conventionally augmented for global branch, and the other half of them are further augmented with Slow-DropBlock for local branch. During testing, the feature embedding concatenated from both global branch and local branch is used for the final matching distance computation.

Iii Slow-DropBlock Network with Double-Batch-Split Co-Training

Iii-a Slow-DropBlock

As shown in Figure 2, Slow-DropBlock randomly generates a contiguous block pattern and erases its pixels for batches of input images. “slow” means that the generated block patten keeps unchanged for at least batches. In the case of , this is equivalent to BDB [2] for the purpose of data augmentation.

The only parameter in SDB is the erased height ratio . In this way, the erased width ratio is always set to 1.

Iii-B Proposed Network Architecture

As shown in Figure 3, we employ an -branch neural network architecture, called SDB-, by modifying the commonly-used ResNet-50 baseline. Figure 3 shows the overall network architecture, which includes a backbone network, a global branch (orange colored arrows), and local branches (blue colored arrows). The main difference between SDB-2 and BDB is the use of attention modules in our proposed architecture. In this paper, the use of both spatial and channel attention modules in the shared-net follows the work of [5] and we, however, insert an extra Spatial Attention Module (SAM) in the subsequent global and local branches. For the use of dimension reduction layer, experiments show that it depends on the underlying dataset. It is required for CUHK03, but not necessarily for both Market-1501 and DukeMTMC-reID.

Iii-B1 Attention Modules

Given an input feature map , where , and denote the height and the width of the feature map, the number of channels, respectively. Following the work in [37, 6, 38], the employed spatial-attention module (SAM) first computes a spatial correlation matrix of

(1)

where the input feature map is reshaped into a tensor of size with and is a parameter. Throughout this paper, we simply set

. Then, the affinity matrix can be obtained as

(2)

Then, the output of spatial attention module is

(3)

where is a learnable parameter.

Compared to SAM, the channel attention module (CAM) performs similarly, but focusing on the channel dimension instead of the spatial dimension.

Iii-B2 Shared-Net

We use the popular ResNet-50 as the backbone network for feature extraction. For a fair comparison with the recent works [2, 3, 4], we also modify the backbone ResNet-50 slightly, in which the down-sampling operation at the beginning of stage 4 has been removed. In this way, we get a larger feature map of size . As shown in Figure 3, we insert SAM + CAM modules in both stages of 3 and 4 for the shared-net.

Iii-B3 Global Branch

The global branch consists of the stage-4 layers, a bottleneck block, a SAM module and a global average pooling (GAP) layer to produce a 2048 dimension vector and a feature reduction module containing a

convolutional layer, a 2-D batch normalization layer, and a ReLU layer to reduce the dimension to 512 providing a compact global feature representation for both triplet loss and cross entropy loss.

Iii-B4 Diverse Local SDB Branches

The local branch has the similar layer structure but with the global max pooling (GMP) in replace of GAP. Furthermore, the reduction module follows the GMP layer is a

fc-layer, followed by a batch-normalization layer and a ReLU layer.

To achieve feature diversity over local branches, we consider to employ multiple local branches with different settings of erased height ratio . A main contribution of this paper is that we show that feature diversity can be achieved with multiple local SDB branches of different . For example, with the setting of for 3 local SDB branches, we can concatenate features from 3 local branches with the global branch for achieving feature diversity.

Iii-B5 Loss Functions

The feature vectors from the global and local branches are concatenated as the final embedded feature for the person Re-ID task. The loss function is the sum of identification loss (softmax loss), soft margin batch-hard triplet loss 

[15] and center loss [39] on both the global branch and local branch, namely,

(4)

where are weighting factors.

Iii-C Double-Batch-Split Co-Training

Firstly, we focus on the training of a -branch SDB network (SDB-2). Experiments show that the use of SDB over input batches of images make the training process difficult to converge. Therefore, we propose a novel double-batch-split co-training scheme to address this problem. During training, the network accepts a super-batch of images with size , where a super-batch is composed of 2 batches of samples, one usual batch and one dropping batch, both of size . The usual batch is with the conventional data augmentation, including horizontal flip, normalization, and cutout. The dropping batch is further with the SDB data augmentation, where the dropping pattern is randomly generated but keeps fixed for independent batches.

This super-batch is first input to the shared-net, including stage-1 to stage-4 layers. The resulting super-batch of features are then split into two sub-batches, one for the global branch, the other for the local branch. The obtained feature for each individual branch will then be of batch size . The global branch uses the global average pooling (GAP) while the global max pooling (GMP) is employed in the local branch, because GMP encourages the network to identify comparatively weak salient features after the most discriminative part is dropped.

Secondly, for an SDB- with , we run the above co-training process times, each for an individual SDB-2 nework, composed of the global branch and the -th local branch, .

During testing, features from the global branch and all the local branches are concatenated as the embedding vector of a pedestrian image.

Iv Experiments

Extenstive experiments have been performed for evaluating the effectiveness of the proposed approach over three public person Re-ID datasets: Market-1501, DukeMTMC-reID and CUHK03. The results are compared to the state-of-the-art methods.

Iv-a Datasets

The Market-1501 dataset [40] has 1,501 identities collected by six cameras and a total of 32,668 pedestrian images. Following [40]. The dataset is split into a training set with 12,936 images of 751 identities and a testing set of 3,368 query images and 15,913 gallery images of 750 identities.

The DukeMTMC-reID dataset [41] contains 1,404 identities captured by more than 2 cameras and a total of 36,411 images. The training subset contains 702 identities with 16,522 images and the testing subset has other 702 identities.

The CUHK03 dataset [42] contains labeled 14,096 images and detected 14,097 images of a total of 1,467 identities captured by two camera views. With splitting just like in [40], a non-overlapping 767 identities are for training and 700 identities for testing. The labeled dataset contains 7,368 training images, 5,328 gallery, and 1,400 query images for testing, while the detected dataset contains 7,365 images for training, 5,332 gallery, and 1,400 query images for testing.

Iv-B Implementation Details

Our network is trained using a single Nvidia Tesla P100 GPU with a batch size of 32 (). Hence, the super-batch is with size of

. Each identity contains 4 instance images in a batch, so there are 8 identities per batch. The backbone ResNet-50 is initialized from the ImageNet pre-trained model. We use the batch hard soft margin triplet loss. The total number of epochs is set to 120 [150], namely, 120 for both Market-1501 and DukeMTMC-reID, and 150 for CUHK03, respectively. We use the Adam optimizer with the base learning rate initialized to 3.5e-5. With a linear warm-up strategy in first 10 [40] epochs, the learning rate increases to 3.5e-4. Then, the learning rate is decayed to 3.5e-5 after 40 [100] epochs, and further decayed to 3.5e-6 after 65 [125] epochs. For SDB, we use the default setting of

. We use for SDB-2 and for SDB-4.

For training, the input images are re-sized to and then augmented by random horizontal flip, cutout, random erasing, and normalization. The testing images are re-sized to with normalization.

Iv-C Comparison with State-of-the-art Methods

We compare our work with state-of-the-art methods, in particular emphasis on the recent remarkable works (CVPR’19 and ICCV’19) on person Re-ID, over the popular benchmark datasets Market-1501, DukeMTMC-ReID and CUHK03. All reported results are obtained without any re-ranking [43, 44] or multi-query fusion [40] techniques. The comparison results are listed in Table 1, Table 2 and Table 3. From these tables, one can observe that our proposed method performs competitively among various state-of-the-art methods, including Beyond Part models (PCB) [18], Batch BropBlock (BDB) [2], Mixed High-Order Attention Network (MHN) [3], CAMA [4], Interaction-and-Aggregation Network (IAN)[5], ABD-Net [6].

As shown, SDB-4 performs consistently better than SDB-2, which means that feature diversity can be well achieved with the use of a multi-branch SDB architecture. For DukeMTMC-reID, our SDB-4 performs the best among various state-of-the-art methods, which has 1.2/3.5% improvements of Rank-1/mAP over BDB, demonstrating the effectiveness of the proposed SDB. By pushing DropBlock operation from intermediate features to inputs, SDB shows its superiority over BDB [2] for all datasets. Compared to SONA, another variant of BDB with attention modules, SDB performs still better on all datasets.

Method mAP rank-1
KPM [45](CVPR’18) 75.3 90.1
MLFN [46](CVPR’18) 74.4 90.0
CRF [47](CVPR’18) 81.6 93.5
PCB [18](ECCV’18) 81.6 93.8
SNL [48](ACM’18) 73.43 88.27
HDLF[49](ACM MM’18) 79.10 93.30
MGN [24](ACM MM’18) 86.9 95.7
Local CNN[50](ACM MM’18) 87.4 95.9
IAN [5] (CVPR’19) 83.1 94.4
CAMA [4](CVPR’19) 84.5 94.7
MHN [3](CVPR’19) 85.0 95.1
Pyramid [36](CVPR’19) 88.2 95.7
SONA [34] (ICCV’19) 88.67 95.68
ABD [6] (ICCV’19) 88.28 95.6
BDB [2] (ICCV’19) 86.7 95.3
SDB-2 88.2  95.7
SDB-4 88.7 95.9
TABLE I: Comparison of our proposed method with state-of-the-art methods for the Market-1501 dataset
Method mAP rank-1
MLFN [46](CVPR’18) 62.8 81.2
GP-Re-ID [51] (CVPR’18) 72.8 85.2
PCB [18] (ECCV’18) 69.2 83.3
MGN [24](ACM MM’18) 78.40 88.7
Local CNN[50](ACM MM’18) 66.04 82.23
IAN [5] (CVPR’19) 73.4 87.1
CAMA [4] (CVPR’19) 72.9 85.8
MHN [3] (CVPR’19) 77.2 89.1
SONA [34] (ICCV’19) 78.05 89.25
BDB [2] (ICCV’19) 76.0 89.0
SDB-2 78.9 89.8
SDB-4 79.5 90.2
TABLE II: Comparison of our proposed method with state-of-the-art methods for the DukeMTMC-reID dataset
Method Labeled Detected
mAP rank-1 mAP rank-1
DaRe+RE [52](CVPR’18) 61.6 66.1 59.0 63.3
MLFN [46](CVPR’18) 49.2 54.7 47.8 52.8
HA-CNN [53](CVPR’18) 41.0 44.4 38.6 41.7
Local-CNN [50](ACM MM’18) 53.83 58.69 51.55 56.76
PCB [18](ECCV’18) - - 57.5 63.7
MGN [24](ACM MM’18) 67.4 68.0 66.0 68.0
MHN [3] (CVPR’19) 72.4 77.2 65.4 71.7
Pyramid [36](CVPR’19) 76.9 78.9 74.8 78.9
SONA [34] (ICCV’19) 79.23 81.85 76.35 79.10
BDB [2] (ICCV’19) 76.7 79.4 73.5 76.4
SDB-2 78.0 80.4 74.8 77.2
SDB-4 80.7 82.6 77.4 79.5
TABLE III: Comparison of our proposed method with state-of-the-art methods for the CUHK03 dataset

Iv-D Visualization

Fig. 4: Visualization of class activation maps (CAMs) for a global branch and multiple diverse local SDB branches. The proposed local branches with different dropping ratios () for SDB allow the model to learn diverse features (marked in orange).
Fig. 5: Three Re-ID examples of SDB-2 and BDB on Market-1501. Left: query image. Upper-Right: top-10 results of SDB. Low-Right: top-10 results of BDB. Images in red boxes are negative results. SDB-2 boost the retrieval performance

Diverse Feature Visualization: We show that diverse features can be clearly observed with the use of the proposed local SDB branches with different dropping ratios in Figure 4, where 5 local branches of and the global branch are highlighted.

Re-ID Visual Retrieving Results: We compare SDB-2 with BDB more directly from visual retrieving results. Three retrieved examples are shown in Figure 5. We can see that BDB fails to retrieve several correct images among the top-10 results. Taking the first query as an example, SDB-2 is able to find correct images of the same identity in the top 10 results whilst BDB gets 4 incorrect ones.

Iv-E Ablation Studies

Method Market-1501 DukeMTMC CUHK03-Labeled CUHK03-Detected
mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1
BDB 83.9 94.2 76.5 88.6 76.8 79.2 72.3 76.0
SDB-2 88.2 95.7 78.9 89.8 78.0 80.4 74.8 77.2
TABLE IV: Comparison with Batch BropBlock [2] under the same SDB-2 network architecture

Iv-E1 Comparison with BDB

Since the proposed SDB is very similar to BDB, it is interesting to compare them in a fair way. Hence, we perform experiments over the SDB-2 network architecture by placing the Batch DropBlock before the GMP module on the local branch, just as did in BDB. Instead of using double-batch-split co-training, we employ the regular single-batch training approach in BDB. The results are shown in Table IV. As shown, SDB-2 clearly outperforms BDB for all the three datasets. For example, the improvement for SDB-2 over BDB in mAP is about 4.3% for Market-1501.

Iv-E2 Benefit of Co-Training

Co-Training Market-1501 DukeMTMC CUHK03-Labeled CUHK03-Detected
mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1
No 86.4 94.4 78.0 89.0 75.1 77.2 71.8 74.2
Yes 88.2 95.7 78.9 89.8 78.0 80.4 74.8 77.2
TABLE V: Benefit of co-training over three datasets

With the use of co-training, SDB-2 performs clearly better as shown in Table V for all the three datasets. For example, the use of co-training surpass its counterpart by about 3.0% in mAP for CUHK03. The motivation behind the use of co-training is that the use of SDB data augmentation for the local branch may hamper the learning of share-net. This suggests that the global branch and the local branch reinforce each other, both contributing to the final performance. It should be pointed out that SDB-2 performs still better than BDB even without use of co-training.

mAP rank-1 rank-5 rank-10
1 78.8 89.5 95.1 96.5
5 78.9 89.8 95.2 96.5
10 78.5 89.5 95.0 96.3
TABLE VI: Rate of changes for SDB-2 over DukeMTMC-reID
Attention Modules Market-1501 DukeMTMC CUHK03-Labeled CUHK03-Detected
mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1
No 87.0 95.1 77.5 88.8 77.3 79.3 73.3 75.8
Yes 88.2 95.7 78.9 89.8 78.0 80.4 74.8 77.2
TABLE VII: Benefit of Attention Modules
0.1 0.2 0.3 0.4 0.5 0.6
mAP 77.2 77.9 78.9 78.3 77.4 76.1
rank-1 88.8 89.6 89.8 89.4 88.6 88.4
TABLE VIII: Influence of Dropping-Ratio for SDB-2 over DukeMTMC-reID

Iv-E3 Rate of Changes for Slow-DropBlock

Compared to BDB, the proposed SDB has the freedom to adjust the rate of changes . The results are shown in Table VI for SDB-2 over DukeMTMC-reID. The use of can achieve the improvement of 0.3% on Rank-1 accuracy. However, the use of does not have any further improvement. This may be due to the fact that the total number of training epochs is fixed, the whole image cannot be fully traversed by randomly-generated dropping patterns when increases. For SDB-4, we also observe from experiments that the use of can achieve consistent performance improvement over that of .

Iv-E4 Benefit of Attention Modules

With the insertion of attention modules, SDB-2 performs clearly better as shown in Table VII for all the three datasets. There is about 1% improvement on mAP for all the datasets.

Iv-E5 Influence of Dropping-Ratio for SDB

Table VIII studies the impact of dropping-ratio on the performance of the SDB-2 network. Since the erased width ratio is fixed to 1.0, the dropping-ratio of SDB, defined as the ratio between the dropping area and the area of whole image is fully determined by the erased height ratio . We can see that the best performance is achieved when for DukeMTMC-reID.

Iv-E6 On the Use of Dimension-Reduction Layer

Use of DR Layer mAP rank-1 rank-5 rank-10
No 67.5 70.4 86.2 91.1
Yes 78.0 80.4 92.1 95.1
TABLE IX: On the use of dimension-reduction (DR) layer for CUHK03-Label

We also perform experiments to investigate the role of dimension-reduction layer. For CUHK03-Label, the results are shown in Table IX. Clearly, A large performance gap is observed and the rank-1 accuracy is 80.4% for the use of dimension-reduction layer, which surpass its counterpart by 10.0%. This phenomena, however, is only observed in CUHK03, which is not the case for both Market-1501 and DukeMTMC-reID. Therefore, it deserves further exploitation.

V Conclusion

In this paper, we propose a novel diversity-achieving Slow-DropBlock multi-branch network for person Re-ID. As an alternative structure for the recently proposed Batch DropBlock network, we push the dropping operation to the input images, which, however, has the difficulty for training convergence. Then, we propose a novel double-batch-split co-training approach with guaranteed training convergence. Experiments show that the proposed diverse SDB network, along with co-training approach, can achieve state-of-the-art performance on several popular person Re-ID datasets, including Market-1501, DukeMTMC-reID and CUHK03.

References

  • [1] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identification: Past, present and future,” arXiv preprint arXiv:1610.02984, 2016.
  • [2] Z. Dai, M. Chen, X. Gu, S. Zhu, and P. Tan, “Batch dropblock network for person re-identification and beyond,” in 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), 2019, pp. 3691–3701.
  • [3] B. Chen, W. Deng, and J. Hu, “Mixed high-order attention network for person re-identification,” in 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), 2019, pp. 371–381.
  • [4] W. Yang, H. Huang, Z. Zhang, X. Chen, K. Huang, and S. Zhang, “Towards rich feature discovery with class activation maps augmentation for person re-identification,” in

    2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2019, pp. 1389–1398.
  • [5] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “Interaction-and-aggregation network for person re-identification,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9317–9326.
  • [6] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, and Z. Wang, “Abd-net: Attentive but diverse person re-identification,” in 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), 2019, pp. 8351–8361.
  • [7] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a strong baseline for deep person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition WorkShops (CVPRW), 2019, pp. 4321–4329.
  • [8] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian, “Glad: Global-local-alignment descriptor for scalable person re-identification,” IEEE Trans. Multimedia, vol. 21, pp. 986 – 999, Apr. 2019.
  • [9] C. Zhao, X. Lv, Z. Zhang, W. Zuo, J. Wu, and D. Miao, “Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification,” IEEE Trans. Multimedia, vol. -, pp. 1 – 16, Early-Access 2020.
  • [10] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in 2017 IEEE Proceedings on International Conference on Computer Vision (ICCV), 2017, pp. 3960–3969.
  • [11] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: A deep quadruplet network for person re-identification,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 1320–1329.
  • [12] S. Bai, X. Bai, and Q. Tian, “Scalable person re-identification on supervised smoothed manifold,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 3356–3365.
  • [13]

    H. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks for image classification with convolutional neural networks,” in

    2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 558–567.
  • [14] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn embedding for person reidentification,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 1, p. 13, 2018.
  • [15] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv:1703.07737, 2017.
  • [16] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributes driven multi-camera person re-identification,” in Proceedings of the European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 475–491.
  • [17] H. Yao, S. Zhang, R. Hong, Y. Zhang, C. Xu, and Q. Tian, “Deep representation learning with part loss for person re-identification,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2860–2871, June 2019.
  • [18] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 480–496.
  • [19] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identification,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 3239–3248.
  • [20] Z. Dai, M. Chen, S. Zhu, and P. Tan, “Batch feature erasing for person re-identification and beyond,” arXiv preprint arXiv:1811.07130, 2018.
  • [21] V. Kumar, A. Namboodiri, M. Paluri, and C. V. Jawahar, “Pose-aware person recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 6797–6806.
  • [22] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embedding for deep person re-identification,” IEEE Transactions on Image Processing, vol. 28, no. 9, pp. 4500–4509, Sep. 2019.
  • [23] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and X. Xue, “Pose-normalized image generation for person re-identification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 650–667.
  • [24] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in 2018 ACM Multimedia Conference on Multimedia Conference.   ACM, 2018, pp. 274–282.
  • [25] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee, “Part-aligned bilinear representations for person re-identification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 402–419.
  • [26] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 1335–1344.
  • [27] X. Fan, H. Luo, X. Zhang, L. He, C. Zhang, and W. Jiang, “Scpnet: Spatial-channel parallelism network for joint holistic and partial person re-identification,” in Asian Conference on Computer Vision.   Springer, 2018, pp. 19–34.
  • [28] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” in arXiv:1708.04552, 2017.
  • [29] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.
  • [30] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization method for convolutional networks,” in Advances in Neural Information Processing Systems, 2018, pp. 10 727–10 737.
  • [31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,”

    The Journal of Machine Learning Research

    , vol. 15, no. 1, pp. 1929–1958, 2014.
  • [32] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 648–656.
  • [33] W. Chen, X. Chen, J. Zhang, and K. Huang, “A multi-task deep network for person re-identification,” in

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [34] B. N. Xia, Y. Gong, Y. Zhang, and C. Poellabauer, “Second-order non-local attention networks for person re-identification,” in 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), 2019, pp. 3760–3769.
  • [35] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang, “Auto-reid: Searching for a part-aware convnet for person re-identification,” in 2019 IEEE Proceedings on International Conference on Computer Vision (ICCV), 2019.
  • [36] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji, “Pyramidal person re-identification via multi-loss dynamic training,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  • [38] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-aware compositional network for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 2119–2128.
  • [39]

    Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in

    Proceedings of the European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 499–515.
  • [40] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 1116–1124.
  • [41] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Proceedings of the European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 17–35.
  • [42] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014, pp. 152–159.
  • [43] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 3652–3661.
  • [44] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 420–429.
  • [45] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-end deep kronecker-product matching for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 6886–6895.
  • [46] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 2109–2118.
  • [47] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistent similarity learning via deep crf for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 8649–8658.
  • [48] K. Li, Z. Ding, K. Li, Y. Zhang, and Y. Fu, “Support neighbor loss for person re-identification,” in 2018 ACM Multimedia Conference on Multimedia Conference.   ACM, 2018, pp. 1492–1500.
  • [49] M. Zeng, C. Tian, and Z. Wu, “Person re-identification with hierarchical deep learning feature and efficient xqda metric,” in 2018 ACM Multimedia Conference on Multimedia Conference.   ACM, 2018, pp. 1838–1846.
  • [50] J. Yang, X. Shen, X. Tian, H. Li, J. Huang, and X.-S. Hua, “Local convolutional neural networks for person re-identification,” in 2018 ACM Multimedia Conference on Multimedia Conference.   ACM, 2018, pp. 1074–1082.
  • [51] J. Almazan, B. Gajic, N. Murray, and D. Larlus, “Re-id done right: towards good practices for person re-identification,” arXiv:1801.05339, 2018.
  • [52] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger, “Resource aware person re-identification across multiple resolutions,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 8042–8051.
  • [53] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 2285–2294.