Person re-identification (ReID) aims to search for the same person from a gallery of pedestrian images taken from different cameras, which is one critical task in computer vision due to its broad applications in domains such as multi-camera tracking, video surveillance, and forensic search. One significant challenge presented in person ReID is that, due to the possible shared face appearance, dressings or accessories among different persons, we need to examine the multifaceted features expressed in different parts of the detected persons to effectively perform the ReID task.
Many ReID methods have been introduced over the years, of which deep learning-based methodsxiao2016learning; cheng2016person; sun2018pcb; cheng2016person; dai2019batch; huang2018adversarially; wang2018mancs; zheng2019joint have gained significant improvement over traditional methods due to their remarkable ability to learn semantic-rich features. Particularly, the striping-based deep methods sun2018pcb; cheng2016person; wang2018learning; zhai2019defense; yao2019deep; hou2019interaction, which builds multi-branch neural networks to learn local features in each of the predefined parts of the identities with one-branch network dedicated to one part, are introduced to learn the aforementioned multifaceted features. They are often the best performers on different benchmark datasets. However, these methods result in complex models, which involve a large number of parameters and computationally expensive training. They are therefore difficult to be well trained, especially when the given labeled training datasets are small.
To address this issue, in this work we propose to learn the multifaceted features in a simple unified singleton neural network. There have been a number of studies hermans2017defense; huang2018adversarially; wang2018mancs; luo2019bag; ristani2018features; shen2018deep exploring the use of single network structure for the ReID task. However, their primary objective is to learn universally effective global features. As a result, although they successfully learned discriminative global features, the models are typically attentive to only a single small discriminative part of the identity, leading to substantially less expressive features than that gained by striping methods.
We introduce a novel framework, termed Unified Multifaceted Feature Learning (UMFL), to achieve the goal of learning complex multifaceted features with substantially simpler neural networks. The immediate gain is that, given the same amount of data, the resulting simplified neural networks can be trained significantly more effectively and efficiently than the complex ones. Specifically, as shown in Figure 1, UMFL consists of two key collaborative modules, compound batch image erasing and hierarchical structured loss. The compound batch erasing is composed by two different types of image erasing operations, Batch-constant Erasing (BcE) and Random Erasing (RE), to generate the identity images with the chosen body part erased in all batch images (i.e., BcE) and random patches erased in each image of the batch (i.e., RE). RE zhong2017random is now a key data augmentation ingredient and has shown a remarkable enabling power to improve the performance of person ReID luo2019bag; chen2019abd, since it enables the model to focus on globally discriminative features. However, the RE-enabled models attend to single dominant discriminative image patches only. The use of the BcE-based augmented images enforces the models to search for non-erased body parts that are critical to person ReID. Since the erased body part is different in different batches, the ReID models are imposed to pay attention to the multifaceted features expressed in diverse discriminative body parts. Additionally, modeling with a mixture of BcE and RE-based augmented images help reinforce the learned multifaceted features.
To effectively learn the underlying multifaceted features in our compound batch erased images, we further introduce the hierarchical structured loss, which structures the BcE and RE-based augmented images into a two-level hierarchy per training batch and enforces separate losses to the augmented images in different groups of the hierarchy. Particularly, each of our training batch consists of two sub-batches, with one sub-batch containing the BcE augmented images and another sub-batch containing RE-based augmented images, resulting in three groups of image samples: BcE augmented sub-batch, RE augmented sub-batch, and the full batch. Separate losses are then applied to these three groups per batch to learn the multifaceted features.
Note that the trained UMFL-based single-branch network can use exactly the same network architecture as one branch of the multi-branch complex networks that aim to learn global features. Therefore, the UMFL model can also be applied as a base block to plug into the complex networks to improve their performance.
Additionally, UMFL is very different from the recently proposed batch dropblock network (BDB) dai2019batch. The batch-constant block dropping in BDB is similar to BcE in UMFL, but they are actually two different operations since BDB performs the dropping in the feature map layer while UMFL applies BcE to the original image. This is the only similarity between BDB and UMFL. UMFL and BDB use completely different approaches (single-branch network with hierarchical structured loss vs. multi-branch networks with single loss). As a result, they have extremely different properties. For example, BDB involves substantially more parameters and it is much more difficult to train than UMFL; UMFL is generic and can plug into other types of approaches, whereas BDB does not have this flexibility. As we show in our experiments, UMFL can easily plug into BDB and other recent state-of-the-art methods such as ABD chen2019abd to achieve new performance benchmarks. In summary, this work makes the following three main contributions.
We propose a novel framework termed Unified Multifaceted Feature Learning to learn diverse discriminative features expressed in different parts of the identifies using single-branch neural networks. As we show in our experiments on four benchmark datasets, despite the use of significantly simplified network architectures, the resulting model is able to capture multifaceted features that can often be captured by state-of-the-art complex multi-branch networks only.
Our UMFL model is generic and can be easily incorporated into state-of-the-art complex methods to substantially improve their performance.
Beyond person ReID, UMFL can also generalize to other similar tasks and achieve state-of-the-art performance on two vehicle ReID benchmark datasets.
2 Related Work
Due to the remarkable capacity of feature learning of CNN, many deep learning based ReID methods xiao2016learning; cheng2016person; sun2018pcb; cheng2016person; dai2019batch; huang2018adversarially; wang2018mancs; zheng2019joint have been proposed recently das2014consistent; li2013learning; liao2015person; ma2013domain; pedagadi2013local; zheng2012reid. These methods can be divided into supervised and unsupervised approaches. Since our method uses labels for training, we only review the supervised methods.
Multifaceted Feature Learning. In supervised ReID methods, the majority of these deep learning based methods focus on designing network structures to divide the image into several stripes and learn local features within each stripe sun2018pcb; cheng2016person; dai2019batch; wang2018learning; zhai2019defense; yao2019deep; hou2019interaction. They enforce the learner to pay more attention to different parts of the identities by combining the striping local features. At the testing stage, different part features and global features are concatenated together as the final representation. This is one of the most effective approaches, but their networks are usually complex, and thus they are computationally expensive and often difficult to be trained well. Some attention-aware methods li2018harmonious; xu2018attention are proposed to enhance the attentiveness of CNN which also benefit from the complex striping structure.
. The other methods focus on data augmentation or loss function. Data augmentation includes GAN based approaches which using GAN to generate more data for trainingwei2018person; zhong2018camstyle, mask or pose guided framework kalayeh2018human; saquib2018pose
that utilizes extra semantic information from pose estimation or segmentation models, as well as random erasing approach that randomly erases a small area of input imageszhong2017random. Some other data augmentation studies zhang2017mixup; inoue2018data; verma2018manifold have shown that, combinations of examples and labels of training data can also effectively regularize the neural network. Random erasing uses no extra information and it is arguably the simplest effective method.
. The triplet loss is now probably the most popular loss used in current state-of-the-art ReID methods. Some advanced versions have been proposed such as batch-hard triplet mininghermans2017defense and batch-soft triplet mining ristani2018features
. In batch-hard triplet loss, only the most hard positive and negative samples are selected for each anchor to form triplet. In batch-soft triplet loss, the triplets are re-weighted to prevent the influence from outlier samples. Some other loss functions also have been used for boosting the performance such as quadruplet losschen2017beyond in which quadruplet networks with quadruplet sampling are used for training.
3 Unified Multifaceted Feature Learning
3.1 The Proposed Framework
In a person ReID system, given a set of training images
and the corresponding one-hot encoding of the identity/class set, then the task is to learn a mapping function which projects the original data onto a new feature space , such that the distance of the images of each person is small while the distance w.r.t different persons is large. Given a query image , the system first computes the distance between and each image from a gallery image set , and then returns the images that have the smallest distance to the query image. Note that and are normally assumed to have no overlapping identities.
Our proposed Unified Multifaceted Feature Learning (UMFL) framework leverages two collaborative modules, compound batch erasing and hierarchical structured loss, to learn the diverse features expressed in different parts of persons using a simple single-branch neural network. They are collaborative in the sense that the synthesis of these two modules in UMFL works substantially better than that with one of these two modules replaced with other functions. The procedure of UMFL is presented in Figure 2. Specifically, the compound batch erasing first generates a structured batch with two sub-batches, one sub-batch samples generated by BcE and another sub-batch samples generated by RE. The samples are then projected onto the space by a single-branch network architecture, in which the hierarchical structured loss is applied to the each batch samples that are structured into a two-level hierarchy (i.e., two sub-batches in the first level and the full batch in the second level) to learn the multifaceted features carried by the sub-batch and cross sub-batch samples.
3.2 Compound Batch Erasing
Compound batch erasing is composed by the widely-used random erasing (RE) and our proposed batch-constant erasing (BcE), which generates two sub-batches of augmented identity images. Specifically, for each batch, its two sub-batches contain exactly the same raw image set but are respectively applied with RE and BcE to generate sub-batch images with different erased areas. This is to have more effective learning of the multifaceted features using a fixed number of identities. To achieve that, we first sample a sub-batch of images , we then duplicate the sub-batch and concatenate the two sub-batches to form the full batch , of which RE and BcE are then respectively applied to and .
3.2.1 Random Erasing (RE)
By randomly erasing a small rectangle patch of training identity images, the RE-based augmentation substantially improves ReID methods in learn discriminative features luo2019bag. RE is extremely simple and works as follows. Particularly, for each image in a given sub-batch , a 50% chance is set to randomly erase a rectangle area for erasing, where and ( is predefined in the range of and is used as the area ratio;
is an aspect ratio; the hyperparameters, , and are used with the recommended settings in luo2019bag).
3.2.2 Batch-constant Erasing (BcE)
Different from the random erasing or other deterministic erasing methods singh2017hide; devries2017improved that erase a small rectangle area only and the erasing area is randomly chosen per image, our batch-constant erasing method erases a striping part of the image and the erased part is fixed and applied to all the images in the same sub-batch . The erased part per sub-batch is randomly chosen. More specifically, we first spatially divide the image into horizontal parts, and then randomly choose one part and erase the same part of all the images per sub-batch . Applying a proper loss to the resulting sub-batches with different constant erased parts would effectively force the model to learn the similarities and differences of the non-erased parts, resulting in diverse attention to different parts of the identities.
3.3 Hierarchical Structured Loss
Separate losses are then applied to the two sub-batches and the full batch to learn three types of features, including the globally discriminative features from the sub-batch , diverse discriminative features in different parts from the sub-batch , global and part mixed features from the full batch . These separate losses are termed hierarchical structured loss to emphasize the importance of using the two-level hierarchical structure of the batch and the corresponding applied losses, since we cannot capture such rich features otherwise. Below we introduce the detail of each separate loss.
3.3.1 Hard-triplet Loss on sub-batches
Triplet loss shows remarkable enabling power in current state-of-the-art ReID methods and is now probably the most widely used loss in the ReID task. It takes three samples, including an anchor , a positive sample that comes from the same person as , and a negative sample taken from a person different from that of the anchor, as a triplet to learn feature representations. In the learning process, it enforces the inter-person distances are greater than the intra-person distances by at least a predefined margin . The generic triplet loss is defined as follows:
where denotes the learned feature representation of , is the distance of two samples, is a predefined margin and represents . Convolutional networks are often employed to instantiate the function. However, the vanilla triplet loss in Eqn. (1) works well only when carefully selected triplets are provided, which is often impossible to perform at scale. An advanced triplet loss known as the hard triplet mining liao2015efficient; su2016deep; liu2017end are used to deal with the problem, which are defined as:
where () is the collection of positive (negative) samples of the anchor in a batch. This loss enables better results than the original triplet loss schroff2015facenet. The role of the hinge function is to avoid correcting ‘already correct’ triplets, but, as shown in hermans2017defense, replacing the hinge function with the softplus function can work in a similar way while at the same time avoiding the use of the hard cut-off margin . So we adopt the softplus function and obtain the following hard-triplet loss:
The obtained hard-triplet loss is applied to the two sub-batches and separately:
where means is applied to the sub-batch .
The term enables the learning of discriminative multifaceted features from diverse parts. Intuitively, for a sub-batch of images samples with a highly discriminative part removed in , minimizing forces the model to learn features from other discriminative parts. For example, as shown in Figure 1, the anchor image sample is from an identity , and the black trouser of this identify is one of the most discriminative parts to differentiate images of from that of who exhibits extremely similar appearance to except the black trouser part. When presenting with the black trouser part erased, enforcing on this sub-batch teaches the model attend to other discriminative information in other parts such as the head or body part.
However, using only may result in the loss of globally discriminative features since a large fixed striping part is always blocked in each sub-batch . Therefore, we apply to the other sub-batch . Since only a small random chosen area is erased and the erased area is often different per image in , imposing to can complement in learning globally discriminative features.
3.3.2 Triplet and Focal Losses on the Full Batch
and are focused on learning features carried by the erased images within each sub-batch, but they cannot capture the features in the same/different identities across the full batch . We therefore apply two additional losses, including the hard-triplet loss and an adapted focal loss, to the full batch to learn such features. They are added to learn finer-grained discriminative features (via the hard-triplet loss) and additional important features (via the adapted focal loss).
Specifically, similar to and , the hard-triplet loss is also applied to as:
This helps learn finer-grained discriminative features than that from the two sub-batches as it is applied to images with more areas erased.
Focal loss is enforced to learn from hard negative samples, i.e., image pairs that are from different identities and become exceedingly similar due to the compound batch erasing. This is due to the fact that in the full batch , each image has two versions, one with a striping part erased and another with a small random area erased, and as a result, the erased negative images, especially the one with the striping part erased, may become undistinguished from the erased anchor images, leading to the decrease of distance between the negative samples and the anchor samples. It is difficult for the hard-triplet loss to cope with the problem, because our erasing operation can also result in large distances between images of the same identity. Therefore, using the distance between the anchor and positive samples to guide the penalization on the negative samples may be misleading. An adapted focal loss is therefore introduced to address this issue.
Focal loss is a dynamically scaled cross entropy loss, in which a scaling factor can automatically down-weight the contribution of easy examples during training , enabling the model to rapidly focus on hard examples lin2017focal. The original focal loss is defined as follows.
where is the probability and is the scaling factor. It automatically reduces the importance of the samples having small . To adapt the focal loss to our task, an adaptive sigmoid based loss function is defined as follows to transform pairwise distances between identity image samples to probabilities:
where is the distance between samples and in the representation space. Eq. (7) is defined to transform the range
in the original adaptive sigmoid functionwang2013nonlinear to .
Overall, our model is driven by the collaboration of the following four losses.
where is a classification loss based on cross entropy between the prediction and the ground truth as in most ReID methods dai2019batch; huang2018adversarially; wang2018mancs; zheng2019joint.
We evaluate the performance on four widely used person ReID datasets from CUHK03 li2014deepreid, Market1501 zheng2015scalable, DukeMTMC-ReID zheng2017unlabeled.
CUHK03 contains the image set with 14,096 images from 1,467 identities captured from six cameras in the CUHK campus and each person only has two camera views. Following zhong2017re, we applied the CUHK03-NP splits, in which 767 identities and the other 700 ones are selected for training and testing respectively. For this dataset, two types of datasets are built based on the way of creating bounding boxes. Specifically, CUHK03-Detected uses pedestrian detectors to create the bounding boxes while that of CUHK03-Labeled is manually labeled.
Market1501 is a large person ReID dataset containing 12,936 images from 751 identities in the training data, and 3,368 query images and 19,732 gallery images from 750 identities in the testing data. These images were captured from 6 different camera viewpoints with manual bounding boxes.
DukeMTMC-ReID is a subset of DukeMTMC ristani2016performance for person ReID. The images are cropped by hand-drawn bounding boxes. The data was taken from 8 cameras of 1,404 identities with respective 16,522, 2,228 and 17,661 images in the training, query and gallery sets.
4.2 Evaluation Protocol
Following the standard protocol in song2018mask; wang2018mancs; sun2018pcb; chen2018person; sun2019perceive, we use Cumulated Matching Characteristics (CMC) and mean average precision (mAP) to evaluate the performance on all datasets. We report the cumulated matching accuracy at rank 1 (R-1 for short) and the mAP value of the retrieval performance. Note that all the reported results here do not involve re-ranking, though it may be used as an extra step to further improve the accuracy.
|RE + HT||58.3||60.1||59.1||62.1||85.5||94.1||76.2||86.6|
|RE (double) + HT||59.8||61.0||63.9||64.4||86.0||94.2||76.2||86.6|
|REBcE + +||67.7||70.1||71.1||73.1||87.0||94.7||76.8||87.5|
|REBcE + + +||68.7||71.1||71.9||73.9||87.5||94.8||77.2||88.0|
4.3 UMFL-enabled Single-branch Network
4.3.1 Experimental Settings
UMFL is proposed to learn multifaceted features in an unified framework to increase the generalization of CNN. Following the ASB method luo2019bag, here the simple single-branch network ResNet-50 is used to evaluate the effectiveness of UMFL. For compound batch erasing, we first randomly choose 16 identities with four image samples per identity as a sub-batch and duplicate the sub-batch before applying any erasing operation. Then we apply BcE and RE respectively to the two sub-batches, and . The in the sub-batch is set to a random integer in the range and the in the sub-batch are respectively set to zhong2017re. Overall, our person ReID model replaces the data augmentation and the simple loss of ASB with the compound batch erasing and the hierarchical structured loss respectively. So, our model is termed UMFL-enabled ASB below.
In this section our UMFL-enabled ASB is compared with state-of-the-art methods that use single-branch network ResNet-50, which includes four data augmentation based methods zhong2018camstyle; qian2018pose; kalayeh2018human; liu2019view and four global feature based methods ristani2018features; huang2018adversarially; wang2018mancs; shen2018deep. ASB is used as a baseline.
The performance of the single-branch network-based methods is shown in the top part in Table 1. The results show that our method significantly outperforms all the competing methods in both mAP and R-1 on all datasets, especially on the CUHK03 detected and labeled datasets where UMFL-enabled ASB achieves more than 17.5% improvement over the best competing methods. The CUHK03 datasets presents substantially more challenges than the other two datasets, because the number of identities there is more than that of Market1501 and DukeMTMC but the number of image samples per identity is only about half of them. UMFL excels at handling such challenging datasets, since its compound batch erasing effectively augments the data and its hierarchical structured loss can leverage these augmented data to learn diverse and discriminative features. Due to the collaboration of these two modules, the data augmentation in our UMFL-enabled ASB is more effective than the GAN, segmentation, or view confusion-based data augmentation methods zhong2018camstyle; qian2018pose; kalayeh2018human; liu2019view. Despite the fact that ASB luo2019bag is the most effective global feature-based method here, our UMFL-enabled ASB can still achieve substantial improvement over ASB. This is mainly because our unified multifaceted feature learning approach drives the model to attend to diverse discriminative body parts, whereas ASB mainly pays attention to only a single highly discriminative part (see Figure 3 for detail).
4.4 UMFL-enabled Striping Methods
4.4.1 Experimental Settings
Our UMFL also can be used in the striping methods to improve the CNN’s generalization. We show in this section that UMFL can be applied to substantially improve two most recent state-of-the-art striping methods, BDBdai2019batch and ABDchen2019abd. We only replace the data augmentation and the loss of BDB and ABD with the respective compound batch erasing and hierarchical structured loss of UMFL with all other parts fixed, which are termed UMFL-enabled BDB and ABD. Eight other state-of-the-art striping ReID methods chang2018multi; sun2018pcb; yao2019deep; hou2019interaction; zhai2019defense; li2018harmonious; li2018harmonious; xu2018attention; song2018mask are also used as competing methods, in which li2018harmonious; xu2018attention; song2018mask are attention based methods.
The performance of the striping-based methods is shown on the lower part in Table 1. It is clear that ABD tends to obtain better performance on Market1501 and DukeMTMC datasets but is less effective on the two challenging CUHK03 datasets. Impressively, UMFL-enabled ABD achieves 4.9% - 5.3% and 4.0% - 6.0% improvement over the original ABD on these two challenging datasets, CUHK03 detected and labeled, and obtains the best performance on all four datasets. Similarly, UMFL-enabled BDB can also substantially improve the original BDB on the Market1501 and DukeMTMC datasets that BDB performs less effectively, achieving 2.1% - 2.7% and 2.1% - 2.7% on mAP and R-1, respectively; it performs comparably to the original BDB on CUHK03 datasets.
Comparing across the full table, it is remarkable that our UMFL-enabled ASB that uses a single-branch ResNet-50 backbone can outperform most striping based methods; it even performs better than BDB in 3 out of 4 datasets in both mAP and R-1. ABD generally performs better than the UMFL-enabled ASB, but it is significantly more complex method, involving nearly triple parameters than the UMFL-enabled ASB (i.e., 69M vs 25M).
4.5 Understanding the Effectiveness of UMFL
4.5.1 Ablation Study
We evaluate the importance of three key components to UMFL, including compound batch erasing, hierarchical structured hard-triplet loss and the adapted focal loss. The ablation evaluation is performed via the UMFL-enabled ASB. The results are provided in Table 2.
Compound Batch Erasing:REBcE. The large margin of the performance between ‘Base + RE + hard-triplet’ (the full ASB) and ‘Base + hard-triplet’ (ASB without RE) justifies the important contribution of random erasing to the ReID performance. When we include both RE and BcE by using separate hard-triplet losses on both sub-batches, i.e., ‘Base + REBcE + ’, we achieve significant improvement over ‘Base + RE + hard-triplet’ across all four datasets. This is mainly due to the collaborative effect of the compound batch erasing and the separate losses on the sub-batches. As we discuss in Section 3.3.1, this collaboration enables our model to learn multifaceted discriminative features from different parts, resulting in the significantly better ReID performance.
Hierarchical Structured Hard-triplet Loss. We further apply the hard-triplet loss to the full batch. Combining with the hard-triplet losses on the two sub-batches, we have a hierarchical hard-triplet loss, i.e., ‘Base + REBcE + + ’. Enforcing the same hard-triplet loss to the full batch as that in sub-batches enables the learning of finer-grained features. Thus, it can achieve consistently additional improvement over ‘Base + REBcE + ’.
Adapted Focal Loss. Lastly the adapted focal loss is added, i.e., ‘Base + REBcE + + +’, which helps obtain further consistent improvement over ‘Base + REBcE + + ’. This demonstrates that the adapted focal loss indeed learns additional important features from hard negative samples, for which the previous three hard triplet losses fail to do so.
4.5.2 Visualization of Attention Maps
We conduct a set of attention visualizations by using Grad-CAM visualization method selvaraju2017grad, which output the attention map on the last output feature maps. The results of ASB luo2019bag (Baseline) and our UMFL-enabled ASB (ours) are shown on Figure 3, from which we can see that the feature maps of ASB highlight single discriminative parts only. By contrast, the UMFL-enabled ASB can effectively ignore background or other noise information and attend to diverse discriminative parts in different cases, e.g., identity images taken different angles, identities with different accessories, and occluded identities. For example, in the 1st and 3rd rows in Figure 3, although the persons are occluded by obstacles of different size, our method can still focus on the identities and also pay attention on different parts, while ASB focuses on small highly discriminative areas only.
4.6 Beyond Person ReID: Enabling Vehicle ReID
To further evaluate the capability of our method, we evaluate the performance of the UMFL-enabled ASB on two vehicle ReID datasets, VeRi-776 liu2016eccv and VehicleID liu2016deep. VeRi-776 contains about 50,000 images of 776 vehicles across 20 cameras. VehicleID dataset contains 221,763 images with 26,267 vehicles. There are three test subsets with different sizes and we use the large test set which contains 20,038 images of 2,400 vehicles. We compare our method with six state-of-the-art vehicle ReID methods lou2019veri; khorramshahi2019dual; he2019part; luo2019bag; wang2017orientation; kanaci2018vehicle, with ASB as the baseline. The results are shown on Table 3. The UMFL-enabled ASB outperforms most vehicle ReID methods by a large margin. Compared to ASB, our method achieves 2.7% improvement on mAP and and 0.7% - 0.8% improvement on R-1. This demonstrates that the proposed UMFL approach can effectively generalize to the vehicle ReID task.
This paper introduces a simple and effective Unified Multifaceted Feature Learning (UMFL) approach to learn diverse discriminative features expressed in different parts of identities. Learning such features often can only be achieved by using multi-branch complex networks. We show that the two key collaborative modules, batch compound erasing and hierarchical structured loss, of UMFL can effectively work together to achieve this goal using simple single-branch network only. Also, the two key modules of UMFL are generic in that (i) they can also be plugged into the complex networks to further enhance their performance in the person ReID task and (ii) they can effectively generalize to the vehicle ReID task.