Person re-identification (ReID) is a task of identifying bounding boxes of persons in the photos taken by multiple non-overlapping cameras. Given a query image, ReID needs to find out the images of the same ID as the query one. That is, all images of the same person should be retrieved. ReID has been widely adopted in many computer vision applications, such as monitoring, activity analysis, and people tracking[chen2019mixed], so it is critical to design a robust and efficient ReID algorithm.
Due to the high discriminative ability of deep-learned representations, much significant progress of ReID has been made [sun2018beyond, li2017learning, zhao2017spindle, chen2017beyond, cheng2016person, xiao2017margin, qian2018pose]
. Lots of research considers ReID as a classification problem by taking the person IDs as different classes, uses softmax loss function to differentiate identities, and learns the feature representations[sun2018beyond, li2017learning, zhao2017spindle, qian2018pose]. In contrast, some others directly leverage the triplet loss as a metric to learn the feature representations [chen2017beyond, cheng2016person, xiao2017margin].
Basically, the triplet loss tries to pull the learned features from the same identity closer and push away features from different identities. Compared with softmax loss, triplet loss directly controls the learning process in embedding space, which ensures features of the same identity are closer than others by a threshold margin. In this way, triplet loss can learn more differences in details than softmax cross-entropy loss. However, triplet loss applications in ReID nowadays are prone to overfitting because of the lack of enough samples in the ReID datasets and the imbalance of samples for different identities. Several studies [zheng2017unlabeled, qian2018pose, liu2018pose, chen2019instance] synthesize training images with Generative Adversarial Nets (GAN) [goodfellow2014generative] to increase the size of the training set, in order to improve the generalization capability of the model. Though the GAN approach is helpful for data augmentation, it also indicates much more effort to generate the samples, which might not be accessible in certain cases, especially when there are too many identities. Guo et al. [guo2017one] proposed feature normalization to overcome the data imbalance. However, normalization will lead to performance degradation, which is also observed as reported in TriNet [hermans2017defense].
ReID datasets are usually captured by multi-cameras. Thus the inherent changing lights and perspectives will lead to inevitable camera-cased gaps in ReID datasets, indicating very noisy samples. To alleviate the gaps, Zhong et al. [Zhong_2018_CVPR] takes advantage of GAN to transfer images from one camera to another, and Wei et al. [wei2018person] from one dataset to another. Both these two methods need to spend a lot of time on image generation. Therefore, it is necessary to learn a more robust feature extractor to avoid expensive data augmentation process.
To address these challenges, this paper proposes ReadNet, a ReID oriented adversarial camera Network (ACN) with angular triplet loss (ATL). The new loss function, ATL, takes angle-distance as the distance metric with a linear angular margin on it. Because angle-distance is not affected by the feature length, feature normalization is still applicable. More importantly, due to the linear angular margin in the embedding space, a linear decision boundary can be guaranteed without performance degradation compared to exising methods. The ACN is designed to address the problem of the camera-cased gaps. ACN consists of a feature extractor and a camera discriminator, and they play a minimax game: the camera discriminator tries to identify which camera the extracted feature was taken from, while the feature extractor tries to extract features without camera information to fight against the discriminator. In this way, it is possible to learn a pedestrian-discriminative-sensitive and multi-camera-invariant feature extractor.
Both ATL and ACN algorithms are straightforward and efficient, and they could be deployed independently or simultaneously. The prototype of ReadNet
is implemented with PyTorch[paszke2017automatic] and is evaluated against three widely adopted ReID datasets. The experimental results show that either ATL or ACN outperforms the baseline as well as many existing methods, while the combo of them delivers the best results for most test cases in terms of prediction accuracy.
The main contributions of the paper are as follows:
We propose ATL to leverage “angle-distance” in contrast to the typical triplet loss to ensure a linear decision boundary to address the data imbalance problem due to limited samples.
We propose ACN to filter the camera information, ensuring the feature extractor can concentrate on the pedestrian information to bridge the gaps stemming from camera noises.
2 Related Work
This section discusses how ReadNet, specifically ATL and ACN, relates to state-of-the-art deep learning based ReID research.
2.1 Triplet loss and its Variants
proposes the triplet loss for face recognition and clustering, and the results are promising. A potential issue with triplet loss is the difficulty on convergence, and lots of new sampling strategies are introduced to solve this problem. For example, Songet al. [oh2016deep] take all pairwise distances in a batch to take full advantage of a batch. Chen et al. [chen2017beyond] adopte quadruplet loss with two negative samples for better generalization capability. Hermans et al. [hermans2017defense] propose TriNet with -style sampling method and hardest example mining, which proved to have no convergence problem on most ReID datasets. TriNet also provides a soft-margin method to pull samples from the same class as close as possible. Ristani et al. [ristani2018features] claim that most hard example mining methods only consider the hardest triplets or semi-hard triplets, but it can be beneficial to take easy triplets as well. They propose adaptive weights triplet loss that provides high and low weights for hard and easy triplets, respectively.
2.2 Angular Margin-based Softmax
As the most widely used loss function, softmax loss can be formulated as in Equation (1):
where denotes the feature of -th sample with label , is the -th column of weight and the bias term [deng2019arcface]. SphereFace [liu2017sphereface]
proposes the angular softmax (A-Softmax) loss that enables convolutional neural networks(CNN) to learn angularly discriminative features by reformulating. If we normalize the to 1, normalize to , and multiply an angular margin , the softmax loss can be modified to Equation (2):
Lots of new angular-based softmax loss functions are proposed based on A-Softmax. For example, CosFace [wang2018cosface] introduces Equation (3):
And, ArcFace [deng2019arcface] devises a similar loss as in Equation (4):
The most significant difference between the three angular-based softmax loss functions is the position of margin . Although the margins look very similar across these equations, different types of the margin can produce totally different geometric attributes, because the margin has the exact correspondence to the geodesic distance. Though geodesic distance does not makes too much sense for ReID, Equation (4) inspires the design of our angular triplet loss (ATL), in which ReID related issues are took into consideration.
2.3 ReID with GAN
Generative Adversarial Nets (GAN) [goodfellow2014generative] is known as one of the most popular networks in deep learning. One dilemma of ReID is the lack of training data, while GAN can inherently be useful to generate samples. According to our knowledge, Zheng et al. [zheng2017unlabeled] is the first ReID related research using GAN to generate random unlabeled samples, and label smoothing is used as there are no labels for synthesized images. Since person re-identification suffers from image style variations caused by multi-cameras, Zhong et al. [Zhong_2018_CVPR] use CycleGAN [zhu2017unpaired] to transfer images from one camera to another to bridge the gaps. Wei et al. [wei2018person] propose a Person Transfer Generative Adversarial Network which transfers images from a dataset to another, to bridge the domain gap in different datasets. Additionally, GAN is also used for synthesizing training images to reduce the impact of pose variation [qian2018pose, liu2018pose]. Chen et al. [chen2019instance] propose CR-GAN for synthesizing person images in the unlabeled target domain. ReadNet not only avoids the generation of additional samples, but also bridges the camera-cased gaps by removing camera-cased information using an adversarial network.
ReadNet consists of two independent parts: the angular triplet loss function (ATL) and an adversarial camera network (ACN). This section will first look back at triplet loss and then present the design and algorithms of ATL and ACN.
3.1 Triplet Loss
Triplet loss [schroff2015facenet] is one of the most popular loss functions of metric learning. For a triplet , triplet loss is formulated as Equation (5):
where is a metric function measuring distance or similarity between and in the embedding space , the anchor sample, a positive sample with the same ID as , while denotes a negative sample. is the feature extractor with parameter . For the sake of clarity, will be used as a shortcut of , where is omitted. the margin threshold that must be less than by at least . The notation means .
The batch-hard method with -style sampling in TriNet [hermans2017defense] picks class randomly, and then samples images for each class randomly to create anchors in a mini-batch. Then, it chooses one hardest positive and one hardest negative for each anchor to form triplet terms contributing to the triplet loss in a mini-batch.
Triplet loss is designed to pull the positives closer and simultaneously push the negatives away with a threshold margin, aimed at . Many ReID research [ding2015deep, schroff2015facenet, chen2017beyond, ristani2018features, sohn2016improved] trains the model with triplet loss taking -norm distance as the distance metric function. However, they suffer from the issues mentioned in Section 3.2.
3.2 Angular Triplet Loss
The essential challenge of ReID is encoding images into robust discriminative features. Though feature normalization is straightforward to alleviate the data imbalance in ReID datasets to some extent [guo2017one], it is harmful to the performance, because the normalization operation in Euclidean distance loses some information which can be caputured by
-norm of features. This is the inspiration and motivation of ATL. After normalization, the Euclidean distance is equivalent to cosine-similarity, so by using cosine-similarity as the metric function, the triplet loss could be formulated as Equation (6):
where and are the angles between and , respectively.
Like ArcFace [deng2019arcface], we transform the cosine-similarity triplet loss to as in Equation (7). is very different from because has a linear angular margin, which could result in a linear decision boundary. Unfortunately, might be difficult to converge due to the small gradients, especially when is small. Both and become very quickly in the experiments, which makes the model hard to optimize.
Actually, by leveraging , the expression could be eliminated to ensure and avoid the potential convergence issue eventually. Note that , then is truncated from to to avoid the denominator to be 0. Since is a very small, it is reasonable to set in the experiments. Therefore, the triplet loss (embedding loss) with angular margin becomes Equation (8).
As ATL only considers the relative distance but not absolute distance, we add a regularization term to limit the norm of features, making the features gather in the embedding space. Finally, the ATL is shown in Equation (9):
where the hyper-parameter controls the weight of . Figure 1 illustrates how ATL behaves differently from the original triplet loss . Figure 1(a) and Figure 1(b) illustrate the feature distributions learned by the original triplet loss and the angular triplet loss on the testing set, respectively. Unsurprisingly, the results show that the features of same class are clustered according to angles with ATL while the original triplet loss clusters features by -distance. More importantly, ATL learns wider linear decision boundaries than the original loss.
3.3 Adversarial Camera Network
As demonstrated in Section 1, ReID images are usually taken by multi-cameras, causing differences in perspectives, surrounding and poses, making it hard to learn a robust model. The camera related noisy information is also encoded into the extracted features, which is harmful for identifying person-ID. Therefore, the challenge is how to get rid of such camera information from feature representations. This is possible to accomplish by an adversarial network with a camera discriminator.
As illustrated in Figure 2, we define a feature extractor with parameter and a camera discriminator with parameter . The responsibility of is representation learning, namely, extracting perspective-invariant and distinguishable features representations. acts like the discriminator in GAN [goodfellow2014generative], trying to distinguish the camera-ID. The sole goal of is to lead the learning process of to extract perspective-invariant features representations. Label smoothing cross entropy [Szegedy_2016_CVPR] is employed for camera-ID prediction. Equation (11) depicts the loss for one sample by ignoring the triplet sampling strategy.
where is the the number of classes, i.e. the number of cameras in a ReID dataset,
is the prediction probability of the-th class, is the indicator variable, and is 1 when the prediction is correct otherwise 0. is the smooth label of
, with a hyperparameter.
In real-world applications, the input of is the output of , is much more deeper than , and is often trained in advance while is not, so it’s difficult for to fight against. A hyper-parameter, , is introduced to reduce the weight of to make the game between and more balanced. The final adversarial loss is a combination of the and , as shown in Equation (13):
As shown in the formula , the process of feature extractor training is to minimize the triplet loss and maximize camera discriminator loss at the same time. The camera discriminator learns to distinguish cameras by minimizing , which forms an adversarial relationship to the feature extractor as in Equation (15).
Since the goals of the two objective functions are opposite, the training process of the minimax game can be divided into two sub-processes. One sub-process optimizes , and the other optimizes . Both the two sub-process can be implemented with Adam [kingma2014adam]. In our experiments, we train for 1 step after steps for the discriminator, as shown in Algorithm 1.
extracted features for current batch: ;
labels and person labels: , ;
Due to the flexible design of ReadNet, ATL and ACN could be deployed independently, so this section discusses the evaluation results of ATL, ACN as well as the combination of ATL+ACN.
ReadNet is evaluated against 3 widely used ReID datasets: Market1501 [zheng2015scalable], DukeMTMC-ReID [Zheng_2017_ICCV] and CUHK03 [li2014deepreid], in which the first 2 datasets are large and the last one is relatively small. The number of cameras varies across different datasets. Table 1 presents the details of the datasets.
Market1501 contains 32,688 images of 1,501 person identities, captured by 6 cameras (5 high-resolution cameras, and 1 low-resolution camera). There are 751 identities for training and 750 for testing, 19,732 gallery images and 12,936 training images detected by DPM [felzenszwalb2008discriminatively], and 3,368 manually cropped query images.
DukeMTMC-ReID consists of 1,404 identities captured by 8 cameras. All the 36,411 bounding boxes are manually labeled. The evaluation protocol in Zheng et al. [Zheng_2017_ICCV] is adopted in our experiments, 16,522 images of 702 identities in the training set, 700 identities in the testing set, with 17,661 images in the gallery and 2,228 images for query.
CUHK03 is another dataset captured by 5 pairs cameras, including 1,467 identities, and each identity is observed by two disjoint camera views. The bounding boxes are detected in two ways: manually cropped (labeled) and DPM-detected (detected). We focus on the results of the labeled ones, but also report the results of the detected ones. The training and testing protocol for CUHK03 following Zhong et al. [zhong2017re] is adopted in our experiments. For the labeled ones, there are 767 identities with 7,368 images in the training set, and 700 identities in the testing set with 1,400 images for query and 5,328 images for gallery. For the detected ones, 767 identities with 7,365 images are in the training set, and 700 identities in the testing set with 1,400 images for query and 5,332 images for gallery.
All experiments share the same global configuration except the margin and the camera loss weight . The prototype is implemented with Pytorch, and all of the models are trained using a single NVIDIA TITAN Xp.
4.2.1 Training Parameters
-style batches is employed in the experiments. For reasonable comparison, the batch size is set to 128 to match TriNet [hermans2017defense] by setting to 32 and to 4, thus a batch contains 32 identities and 4 images for each identity. All images are resized to and we only use random horizontal flips for data augmentation, with the flips probability at .
Adam [kingma2014adam] is chosen as the optimizer for both feature extractor and discriminator network, with base learning rate at and , respectively, with a same weight decay factor of
. All other hyper-parameters of the optimizers are default in Pytorch. The number of epochs is set to 600, and the learning rate will decay at 200 epochs and 400 epochs with a decay factor of.
Features are normalized when computing and , but not normalized for and original triplet loss . It is observed that Euclidean-margin can reach its best performance at , so the Euclidean-margin is set to in all the experiments. The angular-margin is set to on DukeMTMC-ReID, while for the other two datasets when the best accuracy is reached. In addition, the weight is set at in loss in all experiments. Because the dimensions are different, the weight is set at and for Euclidean-distance-based triplet loss and ATL, respectively.
4.2.2 Network Architecture
The baseline is a reimplementation of TriNet according to the implementation description [hermans2017defense], where pretrained ResNet-50 is used as the feature extractor with the weights provided by He et al. [he2016deep]. The baseline is abbreviated as Basel
. The camera discriminator contains 2 fully-connected layers, and ReLU[glorot2011deep] and Dropout [srivastava2014dropout] are applied after each layer. The output channels for the 2 layers are set at and , respectively, where is the number of cameras.
4.2.3 Training Strategy
and are trained alternately, that is, is assigned to 1 in Algorithm 1. As a result, on Market1501, there are approximately 7,000 iterations for feature extractor, and another 7,000 iterations for the discriminator, resulting in a total of 14,000 iterations. The calculation of the iteration number also applies to the other datasets. It usually spends 1 hour for feature extractor training and another 1 hour for camera discriminator training in our configuration.
Mean average precision (mAP) score and cumulative matching curve (CMC) are basic evaluation metrics commonly used in lots of related research[hermans2017defense, zhong2017re, chen2017beyond, wang2017adversarial]. Since ReID is usually regarded as a ranking problem, CMC at rank-1 is reported along with mAP score to make the result more convincing. Single query mode is used in all the experiments.
4.3.1 Comparison with Baseline
The results illustrated in Table 2 show improvement over baseline for either ATL or ACN in most cases, while ATL+ACN beats the others in all cases. Particularly, ATL+ACN increases mAP from 61.28% to 63.50%, and rank-1 accuracy from 77.53% to 79.26% on DukeMTMC-ReID. In the meantime, ATL+ACN delivers 3.40% and 2.86% improvement in mAP and rank-1 accuracy, respectively on CUHK03 (labeled), and gains 3.02% and 2.47% improvement in mAP and rank-1 accuracy on Market1501, respectively. This indicates that ReadNet consolidates the benefits of ATL and ACN.
4.3.2 Comparison with Existing Methods
Table 3 presents the results of ReadNet and well-known existing methods on CUHK03. As CUHK03 contains the least training images compared with Market1501 and DukeMTMC-ReID, it’s regarded to be the hardest to learn a robust deep representation. However, the results indicate that ATL and ACN can work very well on CUHK03, and again, the combo ATL+ACN is the best for most cases by surpassing many existing methods with 57.20% in mAP and 60.00% in rank-1 accuracy on CHUK03 (labeled).
|Methods||CUHK03 (labeled)||CUHK03 (detected)|
The results on Market1501 and DukeMTMC-ReID are reported in Table 4. Particularly, ATL itself outperforms many current methods with 63.17% in mAP and 79.08% in rank-1 accuracy on DukeMTMC-ReID. ATL+ACN achieves competitive metrics with 74.05% in mAP and 88.78% in rank-1 accuracy on Market1501. This means that ACN contributes less than ATL in these datasets.
It can be observed that ReadNet outperforms some new methods such as Liu et al. [liu2018pose], Qian et al. [qian2018pose] and Yao et al. [yao2019deep] published in the past 2 years. On all the three datasets, both ATL and ACN can achieve competitive performance, while ATL+ACN usually reaches the highest scores, which is a strong implication that both ATL and ACN are helpful to learn pedestrian-discriminative-sensitive and multi-camera-invariant representations and the combination of them ReadNet could leverage them simultaneously.
4.3.3 Comparison of Various Margins
As an important hyper-parameter in triplet loss, the variation of margin can affect the results significantly, so it is necessary to evaluate how this is related. Since ACN has nothing to do with the margin, experiments are only conducted with ATL for fair comparison. The margin for the baseline is , and for ATL. Table 5 shows the results in details. Apparently, ATL outperforms Euclidean-distance-based triplet loss in all metrics.
To address the data imbalance and domain gap challenges in ReID applications, this paper proposed ReadNet, an adversarial camera network with an angular triplet loss. The ATL function performs beyond the Euclidean-distance-based triplet loss functions on various datasets to mitigate the effect of data limitation as well as data imbalance. For the domain gaps introduced by independent cameras, the adversarial camera network is devised to filter useless multi-camera information, which encourages feature extractor to learn pedestrian-discriminative-sensitive and multi-camera-invariant feature representations. The model is more robust to tolerate the noise introduced by cameras. Though ATL and ACN are targeted for ReID initially, they could be ported and implemented in other domain applications, especially triplet loss related or multi-view related use cases. In the future, we will extend our work to address the potential training instability problem in ReID.