Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification

10/02/2017 ∙ by Qiqi Xiao, et al. ∙ Megvii Technology Limited Zhejiang University Carnegie Mellon University 0

Person re-identification (ReID) is an important task in computer vision. Recently, deep learning with a metric learning loss has become a common framework for ReID. In this paper, we also propose a new metric learning loss with hard sample mining called margin smaple mining loss (MSML) which can achieve better accuracy compared with other metric learning losses, such as triplet loss. In experi- ments, our proposed methods outperforms most of the state-of-the-art algorithms on Market1501, MARS, CUHK03 and CUHK-SYSU.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (ReID) is an important and challenging task in computer vision. It has many applications in surveilance video, such as person tracking across multiple cameras and person searching in a large gallery etc. However, some issues make the task difficult such as large variations in poses, viewpoints, illumination, background environments and occlusion. And the similarity of appearances among different persons also increases the difficulty.

Some traditional ReID approaches focus on low-level features such as colors, shapes and local descriptors [8, 10]

. With the development of deep learning, the convolutional neural network (CNN) is commonly used for feature representation

[26, 38, 5]

. The CNN based methods can present high-level features and thus improve the performance of person ReID. In supervised learning, current methods can be divided into representation learning and metric learning in terms of the target loss. For the representation learning, ReID is considered as a verification or identification problem. For instance, in

[57], the authors make the comparison between the verification baseline and the identification baseline: (1) For the former, two images are judged whether they belong to the same person. (2) For the latter, the method treats each identity as a category, and then minimizes the softmax loss. In some improved work, Lin et al. combined the verification loss with attributes loss [19], while Matsukawa et al. combined the identification loss with attributes loss [26]. Representation learning based methods have prominent advantages, having reasonable performance and being easily trained and reproducible. But those methods do not care about the similarity of different pairs, leading it difficult to distinguish between pairs of same persons and different persons. To mitigate that problem, different distance losses, such as contrastive loss [38], triplet loss [21], improved triplet loss [5], quadruplet loss [3], etc. are proposed. And [12] also proposes hard batch by sampling hard image pairs. These methods can directly evaluate the similarity of two input images according to their embedding features. Although these distance losses are sensitive to image pairs, which increases the training difficulty, they can generally get better performance than representation learning based methods.

Figure 1:

Framework of our method. Input data are designed to be groups of identities. Distace matrix of features extracted by CNN is calculated. The minimum of negative pair distances and the maximum of positive pair distances are sent to the loss function.

In this paper, we propose a novel metric learning loss with hard sample mining called margin smaple mining loss (MSML). It can minimize the distance of positive pairs while maximizing the distance of negative pairs in feature embedding space. For original triplet or quadruplet loss, the pairs are randomly sampled. In our method, we put each images of persons into a batch, and then calculate an distance matrix where denotes the batch size. Then, we choose the maximum distance of positive pairs and the minimum distance of negative pairs to calculate the final loss. In this way, we sample the most dissimilar positive pair and the most similar negative pair, both of which are hardest to be distinguished in the batch. On Market1501, MARS, CUHK03 and CUHK-SYSU, our method outperforms most of state-of-the-art ones.

In the following, we overview the main contents of our method and summarize the contributions:

  • We propose a new loss with extremely hard sample mining named margin smaple mining loss, which outperforms other metric learning losses on person ReID task.

  • Our method shows significant performance on those four datasets, being superior to most of state-of-the-art methods.

The paper is organized as follows: related works with more details are presented in section 2. In section 3, we introduce our MSML. Datasets and experiments are presented in section 4. Conclusions and outlook are presented in section 5.

2 Related Work

2.1 Deep convolutional networks

Including AlexNet(CaffeNet) [15], GoogleNet [36] and Resnet [11] etc. , several popular deep networks have been proposed in the past few years. A lots of works show that Resnet is better than other baseline models on person ReID task [60, 55, 59]

. Most current paper choose Resnet50 pre-trained on the ImageNet LSVRC image classification datasets

[32] as baseline networks. In this paper, we also choose Resnet50 as our baseline network but reconstruct it.

Resnet is the origin of deep residual networks, and there are some improved versions such as ResNeXt [45], DenseNet [13] and ShuffleNet [51] . All these works use efficient channel-wise convolutions or group convolutions into the building blocks to balance representation capability and computational cost. Different from traditional regular convolutions, group convolutions divide feature maps into several groups concatenated together after respective convolutions. The channel-wise convolutions in which the number of groups equal to the number of channels is a special case of group convolutions. Channel-wise convolutions can effectively reduce computational cost. GoogLeNet Xception [6] uses a large number of channel-wise convolutions. Using building blocks designed with group convolutions and channel-wise convolutions to replace regular convolutions in Resnet is popular and improves accuracy with less computational cost.

2.2 Deep metric learning

Before deep learning, most traditional metric learning methods focus on learning a Mahalanobis distance in Euclidean space. Cross-view Quadratic Discriminant Analysis (XQDA) [18] and Keep It Simple and Straightforward Metric Learning (KISSME) [14] were both classic metric learning methods in person ReID in the past. However, deep metric learning methods usually transform raw images into embedding features, and then compute the similarity scores or feature distances directly in Euclidean space.

In deep metric learning, two images of the same person are defined as a positive pair while two of different persons are a negative pair. The triplet loss is motivated by the threshold enforced between positive and negative pairs. In improved triplet loss, a distance loss of positive pairs is used to reinforce clustering of the same person images in the feature space. The positive pair and the negative one in a triplet share a common image. A triplet only has two identities. Quadruplet loss adds a new negative pair, and a quadruplet samples four images from three identities. For the quadruplet loss, a new loss enforces the distance between positive pairs of one identity and negative pairs of the other two identities. Deep metric learning methods is sensitive to the samples of pairs. Selecting suitable samples for training model by hard mining is shown to be effective [12, 3]. A common practice is to pick out dissimilar positive pairs and similar negative pairs according to similarity scores. Compared with identification or verification loss, distance loss for metric learning can lead to a margin between inter-class distance. But combining softmax loss with distance loss to speed up convergence is also popular.

2.3 Other proposed ReID methods

Some successful unsupervised or transfer learning methods have been proposed recently

[7, 30, 29, 48]. One important concern is that there exists bias among datasets collected in different environments. Another problem is the lack of labeled data, which can cause overfitting easily. Despite that supervised learning methods based on CNN have been successful in some certain dataset, the network trained with that dataset could perform poorly on other datasets. There, one method of transfer learning is to train one task with one dataset, and then fine-tune from the trained model to train another task with another dataset. For example, the model trained on one dataset clusters the other dataset to predict the fake labels which are used to fine-tune the model [7]. In [29], an unsupervised multi-task dictionary learning is proposed to solve dataset-biased problem.

In addition, some paper focus on getting better global or local features. For instance, pose invariant embedding (PIE) aligns pedestrians to a standard pose to reduce the impact of pose [55] variation. Natural language description [16] and image data generated by generative adversarial networks (GANs) [59] are respectively regarded as additional information input into networks. Inspite of image-based learning methods above, there are some video-based person ReID works, which take into account the sequence information such as motion or optical flow [41, 47, 46, 24, 27, 54, 22]

. RNN architectures and attention model are also applied into embedding sequence features.

After getting image features, most current works choose L2 euclidean distance to compute similarity score for ranking or retrieval task. In [40, 60, 1], some re-ranking methods are proposed and obviously improve the ReID accuracy.

3 Our Method

Despite the deep network for feature extracting, our method includes a metric learning loss with hard sample mining called MSML.

3.1 Margin Sample Mining Loss for Metric Learning

The goal of metric embedding learning is to learn a function which maps semantically similar instances from the data manifold in onto metrically close points in [12]. The deep metric learning aims to find the function through minimizing the metric loss of training data. Then we should define a metric function to measure the distances in the embedding space. The distances are used to re-identity the person images.

3.1.1 Related Metric Learning Methods

One of the widely used metric learning loss is triplet loss [33] which helps generate features as discriminative as possible compared to softmax loss for classification. It is trained on groups of triplets. A triplet contains three different images{, , }, where and are images of the same identity while is an image of a different identity. Each image would generate one extracted feature after a deep network. A triplet of -normalized features {, , } would be used to calculate distances and the triplet loss is formulated as following:

(1)

where [31], and is the value of the margin set to allow the network distinguish the positive samples with the negative ones. The first term shortens the distances of positive pairs, while the second term largens the distances of negative pairs. In triplet loss, each positive pair and negative pair share one same image, which makes it pay more attention to obtaining correct orders for pairs w.r.t. the same probe image. As a result, it suffers poor generalization, and is difficult to be applied for tracking tasks.

The quadruplet loss [3] extends the triplet loss by adding a different negative pair. A quadruplet contains four different images {, , , }, where and are images of the same identity while and are images of another two identities respectively. Accordingly, a quadruplet of -normalized features {, , , } would be used to calculate distances. The quadruplet loss is formulated as following:

(2)

where and are the values of the margins in two terms. The first term is the same as (1), which focuses on the distance between positive pairs and negative pairs containing one same probe image. The second term considers the distance between positive pairs and negative pairs which contain different probe images. With the second constraint, an inter-class distance is supposed to be larger than an intra-class distance. In[3], the margin is set to be smaller than the margin to achieve a relatively weeker constraint, so the second term does not play the leading role.

However, we can well combine these two terms into one and extend (2) to:

(3)

where can share the same identity with or not.

A direct application of the loss given in (3) does not achieve good performance. The reason is that the possible number of quadruplets grows rapidly as the dataset gets lager. The number of all the pairs generated from the quadruplets increases accordingly. Most of the samples are relatively easy, especially for the negative pairs, the number of which is squarely larger than that of positive ones. Although a margin is set to restrict the distance between positive and negative pairs, most samples are still too easy to the network, causing the “precious” hard samples overwhelmed and limiting the model performance. In order to relieve this, we apply hard sample mining as in [12]. Triplet loss with hard sample mining computes a batch of samples together. In each batch, it contains different identities, each of which have the same number of samples. For each sample, it picks the most dissimilar sample with the same identity and the most similar sample with a different identity to get a triplet. In [12], the triplet loss with hard sample mining is formulated as following:

(4)

where is the batch size. With hard sample mining, easy samples are filtered and thus improving the robustness of the model.

3.1.2 Margin sample mining loss

We apply a new hard example mining strategy for (3) named margin smaple mining loss (MSML). It picks the most dissimilar positive pairs and the most similar negative pair in the whole batch, as:

(5)

where and can share the same identity with A or not. HardNet [28] also proposed a loss that maximizes the distance between the closest positive and close negative patch in a batch and shows great performance in some other tasks.

(a) Relative distance
(b) Absolute distance
Figure 2: Two examples of edge mining samples.

As shown in Figure 2, the connections are extremely sparse, only two pairs in a batch participating in training phase. There are two examples in our margin smaple mining loss. In Figure 2(a), the positive pair and the negative pair have one common identity, which considers the relative distance. It covers the samples that the triplet loss (or with hard sample mining) can get. And in Figure 2(b)

, the positive pair and the negative pair do not have any common identities, which considers the absolute distance. Therefore, it can cover the second term of quadruplet loss. It seems that we waste a lot of training data. But the two chosen pairs are determined by all the data of one batch. With the loss reducing, not only the two chosen pairs, but the distances of most positive pairs and negative pairs will get larger. In addition, we randomly sample the training data in each batch, which allows the pairs diversity as training epoch grows.

In (5), the first term is the upper bound of the distance of all positive pairs, and the second term is the lower bound of the distance of all negative pairs in a batch. Different from other metric learning losses, which push away positive pairs and negative pairs by each sample, our MSML push away the bounds of two sets in a batch. With training epoch growing, there is a sharp demarcation between positive pairs and negative pairs in feature embedding space. We think it is a useful characteristic for some special tasks.

In summary, compared with other metric learning losses, our MSML has following advantages. First, MSML not only considers the relative distances between positive and negative pairs containing the same probe sample, but also considers absolute distances between positive and negative pairs from different probe samples. Second, it inherits the advantage of hard sample mining and other approaches. And we extend it to edge mining, which leads to a better performance. Finally, we think our MSML is easy to implement and combine with other methods.

4 Experiments

We first conduct two sets of experiments: 1) to compare different networks on person ReID tasks; 2) to evaluate the performance of different losses. Then we compare the proposed approach with other state-of-the-art methods. Note that train a single model using all datasets as [42, 53].

4.1 Datasets

We use public datasets including CUHK03 [17], CUHK-SYSU [43], Market1501 [56] and MARS [35] in our experiments.

CUHH03 contains 14,097 images of 1,467 identities. It provides the bounding boxes detected from deformable part models (DPM) and manually labeling. In this paper, we evaluate our method on the labeled set. Following the evaluation procedure in [17], we randomly pick 100 identities for testing. Since we train one single model for all benchmarks, it is a bit different from the standard procedure, which splits the dataset randomly for 20 times. We only split the dataset once for training and testing.

Market1501 MARS CUHK-SYSU CUHK03
Base model Methods mAP r = 1 r = 5 mAP r = 1 r = 5 mAP r = 1 r = 5 r=1 r=5 r = 10
Cls 41.3 65.8 83.5 43.3 59.3 75.2 70.7 75.0 88.1 51.2 72.6 81.8
Tri 54.8 75.9 89.6 62.1 76.1 89.6 82.6 85.1 94.1 73.0 92.0 96.0
Resnet50 Quad 61.1 80.0 91.8 62.1 74.9 88.9 85.6 87.8 95.7 79.1 95.3 97.9
TriHard 68.0 83.8 93.1 71.3 82.5 92.1 82.4 85.1 94.7 79.5 95.0 98.0
MSML 69.6 85.2 93.7 72.0 83.0 92.6 87.2 89.3 96.4 84.0 96.7 98.2
Cls 40.7 66.3 84.1 45.0 62.6 77.9 74.2 78.2 89.7 50.5 68.8 77.4
Tri 57.9 78.3 91.8 55.5 70.7 85.2 87.7 89.7 96.6 76.9 93.7 97.2
Inception-v2 Quad 66.2 83.9 93.6 65.3 77.8 89.9 88.3 90.2 96.6 81.9 96.1 98.3
TriHard 73.2 86.8 95.4 74.3 84.1 93.5 83.5 86.1 95.2 85.5 97.2 98.7
MSML 73.4 87.7 95.2 74.6 84.2 95.1 88.4 90.4 96.8 86.3 97.5 98.7
Cls 46.5 70.8 87.0 48.0 63.8 80.2 74.2 78.2 89.7 57.2 77.7 85.6
Tri 69.2 86.2 94.7 68.2 79.5 91.7 89.6 91.4 97.0 82.0 96.3 98.4
Resnet50-X Quad 64.8 83.3 93.8 63.6 77.7 89.4 87.3 89.6 96.2 80.7 94.9 97.9
TriHard 71.6 86.9 94.7 69.9 82.5 92.4 86.4 88.8 96.3 82.8 96.1 98.1
MSML 76.7 88.9 95.6 72.0 83.4 93.3 89.6 90.9 97.4 87.5 97.7 98.9
Table 1: Comparison of different methods. Cls stands for classfication, Tri stands for triplet loss [33], TriHard stands for triplet loss with hard sample mining [12], Quad stands for quadruplet loss [3] and MSML stands for our margin smaple mining loss. We combine metric learning loss above with classification loss.

CUHK-SYSU is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. The dataset is close to real world application scenarios for images are cropped from whole images. The training set contains 11,206 images of 5,532 query persons while the test set contains 6,978 images of 2,900 persons.

Market1501 contains more than 25,000 images of 1,501 labeled persons of 6 camera views. There are 751 identities in the training set and 750 identities in the testing set. In average, each identity contains 17.2 images with different appearances. All images are detected by the DPM detector and thus include 2,793 false alarms to mimic the real scenario. MARS (Motion Analysis and Re-identification Set) dataset is an extenstion verion of the Market1501 dataset. It is a large scale video based person ReID dataset. Since all bounding boxes and tracklets are generated automatically, it contains distractors and each identity may have more than one tracklets. In total, MARS has 20,478 tracklets of 1,261 identities of 6 camera views.

We evaluate our method with rank-1, 5, 10 accuracy and mean average precision (mAP), where the rank-i accuracy is the mean accuracy that images of the same identity appear in top-i. For each query, we calculate the average precision (AP). And the mean of the average precision (mAP) shows the performance in another dimension.

4.2 Implementation Details

Each image is resized into pixels and conducted with data augmentaion. The augmentation includes randomly horizontal flipping, shifting, zooming and bluring. The base models (Resnet50, Inception-v2, Resnet50-Xception (Resnet50-X)) are pre-trained from ImageNet dataset. The final feature dimensions of Resnet50, Inception-v2 and Resnet50-X are transformed to through a fully-connected layer. The margin of triplet loss is set to and the margins of the quadruplet loss is set to and . The margin of triplet loss with hard mining and our loss with edge mining are also set to . Adam optimizer is used and the inital learning rate is set to in the first epoches. Learning rate decreases to in the next epoches and until convergence. And the batch size is set to .

We use Resnet50, Inception-v2 and Resnet50-X as base model respectively with different loss functions. There are several contrast experiments and the results are shown in Table 1.

4.3 Results analysis of Different Losses

(a) TriHard
(b) MSML
Figure 3: Distance distributions of two different metric learning losses. Blue boxes are positive pairs while red boxes are negative pairs. Note that the direction arrows are only used for viewing.

We conduct experiments with different losses and provide the results to illustrate the effectiveness of our proposed MSML. They are shown in Table 1. Cls (classification loss) is the baseline experiment. Then, we combine different metric learning losses with classification loss. For Tri (triplet loss), the mAP and rank-1 accuracy increase by approximately compared to baseline experiments. TriHard (triplet loss with hard sample mining) and Quad (quadruplet loss) both have better performance than triplet loss. TriHard is a little better on Market1501, MARS and CUHK-03 while Quad does better on CUHK-SYSU. Finally, our MSML gets best accuracy on most experiments datasets for all different base models.

In terms of accuracy, TriHard and MSML can both get high scores. We further visualize the distance distributions of some randomly chosen image pairs in Figure 3. The numeric values below the image pairs stand for the distances of their features in the embedding space. As we can see, the distances of negative pairs may be smaller than positive pairs, because TriHard does not focus on absolute distance. In contrast, our MSML can get a finer metric in feature embedding space.

For Quad and TriHard, some experiments were unable to reach its best accuracy with the same setting. And in Inception-v2 and Resnet-X experiments, they can be even worse than Tri. However, compared with them, our MSML can always have the best performance.

Methods mAP r=1
Temporal [25] 22.3 47.9
Learning [49] 35.7 61.0
Gated [38] 39.6 65.9
Person [4] 45.5 71.8
Pose [55] 56.0 79.3
Scalable [1] 68.8 82.2
Improving [19] 64.7 84.3
In [12] 69.1 84.9
Spindle[53] - 76.9
Deep[52] 68.8 87.7
Our 76.7 88.9
Table 2: Comparison on Market1501 with single query
Methods mAP r=1
Re-ranking [60] 68.5 73.9
Learning [50] - 55.5
Multi [37] - 68.2
Mars [35] 49.3 68.3
In [12] 67.7 79.8
Quality [23] 51.7 73.7
See [61] 50.7 70.6
Our 74.6 84.2
Table 3: Comparison on MARS with single query

4.4 Comparison with state-of-the-art methods

We compare our method with representative ReID methods on several benchmark datasets ( means it is on ArXiv but not published). The results are shown in Table 2, 3, 4, 5. Methods which applied re-ranking[60] skills are not included.

Methods r=1 r=5 r=10
Person [18] 44.6 - -
Learning [49] 62.6 90.0 94.8
Gated [38] 61.8 - -
A [39] 57.3 80.1 88.3
In [12] 75.5 95.2 99.2
Joint [44] 77.5 - -
Deep [9] 84.1 - -
Looking [2] 72.4 95.2 95.8
Unlabeled [59] 84.6 97.6 98.9
A [58] 83.4 97.1 98.7
Spindle[53] 88.5 97.8 98.6
Our 87.5 97.7 98.9
Table 4: Comparison with existing methods on CUHK03
Methods mAP r=1
End[43] 55.7 62.7
Neural [20] 77.9 81.2
Deep [34] 74.0 76.7
Our 89.6 90.9
Table 5: Comparison with existing methods on CUHK-SYSU

5 Conclusion

In this paper, we propose a new metric learning loss with hard sample mining named MSML in person re-identification (ReID). For triplet and quadruplet loss, the positive pairs and negative pairs are randomly sampled. With hard sample mining, easy samples are filtered and thus improving the robustness of the model. In our method, we calculate a distance matrix and then choose the maximum distance of positive pairs and the minimum distance of negative pairs to calculate the final loss. In this way, MSML uses the most dissimilar positive pair and most similar negative pair to train the model.

We use Resnet50, Inception-v2 and Resnet50-X as base models to do some contrast experiments with different metric learning losses. The results show our MSML gets best performance and learns a finer metric in feature embedding space. Then, we compare our method with some state-of-the art methods. On several benchmark datasets, including Market1501, MARS, CUHK-SYSU and CUHK-03, our method shows better performance than most of other methods.

References