Adversarial Open-World Person Re-Identification

07/27/2018 ∙ by Xiang Li, et al. ∙ IEEE SUN YAT-SEN UNIVERSITY 0

In a typical real-world application of re-id, a watch-list (gallery set) of a handful of target people (e.g. suspects) to track around a large volume of non-target people are demanded across camera views, and this is called the open-world person re-id. Different from conventional (closed-world) person re-id, a large portion of probe samples are not from target people in the open-world setting. And, it always happens that a non-target person would look similar to a target one and therefore would seriously challenge a re-id system. In this work, we introduce a deep open-world group-based person re-id model based on adversarial learning to alleviate the attack problem caused by similar non-target people. The main idea is learning to attack feature extractor on the target people by using GAN to generate very target-like images (imposters), and in the meantime the model will make the feature extractor learn to tolerate the attack by discriminative learning so as to realize group-based verification. The framework we proposed is called the adversarial open-world person re-identification, and this is realized by our Adversarial PersonNet (APN) that jointly learns a generator, a person discriminator, a target discriminator and a feature extractor, where the feature extractor and target discriminator share the same weights so as to makes the feature extractor learn to tolerate the attack by imposters for better group-based verification. While open-world person re-id is challenging, we show for the first time that the adversarial-based approach helps stabilize person re-id system under imposter attack more effectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of adversarial open-world person re-identification. The goal for the generator is to generate target-like images, while we have two discriminators here. The person discriminator is to discriminate whether the generated images are from source dataset (i.e. being human-like). And the target discriminator is to discriminate whether the generated images are of target people. By the adversarial learning, we aim to generate images beneficial for training a better feature extractor for telling target person images apart from non-target ones.

Person re-identification (re-id), which is to match a pedestrian across disjoint camera views in diverse scenes, is practical and useful for many fields, such as public security applications and has gained increasing interests in recent years [3, 42, 11, 6, 37, 22, 10, 40, 36, 45, 4, 34]. Rather than re-identifying every person in a multiple camera network, a typical real-world application is to re-identify or track only a handful of target people on a watch list (gallery set), which is called the open-world person re-id problem [42, 4, 46]. While target people will reappear in the camera network at different views, a large volume of non-target people, some of which could be very similar to target people, would appear as well. This contradicts to the conventional closed-world person re-id setting that all probe queries are belonging to target people on the watch list. In comparison, the open-world person re-id is extremely challenging because both target and non-target (irrelevant) people are included in the probe set.

However, the majority of current person re-identification models are designed for the closed-world setting [43, 37, 6, 33, 40, 36, 45, 38] rather than the open-world one. Without consideration of discriminating target and non-target people during learning, these approaches are not stable and could often fail to reject a query image whose identity is not included in the gallery set. Zheng et al. [42]

considered this problem and proposed open-world group-based verification model. Their model is based on hand-crafted feature and transfer-learning-based metric learning with auxiliary data, but the results are still far from solving this challenge. More importantly, the optimal feature representation and target-person-specific information for open-world setting have not been learned.

In this work, we present an adversarial open-world person re-identification framework for 1) learning features that are suitable for open-world person re-id, and 2) learning to attack the feature extractor by generating very target-like imposters and make person re-id system learn to tolerate it for better verification. An end-to-end deep neural network is designed to realize the above two objectives, and an overview of this pipeline is shown in Fig.

1

. The feature learning and the adversarial learning are mutually related and learned jointly, meanwhile the generator and the feature extractor are learned from each other iteratively to enhance both the efficiency of generated images and the discriminability of the feature extractor. To use the unlabeled images generated, we further incorporate a label smoothing regularization for imposters (LSRI) for this adversarial learning process. LSRI allocates equal probabilities of being any non-target people and zero probabilities of being target people to the generated target-like imposters, and it would further improve the discrimination ability of the feature extractor for distinguishing real target people from fake ones (imposters).

While GAN has been attempted in Person re-id models recently in [43, 44, 8] for generating images adapted from source dataset so as to enrich the training dataset on target task. However, our objective is beyond this conventional usage. By sharing the weights between feature extractor and target discriminator (see Fig. 2), our adversarial learning makes the generator and feature extractor interact with each other in an end-to-end framework. This interaction not only makes the generator produce imposters look like target people, but also more importantly makes the feature extractor learn to tolerate the attack by imposters for better group-based verification.

In summary, our contributions are more on solving the open-world challenge in person re-identification. It is the first time to formulate the open-world group-based person re-identification under an adversarial learning framework. By learning to attack and learning to defend, we realize four progresses in a unified framework, including generating very target-like imposters, mimicking imposter attacking, discriminating imposters from target images and learning re-id feature to represent. Our investigation suggests that adversarial learning is a more effective way for stabilizing person re-id system undergoing imposters.

2 Related Work

[style=unboxed,leftmargin=0cm]

Person Re-Identification:

Since person re-identification targets to identify different people, better feature representations are studied by a great deal of recent research. Some of the research try to seek more discriminative/reliable hand-crafted features [10, 17, 25, 26, 22, 37, 23]. Except that, learning the best matching metric [14, 18, 27, 33, 40, 5, 6]

is also widely studied for solving the cross-view change in different environments. With the rapid development of deep learning, learning to represent from images

[1, 7, 9, 20] is attracted for person re-id, and in particular Xiao et al. [38] came up with domain guided drop out model for training CNN with multiple domains so as to improve the feature learning procedure. Also, recent deep approaches in person re-identification are found to unify feature learning and metric learning [36, 45, 1, 31]. Although these deep learning methods are expressive for large-scale datasets, they tend to be resistless for noises and incapable of distinguishing non-target people apart from the target ones, and thus becomes unsuitable for the open-world setting. In comparison, our deep model aims to model the effect of non-target people during training and optimize the person re-id in the open-world setting.

Towards Open-World Person Re-Identification:

Although the majority of works on person re-id are focusing on the closed-world setting, a few works have been reported on addressing the open-world setting. The work of Candela et al. [4] is based on Conditional Random Field (CRF) inference attempting to build connections between cameras towards open-world person re-identification. But their work lacks the ability to distinguish very similar identities, and with some deep CNN models coming up, features from multiple camera views can be well expressed by joint camera learning. Wang et al. [34] worked out an approach by proposing a new subspace learning model suitable for open-world scenario. However, group-based setting and interference defense is not considered. Also, their model requires a large volume of extra unlabeled data. Zhu et al. [46] proposed a novel hashing method for fast search in the open-world setting. However, Zhu et al. aimed at large scale open-world re-identification and efficiency is considered primarily. Besides, noiseproof ability is not taken into account. The most correlated work with this paper is formulated by Zheng et al. [42, 41], where the group-based verification towards open-world person re-identification was proposed. They came up with a transfer relative distance comparison model (t-LRDC), learning a distance metric and transferring non-target data to target data in order to overcome data sparsity. Different from the above works, we present the first end-to-end learning model to unify feature learning and verification modeling to address the open-world setting. Moreover, our work does not require extra auxiliary datasets to mimic attack of imposters, but integrates an adversarial processing to make re-id model learn to tolerate the attack.

Adversarial Learning:

In 2014, Szegedy et al. [32]

have found out that tiny noises in samples can lead deep classifiers to mis-classify, even if these adversarial samples can be easily discriminated by human. Then many researchers have been working on adversarial training. Seyed-Mohsen

et al. [28] proposed DeepFool, using the gradient of an image to produce a minimal noise that fools deep networks. However, their adversarial samples are towards individuals and the relation between target and non-target groups is not modelled. Thus, it does not well fit into the group-based setting. Nicolas Papernot et al. [29] formulated a class of algorithms by using knowledge of deep neural networks (DNN) architecture for crafting adversarial samples. However, rather than forming a general algorithm for DNNs, our method is more specific for group-based person verification and the imposter samples generated are more effective to this scenario. Later, SafetyNet by Lu et al. [24] was proposed with an RBF-SVM in full-connected layer to detect adversarial samples. However, we perform the adversarial learning at feature level to better attack the learned features.

3 Adversarial PersonNet

3.1 Problem Statement

In this work, we concentrate on open-world person re-id by group-based verification. The group-based verification is to ensure a re-id system to identify whether a query person image comes from target people on the watch list. In this scenario, people out of this list/group are defined as non-target people.

Our objective is to unify feature learning by deep convolution networks and adversarial learning together so as to make the extracted feature robust and resistant to noise for discriminating between target people and non-target ones. The adversarial learning is to generate target-like imposter images to attack the feature extraction process and simultaneously make the whole model learn to distinguish these attacks. For this purpose, we propose a novel deep learning model called Adversarial PersonNet (APN) that suits open-world person re-id.

To better express our work under this setting in the following sections, we suppose that target training images constitute a target sample set sampled from target people. Let indicate the th target image and represents the corresponding person/class label. The label set is denoted by and there are target classes in total. Similarly, we are given a set of non-target training classes containing images, denoted as , where is the th non-target image. is the class of and . Note that there is no identity overlap between target people and non-target people. Under open-world setting, . The problem is to better determine whether a person is on the target-list; that is for a given image without knowing its class , determine if . We use to represent the extracted feature from image , and is the weight of the feature extraction part of the CNN.

Figure 2: Adversarial PersonNet structure. Two discriminators and accept samples from both datasets and generator . Since shares the same weights with feature extractor , we represent them as the same cuboid in this figure.

3.2 Learning to Attack by Adversarial Networks

Always, GANs are designed to generate images similar to those in the source set, which is constituted by both target and non-target image sets. A generator and a discriminator are trained adversarially. However the generator normally only generates images looking like the ones in the source set and the discriminator discriminates the generated images from the source ones. In this case, our source datasets are all pedestrian images, so we call such the person discriminator in response to its ability of determining whether an image is of pedestrian-like images.

is trained by minimizing the following loss function:

(1)

where is the number of samples, represents image from source dataset and is a noise randomly generated.

Suppose that there is a pre-trained feature extractor for person re-id task and in an attempt to steer generator to produce not only pedestrian-like but also feature attacking images towards this feature extractor, we design a paralleled discriminator with the following definition:

(2)

The discriminator is to determine whether an image will be regarded as target image by feature extractor. indicates that part of has the same network structure as feature extractor and shares the same weights (Actually, the feature extractor can be regarded as a part of .). means a full-connected layer following the feature extractor apart from the one connected to original CNN (with a fc layer used to pre-train the feature extractor). So shares the same ability of target person discrimination with the feature extractor. To induce the generator for producing target-like images for attacking and ensure the discriminator to tell the non-target and generated imposters apart from the target ones, we formulate a paralleled adversarial training of and as

(3)

We train to maximize when passed by a target image but minimize it when passed by a non-target image or a generated imposter image by . Notice that this process only trains the final layer of without updating the feature extractor weights , to prevent the feature extractor from being affected by discriminator learning when the generated images are not good enough. We call the target discriminator. And we propose the loss function for the training process of target discriminator :

(4)

We integrate the above into a standard GAN framework as follows:

(5)

The collaboration of generator and couple discriminators is illustrated in Fig. 2. While GAN with only person discriminator will force the generator to produce source-like person images, with the incorporation of the loss of target discriminator , is more guided to produce very much target-like imposter images. The target-like imposters, generated based on the discriminating ability of feature extractor, satisfy the usage of attacking the feature extractor. Examples of images generated by APN are shown in Fig. 3 together with the target images and the images generated by controlled groups (APN without target discriminator and APN without person discriminator ) to indicate that our network indeed has the ability to generate target-like images. The generator is trained to fool the target discriminator in the feature space, so that the generated adversarial images can attack the re-id system. While the target discriminator is mainly to tell these attack apart from the target people so as to defend the re-id system.

Figure 3: Examples of generated images. Although images produced by the generator are based on random noises, we can tell that the imposters generated by APN are very similar to targets. These similarities are mostly based on clothes, colors and postures (e.g. the fifth column). Moreover, surroundings are learned by APN as shown in the seventh column in the red circle.

3.3 Joint Learning of Feature Representation and Adversarial Modelling

We finally aim to learn robust person features that are tolerant to imposter attack for open-world group-based person re-id. For further utilizing the generated person images to enhance the performance, we jointly learn feature representation and adversarial modelling in a semi-supervised way.

Although the generated images look similar to target images, they are regarded as imposter samples, and we wish to incorporate unlabeled generated imposter samples. Inspired by the smoothing regularization [43], we modify the LSRO [43] in order to make it more suitable for group-based verification by setting the probability of an unlabeled generated imposter sample belonging to an existing known class as follows:

(6)

Compared to LSRO, we do not allocate a uniform distribution on each unlabeled data sample over all classes (including both target and non-target ones), but only allocate a uniform distribution on non-target classes. This is significant because we attempt to separate imposter samples from target classes. The modification is exactly for the defense towards the attack of imposter samples. By using this regularization, the generated imposters are more trending to be far away from target classes and have equal chances of being non-target. We call the modified regularization in Eq. (

6) as label smoothing regularization for imposters (LSRI).

Hence for each input sample , we set its ground truth class distribution as:

(7)

where is the corresponding label of , and we let be the th generated image and denote as the set of generated imposter samples. With Eq. (7), we can now learn together with our feature extractor (i.e. weights ). By such a joint learning, the feature learning part will become more discriminative between target and target-like imposter images.

3.4 Network Structure

We now detail the network structure. As shown in Fig. 2, our network consists of two parts: 1) learning robust feature representation, and 2) learning to attack by adversarial networks. For the first part, we train the feature extractor from source datasets and generated attacking samples. In this part, features are trained to be robust and resistant to imposter samples. LSRI is applied in this part to differentiate imposters from target people. Here, a full-connected layer is connected to feature extractor at this stage, and we call it the feature fc layer. For the second part, as shown in Fig. 2, our learning attack by adversarial networks is a modification of DCGAN [36]. We combine modified DCGAN with couple discriminators to form an adversarial network. The generator here is modified to produce target-like imposters specifically as an attacker. And the target discriminator defends as discriminating target from non-target people. Of course, in this discriminator, a new layer is attached to the tail of feature extractor , and we mark it , also called target fc layer, used to discriminate target from non-target images at the process of learning to attack by adversarial networks. By Eq. (2), is the combination of and target fc layer .

4 Experiments111We correct a bug in our code and conduct a better finetune using new implementations in PyTorch for all deep learning methods including ours and the compared ones. Our model is still effective and overall the best. Please refer to the new results.

4.1 Group-based Verification Setting

We followed the criterion defined in [42] for evaluation of open-world group-based person re-id. The performance of how well a true target can be verified correctly and how badly a false target can be verified as true incorrectly is indicated by true target rate (TTR) and false target rate (FTR), which are defined as follows:

(8)

where is the set of query target images from target people, is the set of query non-target images from non-target people, is the set of query target images that are verified as target people, and is the set of query non-target images that are verified as target people.

To obtain TTR and FTR, we follow the two steps below: 1) For each target person, there is a set of images (single-shot or multi-shot) in gallery set. Given a query sample , the distance between sample and a set is the minimal distance between that sample and any target sample of that set; 2) Whether a query image is verified as a target person is determined by comparing the distance to a threshold . By changing the threshold , a set of TTR and FTR values can be obtained. A higher TTR value is preferred when FTR is small.

In our experiments, we conducted two kinds of verification as defined in [42], namely Set Verification (i.e. whether a query is one of the persons in the target set, where the target set contains all target people.) and Individual Verification (i.e. whether a query is the true target person. For each target query image, the target set contains only this target person). In comparison, Set Verification is more difficult. Although determining whether a person image belongs to a group of target people seems easier, it also gives more chances for imposters to cheat the classifier, producing more false matchings [42]. Also note that it is more important to compare the TTR values when FTR is low since it is expected to have more true matching when the false matching is less.

4.2 Datasets & Settings

We evaluated our method on three datasets including Market-1501 [39], CUHK01 [19], and CUHK03 [21]. For each dataset, we randomly selected 1% people as target people and the rest as non-target. Similar to [42], for target people, we separated images of each target people into training and testing sets by half. Since only four images are available in CUHK01, we chose one for training, two for gallery (reduce to one in single-shot case) and one for probe. Our division guaranteed that probe and gallery images are from diverse cameras for each person. For non-target people, they were divided into training and testing sets by half in person/class level to ensure there is no overlap on identity. In testing phase, two images of each target person in testing set were randomly selected to form gallery set, and the remaining images were selected to form query set. In the default setting, all images of non-target people in testing set were selected to form query set. The data split was kept the same for all evaluations on our and the compared methods. Specifically, the data split is summarized below:

[style=unboxed,leftmargin=0cm]

CUHK01

CUHK01 contains 3,884 images of 971 identities from two camera views. In our experiment, 9 people were marked as target and 1,888 images of 472 people were selected to to form the non-target training set. The testing set of non-target people contains 1,960 images of 490 people.

CUHK03

CUHK03 is larger than CUHK01 and some images were automatically detected. A total of 1,467 identities were divided into 14 target people, 712 training non-target people and 741 testing non-target people. The numbers of training and testing non-target images were 6,829 and 7,129 respectively.

Market-1501

Market-1501 is a large-scale dataset containing a total of 32,668 images of 1,501 identities. We randomly selected 15 people as target and 728 people as non-target to form the training set containing a total of 17,316 images, and the testing non-target set contains 758 identities with 14,573 images.

Under the above settings, we evaluated our model together with selected popular re-id models. We used the traditional hand-crafted features suggested in their original papers333We applied these metric learning methods to features extracted by ResNet-50 in the previous version to show our improvement on ResNet-50, but we found that without combining the whole process as an end-to-end learning, the performance of ResNet-50 feature based metric learning methods will largely decrease.

while evaluating metric learning methods such as t-LRDC

[42], XICE [46], XQDA [22] and CRAFT [6].

4.3 Implementation Details

In our APN, we used ResNet-50 [12] as the feature extractor in the target discriminator. The generator and person discriminator is based on DCGAN [30]. At the first step of our procedure, we pre-trained the feature extractor using auxiliary datasets, 3DPeS [2], iLIDS [35], PRID2011 [13] and Shinpuhkan [15]

. These datasets were only used in the pre-training stage for the feature extractor. In pre-training, we used stochastic gradient descent with momentum 0.9. The learning rate was 0.1 at the beginning and multiplied by 0.1 every 10 epochs. Then, the adversarial part of APN was trained using ADAM optimizer

[16] with parameters and . Using the target dataset for evaluation, the person discriminator and generator were pre-trained for 20 epochs. Then, the target discriminator together with the person discriminator , and the generator were trained jointly for epochs, where is optimized twice in each iteration to prevent losses of discriminators from going to zero. Finally, the feature extractor was trained again for epochs with a lower learning rate starting from 0.001 and multiplied by 0.1 every 10 epochs. The above procedure was executed repeatedly as an adversarial process.

Dataset Market-1501 CHUK01 CUHK03
FTR 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30%
Evaluation Set Verification
t-LRDC [42] 3.00 18.88 42.06 51.07 65.24 75.54 5.56 5.56 38.89 50.00 66.67 83.33 8.87 20.16 32.66 37.90 49.60 58.06
XICE [46] 6.77 21.69 45.17 58.88 73.68 81.80 11.11 33.33 44.44 55.56 72.22 83.33 3.14 13.03 31.65 44.36 59.70 69.98
GOG+XQDA [22] 0.43 2.15 9.01 17.17 28.76 39.06 0 5.56 33.33 38.89 66.67 72.22 10.48 16.53 27.02 36.69 53.63 60.48
LOMO+XQDA [22] 4.72 14.16 35.62 46.35 58.80 65.67 0 5.56 38.89 44.44 72.22 88.89 25.81 38.31 51.61 61.69 71.77 83.47
hiphop+CRAFT [6] 2.15 9.44 27.04 38.63 48.07 55.36 11.11 22.22 38.89 55.56 77.78 83.33 23.79 33.87 42.74 47.58 55.65 59.68
JSTL-DGD [38] 26.92 61.54 80.00 88.46 92.31 94.61 33.33 33.33 33.33 55.56 55.56 66.67 38.10 59.52 71.43 76.19 88.10 92.86
ResNet-50 [12] 34.62 80.00 93.85 96.92 98.46 99.23 44.44 55.56 55.56 55.56 77.78 77.78 61.90 73.81 90.48 95.24 95.24 95.24
DCGAN+LSRO [43] 36.15 78.46 94.62 96.15 99.23 99.23 44.44 55.56 55.56 55.56 55.56 77.78 64.29 71.43 88.10 90.48 92.86 95.24
DeepFool [28] 34.62 78.46 94.63 96.92 96.92 99.23 44.44 55.56 55.56 55.56 66.67 77.78 64.29 76.19 90.48 95.24 95.24 95.24
APN 43.85 82.31 96.92 98.46 99.23 100 55.56 55.56 55.56 66.67 77.78 77.78 66.67 78.57 92.86 95.24 95.24 95.24
Evaluation Individual Verification
t-LRDC [42] 15.54 39.89 51.44 68.49 78.10 87.63 15.23 32.15 51.82 67.56 73.54 89.13 16.57 37.40 48.98 58.83 70.76 90.17
XICE [46] 34.64 61.58 84.87 90.68 96.86 97.21 33.33 36.11 55.56 55.56 72.22 88.89 18.63 48.94 71.89 81.64 89.71 97.43
GOG+XQDA [22] 10.49 30.60 51.83 62.77 77.29 86.14 25.00 55.56 88.89 91.67 97.22 100 33.93 45.73 64.93 77.85 87.40 91.99
LOMO+XQDA [22] 25.32 59.10 81.98 86.96 92.96 94.84 5.56 36.11 80.56 88.89 88.89 88.89 40.72 57.79 77.78 86.90 94.44 96.03
hiphop+CRAFT [6] 31.75 62.59 84.09 91.42 93.55 96.30 50.00 72.22 100 100 100 100 42.39 63.27 77.38 89.58 95.68 99.18
JSTL-DGD [38] 47.23 63.85 86.92 93.73 93.73 97.53 33.33 48.15 59.26 72.84 72.84 82.72 53.74 78.18 81.67 92.15 92.15 94.87
ResNet-50 [12] 82.26 95.86 98.54 99.38 99.58 99.58 44.44 50.00 72.22 77.78 83.33 83.33 76.19 91.67 95.24 95.24 95.24 95.24
DCGAN+LSRO [43] 81.71 95.36 98.33 98.96 99.17 99.58 44.44 61.11 72.22 77.78 83.33 88.83 73.81 90.48 95.24 95.24 95.24 95.24
DeepFool [28] 82.26 95.86 95.86 98.96 99.17 99.58 44.44 61.11 72.22 77.78 83.33 83.33 75.61 91.67 95.24 95.24 95.24 95.24
APN 84.00 96.72 98.69 99.58 99.58 99.58 44.44 61.11 77.78 77.78 83.33 88.89 79.54 94.05 95.24 95.24 97.15 97.15
Table 1: Comparison with typical person re-identification: TTR (%) against FTR

4.4 Comparison with Open-world Re-id Methods

Open-world re-id is still under studied, and t-LRDC [42] and XICE [46] are two represented existing methods designed for the open-world setting in person re-id. The results are reported in Table 1. Our APN outperformed t-LRDC and XICE in almost all cases, and the margin is especially large on CUHK03. Compared to t-LRDC and XICE, our APN is an end-to-end learning framework and takes adversarial learning into account for feature learning, so that APN is more tolerant to the attack of samples of non-target people.

4.5 Comparison with Closed-world Re-id Methods

We compared our method with related popular re-id methods developed for closed-world person re-identification. We mainly evaluated ResNet-50 [12], XQDA [22], CRAFT [6], DCGAN+LSRO [43], and JSTL-DGD [38] for comparison. These methods were all evaluated by following the same setting as our APN. As shown in Table 1, these approaches optimal for closed-world scenario cannot well adapt to the open-world setting. In the cases of Market-1501 and CUHK03, the proposed APN achieved overall better performance, especially when tested on Set Verification and when FTR is 0.1% as compared to the others. On Market-1501, APN obtained 7.7 more matching rate than the second place DCGAN+LSRO, which is also an application of GAN in re-id problem, when FTR is 0.1% on Set Verification, and as well outperformed DCGAN+LSRO on all conditions on Individual Verification. Although deep learning methods may cause overfitting problem on CUHK01 dataset, which is shown in Individual Verification of CUHK01 where hiphop+CRAFT had the best performance, APN outperformed all other deep learning methods on all conditions of CUHK01. When FTR is 0.1% on CUHK01 Set Verification, APN gained 11.12 more matching rate as compared to DCGAN+LSRO and 5.56 matching rate more when FTR is 5% on Individual Verification. The compared closed-world models were designed with the assumption that the same identities hold between gallery and probe sets, while the relation between target and non-target people is not modelled. Meanwhile our APN is designed for the open-world group-based verification for discriminating target from non-target people.

Dataset Market-1501 CUHK01 CUHK03
FTR 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30%
Evaluation Set Verification
APN 43.85 82.31 96.92 98.46 99.23 100 55.56 55.56 55.56 66.67 77.78 77.78 66.67 78.57 92.86 95.24 95.24 95.24
APN w/o 36.92 79.23 93.85 97.69 99.23 99.23 44.44 55.56 55.56 55.56 55.56 66.67 61.90 73.81 88.10 90.48 92.86 95.24
APN w/o 35.38 78.46 93.85 95.38 99.23 99.23 44.44 55.56 55.56 55.56 55.56 66.67 61.90 71.43 88.10 92.86 95.24 95.24
APN w/o 28.46 65.38 84.62 89.23 96.15 96.15 44.44 55.56 55.56 55.56 55.56 66.67 57.14 73.81 83.33 90.48 92.86 92.86
No Imposter 34.62 80.00 93.85 96.92 98.46 99.23 44.44 55.56 55.56 55.56 77.78 77.78 61.90 73.81 90.48 95.24 95.24 95.24
Evaluation Individual Verification
APN 84.00 96.72 98.69 99.58 99.58 99.58 44.44 61.11 77.78 77.78 83.33 88.89 79.54 94.05 95.24 95.24 97.15 97.15
APN w/o 82.92 95.40 98.13 98.75 99.58 99.58 44.44 55.56 77.78 77.78 83.33 83.33 73.81 90.48 95.24 95.24 95.24 95.24
APN w/o 82.26 95.06 98.13 98.75 99.58 99.58 44.44 50.00 77.78 77.78 83.33 83.33 72.62 90.48 91.67 95.24 95.24 95.24
APN w/o 72.67 91.11 97.36 97.36 97.78 97.78 44.44 61.11 77.78 77.78 83.33 83.33 69.05 84.52 94.05 95.24 95.24 95.24
No Imposter 82.26 95.86 98.54 99.38 99.58 99.58 44.44 50.00 72.22 77.78 83.33 83.33 76.19 91.67 95.24 95.24 95.24 95.24
Table 2: Different generated imposter sources

4.6 Comparison with Related Adversarial Generation

We compared our model with fine-tuned ResNet-50 with adversarial samples generated by DeepFool [28], which is also a method using extra generated samples. DeepFool produced adversarial samples to fool the network by adding noises computed by gradients. As shown in Table 1, our APN performed much better than DeepFool on all conditions. DeepFool cannot adapt to open-world re-id well because the adversarial samples generated are produced with a separate learning from the classifier learning and thus the relation between the generated samples and target set is not modelled for group-based verification, while in our APN we aim to generate target-like samples so as to make adversarial learning facilitate learning better features.

Method APN ResNet-50
FTR 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30%
Dataset Market-1501
single-shot 23.85 66.15 86.92 94.62 97.69 99.23 20.77 65.38 86.15 90.77 95.38 97.69
multi-shot 43.85 82.31 96.92 98.46 99.23 100 34.62 80.00 93.85 96.92 98.46 99.23
Dataset CUHK01
single-shot 33.33 33.33 55.56 66.67 77.78 77.78 33.33 33.33 44.44 55.56 55.56 77.78
multi-shot 55.56 55.56 55.56 66.67 77.78 77.78 44.44 55.56 55.56 55.56 77.78 77.78
Dataset CUHK03
single-shot 59.52 73.81 88.10 95.24 95.24 95.24 57.14 71.43 88.10 90.48 95.24 95.24
multi-shot 66.67 78.57 92.86 95.24 95.24 95.24 61.90 73.81 90.48 95.24 95.24 95.24
Table 4: Number of shots on Individual Verification
Method APN ResNet-50
FTR 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30%
Dataset Market-1501
single-shot 80.04 98.08 99.17 100 100 100 80.00 95.03 99.17 99.17 100 100
multi-shot 84.00 96.72 98.69 99.58 99.58 99.58 82.26 95.86 98.54 99.38 99.58 99.58
Dataset CUHK01
single-shot 33.33 66.67 88.89 88.89 88.89 88.89 33.33 55.56 88.89 88.89 88.89 88.89
multi-shot 44.44 61.11 77.78 77.78 83.33 88.89 44.44 50.00 72.22 77.78 83.33 83.33
Dataset CUHK03
single-shot 78.57 92.86 95.24 95.24 95.24 95.24 76.19 88.10 95.24 95.24 95.24 95.24
multi-shot 79.54 94.05 95.24 95.24 97.15 97.15 76.19 91.67 95.24 95.24 95.24 95.24
Table 3: Number of shots on Set Verification
Method APN ResNet-50
FTR 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30%
TP. 0.5% 95.08 100 100 100 100 100 91.80 100 100 100 100 100
TP. 1% 43.85 82.31 96.92 98.46 99.23 100 34.62 80.00 93.85 96.92 98.46 99.23
TP. 3% 41.95 70.98 86.02 91.56 95.51 96.31 37.20 63.32 84.96 90.77 93.93 95.51
TP. 5% 38.82 65.23 85.86 91.78 96.38 97.70 35.53 63.32 82.89 89.97 94.74 96.86
Table 6: LSRI vs. LSRO
Evaluation Set Verification Individual Verification
FTR 0.1% 1% 5% 10% 20% 30% 0.1% 1% 5% 10% 20% 30%
Dataset Market-1501
LSRI 43.85 82.31 96.92 98.46 99.23 100 84.00 96.72 98.69 99.58 99.58 99.58
LSRO [43] 40.00 80.00 93.85 96.92 98.46 99.23 82.00 95.36 98.33 98.75 99.58 99.58
Dataset CUHK01
LSRI 55.56 55.56 55.56 66.67 77.78 77.78 44.44 61.11 77.78 77.78 83.33 88.89
LSRO [43] 44.44 55.56 55.56 55.56 55.56 66.67 44.44 55.56 77.78 77.78 83.33 83.33
Dataset CUHK03
LSRI 66.67 78.57 92.86 95.24 95.24 95.24 79.54 94.05 95.24 95.24 97.15 97.15
LSRO [43] 64.29 73.81 88.10 90.48 92.86 95.24 75.00 91.67 95.24 95.24 95.24 95.24
Table 5: Different target proportion of Market-1501 on Set Verification (TP. stands for Target Proportion)

4.7 Further Evaluation of Our Method

[style=unboxed,leftmargin=0cm]

Effect of Generated Imposters.

It can be observed that, training with the imposters generated by APN can achieve large improvement as compared to the case without it, because these imposters are target-like and can improve the discriminating ability of the features. The results of baseline without any generated imposters are shown in the rows indicated by “No Imposters” in Table 2. In details, on Set Verification, APN outperformed 9.23 matching rate on Market-1501, 11.12 more matching rate on CUHK01 and 4.77 on CUHK03 when FTR is 0.1%. On Individual Verification, APN also has better performance on all cases.

Effect of Weight Sharing.

The weight sharing between the target discriminator and the feature extractor aims to ensure that the generator can learn from the feature extractor and generate more target-like attack samples. Without the sharing, there is no connection between generation and feature extraction. Taking Set Verification on Market for instance, ours degrades from 82.31% to 65.38% (no sharing,indicated by “APN w/o WS”) when FTR=1% in Table 2.

Effect of Person Discriminator and Target Discriminator.

Our APN is based on GAN consisting of generator, person discriminator and target discriminator . To further evaluate them, we compared with APN without person discriminator (APN w/o ) and APN without target discriminator (APN w/o ). APN without target discriminator can be regarded as two independent components DCGAN and feature extraction network. To fairly compare these cases, LSRI was also applied as in APN for the generated samples. The results are reported in Table 2. It is obvious that our full APN is the most effective one among the compared cases. Sometimes generating imposters by APN without person discriminator or target discriminator even degrade the performance as compared to the case of no imposter. When target discriminator is discarded, although person-like images can be generated, they are not similar to target people and thus are not serious attacks to the features for group-based verification. In the case without person discriminator, the generator even fails to generate person-like images (see Fig. 3) so that the performance is also degraded in most cases. This indicates that the person discriminator plays an important role in generating person-like images, and the target discriminator is significant for helping the generator to generate better target-like imposters, so that the feature extractor can benefit more from distinguishing these imposters.

LSRI vs LSRO.

We verified that the modification of LSRO, namely LSRI in Eq. (6) is more suitable for optimizing the open-world re-id. The performance of comparing our LSRI with the original LSRO is reported in Table 6. It shows that the feature extractor is more likely to correctly discriminate target people under the same FTR using LSRI on. It is proved that our modification LSRI is more appropriate for open-world re-id scenario, since the imposters are allocated equal probabilities of being non-target for group-based towards modelling, so they are more likely to be far away from target person samples, leading to more discriminative feature representation for target people, while in LSRO, the imposters are allocated equal probabilities of being non-target as well as target.

Effect of target proportion.

The evaluation results on different target proportion are reported in Table 6. We used different percentages of people marked as target. This verification was conducted on Market-1501, and we used original ResNet-50 for comparison. While TTR declines with the growth of target proportion due to more target people to verify, our APN can still outperformed the original ResNet-50 in all cases.

Effect of the Number of Shots.

The performance under multi-shot and single-shot settings were also compared in our experiments. For multi-shot setting, we randomly selected two images of each target person as gallery set, while for single-shot setting, we only selected one. As shown in Table 4 and Table 4, on both single-shot and multi-shot settings, our APN outperformed ResNet-50 on all conditions of Market-1501, CUHK01, and CUHK03. Especially on Set Verification, for CUHK01, when FTR is 10%, APN outperformed ResNet-50 11.11% under both single-shot and multi-shot settings.

5 Conclusion

For the first time, we demonstrate how adversarial learning can be used to solve the open-world group-based person re-id problem. The introduced adversarial person re-id enables a mutually related and cooperative progress among learning to represent, learning to generate, learning to attack, and learning to defend. In addition, this adversarial modelling is also further improved by a label smoothing regularization for imposters under semi-supervised learning.

Acknowledgment

This work was supported partially by the National Key Research and Development Program of China (2016YFB1001002) and the NSFC(61522115, 61472456), Guangdong Programme (2016TX03X157), and the Royal Society Newton Advanced Fellowship (NA150459). The corresponding author for this paper is Wei-Shi Zheng.

References