Person re-identification (re-ID) is a challenging problem specialising on pedestrian matching across a network of cameras. It has not been solved yet principally because of the significant visual changes caused by colour, background, camera viewpoints and human poses. Recent state-of-the-arts are developed in the basis of supervised deep neural networks [1, 2, 3, 4, 5, 6, 7, 8, 9] to learn robust and discriminative representations against visual variations. However, training deep architectures requires a large number of labeled image pairs across multiple camera views, which is prohibitively expensive and not scalable to real-world scenarios. To combat that challenge, a number of semi/un-supervised methods have been developed [10, 11, 12, 13, 14, 15, 16, 17]. Some of them attempt to seek feature invariance by designing robust hand-crafted features [10, 12, 11]. However, without the supervision of labeled data, the discrimination and specificity apt to camera-pair changes are not captured. Also, unsupervised methods treat samples from different views indiscriminately, and the effect of view-specific inference is not considered. On the other hand, some unsupervised methods introduce graph structure or clustering centroid [17, 14, 15, 16, 18] to keep visually similar people close in the projected space. Nonetheless, it is insufficient to explore the discriminative space as the learning of view-specific projections into a shared subspace is optimised independently.
Matching pedestrian snapshots across camera views (probe and gallery) can be achieved by seeking a common subspace therein, and jointly optimising a measure for each pair of cross-view images. Siamese networks with deep convolutions are demonstrated to hold promise in person re-ID [3, 19, 20] by learning a set of nonlinear transformations that align the correlation of layer activations in deep neural networks. However, Siamese networks have layer-wise equality constraints on deep layered representations, which are commonly imposed within convolutional networks through weight sharing. The idea of Siamese networks is to enforce the exact consistency between the probe and the gallery mapping, where the learning of symmetric transformations can reduce the number of parameters in the deep model. Unfortunately, this may induce the optimisation poorly conditioned because the same network must handle images from two disjoint distributions.
This paper is motivated towards person re-ID by presenting a deep view adaptation approach in the sense that the non-linear transformations into a common feature space corresponding to paired observations should be asymmetric. This asymmetric architecture is necessary to characterise the view-specific entailments, and the optimisation regarding asymmetric mappings should be conditioned on each other to capture the identity interplay between the probe and gallery views. Also, a re-ID system proceeds paired images and requires a comparable metric to determine the similarity for each pair. Towards the above practices, in this paper we present a deep feature learning approach to optimise a feature space such that the invariance between the probe and gallery distribution is maximum (cross-view invariance
), and we jointly learn similarity metrics for paired images. Our approach is appealing in the ability to learn asymmetric mappings characterising cross-view images without enforcing any sharing constraints. The minimisation on view discrepancy is achieved by performing adversarial learning with entropy regularisation to operate cross-entropy minimisation over cross-view samples. To jointly learn a comparable metric, the adversarial framework is empowered with a discriminator network to distinguish positive pairs against negatives. More importantly, the network training does not require a large number of training samples as opposed to existing deep learning methods[5, 19, 21] because we introduce adaptive weighting into the paired inputs which would emphasise the most difficult ones by assigning batch-based adaptive weights into positive/negative pairs.
It is noted that our framework is different from the study on domain adaptation to person re-ID [22, 1, 23, 24]. First, this line typically reuses pre-trained models from a closely related dataset with a large amount of samples (source), and then design the training towards the much smaller dataset of interest (target). Here instead, we are interested in adapting adversarial learning into cross-view invariant feature learning a.k.a adversarial view adaptation, to effectively address the view discrepancy in re-identifying persons. Secondly, a common problem of existing domain adaptation approaches is that a principled alignment between the source and target is missing, and thus they are unable to penalise the correlated domain misalignment in practical terms. In contrast, our method explicitly minimises the view discrepancy through the proposed view-adversarial objective. Our method is also distinct from existing methods based on adversarial losses . For instance, SPGAN  is composed of GAN loss to update the target domain w.r.t the source, our method instead uses cross-entropy loss to optimise the view confusion objective.
I-B Our Approach and Contributions
Our approach is designed based on the view adaptation scheme to learn asymmetric deep neural transformations in order to map view-specific distributions into a common feature space. In this sense, we introduce adversarial learning  into view discriminator which is optimised through cross-entropy based view confusion objective. This objective is to confuse the view discriminator that will perceive the two distributions identically so as to minimise the cross-view discrepancy. Specifically, we develop an adaptive learning framework to produce asymmetric mappings over two views through a view-adversarial training. When the view discriminator cannot determine if a pair is from the probe or the gallery view, the feature inference becomes optimal in terms of creating a view-invariant space. In this adversarial learning regime, view adaptation is seen as a generative adversarial network but there is not necessary to generate samples. In fact, the discriminator is confused by a view confusion objective and cannot determine the samples are from the probe or the gallery distribution when the network is optimal in creating a latent space conditioned on two views.
To address the similarity learning, we additionally enforce the semantic similarity by learning a distance metric jointly with the feature learning. This similarity is pertained in the discriminative base model of adversarial networks by using a contrastive loss (i.e., similarity discriminator), which pulls the images in positive pairs closer while pushes the negative pairs away from positives. Thus, our network is end-to-end trainable to process paired samples and outputs its similarity value to determine whether the pair is from the same identity or not. The overview of our framework is shown in Fig. 1. However, training paired input would raise an imbalance issue between the within-identity and between-identity samples. Hence, we particular introduce adaptive weighing into the most difficult positive/negative ones, which leads to optimised re-ID rank loss and quick convergence .
The major contributions of this paper can be summarised as follows.
We propose a principled adversarial feature learning approach to person re-ID to jointly produce a latent view-invariant feature space and its corresponding distance metric which maintains high discrimination on positive pairs from negatives.
Our method is differentiated from the literature in conceptualising cross-view matching through asymmetric mappings followed by explicit view adaptation. This is achieved by presenting view-adversarial learning to train cross-view embedding whilst confusing a view discriminator in a cross-entropy objective function.
We provide insights into adaptive weighing which assigns larger weights to difficult samples such that positive/negative class imbalance is effectively addressed.
Extensive validations of the proposed method against the state-of-the-art are performed to demonstrate the competence of our model.
Ii Related Work
Ii-a Person Re-identification
Most of existing re-ID models are developed in a supervised manner to learn discriminative features [2, 7, 5, 27, 28] or learn distance metrics [29, 8, 21]. However, these models commonly rely on substantial labeled training data, which would hinder the application of them in large networked cameras. Semi-supervised and unsupervised methods are presented to overcome the scalability issue by using limited number of labeled samples or without using label information. These techniques often focus on designing handcrafted features (e.g., colour, texture) [10, 11, 12]
that should be robust to visual changes in imaging conditions. However, low-level features are not expressive in view-invariance because features are not learned to be apt to view-specific bias. On the other hand, transfer learning has been applied into re-ID[30, 31, 13, 15, 10], and these methods learn the model using large labeled datasets and then transfer the discriminative knowledge to the unlabelled target pairs. For example, they can learn a cross-view metric either by asymmetric clustering on person images  or by transferable colour metric from a single pair of images 
. However, there is still a considerable performance gap relative to supervised learning approaches because it is not principled to fully explore the discriminative space in the context of source and domain image translation. In contrast to existing approaches that derive a metric independently from images of people, we learn a deep metric jointly with feature learning from few labeled training pairs in an adversarial manner.
Ii-B Generative Adversarial Networks
The Generative Adversarial Networks (GANs)  consist of a generator and a discriminator to compete the learning where the generator is learned to map samples from a latent distribution to confuse by producing samples close to real data, while the discriminator tries to distinguish between real and generated samples. The most popular variation of GAN is the Deep Convolutional GAN (DCGAN) introduced by Radford et al. 
. DCGAN improved the overall quality of generated images by adapting Convolutional Neural Network (CNN) into GAN architecture. Then, GANs have been extensively studied and widely used in several applications including realistic image generation, image-image translation , domain adaptation , and cross-modal retrieval .
Recently, GANs are adopted into person re-ID community by Zhong et al  which is to introduce a semi-supervised pipeline that integrates GAN-generated samples into the CNN learning. Following up the work in , Deng et al  present an unsupervised domain adaptation method (SPGAN) to preserve the similarity after translation and then train re-ID models with the translated images using supervised feature learning methods. In this work, we do not focus on generating samples for person re-ID as opposed to . Instead, we formulate adversarial networks into the feature learning process to produce a view-invariant subspace and jointly learning a similarity metric. Our method is also different from SPGAN  in the sense that we learn asymmetric transformations regarding view discrepancy rather than addressing a target domain into matching the source domain.
Ii-C Adversarial Adaptation Methods
Deep convolutional neural networks trained on large-scale datasets can learn representations which are generically useful across different tasks and visual domains [37, 38]. However, due to the domain shift/bias, generalising the well-trained recognition models to novel tasks typically require fine-tuning these networks. While it is difficult to obtain enough labelled data to properly fine-tune the large number of parameters in deep networks, and thus recent deep adaptation methods attempt to mitigate the difficulty by learning deep neural transformations that map both domains into a common feature space. This can be generally achieved by optimising the representations to align the source and target sets [39, 40, 41]. For instance, several methods use the maximum mean discrepancy loss to measure the difference between the source and the target feature distributions [39, 42, 43]. Inspired by the idea of adapting higher order statistics of the two distributions [44, 45, 46, 47], some methods propose a transformation to minimise the distance between the covariance representations of source and target datasets to ultimately achieve the correlation alignment [40, 41]. These approaches are unsupervised domain adaptation that do not need any target data labels, but they require large amounts of target training samples, which may not be available always. Also, semantic alignment of classes is difficult without a shared feature space which can be sought by creating positive and negative pairs using the source and target data [48, 49, 50, 51, 52].
Our framework is closely related to the adversarial adaptive methods [53, 54, 55] particularly in the employment of view confusion loss (i.e., cross-entropy loss). However, these works on domain adaptation are in the case of unlabelled target domains, and their ultimate goal is to regulate the learning of the source and target mappings so as to minimise the distance between the empirical mapping distributions. They chose an adversarial loss to minimise domain shift, learning a representation that is simultaneously discriminative of source labels while being able to distinguish between domains. Our method is not designed to match the target distribution to the source through an adversarial loss. In stead, we allow individual mappings which are not enforced to have weight sharing or any consistency to characterise view shifts and the adaptation is achieved through a view-adversarial loss.
Iii Our Approach
Iii-a Problem Formulation
We consider the training task where is the input space with and containing person images captured by two disjoint cameras, namely probe view (source) and gallery view (target). Specifically, the model is trained on labeled pairs in correspondence where and are examples of the same person
across camera views. To address the view variance, we formulate it into adversarial adaptive manner: the main goal is to regularise the learning of the source and target mappings,and , so as to minimise the distance between the empirical source and target mapping distributions: and . Under this setting, the similarity discriminator is to learn to directly determine the input pair belongs to the same person or not, eliminating the cross-view variance.
The standard generative adversarial learning pits two networks against each other: a discriminator and a generator. The generator is in principle trained to produce images in a way that confuses the discriminator, which in turn tries to distinguish them from real image examples. In our case of cross-view adaptation for matching persons, this principle is employed to ensure that the networks cannot distinguish between the distribution of its probe view () and gallery view () [56, 53, 55]. In other words, a view discriminator (
) is adopted to classify whether an example is from the source or the target view. However, the generator is not needed in our network because generative modelling of input image distributions is not necessary, as the ultimate task is to learn discriminative representations regarding identities. On the other hand, asymmetric mappings can better model the differences of camera views than symmetric ones. Therefore, we first learn a couple of asymmetric mappings conditioned on each other through the view-adversarial training to produce view-invariant feature space. Then, a similarity estimator () with a margin-based separability is optimised on the Euclidean distances of positive/negative pairs to learn the effective similarity metrics.
Iii-B Discriminator Networks
In our full adversarial adaptation framework, we have a view discriminator , which classifies whether a data point is drawn from the probe or the gallery domain. Thus, can be optimised according to a supervised loss, and the label indicates the origin domain. Herein, is defined below:
where we design the individual probe and gallery mappings and . It is clear that the two mappings are both parameterised in the supervised training with their asymmetric structures. This strategy is different from existing discriminative domain adaptation approaches  which generally consider a separate adaptation: the probe mapping is first learned through supervised losses, and then target mappings are initialised while adapting with the probe. By contrast, we aim to ensure the distance minimisation between the probe and gallery domains under their respective mappings, while crucially maintaining both mappings semantically discriminative. To effectively minimise the view discrepancy, we design the view-adversarial mapping loss (as defined in Eq.(3)) which suits the case where we initially use independent mappings and then the galley mapping () is updated to adversarially to match the probe ().
An effective re-ID system requires a metric to estimate the similarity for the paired pedestrian snapshots. This is amount to learning discriminative representations that are able to distinguish positive pairs against negative ones. Thus, to empower the view-adaptation framework with discriminative capability, we propose to optimise a view-invariant feature space such that data examples with the same identity are closer than those with different identities. In this work, we are interested in performing an end-to-end training for each paired images in their RGB values and optimising the view-discrepancy jointly with their similarity metrics. As a result, we can simply estimate similarity values for persons by directly computing the Euclidean distances of their embeddings.
To generate the embedding for each pair 111We omit the superscript for the notation simplification., and the corresponding similarity metric, we adopt the similarity discriminator network that aims to map semantically similar examples onto metrically close while simultaneously map semantically different examples onto metrically distant points in the embedding space. Hence, we formulate the similarity discriminator to minimise the following loss:
where is the binary label assigned to the pair , and if the pair is positive and otherwise. denotes the number of identities in training.
denotes the Euclidean distance between two input vectors:. is the margin that defines the separability in the embedding space and is the parameter to control the relative importance of two losses. In our experiments, is empirically set to be and is set to be (see the empirical evaluations in Section IV). The scheme of similarity discriminator is illustrated in Fig.3.
Iii-C Adversarial View Adaptation with Cross-Entropy Loss
In our framework, we need to minimise the distances between the probe and gallery representations through alternating the minimisation between two functions. Thereby, the probe and gallery mappings should be optimised according to a constrained adversarial objective , which can be formulated as:
Intuitively, the loss function in Eq.(3) is a view confusion objective, under which the mapping can be trained using a cross-entropy
loss function against a uniform distribution. This loss is to ensure the adversarial discriminator will view the two domains identically. Finally, the full objective function is formulated to be the unconstrained optimisation as follow:
The components of the objective function Eq.(4) can be interpreted as:
: We allow independent view mappings without enforcing weight sharing (
). This introduces a more flexible learning paradigm that allows view specific feature extractions to be learned. Siamese-like networks in person re-ID[3, 8, 21] have layer-wise equality constraint, thus enforcing exact probe and target mapping consistency. Indeed, learning a symmetric transformation reduces the number of parameters in the model, and ensures the mapping is view-invariant when the optimisation is converged. However, this may render the optimisation poorly conditioned because the same network is demanded to deal with images from two separate distributions.
: In the setting where both the mappings are changing, the standard GAN loss cannot be applied because in the GAN setting the source distribution remains fixed while the target distribution is learned to match it. Thus, we aim to optimise the view confusion objective, and the mappings are updated using cross-entropy loss against a uniform distribution.
Iii-D Network Training
We optimise the objective function Eq.(4) in stages. The overall network has three components to be trained: a similarity discriminator , a view discriminator , and mappings across views . First, and are initialised by two deep models: M-Net  and D-Net , which are effective in independent feature detection and extraction [3, 59]. Then, the similarity discriminator
is modelled by stacked fully-connected layers: 1024 hidden units, 2048 hidden units, and the final similarity output. With the exception of the similarity output layer, these fully-connected layers are using a ReLU activation function. However, there would be a severe imbalance between the number of within-identity pairs and the much greater number of between-identity pairs because the model requires the access to all pairs as input. Thus, our first improvement is to introduce an adaptive weighted loss into the similarity discriminator for the sake of imbalance.
Iii-D1 Adaptive Weighted Loss
The challenge of learning effective features during training with a balanced model is to assign larger weights to difficult positive and negative samples . We improve the similarity loss in Eq. (2) by introducing adaptive weight distribution on the positive/negative class. Thus, Eq.(2) can be rewritten as:
where the gallery sample is positive to , i.e. or negative to , i.e., . and denote the weights assigned to the positive and negative pairs, respectively.Through this adaptive weight loss, the positive/negative class imbalance is alleviated by the explicit reflection on weight distribution. Apparently, the advantage of this adaptive weighing on positive/negative samples is to pertain the contribution of hard samples whilst the original loss using the uniform weights can eliminate the effect of hard samples, and thus very likely to get into the local minima as driven by easy samples. In our implementation, and are defined by using the soft-max/min weight distributions as:
In the training, the VGG-16 network  pre-trained on ImageNet  is used as the base feature architecture. Following the conventional fine-tuning strategy , the last fully-connected layer is modified to have neuron to predict the -classes, where is the number of training persons. Once fine-tuning is done, the convolutional layers of VGG architecture are used to be the non-linear transformations for the two mappings. As the network takes paired inputs, the two mappings are not applied with weight-sharing to ensure the network asymmetric. The outputs of each pair is concatenated before passing into the similarity discriminator . Once the and are trained, the next step is training the view discriminator by classifying the images into or . We model by using two fully-connected layers with a soft-max activation in the last layer to optimise the loss function of Eq. (1). This is implemented by freezing the , and updating the parameters of . Then, the network is trained to confuse in which the cross-entropy loss is computed by optimising Eq. (3). The training process is illustrated in Fig.2, and the procedure is summarised in Algorithm 1.
To address the imbalance on training pairs, we improve the optimisation on the similarity discriminator by introducing the adaptive weighted loss. Thus, to construct a batch during training and calculate adaptive weights, we follow MTMCT  to construct
batches. In specific, during a training epoch each identity is selected into its batch, and the remainingbatch identities are selected at random. And samples for each identity are also selected at random.
Iv-a Datasets and Evaluation
We perform extensive experiments and comparative studies to evaluate our approach over four benchmark datasets: VIPeR , CUHK03 , Market-1501 , and DukeMTMC-reID [64, 14]. Example images are shown in Fig.4.
The VIPeR dataset  contains individuals taken from two cameras with arbitrary viewpoints and varying illumination conditions. The 632 person images are randomly divided into two equal halves, one for training and the other for testing.
The CUHK03 dataset  includes 13,164 images of 1360 pedestrians. The whole dataset is captured with six surveillance camera. Each identity is observed by two disjoint camera views, yielding an average 4.8 images in each view. This dataset provides both manually labeled and detected pedestrian bounding boxes. Our experiments report results on the labeled dataset.
The Market-1501 dataset  contains 32,668 fully annotated boxes of 1501 pedestrians. Each identity is captured by at most six cameras and boxes of person are obtained by running a state-of-the-art detector, the Deformable Part Model (DPM) . The dataset is randomly divided into training and testing sets, containing 750 and 751 identities, respectively.
The DukeMTMC-reID dataset is a re-ID version of the DukeMTMC dataset . It contains 34,183 image boxes of 1,404 identities of which 702 are used for training and the remaining 702 for testing. The probe and gallery images are 2,228 and 17,661, respectively.
We evaluate all the approaches with Cumulative Matching Characteristic (CMC) results by the single-shot setting. The CMC curve can characterise a ranking result for every image in the gallery given the probe image. We also use mean Average Precision (mAP) as performance measure on CUHK03, Market-1501, and DukeMTMC-reID.
Considering the training of our network is accessible to few examples from each person because we do not perform data augmentation, it is necessary to perform cross-validation on hyper-parameters to improve the generalisation on unseen observations. We therefore construct two disjoint sets of classes to be and on each train set of each dataset. For example, on VIPeR dataset, the three subsets are randomly divided to be (216 persons), (100 persons), and (316 persons). The details of the train/validation/test division on four datasets are given in Table I. There are up to six cameras for CUHK03 and Market-1501, and thus for each person we randomly select two cameras to be the probe and gallery views. Then, each person’s images across views are selected to be samples in pairs.
We use the VGG architecture and its variants M-Net and D-Net as the feature bases which are initialised from weights pre-trained on ImageNet and fine-tuned on target of each re-ID dataset. Once fine-tuning is done, the convolution layers of each network are used as (), and a three-layer fully connection with ReLU as activation function is used as the similarity discriminator . The hidden layers in have the dimensionality of 1,024 and 2,048, respectively. The learning rate starts with 0.001 and is divided by 10 every 10 epoches. The network uses a batch size of 128 images. The training is stopped when the loss stops decreasing during the validation on .
Iv-C Experimental Results
In this section, we compared the proposed method with recent un/semi-supervised and supervised models on four datasets. The comparison results measured by rank- accuracies of CMC are shown in Fig.5. And respective rank- () values on four datasets are given in Table II, Table III, Table IV, and Table V. We also conduct self-ablation evaluations on parameter sensitivity and network architecture.
Comparison to Un/semi-supervised Methods
We compared our method with several unsupervised re-ID models, including local salience learning based models (GST  and eSDC ), transfer-learning based models (t-LRDC , PUL , and UMDL ), metric learning methods (OSML , CAMEL , OL-MANS ), and a semi-supervised method of LSRO .
On the VIPeR dataset, Table II shows that our method outperforms other models in the case when there is only one example for each person in each view. For example, our method achieves rank-1=51.3, which is noticeably improved performance compared to OL-MANS  with rank-1=44.9. The main reason is that the assumptions without supervision cannot provide the view-specific inference, and thus impedes these unsupervised methods from achieving higher accuracies. In contrast, the proposed method is based on adversarial learning which is able to effectively minimise the view discrepancy without requiring large numbers of labeled training examples. Moreover, the improved variant of our approach (denoted as ) with adaptive weighted loss can emphasise the most difficult samples in a batch, an thus outperforms the state-of-the-art SpindleNet  at rank-1 value.
On the CUHK03 dataset, in Table III it can be seen that our method outperforms the state-of-the-art by large margins. For instance, the rank-1 value is improved by 25% compared to OL-MANS . The reality is the illumination changes in CUHK03 are extremely severe and even human beings may find difficulty in identifying the persons across views. Without the aid of supervision, unsupervised methods cannot retrain the appearance robustness against visual variations. As a comparison, our approach is able to address this issue by training a discriminative distance metric jointly with the view-invariant feature learning. This leads to better performance of the proposed method. Also, the performance of our method with adaptive weighting outperforms the state-of-the-art SpindleNet  which builds the discriminative representations by extensive body region decompositions.
Table IV and Table V report the comparison results on the Market-1501 and DukeMTMC-reID datasets, respectively. Our method has achieved notable performance gain on the two datasets in comparison with these un/semi-supervised methods. These empirical evaluations on different benchmark datasets demonstrate the effectiveness of our model in cross-view person re-ID owing to the effective view adaptation while learning discriminative metrics in the context of view-aligned feature space.
Comparison to Supervised Methods
We compared the proposed method against recent state-of-the-art supervised models: DCSL , JSTL , DNSL , Deep-Embed , SpindleNet , Part-Aligned , MSCAN , SI-CI , PIE , JLML , MTMCT , SPReID , SVDNet , and DPFL . Comparison results on three datasets are reported in Table II, Table III, Table IV, and Table V respectively. It can be noticed that our method achieves better results compared to these supervised methods, and can outperform them when the adaptive weighting is applied. In Table II, we obtain rank-1=51.3 (55.9 from ) on the VIPeR dataset which has gained the recognition improvement over the SpindleNet  by 2.1% in rank-1 value. And in Table III, we obtain rank-1=86.6 (88.9 after weighted adaptation) as opposed to SpindleNet  with rank-1=88.5. We remark that SpindleNet  is a fully-supervised method that needs annotations on each body region to focus/extract these local features to describe each person. The process of annotating each body region is very cumbersome and not scalable in large networked cameras. On Market-1501, comparison results in Table IV show that our method greatly improves the rank-1 accuracy for this task. For example, in comparison with MSCAN , a state-of-the-art method based on fully-supervised body region encoding, the rank-1 accuracy value goes from 80.3% up to 87.2%. This is particularly effective in Market-1501 dataset where each person has up to 10 samples, and our approach is able to address the view misalignment more carefully. Our approach with weight adaptation loss () can further improve the rank-1 accuracy and achieve 89.1%, which is better than the state-of-the-art SPReID* (rank-1=88.3%) 222Please note that all results of SPReID  are reported by using reduced data augmentation backboned on ResNet-152 architecture. and DPFL  (rank-1=88.6%). Experimental results on the DukeMTMC-reID dataset are reported in Table V. Our method outperforms the state-of-the-art DPFL  by 1% at rank-1 accuracy. It shows that the adaptive weighting scheme is very effective in training a balanced model on DukeMTMC-reID dataset which has severe imbalance classes in the probe (2,228 images) and gallery size (17,661 images).
|(a) VIPeR||(b) CUHK03|
|(c) Market-1501||(d) DukeMTMC-reID|
|Architectures||VIPeR (R=1)||CUHK03 (R=1)|
|M-Net, D-Net =||51.3||86.6|
|D-Net, M-Net =||49.7||83.9|
|ResNet, ResNet =||51.3||86.4|
We first study the sensitivity of our model to the key parameter of in Eq.(2). The impact of is investigated and the results are shown in Fig.7. As is to balance the relative importance of the discriminative distance metric, it is proven to have higher rank-1 accuracy when , while a larger does not bring more gains in accuracy. Thus, we empirically set in all experiments. We also study different network architectures to inspect the importance of backbone networks. In our experiment, we consider the VGG-16 and the ResNet . Specifically, two variants of VGG: M-Net and D-Net are used to initialise (, ), and two identical ResNet networks are employed to initialise (, ) as a comparison. Experimental results are shown in Table VI. We can observe that performances of two identical ResNet networks are inferior to the asymmetric architectures with M-Net and D-Net on VIPeR and CUHK03 datasets. Thus, we use M-Net and D-Net as default backbone networks.
Iv-D Comparison to Other Few-Shot Methods
We also compared to two recently proposed few-shot learning methods: matching networks  and model regression . The matching networks propose a nearest neighbour approach that trains an embedding end-to-end for the task of few-shot learning. Model regression trains a small MLP to regress from the classifier trained on a small dataset to the classifier trained on the full dataset. Both of the two techniques are high-capacity in learning from few examples and facilitates the recognition in the small sample size regime on a broad range of tasks, including domain adaptation and fine-grained recognition. Comparison results are shown in Fig. 6. In terms of the overall performance, our method outperforms the two competitors constantly over the two datasets. Matching networks exhibit similar performance to our method, however, matching networks are based on nearest neighbours and use the entire training set in memory, and thus they are more expensive in testing time compared with our method and model regressors.
Iv-E Comparison with View Adaptation Methods
In this experiment, we validate our approach in view adaptation by comparing to recent domain adaptation methods not limited from person re-ID: SPGAN , Deep Adaptation Networks (DAN) , Adversarial Discriminative Domain Adaptation (ADDA) , and CoGANs . Experimental results are provided in Table VII. For these domain adaptation methods including DAN , ADDA , and CoGANs , their training are set and modified to adapt the gallery view (target) to match the probe view (source). For instance, CoGANs 
can learn a joint distribution of multiple-domain data, the learning can be conducted by using two generative models with an identical architecture corresponding to the probe and the gallery images of a person. Then, through weight sharing, CcGANs are able to encode high-level semantics regarding identities into the low-level feature extraction. Our approach achieves the highest rank-1 value on the three datasets, despite being trained without a deep generator yet being a considerably simpler model. This also provides compelling evidence that generating images is not necessarily relevant to effective view adaptation. This discovery is consistent with ADDA which does not use a generative model while also shows convincing results in comparison with CoGANs . For CoGANs , it is sometimes hard to get convergence, e.g. on CUHK03 when the view changes are very disparate, and it is unable to train coupled generators for them simultaneously.
V Conclusion and Future Work
In this paper, we introduce an effective view adaptation model to person re-identification to produce asymmetric transformations that can fully characterise view specificity. The approach is based on adversarial learning to minimise view-discrepancy through view confusion objective with an entropy regularisation to align and form view-invariant feature space. The network is trained with a cross-entropy loss to optimise view confusion objective and jointly with a discriminative distance metric through a margin-based separability criterial. Also, training imbalance is explicitly described as weight distribution on hard samples, and the proposed adaptive weighting loss can address it more effectively. Experimental results show that the adversarial neural networks are able to produce feature space with cross-view variations being reduced. The proposed approach works effectively for labeled training samples with large visual divergence, and our method shows clear promise as it sets new state-of-the-art performance in experiments.
In future work, we would explore the direction of view adaptation in the case when such training pairs are not given. One possibility is to learn a probe to gallery encoder-decoder under a generative adversarial objective with some reconstruction term which can be applied to predict the clothing people are wearing. The other interesting direction is towards intriguing few-shot learning principles to learn to match persons with more powerful augmented memory networks.
Meng Wang was supported by he National Key Research and Development Program of China under grant 2018YFB0804200; NSFC 61725203, 61732008. Lin Wu was supported by ARC LP160101797 Improved Pathology by Fusion of Digital Microscopy and Plain Text Reports. Yang Wang was supported by National Natural Science Foundation of China, under Grant No 61806035.
-  T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep feature representation with domain guided dropout for person re-identification,” in CVPR, 2016.
-  L. Wu, Y. Wang, J. Gao, and X. Li, “Deep adaptive feature embedding with local sample distributions for person re-identification,” Pattern Recognition, vol. 73, pp. 275–288, 2018.
-  L. Wu, Y. Wang, X. Li, and J. Gao, “What-and-where to match: Deep spatially multiplicative integration networks for person re-identification,” Pattern Recognition, vol. 76, pp. 727–738, 2018.
-  H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in CVPR, 2017.
-  L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identification,” in ICCV, 2017.
-  D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,” in CVPR, 2017.
-  L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose invariant embedding for deep person re-identification,” in arXiv:1701.07732, 2017.
-  F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang, “Joint learning of single-image and cross-image representations for person re-identification,” in CVPR, 2016, pp. 1288–1296.
-  L. Wu, C. Shen, and A. van den Hengel, “Deep linear discriminant analysis on fisher networks: A hybrid architecture for person re-identification,” Pattern Recognition, vol. 65, p. 238–250, 2017.
-  S. Bak and P. Carr, “One-shot metric learning for person re-identification,” in CVPR, 2017.
H. Wang, S. Gong, and T. Xiang, “Unsupervised learning of generative topic saliency for person re-identification,” inBMVC, 2014.
-  R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person re-identification,” in CVPR, 2013.
-  W.-S. Zheng, S. Gong, and T. Xiang, “Towards open-world person re-identification by one-shot group-based verification,” TPAMI, vol. 38, no. 3, pp. 591–606, March 2016.
-  Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in ICCV, 2017.
-  H.-X. Yu, A. Wu, and W.-S. Zheng, “Cross-view asymmetric metric learning for unsupervised person re-identification,” in ICCV, 2017.
-  J. Zhou, P. Yu, W. Tang, and Y. Wu, “Efficient online local metric adaptation via negative samples for person re-identification,” in ICCV, 2017.
-  E. Kodirov, T. Xiang, Z. Fu, and S. Gong, “Person re-identification by unsupervised l1 graph learning,” in ECCV, 2016.
-  Y. Wang, X. Lin, L. Wu, W. Zhang, Q. Zhang, and X. Huang, “Robust subspace clustering for multi-view data by exploiting correlation consensus,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3939–3949, 2015.
-  W. Li, R. Zhao, X. Tang, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in CVPR, 2014.
-  L. Wu, Y. Wang, J. Gao, and X. Li, “Where-and-when to look: Deep siamese attention networks for video-based person re-identification,” IEEE Transactions on Multimedia, 2018.
-  W. Li, X. Zhu, and S. Gong, “Person re-identification by deep joint learning of multi-loss classification,” in IJCAI, 2017.
-  W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” in CVPR, 2018.
-  N. Martinel, M. Dunnhofer, G. L. Foresti, and C. Micheloni, “Person re-identification via unsupervised transfer of learned visual representations,” in International Conference on Distributed Smart Cameras, 2017.
-  M. Geng, Y. Wang, T. Xiang, and Y. Tian, “Deep transfer learning for person re-identification,” in arXiv:1611.05244, 2016.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
-  E. Ristani and C. Tomasi, “Features for multi-target multi-camera tracking and person re-identification,” in CVPR, 2018.
-  K. Liu, B. Ma, W. Zhang, and R. Huang, “A spatio-temporal appearance representation for video-based person re-identification,” in ICCV, 2015.
-  L. Wu, Y. Wang, L. Shao, and M. Wang, “3d personvlad: Learning deep global representations for video-based person re-identification,” IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2019.2891244, 2019.
-  L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null space for person re-identification,” in CVPR, 2016.
-  P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian, “Unsupervised cross-dataset transfer learning for person re-identification,” in CVPR, 2016.
-  H. Fan, L. Zheng, and Y. Yang, “Unsupervised person re-identification: Clustering and fine-tuning,” in Arxiv, 2017.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in ICLR, 2016.
-  A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox, “Learning to generate chairs, tables and cars with convolutional networks,” in arXiv:1411.5928, 2017.
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inICCV, 2017.
-  S. Motiian, Q. Jones, S. M. Iranmanesh, and G. Derotto, “Few-shot adversarial domain adaptation,” in NIPS, 2017.
-  L. Wu, Y. Wang, and L. Shao, “Cycle-consistent deep generative hashing for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1602–1612, 2019.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: a deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
-  C. Zhang, L. Wu, and Y. Wang, “Crossing generative adversarial networks for cross-view person re-identification,” Neurocomputing, vol. 340, no. 7, pp. 259–269, 2019.
-  B. Sun and K. Saenko, “Deep coral: correlation alignment for deep domain alignment,” in ECCV Workshop, 2016.
-  B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in AAAI, 2016.
-  P. Morerio, J. Cavazza, and V. Murino, “Minimal-entropy correlation alignment for unsupervised deep domain adaptation,” in ICLR, 2018.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Deep domain confusion: maximizing for domain invariance,” in arXiv:1412.3474, 2014.
-  M. Long and J. Wang, “Learning transferable features with deep adaptation networks,” in ICML, 2015.
-  G.-J. Qi, C. C. Aggarwal, and T. Huang, “Link prediction across networks by biased cross-network sampling,” in ICDE, 2013, pp. 793–804.
-  ——, “Online community detection in social sensing,” in ACM international conference on Web search and data mining, 2013, pp. 617–626.
-  X.-S. Hua and G.-J. Qi, “Online multi-label active annotation: towards large-scale content-based video search,” in ACM Multimedia, 2008, pp. 141–150.
G.-J. Qi, C. C. Aggarwal, and T. Huang, “On clustering heterogeneous social media objects with outlier links,” inACM international conference on Web search and data mining, 2012, pp. 553–562.
-  S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deep supervised domain adaptation and generalization,” in ICCV, 2017.
-  S. Chang, G.-J. Qi, C. C. Aggarwal, J. Zhou, M. Wang, and T. S. Huang, “Factorized similarity learning in networks,” in ICDM, 2014, pp. 60–69.
-  X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang, “Supervised quantization for similarity search,” in CVPR, 2016, pp. 2018–2026.
-  J. Tang, X.-S. Hua, G.-J. Qi, and X. Wu, “Typicality ranking via semi-supervised multiple-instance learning.” in ACM Multimedia, 2007, pp. 297–300.
-  J. Wang, Z. Zhao, J. Zhou, H. Wang, B. Cui, and G. Qi, “Recommending flickr groups with social topic model,” Inf. Retr., vol. 15, no. 3-4, pp. 278–295, 2012.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in ICCV, 2015.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in CVPR, 2017.
Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” inICML, 2015.
-  M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in NIPS, 2016.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in BMVC, 2014.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
-  L. Wu, Y. Wang, X. Li, and J. Gao, “Deep attention-based spatially recursive networks for fine-grained visual recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1791–1802, 2019.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” in arXiv preprint, arXiv:1409.0575, 2014.
-  P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” in arXiv:1611.07004, 2016.
-  D. Gray, S. Brennan, and H. Tao, “Evaluating appearance models for recognition, reacquisition, and tracking,” in IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2007.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015.
-  E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in ECCV Workshop on Benchmarking Multi-Target Tracking, 2016.
-  B. Huang, J. Chen, Y. Wang, C. Liang, Z. Wang, and K. Sun, “Sparsity-based occlusion handling method for person re-identification,” in Multimedia Modeling, 2015.
-  Y. Zhang, X. Li, L. Zhao, and Z. Zhang, “Semantics-aware deep correspondence structure learning for robust person re-identification,” in IJCAI, 2016.
-  M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak, and M. Shah, “Human semantic parsing for person re-identification,” in CVPR, 2018.
-  Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrian retrieval,” in ICCV, 2017.
-  Y. Chen, X. Zhu, and S. Gong, “Person re-identification by deep learning multi-scale representations,” in ICCV workshops, 2017.
-  S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in CVPR, 2015, pp. 2197–2206.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in arXiv:1606.04080, 2016.
-  Y. Wang and M. Hebert, “Learning to learn: model regression networks for easy small sample learning,” in ECCV, 2016.