Adversarial Binary Coding for Efficient Person Re-identification

03/29/2018 ∙ by Zheng Liu, et al. ∙ 0

Person re-identification (ReID) aims at matching persons across different views/scenes. In addition to accuracy, the matching efficiency has received more and more attention because of demanding applications using large-scale data. Several binary coding based methods have been proposed for efficient ReID, which either learn projections to map high-dimensional features to compact binary codes, or directly adopt deep neural networks by simply inserting an additional fully-connected layer with tanh-like activations. However, the former approach requires time-consuming hand-crafted feature extraction and complicated (discrete) optimizations; the latter lacks the necessary discriminative information greatly due to the straightforward activation functions. In this paper, we propose a simple yet effective framework for efficient ReID inspired by the recent advances in adversarial learning. Specifically, instead of learning explicit projections or adding fully-connected mapping layers, the proposed Adversarial Binary Coding (ABC) framework guides the extraction of binary codes implicitly and effectively. The discriminability of the extracted codes is further enhanced by equipping the ABC with a deep triplet network for the ReID task. More importantly, the ABC and triplet network are simultaneously optimized in an end-to-end manner. Extensive experiments on three large-scale ReID benchmarks demonstrate the superiority of our approach over the state-of-the-art methods.



There are no comments yet.


page 2

page 7

Code Repositories


Codes of the paper "Adversarial Binary Coding for Efficient Person Re-identification"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Learning Projections
(b) Deep Hashing
(c) Ours
Figure 1: Different binary coding schemes for efficient ReID. Our method avoids time-consuming projection learning and results in high-quality binary codes in an intuitive way, via adversarial training without using tanh-like activation.

Given one or multiple images of a pedestrian, person re-identification (ReID) aims to retrieve the person with the same identity from a large collection of images captured in different scenes and from various viewpoints. ReID enables various potential applications, such as long-term cross-scenario tracking and criminal retrieval. The task, however, still remains challenging due to the significant variations in poses, viewpoints and illuminations across different cameras.

Numerous ReID methods have been proposed, most of which adopt high-dimensional (usually thousands or more) features [1, 2, 3, 4, 5, 6] in order to represent persons comprehensively with various cues (colors, textures, and spatial-temporal cues). This directly bring much higher computational complexity to the subsequent similarity measurement (metric learning). Besides, current large-scale ReID benchmarks contain numerous identities and cameras to simulate real-world scenarios, making existing state-of-the-art ReID approaches computationally unaffordable [7]. Therefore, despite the noteworthy improvement in matching accuracies, the computational and memory requirements have at the same time become more challenging.

Binary coding (hashing), adopted by [8, 9], maps high-dimensional features into compact binary codes and efficiently measures similarities in the low-dimensional Hamming space. It is one of the promising solutions for efficient ReID. The hashing based ReID methods can be mainly divided into two categories: 1) The method shown in Fig. 1(a) learns multiple projection matrices to concurrently map original features to a low-dimensional and discriminative Hamming space. However, its objective is generally a non-convex joint function of several sub-tasks (similarity-preserving mapping and binary transformation), which requires the explicit design of sophisticated functions and time-consuming non-convex (discrete) optimizations. The memory storage and computational efficiency are serious issues, especially when dealing with large-scale data. 2) Fig. 1(b) shows a deep neural network based method, which is able to process large-scale data much more efficiently compared with traditional methods by using mini-batch learning algorithms and advanced GPUs. The binary codes here are generated by inserting hashing layers at the end of the networks. However, the hashing layer is simply a fully-connected layer followed by a tanh-like activation to force the outputs in binary form. This straightforward scheme hardly constrains the outputs under the important principles of hashing (balancedness and independence [10]

) to obtain high-quality binary codes. Moreover, the outputs of the hashing layers tend to lie in the approximately linear part of the tanh-like functions for preserving discriminability. Therefore, directly binarizing the outputs by the

function will lose the discriminative information.

To address the above issues, this paper proposes an unified end-to-end deep learning framework for efficient ReID, aiming to jointly learn a discriminative feature representation, an accurate similarity measurement and an

implicit binary transformation. In particular, we propose Adversarial Binary Coding (ABC) by adopting a Generative Adversarial Net (GAN) [11, 12] to regularize features into binary form without loss of discriminability (see Fig. 2). Instead of explicit projections, the adversarial learning makes the target distribution (in binary form) an ‘expert’

that implicitly guides the network to generate samples under the same distribution. Specifically, we employ the Bernoulli distribution to guide a CNN to generate discrete features. Benefiting from the nature of the Bernoulli distribution, our ABC can generate high-quality discriminative codes complying with the important principle of hashing, balancedness. As shown in Fig. 

1(c), our strategy avoids both time-consuming explicit projection learning and low-quality codes with the simple tanh-like activation. More importantly, our ABC can be flexibly embedded into any similarity regressive networks (deep triplet networks) and optimized jointly with the network in an end-to-end manner. The main contributions of this paper are summarized as follows:

1) We propose a binary transformation strategy based on deep adversarial learning. The proposed architecture is composed of a CNN for feature extraction and a discriminator network for distinguishing real-valued and binary features, where the CNN is guided to generate features in binary form to confuse the discriminator. Thus, the features are implicitly regularized into binary codes.

2) An end-to-end deep neural network that seamlessly accommodates the above adversarial binary coding module is built for efficient ReID. We jointly optimize the binary transformation and similarity measurement. Consequently, the discriminative information is largely preserved during feature binarization.

3) Extensive experiments on three large-scale ReID benchmarks (CUHK03 [13], Market-1501 [14], and DukeMTMC-reID [15]) clearly demonstrate the superiority of our framework both in terms of accuracy and efficiency, compared with other binary coding based and the state-of-the-art ReID methods.

2 Related work

Person re-identification:

Traditional approaches usually propose certain feature learning algorithms for ReID, including low-level color features [1, 16, 17] and local gradients [18, 19, 4], and high-level features [2, 20, 3]. Due to the breaking-through performance of deep neural networks, deep learning based ReID methods [13, 21, 22, 23, 24, 25, 26] have been proposed increasingly. For instance, siamese CNNs [27, 28, 29] and triplet CNNs [30, 31, 6] are widely used for similarity measurement. Very recently, several binary coding based approaches [32, 33, 8, 9] have emerged to deal with the high computation and storage costs existing in ReID problems.

Generative adversarial nets:

GANs [11, 12]

provide a methodology to map random variables from a simple distribution to a certain complex one, and have been widely used in image generating

[12, 34, 35, 36], style transferring [37, 38] and latent feature learning [39, 40, 41]. To stabilize and quantify the training of GANs, a breakthrough named Wasserstein GAN (WGAN) was proposed in [42] and improved in [43]

. More recently, GANs are also utilized for image retrieval problems. In

[44], GANs were adopted to distinguish synthetic and real images, aiming to improve the discriminability of binary codes. GANs were also employed to enhance the intermediate representation of the generator in [45]. However, these kinds of studies still simply adopted tanh-like activations for binarization. To our best knowledge, our ABC is the first work that intuitively adopts the spirit of adversarial learning to perform binary transformation for efficient ReID.

3 Approach

The proposed framework transforms high-dimensional real-valued features to compact binary codes mainly based on the adversarial learning. In the following, we first briefly review the principles of GANs in Section 3.1. Then, we introduce the adversarial binary coding (ABC) in detail in Section 3.2. In Section 3.3, we present the joint end-to-end framework with triplet networks for efficient ReID.

3.1 A Brief Review of GANs

In the framework of GAN, there is a generator which competes against an adversary, a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. To learn the generator’s distribution over data , a prior on input noise variables is defined as , then a mapping to data space is denoted as , where is a differentiable function represented by a deep neural network with parameters . Meanwhile,

represents the probability that

comes from the data rather than . is trained to maximize the probability of assigning the correct label to both samples from real data and samples from . Simultaneously, is trained to make

confused. The formal loss function is defined as follows:


However, GANs are difficult to train so that the generator may fail to generate either real-looking or diverse samples. Arjovsky [46, 42] addressed this problem by introducing the WGAN, which optimizes the Wasserstein loss instead of the Jensen-Shannon divergence to evaluate the similarity. The Wasserstein loss is defined as the following:


It provides stronger stability of gradients based on the Wasserstein-1 distance (also called the Earth-Mover distance). Moreover, WGAN provides meaningful learning curves useful for debugging and hyper-parameter searching. Therefore, in this work, we adopt the training strategy of WGAN for adversarial learning.

Figure 2: Illustration of the adversarial binary coding (ABC) framework. The discriminator network receives sampled binary codes as positive samples and extracted real-valued features as negative samples. The feature extractor network and discriminator network are jointly optimized under the Wasserstein loss (W loss), such that the extractor is forced to generate features in binary form.

3.2 Adversarial Binary Coding

Our binary coding scheme is intuitively inspired by GANs. Instead of formulating explicit hashing functions (learning explicit projections), we implicitly guide a deep neural network to directly learn the transformation of data from the original distribution (images) to a distribution of binary vectors in a GAN framework. In this section, we focus on the binary transformation module of the end-to-end efficient ReID framework. How to keep the semantic/discriminative information during the transformation procedure is explained in Section


The proposed framework of adversarial binary coding is illustrated in Fig. 2. The feature extractor can be any CNN architecture (ResNet-50 [47] is adopted in this work), which finally represents the images as feature vectors. Meanwhile, a binary code sampler performs random sampling for every bit of the binary vectors. To satisfy the principle of effective binary coding mentioned in [10]

, we sample from the Bernoulli distribution, based on which there is a 50% chance for each bit to be 0 or 1, and different bits are independent of each other. The discriminator is expected to classify the binary vectors as positive samples and the real-valued feature vectors as negative samples. Thus, the extractor is trained to generate feature vectors that are under the same distribution of positive samples using the Wasserstein loss (W loss) in Eq. (


Formally, we denote a batch of images as under the distribution . The feature extractor is denoted as a mapping function which plays the role of the generator in the definition of GANs under an encoding distribution where denotes the extracted feature vectors. aims to transform data from the original distribution to a target distribution :


Since a binomial distribution is equivalent to multiple Bernoulli samplings with the same probability, the extractor is essentially regularized by matching the posterior

to a prior binomial distribution using the Wasserstein distance.

As mentioned above, we use ResNet-50 [47]

as the backbone model, where Rectified Linear Unit (ReLU) is adopted as the activation function. Hence, we represent every bit of the binary codes by

instead of [33, 9] due to the non-negative outputs of ReLU. We further find that the performance will severely deteriorate if the feature vectors and binary codes are directly fed into the discriminator and the similarity regressive loss (triplet loss in Fig. 3) without normalization, due to the contradiction between the expected 0 or 1 outputs and the learning algorithm. More concretely, the weights of the neural network are generally initialized to very small values (much smaller than 1). Meanwhile, the learning algorithm carefully controls the scale of the weights (by learning rates and weight decay mechanism) to avoid gradient vanishing or exploding under the loss function. As a result, the features extracted by the network will also be very close to 0 since they share the same scale with weights. On the contrary, our ABC expects every dimension of the output features to be constrained near 0 or 1. As a consequence, we will encounter an unstable optimization process if not adopting any normalization.

To address the above issue, we normalize both the output feature vectors and sampled binary codes to the same scale by normalization. As for the real-valued features, we adopt the standard -Norm operation. In terms of the binary codes, we perform the normalization specifically as follows. Given a batch of random binary vectors , where and is the code length, the binary vectors can be directly normalized as follows:


However, the -Norm of every vector could be different since every binary vector may contain different numbers of bits assigned to . In other words, the values of non-zero entries in the normalized vectors will be different. This leads to an unstable training process, where the losses are unable to guide the optimization clearly. Therefore, in this study, we adopt the expectation of the Bernoulli distribution to calculate the -Norm of binary vectors. Specifically, we calculate an uniform normalization factor as:


where represents the expectation of Bernoulli random variables, and thus the binary vectors can be normalized as .

3.3 Triplet Loss based Efficient ReID Framework

Figure 3: Illustration of the triplet loss based adversarial binary coding embedded efficient ReID framework. The feature extraction network is a pre-trained ResNet-50 model. The network is first fine-tuned by the Cross-Entropy Error (CCE) loss on training images and then simultaneously trained by the triplet loss and the Wasserstein loss to generate discriminative binary features.

To not only transform features to binary form, but also measure similarities between binary codes, the ABC is further embedded into a triplet network for ensuring the discriminability of the learned binary codes. The triplet loss [48] is formulated as follows:


where , , and are input features, is the imposed distance margin between positive and negative pairs, and measures the similarity distance. and are features from the same class (same identity in ReID), and is from another class (different identity). The triplet loss forces the distances between samples in negative pairs larger than those in positive pairs. Therefore, it is widely used in the tasks which aim to retrieve data with high relevance.

The overall framework for efficient ReID is shown in Fig. 3

. ResNet-50 pre-trained on ImageNet

[49] is adopted as the backbone model, where the fixed average pooling layer is replaced by an adaptive average pooling to fit different input sizes, followed by a feature embedding (fully-connected) layer to reduce feature dimensions into expected lengths. At the beginning of training, we fine-tune the model on pedestrian images with Cross Entropy Error (CEE) loss, by solving a conventional classification problem, each class contains the images of one person. Because fine-tuning the models pre-trained on a large image collection on small datasets has been verified as an effective approach for knowledge transfer. This is also helpful for deep networks with less data to find the optimal parameters more easily and to reach the convergence faster compared with training from scratch. Note that the outputs of the last layer are not normalized by -norm in this phase, just as conventional CNNs for image classification. After that, we train the model with normalization as shown in Fig. 3, by jointly optimizing the Wasserstein loss for binary coding and triplet loss for similarity measuring. Specifically for the composition of triplet batches, we randomly select different persons and pick two images from different views of each person to be the anchor and positive samples. Then we randomly select an image of a person different from the anchor as the negative sample in each triplet.

Particularly, in the training phase, we adopt the Euclidean distance to measure similarities between real-valued features for the triplet loss without binarization. Because the Euclidean distance provides conspicuously more stable gradients than the Hamming distance, while obtaining the equivalent distance measuring results as the Hamming distance. In this way, the triplet loss focuses on reducing intra-class distances and enlarging inter-class distances in terms of the real-valued features, whilst the Wasserstein loss focuses on the binary transformation of real-valued features.

In the testing phase, images are sent into the trained CNN to obtain the real-valued features, of which every entry should be very close to binary values, such as or , where is a very small value. Finally, we binarize the features as follows:


where is the value of the -th entry of a real-valued feature extracted by , and is the binary bit of after binarization. The Hamming distances between queries and the gallery set are further computed using extremely fast XOR operations to measure similarities.

4 Experiments

We evaluate the performance of our method on three large-scale ReID datasets: CUHK03 [13], Market-1501 [14], and DukeMTMC-reID [50, 15]. The goal of our experiments is mainly to answer the following three research questions:

  • Q1: How efficiency in computation and storage is our learned binary codes, compared with their real-valued counterparts? (Sec. 4.2)

  • Q2: How does our method perform compared to the binary coding based ReID approaches (Sec. 4.3), and the state-of-the-art non-hashing ReID approaches (Sec. 4.4)?

  • Q3: How will our performance change with different configurations (different similarity networks, with and without -norm/fine-tuning)? (Sec. 4.5)

4.1 Datasets and Settings

CUHK03 contains 14,096 images of 1,467 identities captured by six surveillance cameras. The dataset provides both manually labeled and automatically detected bounding boxes with variant sizes. In the experiments, we resize the images to 16060, and the 20 training/testing splits reported in [13] are used. The number of training iterations is set to 6,000, and the margin of the triplet loss is initialized to 0.2 and increased to 0.3 after 1,000 iterations, 0.4 after 2,500 iterations, and 0.5 after 4,000 iterations.

Market-1501 contains 32,668 automatically detected 12864 bounding boxes of 1,501 pedestrians under six cameras and provides a fixed evaluation protocol. In the experiments, the number of training iterations is set to 8,000, and the margin of the triplet loss is initialized to 0.2 and increased to 0.3 after 1,000 iterations and 0.4 after 4,000 iterations.

DukeMTMC-reID contains 36,411 manually annotated bounding boxes of 1,404 identities under 8 cameras and provides a fixed training/testing split. In the experiments, the size of images is resized to 12864. The number of training iterations is set to 8,000, and the margin of the triplet loss is initialized to 0.2 and increased to 0.3 after 2,000 iterations and 0.4 after 5,000 iterations.

(a) Market-1501
(b) DukeMTMC-reID
Figure 4: Comparison of matching time and memory costs of real-valued/binary features in terms of different bit lengths.
Bit Length
CUHK03 (labelled) CUHK03 (detected) Market-1501 DukeMTMC-reID
  real-valued binary    real-valued binary real-valued binary  real-valued binary
64-bit 51.7 42.0 49.5 43.1 52.2 37.1 58.0 48.8
128-bit 55.3 46.2 54.5 48.4 57.7 44.5 67.9 60.3
256-bit 57.6 52.9 56.8 50.3 61.6 49.6 71.4 63.5
512-bit 60.5 58.2 61.4 57.8 67.5 59.3 75.1 65.5
1024-bit 61.7 60.4 65.6 61.2 70.7 66.8 76.9 69.7
2048-bit 69.4 68.8 68.9 68.1 75.8 73.5 80.3 77.6
Table 1: Rank-1 matching rates (%) of real-valued and binary features on different datasets by using the proposed ABC embedded triplet network.

We implement our framework based on the PyTorch deep learning library. The hardware environment is a PC with Intel Core CPUs (3.4GHz), 32 GB memory, and an NVIDIA GTX TITAN X GPU. For all the datasets, the images are horizontally flipped to augment training samples. The batch size is set to 64 in the pre-training phase and altered to 128 in the subsequent training. The learning rate of the extractors in the experiments is initialized to 0.001 and decreased to 0.0001 with the iterations. The learning rate of the discriminator is consistently set to 0.01. To ensure the stability, we update the GAN 10 iterations alone after every 20 global optimization iterations. Every GAN iteration consists of 5 iterations of discriminator updating and 1 iteration of generator (extractor) updating. In the experiments, we re-run the comparison methods if the codes are publicly available to evaluate their efficiencies for fair comparisons.

4.2 Evaluation of Computation and Storage Efficiency

(a) Triplet loss
(b) W loss of D
(c) W loss of G
Figure 5: Curves of training losses on CUHK03, Market-1501, and DukeMTMC-reID.

We first evaluate the efficiency of our method with different bit lengths, since shorter binary codes are more efficient but may cause the descent of accuracies, while longer codes do the opposite. The time of retrieving a query (Q. Time) and the memory storing the gallery features (Mem.) are shown in Fig. 4. As can be seen, the query time and memory consumed by binary codes are far less than those of real-valued features. The matching time and memory of real-valued features increase significantly faster than the binary features with the bit lengths. Besides, we compare the rank-1 matching rates of real-valued and binary features in Table 1. As we can see from the last two rows of the table, binarized features with more bits (1024 or 2048) perform only slightly worse than the corresponding real-valued features, which demonstrates that with sufficient capacity, the discriminative information is well preserved by the binary features using our method. It is also noteworthy that even with 2048 bits, our binary features require much less query time and memory than real-valued counterparts.

In addition to the matching time, the time consumed by feature extraction (F. Time) should also be taken into consideration. With the data scale in ReID getting larger, it is necessary to process a large number of queries in a short time. Therefore, we compare the time consumed by extracting features of our method with two state-of-the-art approaches, namely Local Maximal Occurrence Representation (LOMO) [2] and Hierarchical Gaussian Descriptor (GOG) [3], which are widely adopted by traditional ReID methods. As shown in Table 2, our method extracts features much faster than LOMO and GOG.

Furthermore, we provide the training losses of our framework with 2048 bit length in Fig. 5. We can observe that the losses corresponding to the GAN descend steadily with the proceeding of the training. The triplet loss on Market-1501 is well optimized at every margin value. The triplet losses on CUHK03 and DukeMTMC-reID become fluctuations at certain margin values, nevertheless the losses can reach the steady state by the end of training.

Method CUHK03 Market-1501 DukeMTMC-reID
LOMO[2] 7.50e+00 2.96e+02 2.65e+02
GOG[3] 7.10e+02 2.80e+04 2.51e+04
64-bit ABC+triplet 5.70e01 5.93e+00 5.05e+00
128-bit ABC+triplet 5.70e01 6.11e+00 5.09e+00
256-bit ABC+triplet 5.71e01 6.25e+00 5.17e+00
512-bit ABC+triplet 5.72e01 6.30e+00 5.19e+00
1024-bit ABC+triplet 5.73e01 6.38e+00 5.32e+00
2048-bit ABC+triplet 5.77e01 6.49e+00 5.33e+00
Table 2: Time costs (seconds) of feature extraction for the gallery images.
Method Rank 1 Rank 5 Rank 20 mAP F. Time (s) Q. Time (s) Mem. (MB)
DRSCH [33] 22.0 48.4 81.0 - - - -
DSRH [32] 14.4 43.4 79.2 - - - -
CSBT [9] 55.5 84.3 98.0 - 7.50e+00 8.07e-04 1.01e-02
512-bit ABC+triplet (Ours) 58.2 85.7 98.2 59.7 5.72e-01 8.07e-04 1.01e-02
Table 3: Matching rates (%), mAP (%), feature extraction time (seconds), average query time (seconds) and memory usage (Million Bytes) for storing gallery data, by comparing with hashing-based ReID methods on CUHK03 (labelled).
Method Rank 1 mAP F. Time (s) Q. Time (s) Mem. (MB)
CSBT [9] 42.9 20.3 2.96e+02 2.54e02 1.52e+00
512-bit ABC+triplet (Ours) 59.3 43.8 6.30e+00 2.54e02 1.52e+00
Table 4: Matching rates (%), mAP (%), feature extraction time (seconds), average query time (seconds) and memory usage (Million Bytes) for storing gallery data, by comparing with hashing-based ReID methods on Market-1501.

4.3 Comparison with Binary Coding based Methods

Here we compare our framework with the following state-of-the-art binary coding (hashing) based ReID methods: 1) Deep hashing: Deep Regularized Similarity Comparison Hashing (DRSCH) [33], Deep Semantic Ranking based Hashing (DSRH) [32], and 2) Non-deep hashing: Cross-camera Semantic Binary Transformation (CSBT) [9]. Since CSBT has already significantly outperformed other hashing methods (SePH [51], COSDISH [52] and SDH [53]) in ReID according to the results reported in [9], we mainly compare our method with CSBT. The results on different datasets are shown in Tables 3 and 4, respectively, where the best performance is highlighted in red and the second best in blue.

From Table 3, we can observe that DRSCH and DSRH perform poorly on CUHK03, falling behind the other methods. Our method outperforms the state-of-the-art hashing based ReID method CSBT. The superiority of our method over CSBT becomes greater on the larger dataset, namely Market-1501, as can be seen in Table 4. We achieve 16.3% higher in rank-1 accuracy and double the mean average precision (mAP) of CSBT. This is probably because Market-1501 has much more training data than CUHK03, and projection learning based CSBT can hardly handle such amount of data at once. In contrast, our method optimizes the network based on mini-batch learning approaches, which is able to train the model on large amounts of data. Moreover, CSBT requires extracting LOMO features in advance, which is much slower than extracting binary codes using our method.

4.4 Comparison with the State-of-the-Art Methods

Method Rank 1 Rank 5 Rank 20 mAP F. Time (s) Q. Time (s) Mem. (MB)
DeepReID[13] 19.9 49.8 78.2 - - - -
Improved Deep[21] 44.9 76.4 93.6 - - - -
NSL[54] 54.7 84.8 95.2 - - - 2.11e+01
Gated CNN[29] 61.8 80.9 - 51.3 - - -
EDM[28] 51.9 83.6 - - - 2.85e02 4.90e01
SIR+CIR[6] 52.2 - - - - 1.26e02 5.16e01
Re-ranking[55] 58.5 - - 64.7 7.50e+00 - 1.03e+02
SSM[56] 72.7 92.4 - - 7.18e+02 6.00e01 2.08e+02
Part-aligned[57] 81.6 97.3 99.5 - - - -
MuDeep[58] 75.6 94.4 - - - - -
PDC[59] 78.3 94.8 98.4 - - - -
ABC+triplet (Ours) 68.1 90.3 98.3 61.6 5.77e01 1.85e03 1.22e01
Table 5: Matching rates (%), mAP (%), feature extraction time (seconds), average query time (seconds) and memory usage (Million Bytes) for storing gallery data, by comparing with state-of-the-art non-hashing methods on CUHK03 (detected).
Method Rank 1 mAP F. Time (s) Q. Time (s) Mem. (MB)
SDALF[1] 20.5 8.2 - 2.53e+02 9.31e+01
KISSME[16] 40.5 19.0 - - -
eSDC[60] 33.5 13.5 1.58e+04 7.47e+02 1.45e+03
BoW(best)[14] 44.4 21.9 - 2.08e+00 1.54e+01
LOMO+NSL[54] 55.4 29.9 2.96e+02 - 4.06e+03
Gated CNN[29] 65.9 39.6 - - 4.32e+01
SCSP[61] 51.9 26.4 - - 1.21e+02
SpindleNet[62] 76.9 - - - -
Re-ranking[55] 77.1 63.6 2.96e+02 - 4.06e+03
SSM[56] 82.2 68.8 2.83e+04 1.68e+02 8.21e+03
Part-aligned[57] 81.0 63.4 - - -
PDC[59] 84.1 63.4 - - -
ABC+triplet (Ours) 73.5 52.9 6.49e+00 7.32e02 4.82e+00
Table 6: Matching rates (%), mAP (%), feature extraction time (seconds), average query time (seconds) and memory usage (Million Bytes) for storing gallery data, by comparing with state-of-the-art non-hashing methods on Market-1501.
Method Rank 1 mAP F. Time (s) Q. Time (s) Mem. (MB)
BoW+KISSME[14] 25.l 12.1 - - -
LOMO+XQDA[2] 30.8 17.0 2.65e+02 - 3.62e+03
Basel.+LSRO[15] 67.7 47.1 - - -
Basel.+OIM[63] 68.1 - - - -
SVDNet[64] 76.7 56.8 - - -
ABC+triplet (Ours) 77.6 47.9 5.33e+00 6.58e02 4.31e+00
Table 7: Matching rates (%), mAP (%), feature extraction time (seconds), average query time (seconds) and memory usage (Million Bytes) for storing gallery data, by comparing with state-of-the-art non-hashing methods on DukeMTMC-reID.

We also compare our method (2048-bit ABC+triplet) with the state-of-the-art non-hashing ReID methods. The methods mainly include: 1) Deep Learning based methods, such as DeepReID [13], Improved Deep [21], SIR+CIR [6], EDM [28], Gated CNN [29], Spindle Net [62], SVDNet [64], Deeply-learned Part-Aligned Representation [57], Multi-scale Deep Learning Architectures [58], and Pose-Driven Deep Convolutional Model [59]; 2) Metric learning based methods, such as KISSME [16], XQDA [2], and NSL [54]; 3) Local Patch Matching based methods, such as SDC [60] and BoW [14]; 4) Other ReID methods.

The comparison results on the three datasets are shown in Tables 5, 6, and 7, respectively. Our framework achieves competitive matching accuracies compared to the state-of-the-art methods, which adopt high-dimensional real-value features. It is also obviously that our framework not only outperforms many existing non-hashing approaches, but also achieves significant advantages in terms of the matching efficiencies. The advantages are more outstanding if the gallery set contains more samples. For instance, the query time of ABC is at least dozens of times faster than the non-hashing methods on Market-1501 with 19,732 gallery samples. Several methods adopt LOMO, which represents images as 26960-dim real-valued features. Differently, our method just represents images as 2048-bit binary codes which requires far less memory.

4.5 Effects of Different Network Settings

In this section, we embed the proposed ABC into different similarity measuring networks and evaluate the performance under different settings. We first evaluate two types of networks widely used to measure similarity, namely siamese network [28, 27] and triplet network. A siamese network receives a pair of images and minimizes the distance between images if they are from the same class and maximizes the distance if they have different labels. The siamese network evaluated in our experiments adopts the same ResNet-50 backbone model as the triplet network and employs the contractive loss to measure similarity.

From Table 8, we can observe that adopting siamese network performs worse than triplet network. This is because the loss of the siamese network is too strict, enforcing images of one identity to be projected onto a single point in the subspace. Differently, the triplet loss allows the images from one person to lie on a manifold, while enforcing larger distances between different persons’ images. We can also observe that embedding the ABC into a triplet network achieves better results than into a siamese network.

CUHK03 (labelled) CUHK03 (detected) Market-1501 DukeMTMC-reID
  Rank 1 mAP    Rank 1 mAP Rank 1 mAP    Rank 1 mAP
2048-dim siamese 62.0 59.4 61.5 56.1 67.2 48.8 75.0 43.8
2048-dim triplet 70.8 66.6 69.7 63.1 75.2 54.1 82.0 48.8
2048-bit ABC+siamese 49.2 35.4 45.6 31.2 52.7 21.8 56.9 29.7
2048-bit ABC+triplet 52.3 43.7 50.9 38.1 55.8 27.5 60.3 27.6
2048-bit ABC+siamese+ 61.7 60.4 61.6 60.2 65.7 48.1 70.9 41.7
2048-bit ABC+triplet+ 68.8 64.5 68.1 61.6 73.5 52.9 77.6 47.9
Table 8: Matching rates (%) and mAP (%) of the proposed ABC embedded networks in terms of different settings.
(a) CUHK03
(b) Market-1501
(c) DukeMTMC-reID
Figure 6: Triplet losses with and without the fine-tuning phase on pre-trained models on CUHK03, Market-1501, and DukeMTMC-reID. The bit length is 2,048, and the backbone model is ResNet-50.

As explained in Section 3.2, we normalize both the generated features and binary codes to the same scale to eliminate the conflict between two modules. Here we also compare the networks with and without the normalization. As can be seen from Table 8, the performance is significantly improved after the normalization.

In addition, we evaluate the effect of fine-tuning on the three datasets. Fig. 6 shows that the triplet losses with fine-tuning converge faster than those without fine-tuning. Since the employed ResNet-50 network is pre-trained on ImageNet, it already captures a variety of useful image features. Fine-tuning the network further enables learning features specialized for person representation more efficiently than training the network from scratch.

5 Conclusion

In this work, the adversarial binary coding (ABC) framework for efficient person re-identification was proposed, which could generate discriminative and efficient binary features from pedestrian images. Specifically, our ABC trained a discriminator network to distinguish the real-valued features from binary ones, in order to guide the feature extractor network to generate features in binary form under the Wasserstein loss. The ABC framework was further embedded into a deep triplet network to preserve the semantic information of binary features for the ReID task. Extensive experiments on three large-scale ReID datasets showed that our method outperformed the state-of-the-art hashing based ReID approaches, and was competitive to the state-of-the-art non-hashing ReID approaches, whilst reducing time and memory costs significantly. Considering that the triplet network has been overtaken by other network architectures proposed more recently, one possible improvement of this work in future is to explore the combinations of the ABC framework and other more sophisticated similarity measuring frameworks.


  • [1] Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by symmetry-driven accumulation of local features. In: CVPR. (2010)
  • [2] Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: CVPR. (2015)
  • [3] Matsukawa, T., Okabe, T., Suzuki, E., Sato, Y.: Hierarchical gaussian descriptor for person re-identification. In: CVPR. (2016)
  • [4] Liu, K., Ma, B., Zhang, W., Huang, R.: A spatio-temporal appearance representation for viceo-based pedestrian re-identification. In: ICCV. (2015)
  • [5] Shi, Z., Hospedales, T.M., Xiang, T.: Transferring a semantic representation for person re-identification and search. In: CVPR. (2015)
  • [6] Wang, F., Zuo, W., Lin, L., Zhang, D., Zhang, L.: Joint learning of single-image and cross-image representations for person re-identification. In: CVPR. (2016)
  • [7] Zheng, L., Yang, Y., Hauptmann, A.G.: Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984 (2016)
  • [8] Zheng, F., Shao, L.: Learning cross-view binary identities for fast person re-identification. In: IJCAI. (2016) 2399–2406
  • [9] Chen, J., Wang, Y., Qin, J., Liu, L., Shao, L.: Fast person re-identification via cross-camera semantic binary transformation. In: CVPR. (2017)
  • [10] Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS. (2009)
  • [11] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
  • [12] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • [13] Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person re-identification. In: CVPR. (2014)
  • [14] Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: A benchmark. In: ICCV. (2015)
  • [15] Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: ICCV. (2017)
  • [16] Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: CVPR. (2012)
  • [17] Pedagadi, S., Orwell, J., Velastin, S., Boghossian, B.: Local fisher discriminant analysis for pedestrian re-identification. In: CVPR. (2013)
  • [18] Prosser, B., Zheng, W.S., Gong, S., Xiang, T., Mary, Q.: Person re-identification by support vector ranking. In: BMVC. Volume 2. (2010)
  • [19] Lisanti, G., Masi, I., Bagdanov, A.D., Del Bimbo, A.: Person re-identification by iterative re-weighted sparse ranking. IEEE TPAMI 37(8) (2015) 1629–1642
  • [20] Lan, R., Zhou, Y., Tang, Y.Y.: Quaternionic local ranking binary pattern: A local descriptor of color images. IEEE TIP 25(2) (2016) 566–579
  • [21] Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-identification. In: CVPR. (2015)
  • [22] Xiao, T., Li, H., Ouyang, W., Wang, X.:

    Learning deep feature representations with domain guided dropout for person re-identification.

    In: CVPR. (2016)
  • [23] Zhou, S., Wang, J., Wang, J., Gong, Y., Zheng, N.: Point to set similarity based deep feature learning for person re-identification. In: CVPR. (2017)
  • [24] Lin, J., Ren, L., Lu, J., Feng, J., Zhou, J.: Consistent-aware deep learning for person re-identification in a camera network. In: CVPR. (2017)
  • [25] Panda, R., Bhuiyan, A., Murino, V., Roy-Chowdhury, A.K.: Unsupervised adaptive re-identification in open world dynamic camera networks. In: CVPR. (2017)
  • [26] Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: A deep quadruplet network for person re-identification. In: CVPR. (2017)
  • [27] Yi, D., Lei, Z., Liao, S., Li, S.Z.: Deep metric learning for person re-identification. In: ICPR. (2014)
  • [28] Shi, H., Yang, Y., Zhu, X., Liao, S., Lei, Z., Zheng, W., Li, S.Z.: Embedding deep metric for person re-identification: A study against large variations. In: ECCV. (2016)
  • [29] Varior, R.R., Haloi, M., Wang, G.:

    Gated siamese convolutional neural network architecture for human re-identification.

    In: ECCV. (2016)
  • [30] Chen, S.Z., Guo, C.C., Lai, J.H.: Deep ranking for person re-identification via joint representation learning. IEEE TIP 25(5) (2016) 2353–2367
  • [31] Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: CVPR. (2016)
  • [32] Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multi-label image retrieval. In: CVPR. (2015)
  • [33] Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE TIP 24(12) (2015) 4766–4779
  • [34] Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S.: Stacked generative adversarial networks. In: CVPR. (2017)
  • [35] Huang, R., Zhang, S., Li, T., He, R., et al.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. arXiv preprint arXiv:1704.04086 (2017)
  • [36] Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: ICCV. (2017)
  • [37] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.:

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In: ICCV. (2017)
  • [38] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.:

    Image-to-image translation with conditional adversarial networks.

    In: CVPR. (2017)
  • [39] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
  • [40] Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. arXiv preprint arXiv:1605.09782 (2016)
  • [41] Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., Courville, A.: Adversarially learned inference. arXiv preprint arXiv:1606.00704 (2016)
  • [42] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
  • [43] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028 (2017)
  • [44] Qiu, Z., Pan, Y., Yao, T., Mei, T.: Deep semantic hashing with generative adversarial networks. In: SIGIR, ACM (2017)
  • [45] Song, J.: Binary generative adversarial networks for image retrieval. In: AAAI. (2018)
  • [46] Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017)
  • [47] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
  • [48] Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: BMVC. (2016)
  • [49] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR. (2009)
  • [50] Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCV Workshop. (2016)
  • [51] Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-view retrieval. In: CVPR. (2015)
  • [52] Kang, W.C., Li, W.J., Zhou, Z.H.: Column sampling based discrete supervised hashing. In: AAAI. (2016) 1230–1236
  • [53] Shen, F., Shen, C., Liu, W., Shen, H.T.: Supervised discrete hashing. In: CVPR. Volume 2. (2015)  5
  • [54] Zhang, L., Xiang, T., Gong, S.: Learning a discriminative null space for person re-identification. In: CVPR. (2016)
  • [55] Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: CVPR. (2017)
  • [56] Bai, S., Bai, X., Tian, Q.: Scalable person re-identification on supervised smoothed manifold. In: CVPR. (2017)
  • [57] Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification. In: ICCV. (2017)
  • [58] Qian, X., Fu, Y., Jiang, Y.G., Xiang, T., Xue, X.: Multi-scale deep learning architectures for person re-identification. In: ICCV. (2017)
  • [59] Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: ICCV. (2017)
  • [60] Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: CVPR. (2013)
  • [61] Chen, D., Yuan, Z., Chen, B., Zheng, N.: Similarity learning with spatial constraints for person re-identification. In: CVPR. (2016)
  • [62] Zhao, H., Tian, M., Sun, S., Shao, J., Yan, J., Yi, S., Wang, X., Tang, X.: Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In: CVPR. (2017)
  • [63] Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: Joint detection and identification feature learning for person search. In: CVPR. (2017)
  • [64] Sun, Y., Zheng, L., Deng, W., Wang, S.: Svdnet for pedestrian retrieval. In: ICCV. (2017)