Attributes Guided Feature Learning for Vehicle Re-identification

05/22/2019 ∙ by Aihua Zheng, et al. ∙ 2

Vehicle Re-ID has recently attracted enthusiastic attention due to its potential applications in smart city and urban surveillance. However, it suffers from large intra-class variation caused by view variations and illumination changes, and inter-class similarity especially for different identities with the similar appearance. To handle these issues, in this paper, we propose a novel deep network architecture, which guided by meaningful attributes including camera views, vehicle types and colors for vehicle Re-ID. In particular, our network is end-to-end trained and contains three subnetworks of deep features embedded by the corresponding attributes (i.e., camera view, vehicle type and vehicle color). Moreover, to overcome the shortcomings of limited vehicle images of different views, we design a view-specified generative adversarial network to generate the multi-view vehicle images. For network training, we annotate the view labels on the VeRi-776 dataset. Note that one can directly adopt the pre-trained view (as well as type and color) subnetwork on the other datasets with only ID information, which demonstrates the generalization of our model. Extensive experiments on the benchmark datasets VeRi-776 and VehicleID suggest that the proposed approach achieves the promising performance and yields to a new state-of-the-art for vehicle Re-ID.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 6

page 7

page 9

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Vehicle re-identification (Re-ID) is a frontier and important research problem in computer vision, which has many potential applications, such as intelligent transportation, urban surveillance and security since vehicle is the most important object in urban surveillance. The aim of vehicle Re-ID is to identify the same vehicle across non-overlapping cameras. Although license plate can uniquely identify the vehicle, it is scarcely recognizable due to motion blur, challenging camera view and low resolution etc. in real-life surveillance. Some researchers have explored spatial-temporal information 

[31, 25, 18, 21] to boost the performance of appearance based vehicle Re-ID. However, it is difficult to obtain the complete spatial-temporal information since the vehicles may only appear in a few of the cameras in the large scale camera networks. Therefore, the prevalent vehicle Re-ID methods still focus on appearance based models.

Fig. 1:

Demonstration of two major challenges in vehicle Re-ID: 1) Intra-class difference: the same vehicle ”ID1” appears totally different under Camera1 and Camera3 due to the view variance. 2) Inter-class similarity: the two different vehicles ”ID1” under Camera1 and ”ID2” under Camera2 have extremely similar visual appearance especially for the vehicles from the same manufactories.

Extensive works dedicate on person Re-ID in the past decade [42, 13, 19, 32, 4, 2, 27, 9], which focus on two mainstreams: (1) Appearance modelling [13, 19, 32], which develops robust feature descriptors to leverage the various changes and occlusions among different camera views. (2) Learning-based methods [42, 1, 16, 2, 27, 9]

, which learns metric distance to mitigate the appearance gaps between the low-level features and high-level semantics. Recently, deep neural networks have made a marvellous progress on both feature learning 

[42, 27, 9] and metric learning [1, 16, 2] for person Re-ID. However, directly employing person Re-ID models for vehicle Re-ID obviously cannot guarantee the satisfactory performance, since the appearance of pedestrians and vehicles varies in the different manner from different viewpoints.

Although much progress has been made on vehicle Re-ID [20, 15, 18, 21, 37, 47, 25], vehicle Re-ID encounters more challenges as in addition to the common challenges in person Re-ID such as occlusion, illumination etc. The first crucial challenge of vehicle Re-ID is the large intra-class variation caused by the viewpoint variation across different cameras, which has been widely explored in person Re-ID [24, 38, 39, 26]. This issue is even more challenging in vehicle Re-ID since most of the vehicle images under a certain camera are almost in the same viewpoint due to the rigid motion of the vehicles as shown in Fig. 1. Unfortunately, it cannot guarantee the satisfactory performance when directly employing the methods from person Re-ID for vehicle Re-ID since the appearance distributes totally different between persons and vehicles. Some vehicle Re-ID methods [45, 46] use adversarial learning schemes to generate multi-view images or features from a single image, and can thus address the challenge of view variation to some extent. But they might be difficult to distinguish different vehicles with very similar appearance. Furthermore, they neglect the attributes information, such as type and color, which would be critical cues for boosting the performance of vehicle Re-ID.

Fig. 2: Benefits of camera views, types and colors on vehicle Re-ID. The brown dash box indicates the five vehicles characterized by distinct camera views, types and colors, their visualized feature representations and 2D feature projections of the images from corresponding identities, where different colors represents different vehicles. The blue dash box demonstrates several ranking results of conventional vehicle Re-ID based on ResNet-50 [24], where the red and green solid boxes of the first 15 ranks indicate the wrong and right matching respectively. The results show that extra semantics or attributes play critical role in handling the challenges of vehicle Re-ID.

The second challenge is the high inter-class similarity especially for different identities with similar appearance as shown in Fig. 1. Incorporating the attributes information suffices to generate better discriminative representation for person Re-ID [27, 28, 9, 11]. Therefore, it is essential to learn the deep features with the supervision of attributes in vehicle Re-ID, enforcing the same identity with the consistent attributes. Li et al. [15] introduce the attribute recognition into the vehicle Re-ID framework, and use extra semantic information to assist vehicle identification especially for different identities with the similar appearance. However, none of methods handles both of two challenges (intra-class difference and inter-class similarity) simultaneously, and the performance of vehicle Re-ID is thus limited.

In this paper, we attempt to handle above issues in a unified deep convolutional framework to jointly learn Deep Feature representations guided by the meaningful attributes, including Camera Views, vehicle Types and Colors (DF-CVTC) for vehicle Re-ID. Attribute information has been successfully investigated as the mid-level semantics to boost person Re-ID. It can also help vehicle Re-ID in challenging scenarios. First of all, the camera view is one of the key attributes and challenges in Re-ID. As shown in Fig. 2, the query vehicle image may have completely different views from their counterparts under other cameras, such as query Q1 and Q2 and their right ranks marked as green solid boxes. Second, vehicle types and colors, as the representative attributes for vehicles, play important role in vehicle Re-ID especially for the different vehicles with similar appearance. As shown in Fig. 2

, the wrong hits of query Q3 and Q4, which present with similar appearance, could be effectively evaded by vehicle attribute type. Furthermore, the vehicles with different colors may present with similar shapes (such as the wrong hits rank 1 and rank 2 of query Q1), similar overall appearance (such as the wrong hits of query Q4), or even the similar color (such as the query Q5 with white color while the wrong hits of rank 2-4 with gray color) due to illumination changes. Integrating the color attribute may relieve this inter-class similarity. These challenges motivate us to utilize above attributes to help classifier distinguish different vehicles with very similar appearance and also identify the same vehicles with different viewpoints.

Meanwhile, as we observed, most of the vehicle images under a certain camera are almost in the same viewpoint due to the rigid motion of vehicles as shown in Fig. 1, and thus the number of vehicle images with different views is very limited which brings a big challenge to train deep networks. To handle this issue, we design a view-specified generative adversarial network (VS-GAN) to generate the multi-view vehicle images. It is worth noting that we jointly learn deep features, camera views, vehicle types and colors in an end-to-end framework. At last, we annotate the view labels in the benchmark datasets for network training, which can be directly used for other datasets with only ID labels.

To our best knowledge, it is the first work to collaboratively learn the deep features guided by three attributes simultaneously for vehicle Re-ID. In summary, this paper makes the following contributions to vehicle Re-ID and related applications:

  • It proposes a unified attributes guided deep learning framework that jointly learns Deep Feature representations, Camera Views, vehicle Types and Colors (DF-CVTC) for vehicle Re-ID. These components are collaborative to each other, and thus boost the discrimination ability of the learnt representations.

  • To enhance the diversity of view data, it develops an vehicle generation model, i.e., VS-GAN, to generate the multi-view vehicle images. Together with the synthesized multi-view images and the original single view images, our network can better mitigate the view difference caused by the cross-view cameras.

  • In addition to the type and color labels, we annotate the view labels for the benchmark dataset of vehicle Re-ID, i.e., VeRi-776, for view predictor training, which can be easily employed for the situation with only ID information in vehicle Re-ID. And we will release the annotation information of view labels to public for free academic usage.

  • Comprehensive experimental evaluations on two benchmark datasets, i.e., VeRi-776 and VehicleID, demonstrate the promising performance of the proposed method which yields to a new state-of-the-art for vehicle Re-ID.

Ii Related Work

Ii-a Vehicle Re-ID

With great progress in person Re-ID [44, 43, 3, 33, 48, 30], vehicle Re-ID has gradually gained a lot of attentions recently since vehicles are the most important object in urban surveillance. Liu et al. [20, 18, 21] released a benchmark dataset VeRi-776 and considered the vehicle Re-ID task as a progressive recognition process by using visual features, license plates and spatial-temporal information. Liu et al. [18] released another big surveillance-nature dataset (VehicleID) and designed coupled clusters loss to measure the distance of two arbitrary similar vehicles. Zhang et al. [37] designed an improved triplet-wise training by classification-oriented loss. Li et al. [15] integrated the identification, attribute recognition, verification and triplet tasks into a unified CNN framework. Liu et al. [6] proposed a coarse-to-fine ranking method consisting of a vehicle model classification loss, a coarse-grained ranking loss, a fine-grained ranking loss and a pairwise loss.

In addition to appearance information, Shen et al. [25] combined the visual spatio-temporal path information for regularization. And Wang et al. [31] introduced the spatial-temporal regularization into the proposed orientation invariant feature embedding module. However, the large intra-class variation caused by the viewpoint variation and high inter-class similarity especially for the different vehicles have not been well solved in existing works.

Ii-B View-aware Re-ID

Viewpoint changes introduces a large variation of the intra-class variation in person Re-ID. Zhao et al. [38] proposed a novel method based on human body region guided for person Re-ID which can boost the performance well. Wu et al. [34] proposed a approach called pose prior to make identification more robust to viewpoint. Zheng [40]

introduced the PoseBox structure which is generated through pose estimation followed by affine transformations.

This issue is even crucial in vehicle Re-ID, since the viewpoint of the images are almost the same due to the rigid motion of the vehicles. Wang et al. [31] proposed orientation invariant feature embedding to address the influence of viewpoint variation. Prokaj et al. [22] presented a method based on pose estimation to deal with multiple viewpoint.

Ii-C Attribute Embedded Re-ID

Attributes have been extensively investigated as the mid-level semantic information to boost the person Re-ID. Su et al. [27] introduced a low rank attribute embedding into the a multi-task learning framework for person Re-ID. Khamis et al. [9] jointly optimized the attributes classification loss and triplet loss for person Re-ID. Lin et al. [17] integrated the identification loss and the attributes prediction into a simple ResNet framework and annotated the pedestrian attributes in two benchmark person Re-ID datasets Market-1501 and DukeMTMC-reID. Su et al. [29] proposed a weakly supervised multi-type attribute learning framework based on the triplet loss by pre-training the attributes predictor on an independent data. Despite of the previous works focusing on image-based query, Li et al [12] and Yin et al. [36] investigated attribute-based query for person retrieval and Re-ID task.

In vehicle Re-ID, Li et al. [15] introduced the attribute recognition into the vehicle Re-ID framework together with the verification loss and triplet loss. However, how to incorporate the view-aware identification and the attributes recognition into a unified framework is still not investigated.

Ii-D GAN based Re-ID

As one of the hottest research directions in the current deep learning field, GAN [5] has been intensely explored in image generation [14], data enhancement [7], style migration [44] and other aspects. Recently, some works have also started to develop GAN on person Re-ID. Zheng et al. [42] explored GAN to generate new unlabeled samples for data augmentation in person Re-ID. Zhong et al. [44] introduced a method named camera style (CamStyle) which can be viewed as a data augmentation approach that smooths the camera style disparities. Qian et al. [23] use GAN to generate eight pre-defined pose for each image which augment the data and address the viewpoint variation to some extent. Liu et al. [19] transferred various person pose instances from one dataset to another to improve the generalization ability of the model. Wei et al. [32, 4] proposed a GAN model to bridge the domain gap among different person Re-ID datasets.

Some researchers proposed to use GAN to generate multi-view images or features to relieve the view variation in vehicle Re-ID. Zhou et al. [45] designed a conditional generative network to obtain cross-view images from input view pairs to address the vehicle Re-ID task. Later on, Zhou et al. [46] proposed a Viewpoint-aware Attentive Multi-view Inference (VAMI) model to infer multi-view features from single-view image inputs. They used the image pairs for training while our method employs the classification CNN and jointly learns the deep features and the camera views, vehicle types and colors. By learning the view and attributes specified deep features, our method is superior to the above methods.

Fig. 3: Overview of our DF-CVTC. The view transform model generates multi-view images based on a view-specified GAN. Both the original (blue box) and the generated images (red boxes) are fed into the vehicle Re-ID model, which consists of one backbone (the first three blocks of ResNet-50), three subnetworks and one embedding network.

Iii Proposed Network Architecture

In this paper, we propose a novel deep network architecture, which embeds attributes information, including camera views, vehicle types and colors, for vehicle Re-ID. We shall elaborate the proposed method in this section.

Iii-a Architecture Overview

The overall architecture is demonstrated in Fig. 3. Our proposed architecture mainly consists of two parts: the view transform model and the vehicle Re-ID model. The view transform model consists of a view-specified GAN to generate multi-view vehicle images to relieve the view variations. The vehicle Re-ID model is composed of one backbone, three guiding subnetworks, and the embedding layers. We discuss these parts one by one for clarity.

Iii-B Backbone

In our work, we adopt ResNet-50 as the baseline network for backbone. One can also configure other networks such as Inception-v4, VGG16 and MobileNet architectures without limitation. As for ResNet-50, we use the first three residual blocks of ResNet-50 as our backbone as shown in Fig. 3, due to its compelling performance with deeper layers by residual learning. We denote the parameters of this network as .

Iii-C Subnetworks

As shown in Fig. 3

, each subnetwork consists of a predictor part and a feature extraction part, with inputs of the feature maps from Block-1 and Block-3 respectively in the backbone.

The predictor is composed of three convolutional (Conv) layers and one fully-connected (FC) layer which outputs a probability distribution over the corresponding (view, type or color) values. The kernel sizes in the three Conv layers are

, ,

, respectively. The strides for these kernels are 3, 2 and 1, respectively. We use ReLU activation in all three layers and add a

layer after each Conv layer. The resulting feature vector is fed into the following FC layer to predict the attribute scores via the

-way softmax.

The feature extractor is composed of units, each of which is a Conv-net responsible for extracting high-level features corresponding to one of the view or attribute classes. We use the Block-4 of ResNet-50 as feature extractor.

The features from each specific feature extractor can be formulated as,

(1)

where , . is the number of corresponding units, which also indicates the possible classes of each view or attribute. is an image, denotes the parameters of .

The probability distribution over corresponding view or attribute values from the predictor network is,

(2)

where denotes the parameters of , which is learnt using the cross-entropy loss ,

(3)

where is a one-hot vector of the ground truth of corresponding view or attributes values.

After progressively learning of the three subnetworks, we achieve the specific feature maps via:

(4)

where denotes the element-wise sum, and denotes the element-wise multiply.

The joint deep features with camera view, type and color are achieved as the fusion of feature maps of three subnetworks,

(5)

is the fused deep features containing the complementary view and attributes information. Next, we describe the details of each subnetwork as follows.

Iii-C1 View subnetwork

Viewpoint changes bring a crucial challenge for Re-ID task. We use the view subnetwork to incorporate the view information into the Re-ID model. The view predictor predicts a -way softmax scores which are used to weight the output of each corresponding view unit. In this paper, indicating the five viewpoints , and . For instance, for the training sample in the rear orientation to the camera, the corresponding view unit will be assigned a strong weight and updated strongly during the back propagation.

Iii-C2 Type subnetwork

Type is useful to distinguish the vehicles with similar appearance, which can relieve the inter-class similarity. In the same manner as the view subnetwork, we use the type subnetwork to learn the attribute specific deep features. The scores predicted by type predictor are used to weight the output of each corresponding type unit in the type specific feature extractor. In this paper, we set = 9, indicating 9 types of the vehicles: , , , , , , , and . Similar to previous view specific feature extractor, each type unit will learn a feature map specialized for one of the types.

Iii-C3 Color subnetwork

Color is another discriminative attribute in vehicles. Therefore, we analogously use the color subnetwork to learn the color-specific features. The color predictor predicts the color scores of the vehicle then weight to each color unit. In our implementation, we set =10 denoting 10 colors of the vehicles: , , , , , , , , and . The color-specific feature extractor is designed in the same manner as in view and type subnetworks.

Iii-D Embedding Layers

The embedding layers consist of two FC layers (we denote the parameters of this embedding as ). It embeds the fused feature in Eq. (5) into the higher level joint deep feature , which is used for the final Re-ID task.

In order to train the Re-ID model, we add a softmax layer into the embedding network for ID classification. We use the cross-entropy loss of

for model training,

(6)

where is the number of the vehicle IDs in the training set. is the one-hot ground-truth of the ID label of the vehicle. is the predicted probability indicating the ID of the input vehicle image,

(7)

Fig. 4 demonstrates the effectiveness of the jointly learnt deep features of the proposed DF-CVTC. We can observe that, the vehicle images of the same identity fall into the same cluster regardless of the different visible appearance caused by different camera views (as shown in Fig. 4(a) and (b)) or illumination changes (as shown in Fig. 4 (c)).

Fig. 4: Demonstration of the DF-CVTC features. (a) (b) and (c) denote three vehicle pairs sampled from VeRi-776 dataset under two distinct camera views and their corresponding learned probabilities of varying camera views, types and colors, where visible appearances are distinct due to the different camera views or illumination changes. (d) illustrates the 2D feature projections of the vehicle images learnt by the proposed DF-CVTC. (e) represents the corresponding annotation categories of camera views, types, and colors.

Iii-E View-specified GAN

Due to the rigid motion of the vehicles, the images under a certain camera are almost in a single viewpoint, which brings a big challenge to vehicle Re-ID in wild conditions. Therefore, we design a view-specified GAN to generate the multi-view images. In this paper, we simply employ pix2pix 

[8] for its generality. The generation architecture is illustrated as Fig. 5. In a specific, given an input vehicle image and a target vehicle image with different view, our VS-GAN aims to generate a new vehicle image with the same view as . VS-GAN constitutes a Generator learning a map conditional on the given target, and a Discriminator discriminating real data samples from the generated samples, such that the distribution of image is indistinguishable from the distribution image

. The loss function can be expressed as,

(8)

where tries to minimize this objective against an adversarial that tries to maximize it, distance is used to encourage less blurring. is the weighting coefficient. Fig. 6 demonstrates several examples of synthesizing the view vehicle images to view on VeRi-776 dataset via VS-GAN. One can generate multi-view images by altering the target images with different views.

Fig. 5: The architecture of VS-GAN based on the architecture of pix2pix [8]. For the input single view vehicle image ( view as shown), VS-GAN aims to synthesize a vehicle image with the same view as the target vehicle image ( view as shown).

Iii-F Difference from Previous Work

Our method is significantly different from [45, 46, 24] from the following aspects. 1) [45, 46] infer the multi-view images or features using adversarial learning. However, they render vehicle Re-ID as a verification task while our method employs a classification CNN to learn the deep features. Furthermore, our learnt features embeds attributes information (type and color) in addition to the view information. 2) [24] incorporate both fine and coarse pose/view information to learn a feature representations and propose a novel re-ranking method for person Re-ID. While our DF-CVTC further integrates the attributes information and jointly learns the deep features embedded by camera views, vehicle types and colors into an end-to-end framework.

Iv Training Details

Iv-a Progressive Learning

We progressively learn the three subnetworks and fine-tune the Re-ID model, which achieves comparative performance as the multi-task learning (minimizing the combination of the four losses). Furthermore, it can significantly reduce the computational complexity.

Iv-A1 View subnetwork training

We fine-tune the backbone network pre-trained on ImageNet classification 

[10] and the rest of Re-ID model are initialized from scratch. First, we minimize to obtain , then we minimize to obtain while fixing all the other parameters in Re-ID model.

Iv-A2 Type subnetwork training

We first minimize to obtain , and then minimize using to obtain while fixing all the other parameters in Re-ID model.

Iv-A3 Color subnetwork training

In the same manner, we first minimize to obtain , then minimize using to obtain while fixing all the other parameters in Re-ID model.

Iv-A4 Joint learning

After training the three subnetworks, we fine-tune , of the whole Re-ID model by minimizing until convergence.

Iv-B Implementation Details

In practice we use a stochastic approximation of the objective since the training set is quite huge. Training set is stochastically divided into mini-batches with 16 samples. The network performs forward propagation on the current mini-batch, followed by the backpropagation to compute the gradients with simple cross-entropy loss for network parameters updating. We perform Adam optimizer at recommended parameters with an initial learning rate of 0.0001 and a decay of 0.96 every epoch. With more passes over the training data, the model improves until it converges. To reduce overfitting, we artificially augment the data by performing random 2D translation as the same protocol in  

[13]. For an original image of size [, ], we resize it to [, ]. We also horizontally flip each image.

Fig. 6: Examples of synthesizing the the vehicle images of view to view on VeRi-776 dataset via VS-GAN. The first and the second rows indicate the vehicle images with the original view and the synthesized view respectively.
Fig. 7: Examples of ranking results on VeRi-776 dataset. The green and red boxes indicate the right matchings and the wrong matchings respectively.

V Experiments

We carry out comprehensive evaluation of the proposed DF-CVTC comparing to the state-of-the-art methods on two public vehicle Re-ID datasets, VeRi-776 [20] and VehicleID [18]. We use the Cumulative Matching Characteristics (CMC) curves and mAP to evaluate our results [18]. The type and color labels are available in VeRi-776, therefore, we annotate the view labels for network training. In VehicleID, we directly employ the view, type and color subnetworks pretrained on VeRi-776 and only ID labels are used.

V-a State-of-the-art Methods

All the compared state-of-the-art methods are briefly introduced as follows:

(1) LOMO [16]. Local Maximal Occurrence (LOMO) is dedicated to propose an effective feature representation against viewpoint changes for person Re-ID.

(2) BOW-CN [41]. Bag-of-Word based Color Name (CN).

(3) GoogLeNet [35]. Pre-trained on ImageNet [10] and fine-tuned on the CompCars dataset [35] to extract discriminative semantic feature representation.

(4) FACT [20]. Fusion of Attributes and Color features discriminates vehicles by jointly learning low-level color feature and highlevel semantic attribute such as SIFT, Color Name and GoogLeNet features.

(5) FACT+Plate-SNN+STR [18]. FACT [20] with additional plate verification based on Siamese Neural Network and spatiotempoal relations (STR).

(6) Siamese-Visual [25]. Only visual (appearance) information is used to compute similarity between the input pairwise.

(7) Siamese-CNN+Path-LSTM [25]. Combining Siamese-CNN with Path-LSTM which estimates the validness score of the visual-spatiotemporal path.

(8) NuFACT [21]. The null space based Fusion of Attribute and Color features method to integrate the multi-level appearance features of vehicles, i.e., texture, color, and high-level attribute features.

(9) VAMI [46]. VAMI aims to transform the single-view feature into a global feature representation that contains multi-view feature information, followed by the distance metric learning on the global feature space.

(10) C2F-Rank [6]. C2F-Rank designs the coarse-to-fine ranking loss consisting of a vehicle model classification loss, a coarse-grained ranking loss, a fine-grained ranking loss and a pairwise loss.

Fig. 8: Examples of ranking results on VehicleID dataset. The green boxes indicate the right matchings. Note that there is only one ground truth vehicle image in gallery set in the VehicleID dataset.
Method mAP rank 1 rank 5 reference
(1) LOMO [16] 9.64 25.33 46.48 CVPR 2015
(2) BOW-CN [41] 12.20 33.91 53.69 ICCV 2015
(3) GoogLeNet [35] 17.89 52.32 72.17 CVPR 2015
(4) FACT [20] 18.49 50.95 73.48 ICME 2016
(5) FACT+Plate-SNN+STR [18] 27.70 61.44 78.78 ECCV 2016
(6) Siamese-Visual [25] 29.48 41.12 60.31 ICCV 2017
(7) Siamese-CNN+Path-LSTM [25] 58.27 83.49 90.04 ICCV 2017
(8) NuFACT [21] 48.47 76.76 91.42 TMM 2018
(9) VAMI [46] 50.13 77.03 90.82 CVPR 2018

ResNet-50 (baseline)
51.58 86.71 92.43
  +view 54.52 89.69 94.40
  +view+type 60.47 91.66 95.59 Ours
  +view+type+color (DF-CVTC) 61.06 91.36 95.77
TABLE I: Comparisons with state-of-the-art Re-ID methods on VeRi-776 (in %). The top three results are highlighted in red, green and blue, respectively.
Method Test Size = 800 Test Size = 1600 Test Size = 2400 reference
mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5
(1) LOMO [16] - 19.76 32.01 - 18.85 29.18 - 15.32 25.29 CVPR 2015
(2) BOW-CN [41] - 13.14 22.69 - 12.94 21.09 - 10.20 17.89 ICCV 2015
(3) GoogLeNet [35] 46.20 47.88 67.18 44.00 43.40 63.86 38.10 38.27 59.39 CVPR 2015
(4) FACT [20] - 49.53 68.07 - 44.59 64.57 - 39.92 60.32 ICME 2016
(8) NuFACT [21] - 48.90 69.51 - 43.64 65.34 - 38.63 60.72 TMM 2018
(9) VAMI [46] - 63.12 83.25 - 52.87 75.12 - 47.34 70.29 CVPR 2018
(10) C2F-Rank [6] 63.50 61.10 81.70 60.00 56.20 76.20 53.00 51.40 72.20 AAAI 2018
ResNet-50 (baseline) 70.50 67.75 79.13 68.48 65.79 76.64 66.19 63.45 74.70
  +view 75.44 72.63 84.82 72.41 69.62 81.36 70.71 68.02 79.19 Ours
  +view+type 76.06 73.14 86.25 73.39 70.77 81.75 71.75 69.10 80.40
  +view+type+color (DF-CVTC) 78.03 75.23 88.11 74.87 72.15 84.37 73.15 70.46 82.13
TABLE II: Comparisons with state-of-the-art Re-ID methods on VehicleID (in %). The top three results are highlighted in red, green and blue, respectively.

V-B Experiments on VeRi-776 Dataset

V-B1 Setting

The VeRi-776 dataset contains 776 identities collected with 20 cameras in real-world traffic surveillance environment. The whole dataset is split into 576 identities with 37,778 images for training and 200 identities with 11,579 images for testing. An additional set of 1,678 images selected from the test identities are used as query images. In order to evaluate the view subnetwork, we annotate all the vehicle images in VeRi-776 dataset into five viewpoints as , , , and . We follow the evaluation protocol in [18]. We use mean average precision (mAP) metric for evaluation. We first calculate the average precision for each query. Than, the mAP can be obtained by calculate the mean of each average precision. The cumulative match curve (CMC) metric is also used for evaluation. First, we sort the Euclidean distance between each query and each gallery image in ascending order. Then, the CMC curve can be obtained by the average of sorted value. Noted that, only vehicle in non-overlap cameras will be counted during evaluation.

V-B2 Qualitative examples

Fig. 7 demonstrates the qualitative examples of six ranking results of our DF-CVTC on VeRi-776 dataset. From which we can observe that, our method successfully hits the vehicles with large view variations to the query such as rank 2 and ranks 10-11 in Fig. 7 (a), rank 4 and ranks 8-9 in Fig. 7 (b), rank 8 in Fig. 7 (c), rank 1 and rank 6 in Fig. 7 (d). The wrong hits generally result from the high inter-class similarity with homologous visual appearance, such as ranks 4-5, rank 9 and rank 12 in Fig. 7 (a), rank 10 in Fig. 7 (c), ranks 7-9 in Fig. 7 (e), ranks 2-3 and rank 5 in Fig. 7 (f). Fig. 11 demonstrates the qualitative examples of ranking result of our DF-CVTC on VeRi-776 dataset. Fig. 11 (b) shows the view/attributes probability which are predicted by each subnetwork. Fig. 11 (c) show the ranking result. From the Fig. 11, we can find that the ranking result are improving by introduction each subnetwork progressively.

V-B3 Quantitative results

Table I reports the performance of our approach comparing with the published state-of-the-arts on VeRi-776 dataset while Fig. 9 (a) demonstrates the corresponding CMC curves. From which we can see, our DF-CVTC significantly surpasses the state-of-the-arts. Note that we haven’t utilized any license plates or spatial temporal information as in Siamese-CNN+Path-LSTM [25] and FACT+Plate-SNN+STR [18]. Even though, our method still achieves the superior mAP and ranking accuracies by a large margin. We also investigate the contribution of the components of the view, type and color subnetworkes in our model. By progressively introducing these components, both mAP and ranking accuracies increase, verifying the clear contribution of each component.

V-C Experiments on VehicleID Dataset

V-C1 Setting

The VehicleID dataset [18] consists of the training set with 110,178 images of 13,134 vehicles and the test set with 111,585 images of 13,133 vehicles. Followed by the protocol in [18]

, we test VehicleID dataset in three distinct settings with different number of testing samples: 800, 1600 and 2400. Specifically, since some of the type and color information is missing and no view labels in this dataset, we adopt the view, type and color subnetworks pre-trained on VeRi-776 dataset and fine-tuned during the Re-ID training. Which in turn means one can easily apply our model on the dataset with only ID information. The mean average precision (mAP), cumulative match curve (CMC) are used as the evaluation metric in the same manner as in VeRi-776. The only difference is we randomly select a image from test dataset as gallery, while consider the remaining images in test dataset as query. The experimental results are based on the average of 10 random trials.

V-C2 Qualitative examples

Fig. 8 demonstrates six ranking results of our DF-CVTC on VehicleID. From which we can observe that, our method can successfully hit the right matching with large inter-class difference caused by the illumination/color changes, such as Fig. 8 (a), (c) and (f), as well as the viewpoint changes, such as Fig. 8 (b), (c), (d) and (f). The wrong hits of rank 1 on Fig. 8 (b) and (e) result from the inter-class similarity between vehicles, despite of which, our method still hit the right matchings in the early ranks. Note that there is only one ground truth vehicle image in gallery set in the VehicleID dataset.

Fig. 9: CMC curves on VeRi-776 and VehicleID datasets comparing to the state-of-the-art methods where the curves of variants of our methods are plotted in red color but with different markers.

V-C3 Quantitative results

Table II reports the performance of our method against the state-of-the-arts on VehicleID dataset while Fig. 9 (b)(c)(d) demonstrates the corresponding CMC curves on each test sizes. Clearly, our method significantly beats the existing state-of-the-arts in mAP, rank 1 and rank5. By introducing the view, type and color subnetworks progressively, the performance of our method is consistently improved. On the base model of ResNet-50, our DF-CVTC beat the VAMI strongly. In specific, we increase the rank1 by 12.11%, 19.28% and 23.12% in three difference scaler test sets respectively. Our DF-CVTC beat the C2F-Rank strongly. In specific, we increase the mAP by 14.53%, 14.87% and 20.15% in three difference scaler test sets respectively.

V-D Ablation Study

V-D1 Analysis on VS-GAN

The designed VS-GAN suffices to generate vehicle images with the other four viewpoints as show in Fig. 3. Due to the computational complexity, we have simply transfered 1400 view vehicles into view images for training data augmentation on VeRi-776 dataset as shown in Fig. 6. Fig. 10 demonstrates the ablation study of VS-GAN. From which we can see, by augmenting even only 1400 synthetic multi-view images into total 37729 training samples, VS-GAN can benefit the Re-ID model with various components. Moreover, it can further boost the contribution of the view subnetwork by improving and in mAP and rank 1 respectively, comparing to and improvements of DF-CVTC without VS-GAN. We believe that more generated images with more viewpoints will further boost the performance.

Fig. 10: Ablation study of VS-GAN on VeRi-776 dataset. (a) and (b) demonstrate the mAP and rank 1 scores of the proposed DF-CVTC and its variants without and with VS-GAN respectively. The digits on the top of the last three bars on each metric indicate the degree of improvement by progressively introducing view, type and color, comparing to the first blue bar of the baseline ResNet-50.
Fig. 11: An example of ranking results of DF-CVTC on ResNet-50 backbone by progressively introducing the view, type and color subnetworks on VeRi-776 dataset. The green and red boxes indicate the right matchings and the wrong matchings respectively. The histograms denote the probability distributions learnt from the view, type and color subnetworks respectively.
Component (a) (b) (c) (d)
view
type
color
Backbone mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5
ResNet-50 51.58 86.71 92.43 54.52 89.69 94.40 60.47 91.66 95.59 61.06 91.36 95.77
VGG16 42.35 77.77 88.14 44.17 80.63 89.57 45.43 81.17 90.35 45.62 81.76 91.12
MobileNet 52.55 86.23 94.10 54.48 87.60 93.92 58.49 89.15 94.64 59.23 89.45 94.87
Inception-v4 49.78 84.62 91.90 52.74 87.66 93.68 59.49 89.27 94.76 60.50 89.51 95.47
TABLE III: Ablation study on different backbones with varying components on VeRi-776 dataset (in %). The top three results are highlighted in red, green and blue, respectively.

V-D2 Analysis on backbones

As we mentioned in Section III-B, any other CNN architecture can be configured instead of ResNet-50 without any limitation. We further evaluate three prevalent CNN architectures, Inception-v4, VGG16 and MobileNet as the backbone respectively while remain the other part of the proposed model unchanged. The comprising results to ResNet-50 backbone on VeRi-776 dataset are reported in Table III. From which we can see, all the four CNN counterparts achieve satisfactory performance. Specifically, Inception-v4 and MobileNet achieve competitive performance on all the metrics. VGG16 works slight worse than the other three architecture, but it is still competitive to the state-of-the-art methods, which demonstrates that the high performance of the proposed model is not totally due to the superiority of the ResNet-50. Furthermore, by progressively introducing the view, type and color subnetworks, the performance of the corresponding variants based on all the backbones consistently improves, which verifies the contribution of the proposed jointly learning model. Fig. 11 demonstrates an example of ranking results of the proposed DF-CVTC for a query from VeRi-776 dataset by progressively introducing the view, type and color subnetworks into the ResNet-50 backbone. We observe that: 1) By introducing the view subnetwork, it can eliminate the wrong ranks with quite similar visible appearance especially with similar views to the query especially, such as rank 1 and rank 2 in Fig. 11 (c1). 2) By further introducing the type subnetwork, it can eliminate the wrong ranks with obviously distinct types, such as rank 2 and rank 9 in Fig. 11 (c2). 3) Our full model DF-CVTC (Fig. 11 (c4)) hits the most right ranks by progressively introduce the three subnetworks.

V-E Other Discussion

We further evaluate the attributes recognition on the 1678 query vehicle images from VeRi-776 dataset. The recognition rates of three subnetworks are originally , and for view, type and color attributes respectively, while achieving , and after fine-tuned by the Re-ID loss. We observe that, the attributes guided feature learning model can further benefit the attributes recognition rates especially on camera views by improvement. Meanwhile, the recognition rate of vehicle type has also been slightly improved. It seems that we haven’t achieved any significant change on the recognition rate of vehicle color attribute, the main reason is most of query images are in the similar color and the original recognition rate of color attribute () is nearly saturated.

Vi Conclusion

In this paper, we have proposed a novel end-to-end deep convolutional network to jointly learn deep features, camera views, types and colors for vehicle Re-ID. We expand the backbone of ResNet-50 with three consolidated subnetworks incorporating the view, type and color cues respectively. These three tasks benefit each other and learn a informative discriminative representation for vehicle Re-ID. Furthermore, we have increased the diversity of the views for vehicle images via a view-specified generative adversarial network. By jointly learning the deep features, camera views, vehicle types and vehicle colors in a single unified framework, our method can achieve superior performance comparing to the state-of-the-art methods. Comprehensive evaluation on two benchmark datasets demonstrates the clear contribution of each subnetwork and the capability of informative representation for vehicle Re-ID.

Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (61502006, 61702002 and 61872005).

References

  • [1] E. Ahmed, M. Jones, and T. K. Marks (2015) An improved deep learning architecture for person re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3908–3916. Cited by: §I.
  • [2] P. Chen, X. Xu, and C. Deng (2018) Deep view-aware metric learning for person re-identification.. In

    the International Joint Conference on Artificial Intelligence (IJCAI)

    ,
    pp. 620–626. Cited by: §I.
  • [3] Y. Chen, X. Zhu, W. Zheng, and J. Lai (2018) Person re-identification by camera correlation aware feature augmentation. IEEE transactions on pattern analysis and machine intelligence, pp. 392–408. Cited by: §II-A.
  • [4] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 994–1003. Cited by: §I, §II-D.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In International Conference on Neural Information Processing Systems (NIPS), pp. 2672–2680. Cited by: §II-D.
  • [6] G. Haiyun, Z. Chaoyang, L. Zhiwei, W. Jinqiao, L. Hanqing, et al. (2018) Learning coarse-to-fine structured feature embedding for vehicle re-identification. In Association for the Advancement of Artificial Intelligence (AAAI), pp. 1–8. Cited by: §II-A, §V-A, TABLE II.
  • [7] R. Huang, S. Zhang, T. Li, and R. He (2017-10) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §II-D.
  • [8] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    arXiv preprint. Cited by: Fig. 5, §III-E.
  • [9] S. Khamis, C. Kuo, V. K. Singh, V. D. Shet, and L. S. Davis (2014) Joint learning for attribute-consistent person re-identification. In European Conference on Computer Vision (ECCV), pp. 134–146. Cited by: §I, §I, §II-C.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §IV-A1, §V-A.
  • [11] R. Layne, T. M. Hospedales, S. Gong, and Q. Mary (2012) Person re-identification by attributes.. In The British Machine Vision Conference (BMVC), pp. 8. Cited by: §I.
  • [12] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang (2017) Person search with natural language description. arXiv preprint arXiv:1702.05729. Cited by: §II-C.
  • [13] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 152–159. Cited by: §I, §IV-B.
  • [14] Y. Li, L. Song, X. Wu, R. He, and T. Tan (2018) Anti-makeup: learning a bi-level adversarial network for makeup-invariant face verification. Cited by: §II-D.
  • [15] Y. Li, Y. Li, H. Yan, and J. Liu (2017) Deep joint discriminative learning for vehicle re-identification and retrieval. In IEEE International Conference on Image Processing (ICIP), pp. 395–399. Cited by: §I, §I, §II-A, §II-C.
  • [16] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2197–2206. Cited by: §I, §V-A, TABLE I, TABLE II.
  • [17] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang (2017) Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220. Cited by: §II-C.
  • [18] H. Liu, Y. Tian, Y. Yang, L. Pang, and T. Huang (2016) Deep relative distance learning: tell the difference between similar vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2167–2175. Cited by: §I, §I, §II-A, §V-A, §V-B1, §V-B3, §V-C1, TABLE I, §V.
  • [19] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu (2018) Pose transferrable person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4099–4108. Cited by: §I, §II-D.
  • [20] X. Liu, W. Liu, H. Ma, and H. Fu (2016) Large-scale vehicle re-identification in urban surveillance videos. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §I, §II-A, §V-A, §V-A, TABLE I, TABLE II, §V.
  • [21] X. Liu, W. Liu, T. Mei, and H. Ma (2018) PROVID: progressive and multimodal vehicle reidentification for large-scale urban surveillance. In IEEE Transactions on Multimedia, pp. 645–658. Cited by: §I, §I, §II-A, §V-A, TABLE I, TABLE II.
  • [22] J. Prokaj and G. Medioni (2009) 3-d model based vehicle recognition. In Applications of Computer Vision (WACV), 2009 Workshop on, pp. 1–7. Cited by: §II-B.
  • [23] X. Qian, Y. Fu, W. Wang, T. Xiang, Y. Wu, Y. Jiang, and X. Xue (2017) Pose-normalized image generation for person re-identification. arXiv preprint arXiv:1712.02225. Cited by: §II-D.
  • [24] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen (2017) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. arXiv preprint arXiv:1711.10378. Cited by: Fig. 2, §I, §III-F.
  • [25] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang (2017) Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In IEEE International Conference on Computer Vision (ICCV), pp. 1918–1927. Cited by: §I, §I, §II-A, §V-A, §V-A, §V-B3, TABLE I.
  • [26] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian (2017) Pose-driven deep convolutional model for person re-identification. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 3980–3989. Cited by: §I.
  • [27] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao (2015) Multi-task learning with low rank attribute embedding for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3739–3747. Cited by: §I, §I, §II-C.
  • [28] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian (2016) Deep attributes driven multi-camera person re-identification. In European conference on computer vision, pp. 475–491. Cited by: §I.
  • [29] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian (2018) Multi-type attributes driven multi-camera person re-identification. Pattern Recognition, pp. 77–89. Cited by: §II-C.
  • [30] X. Wang, S. Zheng, R. Yang, B. Luo, and J. Tang (2019) Pedestrian attribute recognition: a survey. arXiv preprint arXiv:1901.07474. Cited by: §II-A.
  • [31] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang (2017) Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 379–387. Cited by: §I, §II-A, §II-B.
  • [32] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 79–88. Cited by: §I, §II-D.
  • [33] A. Wu, W. Zheng, and J. Lai (2017) Robust depth-based person re-identification.. IEEE Transactions on Image Processing, pp. 2588–2603. Cited by: §II-A.
  • [34] Z. Wu, Y. Li, and R. J. Radke (2015) Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features. IEEE transactions on pattern analysis and machine intelligence 37 (5), pp. 1095–1108. Cited by: §II-B.
  • [35] L. Yang, P. Luo, C. Change Loy, and X. Tang (2015) A large-scale car dataset for fine-grained categorization and verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3973–3981. Cited by: §V-A, TABLE I, TABLE II.
  • [36] Z. Yin, W. Zheng, A. Wu, H. Yu, H. Wan, X. Guo, F. Huang, and J. Lai (2018) Adversarial attribute-image person re-identification.. In the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1100–1106. Cited by: §II-C.
  • [37] Y. Zhang, D. Liu, and Z. Zha (2017) Improving triplet-wise training of convolutional neural network for vehicle re-identification. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1386–1391. Cited by: §I, §II-A.
  • [38] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1085. Cited by: §I, §II-B.
  • [39] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017) Deeply-learned part-aligned representations for person re-identification.. In ICCV, pp. 3239–3248. Cited by: §I.
  • [40] L. Zheng, Y. Huang, H. Lu, and Y. Yang (2017) Pose invariant embedding for deep person re-identification. arXiv preprint arXiv:1701.07732. Cited by: §II-B.
  • [41] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1116–1124. Cited by: §V-A, TABLE I, TABLE II.
  • [42] Z. Zheng, L. Zheng, and Y. Yang (2017-10) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II-D.
  • [43] Z. Zheng, L. Zheng, and Y. Yang (2017) A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), pp. 13. Cited by: §II-A.
  • [44] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang (2018) Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. Cited by: §II-A, §II-D.
  • [45] Y. Zhou and L. Shao (2017) Cross-view gan based vehicle generation for re-identification. In Proceedings of the British Machine Vision Conference (BMVC), pp. 1–12. Cited by: §I, §II-D, §III-F.
  • [46] Y. Zhou and L. Shao (2018) Aware attentive multi-view inference for vehicle re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6489–6498. Cited by: §I, §II-D, §III-F, §V-A, TABLE I, TABLE II.
  • [47] J. Zhu, Y. Du, Y. Hu, L. Zheng, and C. Cai (2018) VRSDNet: vehicle re-identification with a shortly and densely connected convolutional neural network. In Multimedia Tools and Applications, pp. 1–15. Cited by: §I.
  • [48] X. Zhu, B. Wu, D. Huang, and W. Zheng (2018) Fast open-world person re-identification. IEEE Transactions on Image Processing, pp. 2286–2300. Cited by: §II-A.