AttributeNet: Attribute Enhanced Vehicle Re-Identification

02/07/2021 ∙ by Rodolfo Quispe, et al. ∙ Microsoft University of Campinas 0

Vehicle Re-Identification (V-ReID) is a critical task that associates the same vehicle across images from different camera viewpoints. Many works explore attribute clues to enhance V-ReID; however, there is usually a lack of effective interaction between the attribute-related modules and final V-ReID objective. In this work, we propose a new method to efficiently explore discriminative information from vehicle attributes (e.g., color and type). We introduce AttributeNet (ANet) that jointly extracts identity-relevant features and attribute features. We enable the interaction by distilling the ReID-helpful attribute feature and adding it into the general ReID feature to increase the discrimination power. Moreover, we propose a constraint, named Amelioration Constraint (AC), which encourages the feature after adding attribute features onto the general ReID feature to be more discriminative than the original general ReID feature. We validate the effectiveness of our framework on three challenging datasets. Experimental results show that our method achieves state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vehicle Re-Identification (V-ReID) aims to match/associate the same vehicle across images. It has many applications for vehicle tracking and retrieval. V-ReID has gained increasing attention in the computer vision community 

(naphade20192019; khan2019survey). This task is challenging due to drastic changes in view points and illumination, resulting in a small inter-class and large intra-class difference.

(a)
(b)
(c)
(d)
Figure 1: Four images of vehicles used for V-ReID. The first and second images belong to the same vehicle; in this case, the color attribute can help overcome the illumination issue to match them. The third and fourth images belong to different vehicles with really similar appearance; in this case, vehicle brand or type can help differentiate them.

Recently, there is a trend to explore additional clues for better V-ReID, such as using semantic maps (meng2020parsing), attributes (e.g., type, color) (zheng2019attributes; tang2019pamtri; wang2020attribute; qian2020stripe; lee2020strdan), viewpoints (chu2019vehicle), and vehicle parts (chu2019vehicle; zhang2019part; liu2020beyond). In this work, we focus on the exploration of attributes to enhance the discrimination power of feature representations. Attributes are in general invariant to viewpoint changes and robust to environment alterations (see the examples in Figure 1).

Most of the previous attribute-based works (lee2020strdan; qian2020stripe; tang2019pamtri; wang2020attribute; zheng2019attributes; liu2018ram; liu2016deep) share a common characteristic in their design: a global feature representation is extracted from an input image using a backbone network (e.g., ResNet (he2016deep)), where this feature is followed by two types of heads, one for re-identification (ReID), and the other for attribute recognition. We refer to this design as the Vanilla-Attribute Design (VAD) and illustrate a representative VAD based Network (VAN) in Figure 2. One direct way to use the VAD for V-ReID is to concatenate the embedding features generated from the backbone (i.e., global feature) and the attribute-based modules (qian2020stripe; liu2018ram).

VAD aims to drive the network to learn features that are discriminative for both V-ReID and attribute recognition, where the attributes are in general invariant to viewpoint and illumination changes. However, there is a lack of effective interaction between the attribute-based branches and V-ReID branch, where the attribute modules learn features for attribute recognition but are not explicitly designed to serve for V-ReID. Wang et al. (wang2020attribute) explore attributes to generate attention masks, but these masks are used only to filter the information from the global feature instead of introducing the rich attribute representation into the final feature representation.

Figure 2: Illustration of VAD based Network (VAN) for V-ReID. It is composed of a backbone network that learns to extract information from an input image and branches to predict attributes based on attention modules. We use this VAN in our ANet as the first part of our framework.

We propose Attribute Net (ANet) to enrich the interaction between the attribute features and the V-ReID feature. ANet is designed to distill attribute information and add it into the global representation (from the backbone) to generate more discriminative features. Figures 2 and 3 (with input feature maps obtained from the VAN as illustrated in Figure 2) present the proposed ANet. Particularly, we combine the feature maps of different attribute branches to have a unique and generic representation of all the attributes. We distill the helpful attribute feature from and compensate it onto the global V-ReID feature to have the final feature map , where the spatial average pooled feature of is the final ReID feature for matching. Moreover, we introduce a new supervision objective, named Amelioration Constraint (AC), which encourages the compensated V-ReID feature to be more discriminative than the V-ReID feature before the compensation from attribute feature.

Figure 3: Illustration of the Joint Module. Note that the network to extract feature maps , , , is shown in Figure 2 and is not shown here. We distill the helpful attribute feature from and compensate it onto the global V-ReID feature to have the final feature map , where the spatial average pooled feature of is the final ReID feature for matching. Moreover, we introduce a new supervision objective, named Amelioration Constraint (AC), which encourages the compensated V-ReID feature to be more discriminative than the V-ReID feature before the compensation from attribute feature.

The main contributions of this work are:

  • We propose a new architecture, named ANet, for effective V-ReID, which enhances the interaction between the attribute-supervised modules and V-ReID branch. This encourages the distilled attribute features to serve for V-ReID.

  • We introduce an Amelioration Constraint (AC), which encourages the attribute compensated feature to be more discriminative than the V-ReID feature before compensation.

Experiments on three challenging datasets demonstrate the effectiveness of our ANet, which outperforms baselines significantly and achieves the state-of-the-art performance.

2 Related Work

For vehicle ReID, many approaches explore Generative adversarial Networks (GANs) 

(khorramshahi2020devil), graph networks (GNs) (liu2020beyond; shen2020exploring), semantic parsing (SP) (meng2020parsing) and vehicle part detection (VPD) (he2019part; zhang2019part) to improve the performance. Some of them tend to describe the vehicle details (khorramshahi2020devil) and local regions(he2019part; zhang2019part). PRND (he2019part) and PGAN (zhang2019part) detect predefined regions (e.g., back mirrors, light, wheels, etc

) and describe them with deep features. SAVER 

(khorramshahi2020devil) modify the input image with the vehicle details erased using a GAN. Then, this synthetic image is combined with the input image to create a new version with the details visually enhanced for ReID. Some works aim to handle the drastic viewpoint changes (liu2020beyond; meng2020parsing). Liu et al. (liu2020beyond) describe each vehicle view based on semantic parsing and also encode the spacial relationship between them using GNs.

Some works exploit attribute information (qian2020stripe; wang2020attribute; guo2018learning; liu2016deep; liu2018ram) or combine attributes with other clues (lee2020strdan; tang2019pamtri; zheng2019attributes). Most of the previous attribute-based works use attribute information to regularize the feature learning (lee2020strdan; qian2020stripe; tang2019pamtri; wang2020attribute; zheng2019attributes; liu2018ram; liu2016deep). In general, they regress the attribute classes from the backbone features, along with the ReID supervision based on the backbone features. However, using separate heads for different tasks ignores the interaction between the two tasks, where the attribute branches should serve for better ReID.

Our work explores attribute clues by enabling effective interaction between attribute regression and V-ReID. Different from previous methods, we distill helpful attribute information and compensate into the ReID feature representation to have more discriminative representation.

3 Proposed ANet

Our proposed ANet is designed to exploit attribute information for effective V-ReID. In previous works that use attributes, there is a lack of interaction between the global V-ReID head and the attribute regression heads, which makes that the feature information is not effectively exploited for V-ReID.

To address this issue, we propose ANet (as shown in Figures 2 and 3). It consists of two parts: VAD based Network (VAN) and Joint Module (JM). VAN is based on a Backbone with two heads, where one of them is to learn global V-ReID features and the other to regress attributes. VAN outputs an initial feature representation of V-ReID and multiple Attribute features from the input image. Then, the JM distills V-ReID-helpful attribute information and compensates it into the global features. JM promotes the interaction between the attribute branches and V-ReID branch. Furthermore, we propose an Amelioration Constraint (AC), which encourages the attribute compensated feature to be more discriminative than the original V-ReID feature before the compensation.

3.1 VAD based Network

VAD based Network (VAN), shown in Figure 2, aims to learn V-ReID features and regress attributes. This design is similar to previous literature work, where the attribute branches are expected to drive the learning of robust features since the attributes are in general invariant to illumination, viewpoints, etc.

Backbone. A backbone network is used to extract feature map from an input image , where , and are height , width and channels of , respectively. We follow the previous works and use ResNet (he2016deep) as the backbone.

V-ReID Head/Branch. On top of the backbone feature , we append a spatial global average pooling (GAP) layer followed by a fully-connected (FC) layer to generate the V-ReID feature as

(1)

where and denote the weights and bias of the FC layer used to reduce the dimension of the pooled feature, and , where is the predefined dimension of the output. is followed by Triplet Loss and Cross Entropy Loss .

Attribute Heads/Branches. On top of the backbone feature , we add attribute branches for attribute classification, where is the number of available attributes in the training dataset, one branch for each attribute. For the -th attribute branch, we use a spatial and channel attention module to obtain attribute-related feature as

(2)

where denotes the response of the attention module.

To make classification for the

-th attribute, we apply GAP and a FC layer to get a feature vector

as

(3)

where and denote the weights and bias of the FC layer, and , where is the predefined size of the output.

is followed by a classifier with a cross entropy loss

to recognize which class it belongs to for the -th attribute.

In summary, VAN is trained by minimizing the loss as

(4)

where is a hyper-parameter for balancing the importance of V-ReID loss and attribute-related losses.

3.2 Joint Module

The Joint Module (JM) is illustrated in Figure 3. JM aims to distill V-ReID helpful information from the attribute features and compensate it to the V-ReID feature for the final feature matching. First, we merge the attribute feature maps from multiple branches to have a unified attribute feature map . Then, we distill discriminative V-ReID helpful information from and compensate it onto to create a Joint Feature . To encourage a higher discriminative capability of the Joint Feature, we introduce an Amelioration Constraint (AC), which drives the distillation of discriminative information from to enhance the original V-ReID feature . The JM promotes the interaction between the attribute and V-ReID information to improve the V-ReID performance.

Attribute Feature . To facilitate the distillation of helpful attribute features, we combine all the attribute feature maps , where , to have a unified attribute feature map

. We achieve this by summarizing the attribute feature maps followed by a convolution layer and a residual connection as

(5)

where is implemented by a

convolutional layer followed by batch normalization (BN) and ReLU activation,

i.e, . We omit BN to simplify the notation.

For the combined attribute feature map , we add supervision from attributes to preserve the attribute information. Given attributes, is the number of classes for the -th attribute. There are in total attribute patterns. We apply a GAP layer on to get the feature vector . Then, the Triplet Loss is used as supervision to pull the features for the same attribute pattern and push the features for the different attribute patterns. We name this supervision as Attribute-based Triplet Loss.

Joint Feature . To distill V-ReID-helpful attribute information from to enhance , we use two convolution layers to have distilled feature

(6)

where and are implemented similarly to but we use a convolutional layer instead of , , and .

By adding onto the V-ReID feature , we have the Joint Feature as

(7)

combines V-ReID information from and the relevant V-ReID-helpful information from the attributes . Similar to the supervision on , we add Triplet Loss and Cross Entropy Loss on the spatially average pooled feature , where is obtained as

(8)

where and represent the weights and bias of a FC layer, and , is the predefined dimension of the output. JM is trained by minimizing

(9)

where

is a hyperparameter balancing the importance of the compensated V-ReID loss and the attribute related loss.

Finally, we can train the entire network ANet end-to-end by minimizing

(10)

where is a hyperparameter to balance the importance of and .

Amelioration Constraint. To further boost the capabilities of the network, we define the Amelioration Constraint (AC). AC aims to explicitly encourage to be more discriminative than . We separately apply AC for cross entropy loss and triplet loss.

AC for Cross Entropy Loss: For image , we define it as

(11)

where is a monotonically increasing function that helps to reduce the optimization difficulty by avoiding negative values (jin2020style). and represent the identity cross entropy loss with respect to feature and , respectively. Minimizing encourages the network to have a lower classification error for than that for .

AC for Triplet Loss: We seek to represent an enhanced feature of , where has a higher discriminative capability than . Thus, we encourage the feature distance between an anchor sample/image and a positive sample to be smaller w.r.t. feature than feature . Similarly, we encourage the feature distance between an anchor sample/image and a negative sample to be larger w.r.t. feature than feature . Then, AC for triplet loss is defined as

(12)

We notice that training with in an end-to-end leads to unstable learning. Thus, we follow two steps in training. In the first step, we minimize . In the second step, we freeze the backbone (i.e., all operations before ) and minimize . Compared with in (10), the AC losses are enabled and the losses on feature are disabled in as

(13)

4 Experiments

In this section, we present the datasets used in our experiments, the implementation details, an ablation study and a comparison against the state of the art to validate our proposed method.

4.1 Datasets

We evaluate our vehicle re-identification method on three challenging benchmark datasets.

  • VeRi776 (liu2017provid): It contains over 50,000 images of 776 vehicles with 20 camera views. It includes attribute labels for color and type. It considers 576 vehicles for training and 200 vehicles for test.

  • VeRi-Wild (lou2019veri): This is the largest vehicle re-identification dataset. It considers 174 camera views, 416,314 images and 40,671 IDs. It includes attribute labels for vehicle model, color and type. The testing set is divided into three sets with 3,000 (small), 5,000 (medium) and 10,000 (large) IDs. This is the most challenging dataset because the images were captured for a period of one month and include severe changes in background, illumination, viewpoint and occlusions.

    VeRi776 Vehicle-ID VeRi-Wild
    Small Medium Large Small Medium Large
    Method mAP R1 R1 R5 R1 R5 R1 R5 mAP R1 mAP R1 mAP R1
    Baseline 78.1 96.1 81.3 94.4 77.7 90.6 75.8 88.5 78.1 94.6 72.2 92.5 64.0 88.7
    VAN () 78.1 96.6 84.1 96.5 80.4 93.6 78.4 91.8 83.1 94.5 78.3 93.5 70.6 90.0
    VAN () 77.3 96.5 81.5 95.0 78.5 92.0 76.3 89.6 81.9 94.1 76.9 93.1 69.2 89.4
    ANet () w/o AC 79.8 96.9 85.0 96.7 80.9 94.1 79.0 91.8 84.6 96.1 79.9 94.4 72.9 91.5
    ANet () 80.1 96.9 86.0 97.4 81.9 95.1 79.6 92.7 85.8 95.9 81.0 94.5 73.9 91.6
    Table 1: Ablation study on the effectiveness of our designs. We indicate the feature vector used for testing using the symbol in parenthesis.
    VeRi776 Vehicle-ID VeRi-Wild
    Method mAP R1 R1 R5 mAP R1
    fc 76.7 95.8 83.3 96.0 82.1 94.3
    att 78.1 96.6 84.1 96.5 83.1 94.5
    Table 2: Comparison of choice for implementation of attribute branches for the attribute-based baseline. fc represents an implementation using fully connected layers and att represents an implementation using SE attention blocks. Results for Vehicle-ID and VeRi-Wild are reported using their small scale test set.
  • Vehicle-ID (liu2016deep): It includes 221,763 images of 26,267 vehicles, captured from either front or back views. The training set contains 110,178 images of 13,134 vehicles and the test set contains 111,585 images of 13,133 vehicles. The testing data is further divided into three sets with 200 (small), 1,600 (medium) and 2,400 (large) vehicles. Some images in this dataset have attribute labels for vehicle color and type but not for all the images.

For the first two datasets, the validation protocols is based on mean Average Precision (mAP) and Cumulative Matching Curve (CMC) @1 (at rank-1/R1) and @5 (at rank-5/R5) as they have fixed gallery and query sets. For Vehicle-ID, we follow the protocol proposed by the authors of the dataset, which randomly chooses one image of each vehicle ID as gallery and the rest as query. The final R1 and R5 results are reported after repeating this process 10 times.

4.2 Implementation Details

We follow other works in the literature to implement the backbone for a fair comparison. We use a modified version of ResNet-50 (he2016deep) with Instance-Batch Normalization (pan2018two) and remove the last pooling layer to obtain the feature map for an image . Each attention module is based on SE (hu2018squeeze) with the reduction ratio of 16. For the FC layers, we set and .

We use cross entropy loss with label smoothing regularize (szegedy2016rethinking) and triplet loss with hard positive-negative mining (hermans2017defense), following the Bag-of-Tricks (luo2019bag). For simplicity, we set , , and give the same importance to all branches in the network.

In one of the datasets, not all input images have attribute labels. For these samples, we simply do not backpropagate the losses from

and . We found this works well since we use batch size of 512 (4 images per ID) and the missing labels are alleviated by the other IDs in the batch. Note that these missing labels do not affect our and , so ANet can still learn from those cases.

The input images are resized to 256256 pixels and augmented by random horizontal flipping, random zooming and random input erasing (ghiasi2018dropblock; torchreid; zhou2019osnet; zhou2019learning)

. All models are trained on 8 V100 GPUs with NVLink for 210 epochs with Amsgrad. An initial learning rate is set to 0.0006 and the learning rate is decayed by 0.1 at epochs 60, 120 and 150. The first learning step minimizes

for the first 150 epochs, then the second step optimizes for 60 epochs. for all datasets, where we consider vehicle color (e.g., red, yellow, gray, etc) and type (e.g., sedan, truck, etc). During testing, the feature vectors are L2-normalized for matching.

(a)
(b)
(c)
Figure 4: Comparison of activation maps. The first row represents the input images, second and third row their corresponding activation maps for (attribute features) and (attribute features oriented to V-ReID), respectively. The first column is the query image, the second to sixth columns represent the vehicle retrieved at rank-1, rank-2, rank-3, rank-4 and rank-5.

4.3 Ablation Study

4.3.1 Effectiveness of using Attributes on V-ReID

We first evaluate the effects of using attributes in V-ReID and show the comparisons in Table 1. Baseline denotes the scheme which generates feature using only the backbone, without using attribute-related designs. VAN denotes the vanilla scheme that explores attributes as shown in Figure 2, using the same backbone as Baseline. For our VAN, we can use the V-ReID feature (i.e. VAN ()), or use the concatenation of and attribute features (i.e. VAN ()) in inference. We can see that: 1) VAN (), where the attributes regularize the feature learning, outperforms Baseline significantly on Vehicle-ID and VeRi-Wild. Specially, using attributes improves the rank-1 by 0.5% for VeRi776, 2.8% at rank-1 and 3.3% at rank-5 for Vehicle-ID, 6.6% in mAP and 1.3% at rank-1 for VeRi-Wild; 2) using VAN () has lower performance than VAN (). This is because not all the attribute information is equally important for V-ReID. Allocating the relative contributions of each attribute is needed to have satisfactory results. Hence how to distill task-oriented attribute information to efficiently benefit V-ReID is important, which is what our ANet aims to address.

We use VAN as our attribute-based baseline, which is similar to previous works exploiting vehicle attributes. However, previous works usually use simple FC layers instead of attention blocks for the attribute branches. Using attention facilitates the distillation of attribute features. As shown in Table 2, using attention outperforms that using FC layers by 1.2% in rank-1 on Vehicle-ID, 1.4% and 1% in mAP on VeRi776 and VeRi-Wild, respectively.

Method Clues mAP R1 R5
PAMAL (tumrani2020partial) attributes 45.0 72.0 88.8
MADVR (jiang2018multi) attributes 61.1 89.2 94.7
DF-CVTC (zheng2019attributes) attributes 61.0 91.3 95.7
PAMTRI (tang2019pamtri) attributes 71.8 92.8 96.9
AGNet (wang2020attribute) attributes 71.5 95.6 96.5
SAN (qian2020stripe) attributes 72.5 93.3 97.1
StRDAN (lee2020strdan) attributes 76.1
VAnet (chu2019vehicle) viewpoint 66.3 89.7 95.9
PRND (he2019part) veh. parts 74.3 94.3 98.6
UMTS (jin2020uncertainty) TS 75.9 95.8
PCRNet (liu2020beyond) GN + parsing 78.6 95.4 98.4
SAVER (khorramshahi2020devil) GAN 79.6 96.4 98.6
PVEN (meng2020parsing) parsing 79.5 95.6 98.4
HPGN (shen2020exploring) GN 80.1 96.7
VKD (calderararobust) viewpoint + TS 82.2 95.2 98.0
Baseline attributes 78.1 96.1 98.3
ANet (Ours) attributes 80.1 97.1 98.6
FastReid (he2020fastreid) backbone 81.0 97.1 98.3
ANet + FastReid (Ours) attributes 81.2 96.8 98.4
Table 3: Comparison of our proposed method against the state of the arts on VeRi776. The first and second best results are marked by bold and underline, respectively.

4.3.2 ANet: A Superior Way to Distill Attributes Information

We propose ANet to distill attribute information for more effective V-ReID. Here we study the effectiveness of our Joint Module design, and the AC losses. Table 1 shows the comparisons. We can see that: (i) Our final scheme ANet () significantly outperforms the basic network VAN (), by 2.0% in mAP on VeRi776, 1.9%/1.5%/1.5% in Rank-1 on Small/Medium/Large scales of Vehicle-ID, 2.7%/2.7%/3.3% in mAP on Small/Medium/Large scales of VeRi-Wild; (ii) our proposed AC losses, which encourages higher discrimination after the compensation of distilled attribute feature than that before, is very helpful to promote the distill of discriminative information from attribute feature for V-ReID purpose.

These results show that the interaction between the V-ReID and attribute features of VAN improves the network performance, thanks to the distill of V-ReID oriented attribute features.

To better understand the effects of ANet, we visualize the attention maps of and and show some in Figure 4. encodes generic features of the attributes, where the activations are flatter and do not have a special focus on the vehicle parts. In contrast, represents a portion of the information of that is helpful for V-ReID. We can observe that the activation maps focus more on the vehicle.

Small Medium Large
Method Clues R1 R5 R1 R5 R1 R5
PAMAL (tumrani2020partial) attributes 67.7 87.9 61.5 82.7 54.5 77.2
AGNet (wang2020attribute) attributes 71.1 83.7 69.2 81.4 65.7 78.2
DF-CVTC (zheng2019attributes) attributes 75.2 88.1 72.1 84.3 70.4 82.1
SAN (qian2020stripe) attributes 79.7 94.3 78.4 91.3 75.6 88.3
PRND (he2019part) veh. parts 78.4 92.3 75.0 88.3 74.2 86.4
SAVER (khorramshahi2020devil) GAN 79.9 95.2 77.6 91.1 75.3 88.3
UMTS (jin2020uncertainty) TS 80.9 78.8 76.1
PVEN (meng2020parsing) parsing 84.7 97.0 80.6 94.5 77.8 92.0
PCRNet (liu2020beyond) GN + parsing 86.6 98.1 82.2 96.3 80.4 94.2
VAnet (chu2019vehicle) viewpoint 88.1 97.2 83.1 95.1 80.3 92.9
HPGN (shen2020exploring) GN 89.6 79.9 77.3
Baseline attributes 81.3 94.4 77.7 90.6 75.8 88.5
ANet (Ours) attributes 86.0 97.4 81.9 95.1 79.6 92.7
FastReid (he2020fastreid) backbone 85.5 97.4 81.8 95.3 79.9 93.8
ANet + FastReid (Ours) attributes 87.9 97.8 82.8 96.2 80.5 94.6
Table 4: Comparison of our proposed method against the state of the arts on Vehicle-ID. The first and second best results are marked by bold and underline, respectively.
Small Medium Large
Method Clues mAP R1 R5 mAP R1 R5 mAP R1 R5
UMTS (jin2020uncertainty) TS 82.8 84.5 66.1 79.3 54.2 72.8
HPGN (shen2020exploring) GN 80.4 91.3 75.1 88.2 65.0 82.6
PCRNet (liu2020beyond) GN + parsing 81.2 92.5 75.3 89.6 67.1 85.0
SAVER (khorramshahi2020devil) GAN 80.9 94.5 98.1 75.3 92.7 97.4 67.7 89.5 95.8
PVEN (meng2020parsing) parsing 82.5 96.7 99.2 77.0 95.4 98.8 69.7 93.4 97.8
Baseline attributes 78.1 94.6 98.5 72.2 92.5 97.3 64.0 88.7 95.6
ANet (Ours) attributes 85.8 95.9 99.0 81.0 94.5 98.1 73.9 91.6 96.7
FastReid (he2020fastreid) backbone 84.8 95.7 98.9 80.0 94.5 98.1 73.2 91.5 96.7
ANet + FastReid (Ours) attributes 86.9 96.5 99.2 82.5 95.2 98.3 75.9 92.5 97.2
Table 5: Comparison of our proposed method against the state of the arts on VeRi-Wild. The first and second best results are marked by bold and underline, respectively.

4.4 Comparison with State-of-the-Art Methods

We compare our method with approaches that also use attributes information (zheng2019attributes; jiang2018multi; tang2019pamtri; wang2020attribute; qian2020stripe; lee2020strdan). We also compare our method with the most recent approaches that leverage clues/techniques such as vehicle parsing maps (meng2020parsing), vehicle parts (zhang2019part; he2019part), GANs (khorramshahi2020devil), Teacher-Student (TS) distillation (jin2020uncertainty; calderararobust), camera viewpoints (chu2019vehicle; calderararobust), and Graph Networks (GN) (shen2020exploring; liu2020beyond)

. HPGN creates a pyramid of spacial graph network to explore the spatial significance of the backbone tensor. PCRNet studies the correlation between parsed vehicle parts through a graph network. VAnet 

(chu2019vehicle) learns two metrics for similar viewpoints and different viewpoints in two feature spaces, respectively.

We also compare against FastReid (he2020fastreid), a strong baseline network for re-identification that performs an extensive search of hyperparameters, augmentation methods, and use some architecture design tricks to achieve excellent performance. We also implemented our design on top of it by taking it as our backbone, which we named ANet + FastReid. Note that the reported results of FastReid were obtained by our running of their released code.

Tables 3,  4 and  5 show the comparisons on VeRi776, Vehicle-ID, and VeRi-Wild, respectively.

VeRi776. Compared with attribute-based methods (first group in Table 3), our scheme ANet + FastReid outperforms the best results in this group by 5.1% in mAP; and 1.5% for rank-1 and rank-5. By comparing with methods that do not use attributes, we can see that it performs the second best in mAP, and achieves the best for rank-1 and rank-5. VKD (calderararobust) is better than ours in mAP and is inferior to ours at rank-1 and rank-5, where VKD uses camera labels in training to be viewpoint-invariant and trains a model based on the Teacher-Student framework.

Vehicle-ID. Our method outperforms attribute-based methods (first group in Table 4) consistently. For rank-1, our scheme ANet + FastReid outperforms the best attribute-based method by 8.2%, 4.4% and 4.9% for small, medium and large scales, respectively. When compared with methods using other clues, ours achieves in the best results on the large set and competitive performance on the other sets.

VeRi-Wild. Previous attribute based methods have not yet reported results for this latest dataset. From Table 5, we can see that our schemes ANet and ANet + FastReid achieve the best performance in mAP.

PVEN (meng2020parsing) is a method based on semantic parsing to describe each vehicle view and region. It has better results on rank-1/rank-5 but it is not as competitive as in the two previous datasets.

We observed that none of the existing methods consistently achieve the best results on all the datasets. This may be because different datasets have different main challenges. Our proposed ANet shows a more consistent state-of-the-art performance on all the datasets, thanks to the generic capabilities of attributes on V-ReID.

5 Conclusions

In this work, we proposed ANet, a novel framework to leverage attribute information for vehicle re-identification. ANet addresses the problem of lack of interaction between the V-ReID features and attribute features of previous methods. Particularly, we encourage the network to distill task-oriented information from the attribute branches and compensate it into the global V-ReID feature to enhance the discrimination capability of the feature. Evaluation on three datasets shows the effectiveness of our methods.

Acknowledgments

This work was done while the first author is affiliated with Microsoft Corp. We are thankful to Microsoft Research, São Paulo Research Foundation (FAPESP grant #2017/12646-3), National Council for Scientific and Technological Development (CNPq grant #309330/2018-1) and Coordination for the Improvement of Higher Education Personnel (CAPES) for their financial support.

References