DeepAI
Log In Sign Up

Viewpoint-Aware Channel-Wise Attentive Network for Vehicle Re-Identification

Vehicle re-identification (re-ID) matches images of the same vehicle across different cameras. It is fundamentally challenging because the dramatically different appearance caused by different viewpoints would make the framework fail to match two vehicles of the same identity. Most existing works solved the problem by extracting viewpoint-aware feature via spatial attention mechanism, which, yet, usually suffers from noisy generated attention map or otherwise requires expensive keypoint labels to improve the quality. In this work, we propose Viewpoint-aware Channel-wise Attention Mechanism (VCAM) by observing the attention mechanism from a different aspect. Our VCAM enables the feature learning framework channel-wisely reweighing the importance of each feature maps according to the "viewpoint" of input vehicle. Extensive experiments validate the effectiveness of the proposed method and show that we perform favorably against state-of-the-arts methods on the public VeRi-776 dataset and obtain promising results on the 2020 AI City Challenge. We also conduct other experiments to demonstrate the interpretability of how our VCAM practically assists the learning framework.

READ FULL TEXT VIEW PDF

page 1

page 3

page 6

08/26/2020

Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network

Vehicle re-identification (re-ID) focuses on matching images of the same...
11/18/2020

Viewpoint-aware Progressive Clustering for Unsupervised Vehicle Re-identification

Vehicle re-identification (Re-ID) is an active task due to its importanc...
07/17/2021

Looking Twice for Partial Clues: Weakly-supervised Part-Mentored Attention Network for Vehicle Re-Identification

Vehicle re-identification (Re-ID) is to retrieve images of the same vehi...
03/09/2021

Pluggable Weakly-Supervised Cross-View Learning for Accurate Vehicle Re-Identification

Learning cross-view consistent feature representation is the key for acc...
10/09/2019

Vehicle Re-identification with Viewpoint-aware Metric Learning

This paper considers vehicle re-identification (re-ID) problem. The extr...
05/02/2020

PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data

In comparison with person re-identification (ReID), which has been widel...
05/07/2019

Variational Representation Learning for Vehicle Re-Identification

Vehicle Re-identification is attracting more and more attention in recen...

1 Introduction

Vehicle re-identification (re-ID) aims to match images of the same vehicle captured by a camera network. Recently, this task has drawn increasing attention because of its wide applications such as analyzing and predicting traffic flow. While several existing works obtained great success with the aid of Convolutional Neural Network (CNN) 

[17, 18, 24], various challenges still hinder the performance of vehicle re-ID. One of them is that a vehicle captured from different viewpoints usually has dramatically different visual appearances. To reduce this intra-class variation, some works [25, 11, 34]

guide the feature learning framework by spatial attention mechanism to extract viewpoint-aware features on the meaningful spatial location. However, the underlying drawback is that the capability of the learned network usually suffers from noisy generated spatial attention maps. Moreover, the more powerful spatial attentive model may rely on expensive pixel-level annotations, such as vehicle keypoint labels, which are impractical in real-world scenario. In view of the above observations, we choose to explore another type of attention mechanism in our framework that is only related to high-level vehicle semantics.

Recently, a number of works adopt channel-wise attention mechanism [8, 3, 26, 29] and achieve great success in several different tasks. Since a channel-wise feature map is essentially a detector of the corresponding semantic attributes, channel-wise attention can be viewed as the process of selecting semantic attributes which are meaningful or potentially helpful for achieving the goal. Such characteristic could be favorable in the task of vehicle re-ID. Specifically, channel-wise feature maps usually represent the detectors of discriminative parts of vehicle, such as rear windshield or tires. Considering that the vehicle parts are not always clearly visible in the image, with the aid of channel-wise attention mechanism, the framework should therefore learn to assign larger attentive weight and, consequently, emphasize on the channel-wise feature maps extracted from the visible parts in the image. Nonetheless, the typical implementation of channel-wise attention mechanism [8, 3] generates the attentive weight of each stage, explicitly each bottleneck, based on the representation extracted from that stage in the CNN backbone. We find that the lack of semantic information in the low-level representations extracted from the former stages may result in undesirable attentive weight, which would limit the performance in vehicle re-ID.

As an alternative solution, in this paper, we propose a novel attentive mechanism, named Viewpoint-aware Channel-wise Attention Mechanism (VCAM), which adopts high-level information, the “viewpoint” of captured image, to generate the attentive weight. The motivation is that the visibility of vehicle part usually depends on the viewpoint of the vehicle image. As shown in Fig. 1

, with our VCAM, the framework successfully focuses on the clearly visible vehicle parts which are relatively beneficial to re-ID matching. Combined with VCAM, our feature learning framework is as follows. For every given image, our framework first estimates the viewpoint of input vehicle image. Afterwards, based on the viewpoint information, VCAM accordingly generates the attentive weight of each channel of convolutional feature. Re-ID feature extraction module is then incorporated with the channel-wise attention mechanism to finally extract viewpoint-aware feature for re-ID matching.

Extensive experiments prove that our method outperforms state-of-the-arts on the large-scale vehicle re-ID benchmark: VeRi-776 [17, 18] and achieves promising results in the 2020 Nvidia AI City Challenge111https://www.aicitychallenge.org/, which holds competition on the other large-scale benchmark, CityFlow-ReID [24]. We additionally analyze the attentive weights generated by VCAM in interpretability study to explain how VCAM helps to solve re-ID problem in practice. We now highlight our contributions as follows:

  • We propose a novel framework which can benefit from channel-wise attention mechanism and extract viewpoint-aware feature for vehicle re-ID matching.

  • To the best of our knowledge, we are the first to show that viewpoint-aware channel-wise attention mechanism can obtain great improvement in the vehicle re-ID problem.

  • Extensive experiments on public datasets increase the interpretability of our method and also demonstrate that the proposed framework performs favorably against state-of-the-art approaches.

Figure 2: Architecture of our proposed framework. 

2 Related Work

Vehicle Re-Identification.

Vehicle re-ID has received more attention for the past few year due to the releases of large-scale annotated vehicle re-ID datasets, such as VeRi-776 [17, 18] and CityFlow [24] datasets. As earlier work, Liu [17] et al. showed the advantage of using CNN model to tackle the vehicle re-ID problem. However, vehicle captured from different viewpoint usually have dramatically different visual appearances which could impede the model capability of re-ID matching.

Viewpoint-aware Attention.

To reduce the impact caused by such intra-class variation, numerous works [25, 11, 34, 23, 2, 9, 4] proposed the viewpoint-aware feature learning frameworks to adapt the viewpoint of input image. Specifically, most of them utilized “spatial” attention mechanism to extract local features from the regions that are relatively salient. For example, Wang et al[25] and Khorramshahi et al[11] generated spatial attentive maps for 20 vehicle keypoints to guide their networks to emphasize on the most discriminative vehicle parts. While they are the first to show that viewpoint-aware features could aid vehicle re-ID, the required vehicle keypoint labels are expensive to obtain for real-world scenario. To avoid such problem, Zhou et al[34] proposed a weakly-supervised viewpoint-aware attention mechanism which can generate the spatial attention maps for five different viewpoints of vehicle. Instead of utilizing pixel-level annotations, they only requires image-level orientation information for training. However, due to the lack of strong supervision on the generation of attention maps, the attention outcomes may become noisy and would affect network learning. Considering to the general disadvantages of spatial attention mechanism mentioned above, we turn to a different aspect of attention mechanism to tackle the vehicle re-ID problem.

Channel-wise Attention.

Channel-wise attention can be treated as a mechanism to reassess the importance of each channel of the features maps. The benefits brought by such mechanism have been shown across a range of tasks, such as image classification [8]

, image captioning 

[3], object detection [26]

and image super-resolution 

[29]. Among existing works, typical implementation of channel-wise attention reweighs the channel-wise feature with the attentive weight which is generated by the representation extracted from each stage of CNN backbone. However, as mentioned in Sec.1, the lack of semantic information in the low-level representations extracted from the former stages may fail to generate meaningful attentive weight. Accordingly, we exploit the high-level information, the “viewpoint” of image, to better assist the model to emphasize on those semantically important channel-wise feature maps.

3 Proposed Method

The whole feature learning framework is depicted as Fig. 2. For every given image

, there is a viewpoint estimation module to first evaluate the viewpoint of image and generate the viewpoint vector as

V. According to the information V, our viewpoint-aware channel-wise attention mechanism (VCAM) then generates the attentive weights of channel-wise feature maps extracted from each stage of re-ID feature extraction module. Specifically, the CNN backbone of re-ID feature extraction module is constituted of stages, and the attentive weight generated by VCAM indicates the importance of channel-wise feature maps of the intermediate representation extracted from the stage in re-ID feature extraction module. Finally, the re-ID feature extraction module combined with the channel-wise mechanism would generate the representative feature f for re-ID matching. We will give more details about viewpoint estimation module in Sec. 3.1, viewpoint-aware channel-wise attention mechanism (VCAM) in Sec. 3.2, re-ID feature extraction module in Sec. 3.3, and the overall training procedure of our framework in Sec. 3.4.

3.1 Viewpoint Estimation Module

To better guide the VCAM generating the attentive weights of channel-wise feature maps with high-level semantic information, we utilize a viewpoint estimation module to embed the whole image into one representative viewpoint feature V for every input image . To confirm that the feature V is able to explicitly indicate the viewpoint of image, we first define the target of viewpoint by two properties of captured vehicle image: angle of depression and orientation . Angle of depression represents the angle between horizontal line and the line of camera sight. It can be easily obtained by the camera height and the horizontal distance between object and camera as:

(1)

Orientation indicates the rotation degree of the vehicle (from to ). However, we find that the discontinuity of orientation would seriously affect the learning of viewpoint estimation module. Specifically, for the image with orientation of , the module would be mistakenly punished by huge loss when it predicts the orientation of even if there are only degree error between the real and predicted orientation. As a revised method, and are used to mutually represent the orientation which guarantee continuous differentiation for two similar viewpoints. Overall, the target of viewpoint feature is defined as:

(2)

With the target , we then apply the viewpoint loss:

(3)

which represents the mean square error (MSE) between the prediction and target of viewpoint feature to optimize our viewpoint estimation module.

3.2 Viewpoint-aware Channel-wise Attention
Mechanism (VCAM)

Based on the viewpoint feature V extracted from the viewpoint estimation module, VCAM generates a set of attentive weights to reassess the importance of each channel-wise feature map. Compared to the typical implementation of channel-wise attention mechanism which uses the representations (extracted from the stages in CNN backbone) as reference to generate attentive weights, our VCAM uses viewpoint information instead; the reason is that we expect our generated channel-wise attentive weight is positively related to the visibility of corresponding vehicle part, and, moreover, that part visibility is usually determined by the viewpoint of input vehicle image. For example, in Fig. 1, the attentive weight of the channel (which is the detector of tires) should be larger if side face of vehicle is clearly captured in the image. All in all, according to the viewpoint feature V with only three dimensions, our VCAM generates the attentive weights A by a simple transfer function with one fully-connected (FC) layer:

(4)

where denotes the parameters in FC layer and (

) refers to the sigmoid function.

3.3 Re-ID Feature Extraction Module

As shown in the Fig. 2, the main purpose of the re-ID feature extraction module is to embed the final representation for re-ID matching with the aid of channel-wise attention mechanism. Based on the viewpoint-aware attentive weights A generated by VCAM, the module would refine the channel-wise features of the representations extracted from the stages of re-ID feature extraction module. Similar to previous works [8, 3], we use channel-wise multiplication between feature maps and attentive weights to implement channel-wise attention mechanism:

(5)

where represents convolution operator and is the reweighted feature which would be fed into next CNN stage for further feature extraction.

After getting the feature extracted from the last stage, saying , the module first adopts adaptive pooling to suppress the feature. To fully refer the viewpoint information, the feature is then concatenated with viewpoint feature V and passed through one fully connected layer to get final representative feature f used for re-ID matching.

3.4 Model Learning Scheme

The learning scheme for our feature learning framework consists of two steps. In the first step, we utilize large-scale synthetic vehicle image dataset released by Yao et al[28] to optimize our viewpoint estimation module by the viewpoint loss () defined in Eq. 3:

(6)

In the second step, we jointly fine-tune the viewpoint estimation module and fully optimize the rest of our network, including VCAM and re-ID feature extraction module, on the target dataset with two common re-ID losses. The first one for metric learning is the triplet loss ([22]; the other loss for the discriminative learning is the identity classification loss ([32]. The overall loss is computed as follows:

(7)

4 Experiments

4.1 Datasets and Evaluation Metrics

Our framework is evaluated on two benchmarks, VeRi-776 [17, 18] and CityFlow-ReID [24]. VeRi-776 dataset contains 776 different vehicles captured, which is split into 576 vehicles with 37,778 images for training and 200 vehicles with 11,579 images for testing. CityFlow-ReID is a subset of images sampled from the CityFlow dataset [24] which also serves as the competition dataset for Track 2 of 2020 AI City Challenge. It consists of 36,935 images of 333 identities in the training set and 18,290 images of another 333 identities in the testing set. It has the largest scale of spatial coverage and number of cameras among all the existing vehicle re-ID datasets.

As in previous vehicle re-ID works, we employ the standard metrics, namely the cumulative matching curve (CMC) and the mean average precision (mAP) [30] to evaluate the results. We report the rank-1 accuracy (R-1) in CMC and the mAP for the testing set in both datasets. Note that in CityFlow-ReID dataset, the listing results are reported with rank list of size 100 on 50% of the testing set displayed by the AI City Challenge Evaluation System.

4.2 Implementation Details

We respectively adopt ResNet-34 [6] and ResNeXt-101 32x8d [27]

as CNN backbone for the viewpoint estimation module and re-ID feature extraction module (both networks are pre-trained on ImageNet 

[5] dataset). As for re-ID feature extraction module, we split the whole ResNeXt-101 into stages; the sizes of representations extracted from each stage are (channel height width), , , and respectively. Hence, the VCAM is composed by four independent networks which all take 3-dim viewpoint feature V as input and generates a set of attentive weights A with 256-dim, 512-dim, 1024-dim, and 2048-dim.

For training process of feature learning framework, we first optimize viewpoint estimation module with in advance on large-scale synthetic vehicle image dataset released by Yao et al[28], where viewpoint information is available. Afterward, we optimize the rest of the framework, including VCAM and re-ID feature extraction module, and fine-tune the viewpoint estimation module (by 10 times smaller learning rate) with on target dataset. For optimizing with triplet loss (), we adopt the training strategy [7], where we sample different vehicles and images for each vehicle in a batch of size . In addition, for training identity classification loss (), we adopt a BatchNorm [21]

and a fully-connected layer to construct the classifier as in 

[20, 14]. We choose SGD optimizer with the initial learning rate starting from 0.005 and decay it by 10 times every 15000 iterations to train network for 40000 iterations.

4.3 Ablation Study

In this section, we conduct the experiments on both VeRi-776 and CityFlow-ReID datasets to assess the effectiveness of our Viewpoint-aware Channel-wise Attention Mechanism (VCAM) and show the results in Table 1. We first simply train ResNeXt-101 without any attention mechanism as the baseline model and list the result in the first row. We also compare our VCAM with the typical implementation of channel-wise attention mechanism listed in the second row. For this experiment, the backbone is replaced with SE-ResNeXt-101 [8] which shares similar network architecture with ResNeXt-101 except for adding extra SE-blocks, proposed by Hu et al. [8], after each bottleneck block of ResNeXt-101. It shows that compared to the baseline model, the performances are all boosted with the help of channel-wise attention mechanism. However, while SE-ResNeXt-101 could only reach limited advancement ( and for mAP on VeRi-776 and CityFlow-ReID), our proposed framework favorably achieves greater improvement on both datasets ( and for mAP on VeRi-776 and CityFlow-ReID). It verifies that, according to the viewpoint information, our VCAM could generate more beneficial attentive weight to re-ID matching rather than the weight produced by typical channel-wise attention mechanism.

Model VeRi-776 CityFlow-ReID
mAP R-1 mAP R-1
ResNeXt-101 61.5 93.2 37.3 54.1
SE-ResNeXt-101 63.2 93.8 38.9 55.2
VCAM (Ours) 68.6 94.4 46.8 63.3
Table 1: Ablation study of our proposed VCAM (). 
Method VeRi-776
mAP R-1 R-5
OIFE [25] 48.0 68.3 89.7
VAMI [34] 50.1 77.0 90.8
RAM [15] 61.5 88.6 94.0
AAVER [11] 61.2 89.0 94.7
GRF-GGL [16] 61.7 89.4 95.0
GSTE [1] 59.5 - -
EALN [19] 57.4 84.4 94.1
MoV1+BS [12] 67.6 90.2 96.4
MTML [10] 64.6 92.3 95.7
VCAM (Ours) 68.6 94.4 96.9
Table 2: Comparison with state-of-the-arts re-ID methods on VeRi-776 ().  Upper Group: attentive feature learning methods. Lower Group: the others. Note that all listed scores are from the methods without adopting spatial-temporal information [18] and extra post processing such as re-ranking [33] .
Figure 3: Distribution of Channel-wise Attentive weights.  We categorize vehicle images into five viewpoints, and, for each viewpoint, we sample 100 images and plot the average 2048-dim attentive weight vector for the fourth stage, namely . We assign and color each channel with one of front, side, or rear vehicle face label if the weight value of front, side, or rear viewpoint is relatively larger. We can then find that the channels emphasized by our proposed VCAM usually belong to the visible vehicle face(s).

4.4 Comparison with the State-of-the-Arts

We compare our method with existing state-of-the-art methods on VeRi-776 dataset in Table 2. Previous vehicle re-ID methods can be roughly summarized into three categories: attentive feature learning [25, 34, 15, 11, 16], distance metric learning [19, 1], and multi-task learning [10]. For the attentive feature learning which have been most extensive studied category, the existing methods all adopted “spatial” attention mechanism to guide the network to focus on the regional features which may be useful to distinguish two vehicles. Nevertheless, unfavorable generated attention masks would hinder the re-ID performance on the benchmark. In contrast, our proposed VCAM, which is the first to adopt channel-wise attention mechanism in the task of vehicle re-ID, achieves clear gain of for mAP on VeRi-776 dataset compared to GRF-GGL [16] which is with attentive mechanism. It indicates that our framework can fully exploit the viewpoint information and favorably benefit from the channel-wise attention mechanism. Moreover, our proposed framework outperforms other state-of-the-art methods on VeRi-776 dataset.

4.5 Interpretability Study and Visualization

While our proposed framework have been empirically shown to improve the performance of vehicle re-ID task, we further conduct an experiment to illustrate how VCAM practically assists in solving the re-ID task in this section. We first categorize the viewpoint into five classes: front, side, rear, front-side, and rear-side; example images of these five classes are shown in Fig. 3. For each class, we then sample 100 images and compute the average 2048-dim attentive weight vector of the fourth stage, namely . We uniformly select forty channels among total 2048-dim vector and plot the results in Fig. 3. In order to increase the readability, we first analyze the attentive weights of three non-overlapped viewpoints, , , and . We assign and color each channel with one of front, side, or rear vehicle face label if the weight value of the corresponding viewpoint is relatively larger than the other two. Take the channel shown in Fig. 3 as example, it belongs to the front face and is, consequently, marked in blue because the attentive weight of front viewpoint is larger than the other ones of both side and rear viewpoints. The physical meaning of the assignment of vehicle face label to each channel is that the channel-wise feature maps are essentially the detectors of vehicle parts, such as rear windshield and tires as illustrated in Fig. 1, and, moreover, the visibility of that vehicle part is usually determined by whether the corresponding face is captured; for example, the presence of rear windshield in the image depends on whether the rear face is visible. Hence, for each channel, we assign one of front, side, and rear vehicle face label.

With the assignment of vehicle face label, the following observation is made from the experiment result of all five viewpoints. For the attentive weight vector of each viewpoint, the relatively emphasized channels (commonly attentive weight values ) usually belong to the face(s) which can be seen in the image. For example, for the images with front-side viewpoint, VCAM would generate larger attentive weight for the channels belonging to front or side face. Based on the observation, we then accordingly give the explanation about the learning of our VCAM: our VCAM usually generates larger weights for the channel-wise feature maps extracted from clearly visible parts which are potentially beneficial to re-ID matching.

4.6 Submission on the 2020 AI City Challenge

We also submit our proposed method to the 2020 AI City Challenge, which holds competition for vehicle Re-ID in the CityFlow-ReID dataset. As a supplement to our proposed method, we employ some additional techniques for the final submission:

Synthetic dataset and Two-stage Learning

Different from the past challenges held in previous years, the organizer release a large-scale synthetic vehicle re-ID dataset which consists of 192,151 images with 1,362 identities. All images on synthetic dataset are generated by an vehicle generation engine, called VehicleX, proposed by Yao et al[28], which enables user to edit the attributes, such as color and type of vehicle, illumination and viewpoint to generate a desired synthetic dataset. With this engine, the attributes of synthetic images can be obtained easily without manually annotated which requires considerable or even prohibitive effort. In this paper, we exploit viewpoint information of synthetic dataset to train viewpoint estimation module and identity information to enhance the learning of re-ID framework. To better utilize the identity information of large-scale auxiliary dataset, which is synthetic dataset here, we adopt a two-stage learning strategy proposed by Zheng et al[31] as our training scheme. The framework is first trained with auxiliary dataset; when the learning converges, the classification FC layer used for training is replaced by a new one and the framework would be followingly trained with target dataset. Based on the results displayed on the AI City Challenge Evaluation system, with the help of large-scale auxiliary dataset, we can achieve improvement of 5.3% for mAP on the validation set of CityFlow-ReID (from 46.8% to 52.1%).

Track-based Feature Compression and Re-ranking

Track-based feature compression is first proposed by Liu et al[13]. It is an algorithm for the video-based inference scheme according to the additional tracking information of each image. The whole algorithm includes two steps: merge and decompress. First, all image features of the same track in the gallery would be merged into one summarized feature vector by average pooling to represent their video track. Then, in the decompression step, the summarized feature vector would be directly used as the representative feature for all images belonging to that video track. With track-based feature compression, the rank list could be refined with the help of tracking information during inference scheme. Finally, we perform typical re-ID scheme to rank the modified image features in the gallery according to the query image feature and adopt the k-reciprocal re-ranking method proposed by Zong et al[33] to re-rank our re-ID results. Benefiting from track-based feature compression and re-ranking strategy, we can gain another improvement of 5.6% for mAP on the validation set of CityFlow-ReID (from 52.1% to 57.7%).

Different from the listed results above, the score of our final submission to 2020 AI City Challenge Track2 is calculated with 100% testing set. With our VCAM and the tricks mentioned above, we finally achieve in mAP at the rank list size of 100 (rank100-mAP) and rank among all participated teams.

5 Conclusion

In this paper, we present a novel Viewpoint-aware Channel-wise Attention Mechanism (VCAM) which is the first to adopt channel-wise attention mechanism to solve the task of vehicle re-ID. Our newly-design VCAM adequately leverage the viewpoint information of the input vehicle image and accordingly reassess the importance of each channel which is proven to be more beneficial to re-ID matching. Extensive experiments are conducted to increase the interpretability of VCAM and also show that our proposed method performs favorably against existing vehicle re-ID works.

Acknowledgment

This research was supported in part by the Ministry of Science and Technology of Taiwan (MOST 108-2633-E-002-001), National Taiwan University (NTU-108L104039), Intel Corporation, Delta Electronics and Compal Electronics.

References

  • [1] Y. Bai, Y. Lou, F. Gao, S. Wang, Y. Wu, and L. Duan (2018) Group-sensitive triplet embedding for vehicle reidentification. IEEE Transactions on Multimedia 20 (9), pp. 2385–2399. Cited by: §4.4, Table 2.
  • [2] H. Chen, B. Lagadec, and F. Bremond (2019)

    Partition and reunion: a two-branch neural network for vehicle re-identification

    .
    In Proc. CVPR Workshops, pp. 184–192. Cited by: §2.
  • [3] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 5659–5667. Cited by: §1, §2, §3.3.
  • [4] Y. Chen, L. Jing, E. Vahdani, L. Zhang, M. He, and Y. Tian (2019) Multi-camera vehicle tracking and re-identification on ai city challenge 2019. In Proc. CVPR Workshops, Cited by: §2.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.2.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.2.
  • [7] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv 1703.07737. Cited by: §4.2.
  • [8] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2, §3.3, §4.3.
  • [9] P. Huang, R. Huang, J. Huang, R. Yangchen, Z. He, X. Li, and J. Chen (2019) Deep feature fusion with multiple granularity for vehicle re-identification. In Proc. CVPR Workshops, pp. 80–88. Cited by: §2.
  • [10] A. Kanaci, M. Li, S. Gong, and G. Rajamanoharan (2019) Multi-task mutual learning for vehicle re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, pp. 62–70. Cited by: §4.4, Table 2.
  • [11] P. Khorramshahi, A. Kumar, N. Peri, S. S. Rambhatla, J. Chen, and R. Chellappa (2019) A dual-path model with adaptive attention for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6132–6141. Cited by: §1, §2, §4.4, Table 2.
  • [12] R. Kuma, E. Weill, F. Aghdasi, and P. Sriram (2019) Vehicle re-identification: an efficient baseline using triplet embedding. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: Table 2.
  • [13] C. Liu, M. Lee, C. Wu, B. Chen, T. Chen, Y. Hsu, S. Chien, and N. I. Center (2019) Supervised joint domain learning for vehicle re-identification. In Proc. CVPR Workshops, pp. 45–52. Cited by: §4.6.
  • [14] C. Liu, C. Wu, Y. F. Wang, and S. Chien (2019) Spatially and temporally efficient non-local attention network for video-based person re-identification. In British Machine Vision Conference (BMVC), Cited by: §4.2.
  • [15] X. Liu, S. Zhang, Q. Huang, and W. Gao (2018) Ram: a region-aware deep model for vehicle re-identification. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §4.4, Table 2.
  • [16] X. Liu, S. Zhang, X. Wang, R. Hong, and Q. Tian (2019) Group-group loss-based global-regional feature learning for vehicle re-identification. IEEE Transactions on Image Processing 29, pp. 2638–2652. Cited by: §4.4, Table 2.
  • [17] X. Liu, W. Liu, H. Ma, and H. Fu (2016) Large-scale vehicle re-identification in urban surveillance videos. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1, §1, §2, §4.1.
  • [18] X. Liu, W. Liu, T. Mei, and H. Ma (2016)

    A deep learning-based approach to progressive vehicle re-identification for urban surveillance

    .
    In European Conference on Computer Vision (ECCV), pp. 869–884. Cited by: §1, §1, §2, §4.1, Table 2.
  • [19] Y. Lou, Y. Bai, J. Liu, S. Wang, and L. Duan (2019) Embedding adversarial learning for vehicle re-identification. IEEE Transactions on Image Processing 28 (8), pp. 3794–3807. Cited by: §4.4, Table 2.
  • [20] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, Cited by: §4.2.
  • [21] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019) Bag of tricks and a strong baseline for deep person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, Cited by: §4.2.
  • [22] E. Ristani and C. Tomasi (2018) Features for multi-target multi-camera tracking and re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6036–6046. Cited by: §3.4.
  • [23] Z. Tang, M. Naphade, S. Birchfield, J. Tremblay, W. Hodge, R. Kumar, S. Wang, and X. Yang (2019) Pamtri: pose-aware multi-task learning for vehicle re-identification using highly randomized synthetic data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 211–220. Cited by: §2.
  • [24] Z. Tang, M. Naphade, M. Liu, X. Yang, S. Birchfield, S. Wang, R. Kumar, D. Anastasiu, and J. Hwang (2019) Cityflow: a city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8797–8806. Cited by: §1, §1, §2, §4.1.
  • [25] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang (2017) Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 379–387. Cited by: §1, §2, §4.4, Table 2.
  • [26] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §2.
  • [27] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §4.2.
  • [28] Y. Yao, L. Zheng, X. Yang, M. Naphade, and T. Gedeon (2019) Simulating content consistent vehicle datasets with attribute descent. arXiv preprint arXiv:1912.08855. Cited by: §3.4, §4.2, §4.6.
  • [29] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: §1, §2.
  • [30] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124. Cited by: §4.1.
  • [31] Z. Zheng, T. Ruan, Y. Wei, and Y. Yang (2019) VehicleNet: learning robust feature representation for vehicle re-identification. In Proc. CVPR Workshops, Cited by: §4.6.
  • [32] Z. Zheng, L. Zheng, and Y. Yang (2018) A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (1), pp. 13. Cited by: §3.4.
  • [33] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: §4.6, Table 2.
  • [34] Y. Zhou and L. Shao (2018) Aware attentive multi-view inference for vehicle re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6489–6498. Cited by: §1, §2, §4.4, Table 2.