Log In Sign Up

Camera-Tracklet-Aware Contrastive Learning for Unsupervised Vehicle Re-Identification

Recently, vehicle re-identification methods based on deep learning constitute remarkable achievement. However, this achievement requires large-scale and well-annotated datasets. In constructing the dataset, assigning globally available identities (Ids) to vehicles captured from a great number of cameras is labour-intensive, because it needs to consider their subtle appearance differences or viewpoint variations. In this paper, we propose camera-tracklet-aware contrastive learning (CTACL) using the multi-camera tracklet information without vehicle identity labels. The proposed CTACL divides an unlabelled domain, i.e., entire vehicle images, into multiple camera-level subdomains and conducts contrastive learning within and beyond the subdomains. The positive and negative samples for contrastive learning are defined using tracklet Ids of each camera. Additionally, the domain adaptation across camera networks is introduced to improve the generalisation performance of learnt representations and alleviate the performance degradation resulted from the domain gap between the subdomains. We demonstrate the effectiveness of our approach on video-based and image-based vehicle Re-ID datasets. Experimental results show that the proposed method outperforms the recent state-of-the-art unsupervised vehicle Re-ID methods. The source code for this paper is publicly available on `'.


Unsupervised Vehicle Re-Identification via Self-supervised Metric Learning using Feature Dictionary

The key challenge of unsupervised vehicle re-identification (Re-ID) is l...

Unsupervised Vehicle Counting via Multiple Camera Domain Adaptation

Monitoring vehicle flow in cities is a crucial issue to improve the urba...

Global-Supervised Contrastive Loss and View-Aware-Based Post-Processing for Vehicle Re-Identification

In this paper, we propose a Global-Supervised Contrastive loss and a vie...

Heterogeneous Relational Complement for Vehicle Re-identification

The crucial problem in vehicle re-identification is to find the same veh...

Viewpoint-aware Progressive Clustering for Unsupervised Vehicle Re-identification

Vehicle re-identification (Re-ID) is an active task due to its importanc...

Vehicle Re-Identification: an Efficient Baseline Using Triplet Embedding

In this supplementary material we tackle the problem of vehicle re-ident...

Amur Tiger Re-identification in the Wild

Monitoring the population and movements of endangered species is an impo...

I Introduction

Vehicle re-identification (re-id) is a task to identify the vehicles of the same identities across the cameras. It is an essential procedure in modern intelligent traffic management systems. The primary challenge for vehicle re-id is to derive a robust representation model that can cover various details of captured vehicle images and distinguish differences among them. In the past few years, with the rising of deep learning, supervised methods based on deep learning improve vehicle re-id performance dramatically [19, 18, 31, 30, 29, 24, 19]. Those methods achieved outstanding performance compared with previous hand-crafted feature-based methods [15, 38]. However, supervised methods require a well-labelled dataset which may not be available on a large-scale.

To overcome the dependence of labelled datasets, various unsupervised learning approaches have been proposed

[12, 10, 22, 33]

. The predominant approach is to use transfer learning or domain adaptation (DA), and those suggest training a model with a pre-labelled dataset (

a.k.a. source domain) before adapting to unlabelled datasets (a.k.a. target domain) [12, 10, 22]. However, those approaches still require laborious annotations for the source domain and may fail when the domain gap is large [31].

In recent, fully unsupervised re-id methods [5, 6, 27, 17], which employ a pseudo-label generation for creating supervisory signals to train their model, have been proposed. However, those approaches methodologically cannot perfectly filter false-positive prediction results, which can significantly degrade re-id performance, during the pseudo-label generation. However, those approaches methodologically cannot perfectly filter false-positive predictions during the pseudo-label generation, which can significantly degrade re-id performance.

Therefore, in this paper, we present camera-tracklet-aware contrastive learning (CTACL) to boost the performance of vehicle re-id without explicit vehicle identity labels or a pre-labelled source dataset, which only uses cameras and tracklet identities for reducing the risk of false-positive predictions results. Camera Ids is one of the general information contained in re-id datasets, and tracklet is cost-free by-product information since tracking is a commonly used function in collecting person or vehicle images on re-id studies [31, 20, 37, 35]. Thus, this problem setting is fairly reasonable.

When unlabelled vehicle images, camera Ids, and tracklet Ids are given, we divide the entire images into camera-level subdomains and conduct contrastive learning to each subdomain. Positive and negative samples for contrastive learning are decided by using the tracklet Ids. Camera-tracklet-aware memory (CTAM) is introduced to store and manage extracted features and the Ids. Additionally, as the proposed contrastive learning mainly operates in each subdomain, we introduce a DA across cameras to prevent the model from learning a camera-specific representation.

Our model is mainly evaluated on the VVeRI-901 dataset [35] which is the first video-based vehicle re-id evaluation benchmark. Additionally, we reorganise image vehicle re-ID datasets, i.e., VeRi-776 dataset [19] and VeRi-Wild dataset [31], to simulate the video vehicle re-id scenario for the evaluation. Compared with existing state-of-the-art unsupervised vehicle re-id methods, including DA-based methods [12, 10, 22], our method produces state-of-the-art vehicle re-id performance with large performance margins. Our method produces rank-1 accuracies of 89.3 and 38.2 on VeRi-776 and VVeRi-901 datasets, respectively. These are 11.9% and 9.6% improved achievements against the best performing competitors. The second-ranked methods are VACP-DA [36] on the VeRi-776 with rank-1 accuracy of 77.4, and SSL [17] on the VVeRI-901 with rank-1 accuracy of 28.6. Consequently, the proposed method demonstrates that it can provide promising performance without any types of labelled data.

Fig. 1: The training process of the proposed CTACL and DA using the CTAM. The positive and negative samples on the CTACL for each subdomain are definitively defined using tracklet Id within the subdomain specified by camera Id . In this process, to improve the generalisation performance of the model, potential positive samples located beyond the subdomain are applied through positive sample mining. In addition, the DA across cameras is performed to improve the generalisation performance of the model explicitly.

Ii Preliminary

Ii-a Contrastive learning

The contrastive learning is presented to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Triplet loss [23]

, noise contrast estimation (NCE)

[7], and InfoNCE [21] are well-known approaches among the contrastive learning. Recently, contrastive learning combined with self-supervised tasks is shown to be powerful in learning a robust representation model with impressive performance in various visual recognition tasks [2, 26, 13].

With an encoder () and a single batch of images , where is the number of images in the batch, the contrastive learning is performed as a self-supervision manner (e.g., [2, 26, 11]), the loss is defined as follows:


Here, , where denotes the dimensionality of a latent feature space. The symbol indicates the inner product. is a scalar temperature parameter, and . The is called the anchor feature, is called the positive features related to the anchor, and the others are all called the negatives features. Positive features corresponded to the anchors are usually obtained by images augmented from anchor images [13]. Simple image augmentation techniques such as random cropping [26] and rotating [11] have been used.

Iii The proposed approach

Iii-a Camera-tracklet-aware contrastive learning

Our motivation is as follows: Even if identity labels for vehicle images, which can verify vehicle Ids between non-overlapping cameras, are not provided, if the camera and tracklet Ids are available, we can define multiple labelled subdomains by dividing unlabelled domain (i.e., the entire image) into camera-level subdomains and by assigning temporal labels using tracklet. Based on this insight, we derive camera-tracklet-aware contrastive learning. Fig. 1 illustrates the entire workflow of the proposed approach.

We assume that a re-id model can use not only vehicle images but the Ids of cameras and tracklet , where is the number of vehicle images. The proposed approach initially extracts latent feature from each image using an encoder function . The extracted features are regularised by -normalisation to improve scale consistency of features in applying contrastive learning and DA.

After extracting the features, a camera-tracklet-aware memory (CTAM) is constructed. As shown in Fig. 1, the CTAM is a memory bank storing the extracted features, and it is accessible by the camera and tracklet Ids. The CTAM is defined as follows:


where and denotes camera’s feature sets and the tracklet’s feature sets included in , respectively. , , and denote the numbers of camera, tracklets on camera, and vehicle images in tracklet of camera, respectively. is stored features in the CTAM.

During training, is updated every training step as follow:


where indicates training step.

Based on the CTAM, we apply contrastive learning using camera and tracklet Ids (CTACL). One possible option is to directly apply self-supervised contrastive learning (SSCL) (Eq. (1

)) for vehicle re-id. However, it may be unsuitable for video re-id setting, since the loss function does not regard the existence of multiple positive samples. If a re-id model incorrectly considers actual positive samples to negative samples for training, the re-id performance would be significantly degraded.

As we aforementioned, by using camera and tracklet Ids, we can transform the unlabelled domain to a set of labelled subdomains. By this transformation, although it is limited to each subdomain, we can deterministically distinguish whether a sample is positive or negative. Therefore, we propose a contrastive learning loss with multiple positive samples.

Given an unlablled vehicle image with camera and tracklet Ids, the loss function of the CTACL based on CTAM is defined as follows:



is a latent feature extracted by

. is the temperature parameter for contrastive learning, and is a set of features specified by the tracklet Id and the camera Id . indicates the cardinality of the set .

In computing the loss, the CTACL determines the subdomain for computing the loss through the camera Id: , and it determines a set of positive samples using the tracklet Id: . This means that only samples within the same subdomain are used to calculate the loss function. However, to derive robust vehicle re-id, it is essential to learn a generalised representation that can cover all other subdomains.

One straightforward way to overcome this issue is to interconnect subdomains by sampling the positive and negative samples from different subdomains for cross-domain contrastive learning. We define two types of positive samples: easy and hard positives. Among the samples of all other subdomains, the easy positive samples are defined by -nearest samples of input images, as follows:


where is the sorted set of features based on pair-wise similarities between and other features stored in the CTAM (), and is the set of selected features as the easy positive. The easy positive samples mean samples taken from different cameras, which are similar to the input sample.

On the other hand, the hard positive samples mean the samples taken from different cameras, which are maximally different from the input sample. The hard positive samples are defined by the -nearest samples of the samples having the same tracklet Id but the farthest from the input sample in the feature space, represented as follows:


where indicates the sample having the same tracklet Id and the farthest from the input sample. and are the sorted set of features using and the set of selected features as the hard positives, respectively.

In selecting negative samples from other subdomains, we establish a grey zone that defines a non-obtainable area in acquiring negative samples to reduce the possibility of false-negative selection. After positive sample mining is completed, the remaining samples are sorted in descending order using a similarity score. And then, among the sorted features, samples corresponding to % of the remaining samples are excluded from the top, and the remainders are designated as negative samples: , where denotes ceiling function.

By using the above two positive sample sets ( and ) and the negative samples (), we extend the CTACL as follows:


where denotes the union set of the two positive sample sets and . represents the negative sample set.

CTACL using CTAM enables generalised non-parametric representation learning on the variational domain problem. For example, the number of tracklets and the images on each tracklet can vary. When a parametric approach (e.g.,

classification based on neural networks) is used, the parameter setting such as the output dimensions should be revised. However, CTACL using CTAM can provide a learning approach without any parametric model, which gives better methodological flexibility.

Iii-B Domain adaptation across camera networks

The CTACL includes a process for taking positive and negative samples from out of subdomains to improve generalisation performance. However, it may not provide sufficient generalisation performance for the entire camera network. It is a strong likelihood that the number of selected positive samples by the mining would be much smaller compared with the true positive samples. We can manually increase to take positive samples from beyond each subdomain as many as possible, but this approach also can generate a great number of false positives, which can degrade the re-id model performance.

To resolve this issue, we propose camera-level DA using CTAM to explicitly improve the generalisation performance of the learnt representation. As shown in Fig. 1, the proposed DA aims to uniformise the likelihood of camera Id classification. Intuitively, uniformising the likelihood of camera Id classification can be interpreted as maximising uncertainty of camera-specified information, and it can be considered as reducing the informational bias obtained in each subdomain.

To do this, we define the likelihood that the input image

will be classified by the

camera Id as follows:


where is the latent feature obtained by , and indicates the centre point of feature distribution having camera Id, and it is defined by the average of the features having the same camera Id as follow:


where denotes the number of features classified by camera Id. As same as all other features, the averaged features are also -normalised after computing it.

The loss function for the DA between cameras is formulated based on Kullback–Leibler (KL) divergence between the likelihoods and the uniform distribution, which is defined as follows:


Here, the uniform distribution vector

is defined according to the number of cameras as follows:


By minimising the KL-divergence between the uniform distribution and the likelihoods for camera Id classification, our re-id method maximise uncertainty in distinguishing specific camera Id, and subsequently, it reduces the risk of subdomain-specific representation learning and improve generalisation performance.

The objective function for joint learning using the CTACL and the proposed DA is defined as follows:


where indicates a balancing weight for the DA loss.

Dataset Imgs Ids ImgsId Cams Time (H)
VVeRI-901 488,195 901 541.83 11 33
Veri-Wild 416,314 40,671 10.23 174 125,280
VeRi-776 49,360 776 63.60 18 18
TABLE I: Key properties of the vehicle re-id datasets. ‘Imgs’, ‘Ids’, ‘Cams’, and ‘Time’ denote the number of images, vehicle identities, cameras, and recording time of each dataset, respectively.

Iv Experiments

Iv-a Dataset and Evaluation metrics

We use the three publicly available datasets: VVeRI-901 [35], VeRi-776 [19], and Veri-Wild [31] on the ablation study and performance comparison with recent state-of-the-art methods for unsupervised vehicle re-id. Table I shows the key properties of those datasets. Since we assume that the proposed approach can leverage camera and tracklet Ids, the VVeRI-901 dataset, which is proposed to provide a benchmark for video-based vehicle re-id task, is used as the main benchmark to evaluate our methods.

Additionally, we reorganised VeRi-776 and Veri-Wild datasets, which are image-based vehicle re-id datasets, to assign virtual tracklet Ids to each sample. Since those two datasets already provide camera Ids, we additionally create the virtual tracklet Ids by mapping the vehicle IDs included in each camera to a new Ids that are not shared between cameras in order to secure uniqueness of the tracklet Ids across the entire camera network. Our experiments are conducted based on the standard experimental protocol for unsupervised vehicle re-id [19, 18, 31, 33]

. Cumulative Matching Characteristics (CMC) and Mean Average Precision (mAP) are used as performance evaluation metrics.

Fig. 2: Performance analysis depending on the values of and on VVeRI-901 and VeRi-776 datasets. (a) shows the rank-1 accuracies and mAP depending on . (b) represents the trends of rank-1 accuracies and mAP according to .

Iv-B Implementation

All images are resized to 256

128. Stochastic gradient descent (SGD) with a momentum of 0.9 is used for model optimisation. The training epoch is set by 50. At the beginning of model training, the learning rate is set by 0.1 and decayed by multiplying 0.1 for every 10 epoch. The batch size is 256. ResNet-50

[9] followed by

-normalisation layer is used as the encoder function. The encoder model is pre-trained by ImageNet

[14]. The output dimensionality of the encoder function is 2,048. We used simple data augmentations (such as random crop, rotation, and colour jitters) to boost the generality of learnt representations. Initially, the CTACL is applied within subdomains (Eq. (4)) for 5 epochs to ensure the minimum representation learning performance; after that the extended CTACL (Eq. (7)) with the DA is applied. CTAM is completely overhauled every 5-epoch in training to improve the consistency between features. By referring to the research results of Wang et al. [28] and Khosla et al. [13], the temperature parameter is set to 0.07. The balancing weight and the grey zone scale are fixed by 0.2 and 0.01, respectively (based on the ablation study).

Iv-C Ablation Study

The performance of our method is affected by the following three hyper-parameters: , , and . We observe performance changes according to the variations of those parameters. We also demonstrate the effectiveness of the CTACL and DA by comparing them with various loss functions. All experiments are conducted with unsupervised vehicle re-id settings on the VVeRI-901 and VeRi-776 datasets. Parameters that are not subject to monitoring are fixed during the experiments. The parameter achieving the best performance would be fixed for further experiments.

Parameter analysis on : The in Eq. (5) and Eq. (6) decides how many potential positive samples to be selected from outside of subdomains. We observe the rank-1 accuracy and mAP in terms of . The experimental results in Fig. 2(a) show, in the interval where value increases from 0 to 1000, the performance increases rapidly, and after that, the performance is gradually decreased. These experimental results show that using the positive samples mined from other subdomains can improve the vehicle re-id performances by giving a chance to learn more general representation. However, when is getting larger, the possibility of wrong mined results also being increased so, it degrades the performance. The best performance is achieved by of 5.

Parameter analysis on : decides the range of the grey zone, which is the skipping area in selecting positive and negative samples from beyond subdomains. As shown in Fig 2(b), when the grey zone is not considered, i.e., is 0, the performances are lowest. The best performances are achieved by 0.01 of . The performance is slightly decreased when is getting larger, but it is not significant compared with the performance increment between 0 to 0.01 of . The trends of the rank-1 accuracy and mAP shown in Fig. 2(b) can be interpreted that the advantage of contrastive learning using weak supervisory signals, which are camera and tracklet Ids, can be degraded by false-positive. The presence of grey areas has a significant impact on performance, but its size does not significantly affect performance.

setting 0.0 0.01 0.1 0.2 0.5 1.0
VeRi-776 Rank-1 81.0 87.4 89.1 89.3 88.7 88.2
mAP 43.9 53.6 55.4 55.2 54.8 54.9
VVeRI-901 Rank-1 33.6 36.7 36.8 38.2 37.5 37.3
mAP 25.8 26.6 26.7 29.0 28.2 28.5
TABLE II: Performance analysis depending on the setting of . The bolded figures denote the best performance.
Loss function VeRi-776 VVeRI-901
Rank-1 mAP Rank-1 mAP
SSCL (Eq. 1) 23.6 10.4 26.1 15.9
Softmax CE+GT 94.8 79.8 43.5 42.8
Softmax CE+Tracklets 50.8 16.1 28.6 16.5
CTACL 81.6 44.2 33.7 26.0
CTACL+DA 89.3 55.2 38.2 28.1
CTACL+GT 92.3 68.2 41.1 29.5
CTACL+DA+GT 92.8 70.7 41.8 29.9
TABLE III: Performance comparison of various loss function settings. Self-supervised contrastive learning (SSCL) and Softmax cross-entropy (CE) are used. ‘GT’ indicates the model is trained by using the vehicle class labels. ‘Tracklet’ denotes that the tracklets of each subdomain is used as a label. ‘DA’ means the domain adaptation has been applied in the training phase.

Effectiveness of DA with : When is 0, it means that the DA would not be considered during model training. On the other hand, when is 1.0, it means that the gradient of the DA loss would be equally considered with the gradient of the contrastive learning loss (Eq. (7)). We conduct the performance evaluation with six different values of between 0 to 1, and the results is shown in Table II.

The best performance is obtained by 0.2 of , and it shows 89.3 of rank-1 accuracy and 55.2 of mAP on the VeRi-776 dataset and 38.2 of rank-1 accuracy and 29.0 of mAP on the VVeRI-901 dataset. The lowest performance on our experiments is obtained by 0 of , which means the DA is not applied. The experimental results on show that the domain adaptation across the camera networks improves the vehicle re-id performance.

Method Settings VVeRI-901
Rank-1 Rank-5 Rank-10 mAP
GoogLeNet [31] SU 40.8 59.6 65.3 41.4
ID Loss [39] SU 37.2 52.4 60.5 36.5
TCLNET-tri [6] SU 45.5 58.0 67.1 44.0
MGH [32] SU 44.3 61.8 67.8 44.5
Triplet [34] SU 35.6 51.2 54.3 33.7
PhD [35] SU 47.1 67.6 74.7 47.2
BOW [38] UN 13.2 15.1 20.1 3.7
BUC [16] UN 15.2 17.6 23.7 5.8
SSL [17] UN 28.6 36.8 38.1 19.2
MMLP [27] UN 26.3 39.1 40.2 18.1
SSML [33] UN 27.9 36.7 39.6 19.6
CTACL UN 33.7 37.6 40.3 26.0
CTACL+DA UN 38.2 42.1 43.4 29.0
TABLE IV: Comparison of person re-id performance on VVeRI-901. ‘SU’ and ‘UN’ indicate a methods is based on supervised and fully unsupervised learning. The underlined figures and bolded

figures indicate the best performance among the supervised learning-based and unsupervised learning-based methods, respectively.

Methods Year Settings Source VeRi-776 Veri-Wild (Small) Veri-Wild (Medium) Veri-Wild (Large)
Rank-1 Rank-5 mAP Rank-1 Rank-5 mAP Rank-1 Rank-5 mAP Rank-1 Rank-5 mAP
SPGAN [3] 2018 DA VehicleID 57.4 70.0 16.4 59.1 76.2 24.1 55.0 74.5 21.6 47.4 66.1 17.5
VR-PROUD [1] 2019 DA VehicleID 55.7 70.0 22.7 - - - - - - - - -
ECN [39] 2019 DA VehicleID 60.8 70.9 27.7 73.4 88.8 34.7 68.6 84.6 30.6 61.0 78.2 24.7
PAL [8] 2020 DA VehicleID 68.2 79.9 42.0 - - - - - - - - -
UDAP [25] 2020 DA VehicleID 76.9 85.8 35.8 68.4 85.3 30.0 62.5 81.8 26.2 53.7 73.9 20.8
VACP-DA [36] 2020 DA VehicleID 77.4 84.6 40.3 75.3 89.0 39.7 69.0 85.5 34.5 61.0 79.7 27.4
AE [4] 2020 DA VehicleID 73.4 82.5 26.2 68.5 87.0 29.9 61.8 81.5 26.2 53.1 73.7 20.9
LOMO [15] 2015 UN - 42.1 62.2 12.2 25.7 44.7 8.9 23.6 40.6 8.1 18.8 34.4 5.9
BOW [38] 2015 UN - 44.7 66.4 14.5 28.5 43.6 9.4 25.4 40.7 8.6 18.3 38.6 6.6
BUC [16] 2019 UN - 54.7 70.4 21.2 37.5 53.0 15.2 33.8 51.1 14.8 25.2 41.6 9.2
SSL [17] 2020 UN - 69.3 72.1 23.8 38.5 58.1 16.1 36.4 56.0 17.9 32.7 48.2 13.6
MMLP [27] 2020 UN - 71.8 75.9 24.2 40.1 63.5 15.9 39.1 60.4 19.2 33.1 50.4 14.1
SSML [33] 2021 UN - 74.5 80.3 26.7 49.6 71.0 23.7 43.9 64.9 20.4 34.7 55.4 15.8
CTACL 2021 UN - 81.6 89.5 44.2 71.05 86.6 58.2 69.2 83.7 49.2 60.1 81.5 41.2
CTACL+DA 2021 UN - 89.3 93.9 55.2 79.2 93.6 65.0 73.1 89.5 56.2 63.6 83.5 44.9
TABLE V: Performance comparison on unsupervised vehicle re-id with state-of-the-art methods on the VeRi-776 dataset [19] and the Veri-Wild dataset [31]. ‘DA’ and ‘UN’ denote a method is based on domain-adaptation or fully unsupervised learning. ‘-’ denotes that the results are not provided. The underlined results indicate the best performance among the DA-based methods. The bolded results indicate the best performance on the comparison.

Effectiveness of CTACL: We compare the CTACL with softmax cross-entropy (CE) loss and the SSCL loss (Eq. (1)). Based on the softmax CE, we derive the vehicle re-id models using the ground truth i.e., vehicle class labels, and the tracklets. Also, we evaluate the performance of the CTACL trained by the ground truth. In this case, samples having the same vehicle class label are considered as the positive samples, and the remaining samples are all negatives; i.e., the positive sample mining was not used. Table III contains the experimental results for this ablation study.

The lowest performance is obtained by SSCL loss. The best performance is achieved by the softmax CE with the ground truth. It shows rank-1 accuracy of 94.8 and mAP of 79.8 on the VeRi-776 dataset, and it shows rank-1 accuracy of 43.5 and mAP of 42.8 on the VeRi-776 dataset. The CTACLs using the ground truth also shows similar performances. However, the experimental results show that when only a limited supervisory signal (e.g., tracklet Ids) is available, softmax CE may not be a suitable solution. In contrast to the CTACL produces over than rank-1 accuracy of over 80 without explicit labels for vehicle Ids, the performances of the vehicle re-id model trained by softmax CE (rank-1 accuracy of 50.8) are significantly dropped when only tracklet Ids are given.

Iv-D Comparison with state-of-the-art methods

We compare the CTACL with various state-of-the-art methods of unsupervised vehicle re-id. Unfortunately, only a few fully unsupervised vehicle re-id methods have been proposed and reported their performances. In particular, on the VVeRI-901 dataset, only the performances of supervised methods have been reported. Accordingly, we conducted additional experiments with several methods recently presented for fully unsupervised person re-id. Following studies are used for the performance comparison: LOMO [15], BOW [38], BUC [16], SSL [17], MMLP[27], and SSML [33]. Those studies provide publicly available source codes, so the performance evaluation is carried out with those source codes.

VVeRI-901 dataset: Table IV shows the quantitative performance comparison using VVeRI-901 dataset. The CTACL with DA achieves rank-1 accuracy of 38.2 and mAP of 29.0. Those are the best performance among the unsupervised methods. The CTACL outperforms other unsupervised vehicle re-id methods with a minimum performance gap of 9.6%. In comparison with supervised vehicle re-id methods, the performance of the CTACL is higher than several supervised learning-based models [39, 34]. The best performance on the VVeRI-901 dataset is achieved by PhD learning [35], and it shows about 8.9% better performance compared to ours.

The experimental results on the VVeRI-901 dataset can be interpreted as follows. Since the VVeRI-901 dataset contains much noisy information such as occlusion between vehicle detection results, the performances of the unsupervised vehicle re-id methods show inferior performance compared with supervised vehicle re-id methods. Obviously, using vehicle Ids is a clear advantage in deriving a robust vehicle re-id model. However, even though explicit vehicle Ids are not given, the CTACL can obtain promising performance by using contrastive learning with weak-supervisory signals such as camera and tracklet Ids.

VeRi-771 and Veri-Wild datasets: The experimental results on VeRi-776 and Veri-Wild datasets also demonstrate the effectiveness of the CTACL. Table V shows that the quantitative performance comparison between the CTACL and other unsupervised methods on Veri-776 and Veri-Wild datasets. The CTACL with DA achieves state-of-the-art performances on our experiments. The CTACL achieves rank-1 accuracy of 89.3 and mAP of 55.2 on the Veri-776 dataset. For the Veri-Wild dataset, it achieves rank-1 accuracy of 79.2 and mAP of 65.0. For the ‘Medium’ and ‘Large’ test sets, it produces rank-1 accuracy of 73.1 and mAP of 56.2 and rank-1 accuracy of 63.6 and mAP of 44.9, respectively. These figures are minimum of 2.9% improved performance over the VACP-DA [36] which is the DA-based method with rank 2.

In comparison with the fully unsupervised methods [16, 17, 27, 33], the CTACL outperforms other fully unsupervised methods with significantly large margins. SSML [33], which is the second-ranked method among the fully unsupervised methods, produces rank-1 accuracy of 49.6 and mAP of 23.7 on the ‘Small’ test set, rank-1 accuracy of 43.9 and mAP of 20.4 on the ‘Medium’ test set, and rank-1 accuracy of 34.7 and mAP of 15.8 on the ‘Large’ test set.

The overall experimental results demonstrate that the CTACL not only outperforms the existing state-of-the-art unsupervised vehicle re-id methods but also achieves comparable performance to the supervised learning-based methods. Additionally, performance comparisons between the CTACL and DA-based methods justify that if weak-supervisory signals are available, we can derive a robust vehicle re-id method without a labelled source dataset.

V Conclusion

In this paper, we have proposed camera-tracklet-aware contrastive learning (CTACL). The proposed CTACL uses camera and tracklet information, which are easily obtained when a vehicle re-id dataset is constructed. Based on these two Ids, we divide an unlabelled domain (i.e., the entire images), into multiple camera-level subdomains. Tracklet Ids corresponding to each subdomain are used to decide positive and negative samples to compute the CTACL loss. Also, we have applied the DA across camera networks to improve the generalisation performance of learnt representation. The ablation studies have demonstrated the effectiveness of CTACL and DA in boosting the unsupervised vehicle re-id performance. In comparison with various existing state-of-the-art methods on unsupervised vehicle re-id, the CTACL has outperformed other unsupervised methods, including the DA-based method.


  • [1] R. M. S. Bashir, M. Shahzad, and M. Fraz (2019) VR-proud: vehicle re-identification using progressive unsupervised deep architecture. Pattern Recognition 90, pp. 52–65. Cited by: TABLE V.
  • [2] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    pp. 1597–1607. Cited by: §II-A, §II-A.
  • [3] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 994–1003. Cited by: TABLE V.
  • [4] Y. Ding, H. Fan, M. Xu, and Y. Yang (2020) Adaptive exploration for unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16 (1), pp. 3:1–3:19. Cited by: TABLE V.
  • [5] H. Fan, L. Zheng, C. Yan, and Y. Yang (2018) Unsupervised person re-identification: clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 83. Cited by: §I.
  • [6] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In ICCV, Cited by: §I, TABLE IV.
  • [7] M. Gutmann and A. Hyvärinen (2010)

    Noise-contrastive estimation: a new estimation principle for unnormalized statistical models


    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 297–304. Cited by: §II-A.
  • [8] B. He, J. Li, Y. Zhao, and Y. Tian (2019) Part-regularized near-duplicate vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3997–4005. Cited by: TABLE V.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-B.
  • [10] S. He, H. Luo, W. Chen, M. Zhang, Y. Zhang, F. Wang, H. Li, and W. Jiang (2020) Multi-domain learning and identity mining for vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 582–583. Cited by: §I, §I.
  • [11] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §II-A.
  • [12] Y. Huang, B. Liang, W. Xie, Y. Liao, Z. Kuang, Y. Zhuang, and X. Ding (2020) Dual domain multi-task model for vehicle re-identification. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I, §I.
  • [13] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Cited by: §II-A, §II-A, §IV-B.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §IV-B.
  • [15] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In IEEE CVPR, pp. 2197–2206. Cited by: §I, §IV-D, TABLE V.
  • [16] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang (2019) A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8738–8745. Cited by: §IV-D, §IV-D, TABLE IV, TABLE V.
  • [17] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian (2020) Unsupervised person re-identification via softened similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3390–3399. Cited by: §I, §I, §IV-D, §IV-D, TABLE IV, TABLE V.
  • [18] H. Liu, Y. Tian, Y. Yang, L. Pang, and T. Huang (2016) Deep relative distance learning: tell the difference between similar vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2167–2175. Cited by: §I, §IV-A.
  • [19] X. Liu, W. Liu, T. Mei, and H. Ma (2016) A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In European conference on computer vision, pp. 869–884. Cited by: §I, §I, §IV-A, §IV-A, TABLE V.
  • [20] Y. Lou, Y. Bai, J. Liu, S. Wang, and L. Duan (2019) Veri-wild: a large dataset and a new method for vehicle re-identification in the wild. In IEEE CVPR, pp. 3235–3243. Cited by: §I.
  • [21] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §II-A.
  • [22] J. Peng, Y. Wang, H. Wang, Z. Zhang, X. Fu, and M. Wang (2020) Unsupervised vehicle re-identification with progressive adaptation. arXiv preprint arXiv:2006.11486. Cited by: §I, §I.
  • [23] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §II-A.
  • [24] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang (2017) Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In IEEE ICCV, pp. 1918–1927. External Links: Link, Document Cited by: §I.
  • [25] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang (2020) Unsupervised domain adaptive re-identification: theory and practice. Pattern Recognition 102 (), pp. 107173. Cited by: TABLE V.
  • [26] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Cited by: §II-A, §II-A.
  • [27] D. Wang and S. Zhang (2020) Unsupervised person re-identification via multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10981–10990. Cited by: §I, §IV-D, §IV-D, TABLE IV, TABLE V.
  • [28] F. Wang and H. Liu (2021) Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504. Cited by: §IV-B.
  • [29] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang (2017) Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In IEEE ICCV, pp. 379–387. External Links: Link, Document Cited by: §I.
  • [30] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang (2017) Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 379–387. Cited by: §I.
  • [31] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 79–88. Cited by: §I, §I, §I, §I, §IV-A, §IV-A, TABLE IV, TABLE V.
  • [32] Y. Yan, J. Qin, J. Chen, L. Liu, F. Zhu, Y. Tai, and L. Shao (2020) Learning multi-granular hypergraphs for video-based person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2899–2908. Cited by: TABLE IV.
  • [33] J. Yu and H. Oh (2021) Unsupervised vehicle re-identification via self-supervised metric learning using feature dictionary. arXiv preprint arXiv:2103.02250. Cited by: §I, §IV-A, §IV-D, §IV-D, TABLE IV, TABLE V.
  • [34] F. Zhao, S. Liao, G. Xie, J. Zhao, K. Zhang, and L. Shao (2020) Unsupervised domain adaptation with noise resistible mutual-training for person re-identification. In European Conference on Computer Vision (ECCV), Glasgow, UK, pp. 1–18. Cited by: §IV-D, TABLE IV.
  • [35] J. Zhao, F. Qi, G. Ren, and L. Xu (2021) PhD learning: learning with pompeiu-hausdorff distances for video-based vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2225–2235. Cited by: §I, §I, §IV-A, §IV-D, TABLE IV.
  • [36] A. Zheng, X. Sun, C. Li, and J. Tang (2020) Aware progressive clustering for unsupervised vehicle re-identification. arXiv preprint arXiv:2011.09099. Cited by: §I, §IV-D, TABLE V.
  • [37] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) Mars: a video benchmark for large-scale person re-identification. In European Conference on Computer Vision, pp. 868–884. Cited by: §I.
  • [38] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In IEEE ICCV, pp. 1116–1124. Cited by: §I, §IV-D, TABLE IV, TABLE V.
  • [39] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 598–607. Cited by: §IV-D, TABLE IV, TABLE V.