In the current era of the advanced transportation system, video surveillance plays a pivotal role in providing security and safety measures in traffic control. With the increased demand across the globe for making cities “Smart cities”, investments are in place to develop robust Intelligent Transportation Systems. To fulfill this need, government authorities/Industrial representatives motivate to set up a surveillance environment to monitor the daily traffic activities. More often to monitor vehicle movements, surveillance cameras have been set up across prominent areas in the cities such as highways, junction points, gated campuses, etc. where there is a possibility of traffic breach. These surveillance videos are used as a source of evidence to enforce a penalty on the owner of vehicles who have violated traffic rules or caused accidents. The data acquired by surveillance cameras are valuable in performing computer vision tasks such as object counting, object detection, semantic segmentation object re-identification, and tracking
. Over the years, re-identification has made its mark in various tasks such as crowd monitoring and anomaly detection. As an application of ITS, vehicle re-identification has attracted various researchers in the field of computer vision. Vehicle re-identification aims in obtaining a possible match of a vehicle observed in one camera with images of the same vehicle appearing in the initial camera or different non-overlapping cameras. Compared to person re-identification, vehicle re-identification poses several challenges namely (a) there are limited appearance cues apart from vehicle color to differentiate vehicles of the same make and color (b) Identical vehicles appearing at multiple cameras are subjected to different viewpoints which lead to falsely re-identifying the vehicle of interest.
Several Convolutional Neural Network (CNN)-based algorithms were designed specifically to address the vehicle re-identification problems. These methods utilize several convolutional layers, pooling layers, and multiple linear layers along with stacked non-linear activation functions. However, CNN models developed for re-identification focuses on discriminative regions due to the choice of the different receptive field. During the process of convolving an image, the CNN models downsample the spatial resolution of output feature maps at different convolutional layers. Due to this, the network fails to identify similar-looking vehicles that differ with minor appearance changes. To explore long-range dependencies, attention-based CNN models were introduced that aim to identify long-range dependencies . These methods utilize vehicle key points which are trained in a supervised fashion to learn the discriminative parts of the vehicle. The performance of these methods is determined by the selection of key points that are either obtained by manual annotation or automated detection.
With the increasing popularity of transformers in the domain of NLP, researchers have utilized the transformer models to address computer vision-related tasks. By utilizing the concept of multi-head attention, the transformers aim to capture long-range dependencies by attending to various vehicle parts compared to CNN models . In contrast to CNN models, Transformer models process the images at patch levels across several layers without downsampling the images which enables them to learn more local information about vehicles. Although transformer models are proven to outscore existing CNN architectures, these models require a larger dataset to yield a comparable score.
Inspired by these observations, in this study, a vehicle re-identification framework is proposed that performs re-identification using the vehicle features that are computed by a CNN and a transformer model. Specifically, a ResNetmid CNN model and a Swin Transformer are used to learn the vehicle representations of vehicles observed across nonoverlapping cameras. For a given query vehicle, its presence is verified across gallery images by generating the features learned by both ResNetmid and Swin transformer. The generated features from individual models are fused to encapsulate both global and local representations of vehicles. The proposed re-identification framework is evaluated for 81 identical vehicles that are spotted across 20 CCTV cameras. To the best of our knowledge, this is the first of its kind study that utilizes both CNN and transformer models to perform the re-identification of vehicles.
The contribution of the paper is summarized as follows:
A first of its kind study in conducting vehicle re-identification by fusing the vehicle features of CNN and transformer model.
Performance evaluation of vehicle re-identification framework utilizing features learned using standalone CNN, transformer model, and fused feature representation (CNN+Transformer)
Ii Related Work
Computer Vision based approaches have been used for various applications such as precision agriculture , environmental monitoring , and surveillance  to name a few. Studies on object re-identification have been mainly focused on person and vehicle entities. This section summarizes different works that are been contributed by researchers to address vehicle re-identification.
Recent works on object re-identification commonly utilize triplet loss as a loss metric. The authors in the work 
proposed a two-branch deep convolutional network that projects the vehicle images to a euclidean space to measure the similarity of the two vehicles. To learn the discriminative features from deep feature embeddings, the network utilizes a Coupled-Cluster Loss (CCL).
The authors in  introduced a batch sampling strategy with triplet loss to perform vehicle re-identification. The batch sample and batch weighted variants were evaluated against the standard batch hard and batch all variants of vehicle re-identification.
An unsupervised metric learning model  is developed that leverages pairwise and triplet constraints to train a re-identification model using the triplet loss similarity metric. The vehicle features are transformed from an initial input dimension into a feature space where similar identity vehicles are close together while keeping dissimilar vehicle identities far apart. A Single-shot detector is utilized to identify the vehicles appearing in a scene and is assigned to an existing or a new tracklet. The detector is built upon a VGG-16 
backbone network and is trained using a COCO dataset with only vehicle class. Here the trackelts are grouped by location of videos. To compare the similarity between two vehicles, a middle frame from the tracklets is selected and the similarity score is computed using Euclidian distance.
A vehicle re-identification and abnormality detection framework were contributed in the work 
. The framework constitutes three steps to obtain vehicle features at discriminative level. The deep metric embedding module is utilized to extract discriminative vehicle features. The classifier module addresses how the vehicle features can be learned when they are of different pose and color. A Faster RCNN is used to detect the vehicles appearing on the scene. The features of detected vehicles are learned using a ResNet-50 trained with a triplet loss metric. As a post-optimization step, the authors have re-ranked the candidate images for a given query using the bag-of-words approach.
To alleviate the requirement of labeled data, the authors in  proposed an adaptive feature learning method to address re-identification. A re-identification network is trained on existing datasets by fine-tuning the feature extractor module to adapt to any different target dataset. Their proposed framework consists of three stages namely, the vehicle proposal stage, single-camera tracking, and a feature extractor step to perform re-identification.
As transformer architectures have found significant success in NLP, its application in computer vision tasks specific to re-identification is limited. Authors in  developed a transformer-based object re-identification framework that comprises two modules. Primarily these modules are designed to acquire more robust discriminative features of the vehicle and mitigate the similarity discrepancy across inter-cameras and intra-cameras matching. The authors evaluated the performance of the framework with the existing vehicle re-identification datasets.
Above discussed methods focus on re-identification by utilizing standard CNN architectures. With the emergence of transformer-based architectures, the problem of re-identification can be addressed significantly. There is limited set of studies that conduct re-identification utilizing both CNN and transformer-based architectures. Hence there is a scope to perform vehicle re-identification by jointly utilizing both CNN and transformer models which can overcome the hurdles faced in-vehicle re-identification such as illumination, viewpoint change, occlusion, etc.
In the present work, re-identification is conducted for vehicles observed across a network of surveillance cameras (CCTV). Vehicles observed by CCTV cameras may exhibit several appearance changes, and variations in scale, making it challenging to closely re-identify the vehicles. Hence in this work, a novel vehicle re-identification method is developed which fuses the learned feature representations from a ResNetmid network and Swin transformer
. The CNN network is utilized to learn both the semantic and global features of vehicles. The transformer network encodes the vehicle images at multiple resolutions and hierarchically fuses this learned representation at different stages. Both these networks are trained independently using triplet loss. During inference (Figure1), for a given query vehicle identity observed in CCTV surveillance system, its presence is verified across the gallery set that comprises identical vehicles images observed across network of cameras. This is achieved by generating the feature embeddings of vehicles using both ResNetmid and Swin transformer network. The feature embeddings are generated for query vehicle identity and vehicle identities observed by CCTV cameras. Each of the generated feature embeddings is fused and a similarity score is computed for every gallery identity vehicle and the input query vehicle. Using the similarity score the vehicle identities in the gallery are ranked such that the near resembling vehicles to the query vehicle identity appear at the top of the list. The following sections describe the overall architecture of the ResNetmid (Section III-B) and Swin transformer (Section III-C) that are used to learn the vehicle representations.
To learn the semantic and global features of the vehicles, in this work a ResNetmid backbone architecture is used. The architecture uses ResNet50 
variant that comprises five residual blocks.The first five blocks of networks are initialized with pre-trained ImageNet weights. As inferred from the work, utilizing the features learned from the middle layers of the network is beneficial to learning the semantic representation of the vehicles. Different from the work presented in 
, here the final feature embeddings from residual block 4 are extracted and a global average pooling is applied. This is to encapsulate the semantic information of the vehicles that are learned by the initial residual blocks of ResNet50. The global average pooling is also applied to the feature vector generated by a final residual block of ResNet50. This feature vector exhibits a high level of the global representation of vehicles. Both the semantic feature vector and the high-level feature vector are later concatenated to generate the final feature embedding vector which incorporates both the semantic and global vehicle representation. A dense layer is added to predict the probability of a vehicle belonging to a known vehicle identity class. The network is trained using triplet loss. The final feature embedding vector is given as the input for the loss module. Triplet loss generates triplets which consist of an anchor image a positive image that is similar to an anchor image and a negative image that is dissimilar to an anchor image.
Iii-C Swin Transformer
Swin transformer processes the image at patch level by decomposing the images to several patches. Swin transformer consists of four stages. In each stage, the swin transformer generates feature maps with different sizes that correspond to different scales. The transformer initially partitions the input image into non-overlapping patches using a patch partition module. In this work a patch size of is chosen and thus the feature dimension of each patch is . Each patch is treated as a “token” and is further projected to an arbitrary dimension ‘’ using a linear embedding layer. In this work, a Swin-S variant of the transformer is utilized where the . A swin transformer block with self-attention is applied to the tokens. At each stage, neighboring patches are merged thereby performing self-attention at different scales. Different stages of swin transformer block jointly produce a hierarchical representation of the feature map of different resolutions. At each stage of the swin transformer block, the number of layers or depth of the swin transformer are 2,2,18,2. At the last stage of the transformer, a global average pooling layer is applied to the output feature map which is later utilized to train the network with triplet loss.
During inference, utilizing both the networks (ResNetmid and Swin Transformer) for a given set of query vehicle images and candidate vehicle images in the gallery set, the feature embedding is generated. Specifically is a set containing query vehicle images and denotes collection of gallery images. For each query and gallery set, the feature embeddings are obtained using the trained ResNetmid and Swin transformer model. For ResNetmid the concatenated layer that exhibits semantic and global information of vehicles is used to obtain feature embedding information for both query and gallery images. Similarly for Swin transformer, the global average pooling layer of the last stage of the transformer is used to generate the feature embeddings for both query and vehicle images present in the gallery. and denotes the feature embedding representation of query set for both ResNetmid and Swin Transformer. Similarly, and denotes the feature embedding representation of entire gallery set. Here and are the dimensions of the feature embedding layer corresponding to both ResNetmid and Swin transformer respectively. These feature representations are later fused(concatenated) to generate a final feature vector. Utilizing these feature vectors a similarity score is computed for each query vehicle image against the vehicle images present in the gallery. , Euclidean distance is used to determine the similarity score between a query vehicle image and a candidate vehicle image appearing in the gallery. The computed score is further sorted such that similar appearing vehicles to the given query are closer and appear at the top in the ranked list.
As illustrated in Figure 1, during inference the generated feature embeddings for both gallery and query set of vehicles using ResNetmid and Swin Transformer are computed paralelly and are further concatenated. This process is defined in equation (1) and (2)
where . Both and are utilized to compute similarity score for each of the given query image.
Iv Results and Discussion
Here a detailed information on the data gathering process to conduct vehicle re-identification is outlined. Further detailed analysis of vehicle re-identification using standalone ResNetmid, Swin transformer, and the fused (concatenated) feature representation that is jointly obtained using ResNetmid and Swin transformer.
Iv-a Experimental Setup
To evaluate the re-identification framework, surveillance data is acquired using CCTV cameras at the campus of Manipal Institute of Technology, Manipal, India. Of the entire cameras available on the campus (Total area: 188 acres), the cameras considered for this study are such that the probability of traffic movements is non-uniform. Camera locations include entry/exit of campus, academic section, hostel premises, etc. The data is collected for 2 days using Hikvision surveillance cameras with a resolution of at 20fps. A total of 81 vehicle identities were identified which are during the training and inference stage of re-identification. The information regarding the dataset is summarized in Table I. For a similar vehicle identified on two different days, a different vehicle identity is assigned with a belief that these vehicles may undergo appearance changes.
|CCTV Cameras used for experiment||20 Cameras|
|CCTV camera frame resolution||19201080p|
|Frame rate||20 fps|
|Duration of CCTV Videos||15 to 40 minutes|
Besides, processing each frame of CCTV videos is time-consuming and redundant as they are acquired at 20fps. Hence as a pre-processing step, a shot boundary detection algorithm is applied to generate keyframes. In shot boundary detection, histogram difference is computed on an RGB image that is divided uniformly into a non-uniform grid. A shot boundary is identified if the histogram difference between two successive frames is greater than a certain threshold value. In this paper, the threshold value is experimentally set to 0.20. The identical vehicles are manully labelled using Microsoft Vott annotation tool.
Iv-B Vehicle Re-identification results
The proposed re-identification framework utilizes features learned from both CNN(ResNetmid) and a transformer(Swin) model. Hence each network is trained individually and is further utilized during inference to obtain the vehicle representations which are later fused to compute re-identification scores. For identified 81 vehicle identities across 20 surveillance cameras, a total of 46 vehicle identities are used for training the re-identification network. A total of 1,317 annotated vehicle images of 46 vehicle identities are used to train the model. A Batch Hard triplet loss variant is used with the parameters and
set to 3 and 4 respectively. The network is trained for 200 epochs with an initial learning rate of 0.001 and a decay factor of 5e-4. During inference 35 vehicle identities that were detected across 20 surveillance cameras are considered. A total of 35 images of each vehicle identity are considered as a query set. The presence of each query vehicle instance is verified across 983 vehicle images present in gallery. For each vehicle query image, a score is computed across the entire gallery of images that are further ranked with the most similar image to the given query ranked at the top of the list.
The performance of the framework is assessed using mAP and rank-k accuracy. Table II shows the re-identification scores computed for each experiment. A discussion is laid out regarding the computed mAP scores by inferring the Table II and the top-5 (Figure 2) visualization obtained for each query vehicle.
Vehicle re-identification using ResNetmid:
For the developed re-identification framework, a modified architecture of ResNet50 is used to learn the semantic and global features of vehicles. When the network is solely used to infer the presence of query vehicle identities, an mAP of is obtained. From Figure 2, for query 1 it can be observed that the network can get a single candidate match of a vehicle belonging to the same identity query vehicle image. The retrieved false match of the candidate vehicle images has a similar appearance in form of color feature. The network fails to generalize the discriminative features of vehicles thereby focussing more on the global appearance of the vehicle. For query 2 the ResNetmid fails to retrieve a candidate image similar to query vehicle identity in the top-5 rank. The retrieved top-5 candidate images have a similar appearance in form of color features. The computed rank-k scores for set of query images using ResNetmid are (rank-1), (rank-5), (rank-10), (rank-20) respectively.
Vehicle re-identification using Swin Transformer:
To learn the discriminative features of the vehicle, the Swin-S variant of the transformer is used. In the network at each stage with different resolutions of the feature map, a self-attention score is computed. During inference for a given set of query vehicle images, an mAP of is obtained. From Figure 2 it can be observed that for both the queries the network can retrieve candidate vehicle images from the gallery that are similar to the query vehicle image. Swin transformer processes the images as a collection of patches/tokens. Each patch participates in computing attention scores with neighboring patches whereby the discriminative/part-level features of vehicles are learned effectively. Hence the network can retrieve the vehicle images similar to the given query images observed at different viewpoints. The computed rank-k scores for set of query images using Swin Transformer are (rank-1), (rank-5), (rank-10), (rank-20) respectively.
Vehicle re-identification using ResNetmid+Swin Transformer:
As Outlined in the Figure 1, the proposed re-identification framework fuses the feature representations of the vehicle that are generated by two sub-networks ResNetmid and Swin Transformer. The fused feature representation of vehicles consists of both global features and discriminative features learned by each of the individual network images, an mAP score of is obtained. The obtained mAP score is significantly better than the computed mAP scores obtained using the individual network. Using the fused representations, it is observed that for both query images, the network can retrieve more candidate images belonging to the query identity. Utilizing the concatenated feature representation a higher rank-k accuracy is obtained. A (rank-1), (rank-5), (rank-10), (rank-20) scores are obtained.
Vehicle re-identification is open problem in computer vision task that aims to re-identify vehicles across multiple cameras. Currently the task of performing vehicle re-identification is carried out using either CNN and attention models. These models fail to capture the long-range dependencies as the image is downsampled across deep layers whereby prominent information is lost. Transformer architectures are emerging to address computer vision tasks with greater scope to solve re-identification problems. These models process the images at the patch/token level. These patches are made to obtain the long-range dependencies across neighboring patches to weigh their importance over other patches of the image. Transformer models are computationally expensive and require a larger dataset to obtain comparable results. Under these observations, a vehicle re-identification framework is presented that fuses the learned vehicle representation from ResNetmid CNN and Swin Transformer. Using the fused vehicle representation a higher mAP ofalong with rank-1 of rank-5 of rank-10 of and rank-20 accuracy of is obtained. The computed scores were significantly better than re-identification scores determined individually for ResNetmid and Swin architecture. The fused representation contains both global(ResNetmid) and discriminative features(Swin Transformer) of vehicles which are useful in re-identifying vehicles that are partially occluded, subjected to the viewpoint and illumination changes.
Unsupervised vehicle re-identification using triplet networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 166–171. Cited by: §II.
-  (2019-10) Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §I.
-  (2021) Semantic segmentation with enhanced temporal smoothness using crf in aerial videos. In 2021 IEEE Madras Section Conference (MASCON), Vol. , pp. 1–5. External Links: Cited by: §I.
-  (2019) Performance analysis of semantic segmentation algorithms for finely annotated new uav aerial video dataset (manipaluavid). IEEE Access 7, pp. 136239–136253. Cited by: §IV-A.
-  (2020) A survey on vision transformer. arXiv preprint arXiv:2012.12556. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §II, §III-A, §III-B.
-  (2021) Transreid: transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15013–15022. Cited by: §I, §II.
-  (2020) Efficient vehicle counting by eliminating identical vehicles in uav aerial videos. In 2020 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), pp. 246–251. Cited by: §II.
-  (2019) A survey of advances in vision-based vehicle re-identification. Computer Vision and Image Understanding 182, pp. 50–63. Cited by: §I.
-  (2019) Vehicle re-identification: an efficient baseline using triplet embedding. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: §II.
-  (2022) Transformer-based attention network for vehicle re-identification. Electronics 11 (7), pp. 1016. Cited by: §I.
-  (2016) Deep relative distance learning: tell the difference between similar vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2167–2175. Cited by: §II.
-  (2019) Pose-guided complementary features learning for amur tiger re-identification. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Vol. , pp. 286–293. External Links: Cited by: §I.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §I, §III-A.
-  (2019) Vehicle re-identification with learned representation and spatial verification and abnormality detection with multi-adaptive vehicle detectors for traffic video analysis.. In CVPR Workshops, pp. 363–372. Cited by: §II.
-  (2020) Adaptive l2 regularization in person re-identification. External Links: Cited by: §I.
-  (2021) UVid-net: enhanced semantic segmentation of uav aerial videos by embedding temporal information. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14, pp. 4115 – 4127. External Links: Cited by: §I.
Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §III-B.
-  (2019-06) Comparative study on various losses for vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
-  (2021) DeepRivWidth: deep learning based semantic segmentation approach for river identification and width measurement in sar images of coastal karnataka. Computers & Geosciences 154, pp. 104805. Cited by: §II.
-  (2015) Segmentation of tomatoes in open field images with shape and temporal constraints. In Pattern Recognition Applications and Methods, A. Fred, M. De Marsico, and A. Tabbone (Eds.), Cham, pp. 162–178. Cited by: §II.
-  (2019) A survey of vehicle re-identification based on deep learning. IEEE Access 7 (), pp. 172443–172469. External Links: Cited by: §I.
-  (2018) Vehicle re-identification with the space-time prior. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 121–128. Cited by: §II.
-  (2021-06) A multi-camera vehicle tracking system based on city-scale vehicle re-id and spatial-temporal information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4077–4086. Cited by: §I.
-  (2017) The devil is in the middle: exploiting mid-level representations for cross-domain instance matching. arXiv preprint arXiv:1711.08106. Cited by: §I, §III-B.
-  (2019) Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems 30 (11), pp. 3212–3232. External Links: Cited by: §I.