Robust, Extensible, and Fast: Teamed Classifiers for Vehicle Tracking in Multi-Camera Networks

As camera networks have become more ubiquitous over the past decade, the research interest in video management has shifted to analytics on multi-camera networks. This includes performing tasks such as object detection, attribute identification, and vehicle/person tracking across different cameras without overlap. Current frameworks for management are designed for multi-camera networks in a closed dataset environment where there is limited variability in cameras and characteristics of the surveillance environment are well known. Furthermore, current frameworks are designed for offline analytics with guidance from human operators for forensic applications. This paper presents a teamed classifier framework for video analytics in heterogeneous many-camera networks with adversarial conditions such as multi-scale, multi-resolution cameras capturing the environment with varying occlusion, blur, and orientations. We describe an implementation for vehicle tracking and surveillance, where we implement a system that performs automated tracking of all vehicles all the time. Our evaluations show the teamed classifier framework is robust to adversarial conditions, extensible to changing video characteristics such as new vehicle types/brands and new cameras, and offers real-time performance compared to current offline video analytics approaches.


page 1

page 3

page 4

page 5

page 7

page 8


ReXCam: Resource-Efficient, Cross-Camera Video Analytics at Enterprise Scale

The deployment of large camera networks for video analytics is an establ...

CrossRoI: Cross-camera Region of Interest Optimization for Efficient Real Time Video Analytics at Scale

Video cameras are pervasively deployed in city scale for public good or ...

Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model

Multi-target multi-camera tracking (MTMCT), i.e., tracking multiple targ...

REVAMP^2T: Real-time Edge Video Analytics for Multi-camera Privacy-aware Pedestrian Tracking

This article presents REVAMP^2T, Real-time Edge Video Analytics for Mult...

Scalable and Real-time Multi-Camera Vehicle Detection, Re-Identification, and Tracking

Multi-camera vehicle tracking is one of the most complicated tasks in Co...

PVSS: A Progressive Vehicle Search System for Video Surveillance Networks

This paper is focused on the task of searching for a specific vehicle th...

Lost in Time: Temporal Analytics for Long-Term Video Surveillance

Video surveillance is a well researched area of study with substantial w...

I Introduction

The ubiquity of large-scale camera networks have coincided with the emergence of real-time video management platforms such as Chameleon [1], Kestrel [2], VideoStorm [3], and BriefCam [4]. These tools allow human users to manage camera networks of thousands to hundreds of thousands of cameras, and query them manually to obtain live and archived video data summarization, mainly for forensics applications.

From the big data and machine learning (ML) research point of view, a major research challenge is the automation of video analytics to detect and track interesting objects and events. Metadata extraction and object tracking must be robust to streams with varying resolution and scale, along with camera artifacts such as different levels of blur and orientations. For practical applications, ML models also need to be extensible, so more classes of objects can be added as they become interesting. Finally, for real-time analytics, such tracking requires fast models due to the amount of data to be analyzed within a limited time (milliseconds per frame).

A typical vehicle tracking system consists of: (i) vehicle detection from frames, (ii) collection and integration of detections and metadata on the identified vehicle, and (iii) vehicle re-identification across different frames and cameras. Various approaches have been proposed (we cover Related Work in Section II), but they have performance issues due to significant computation requirements and extensibility issues due to assumptions on the vehicles, cameras, or other system components. To automate the process, effective knowledge acquisition models are necessary to detect and track relevant objects, infer missing metadata, and enable automated event detection.

In this paper, we propose reframing the typical large-scale video analytics pipeline from the common single-model approach to a novel teamed-classifier approach to deal with real-world video datasets with dynamic distributions. In a teamed-classifier approach, we build teams of models where each model, or a subset of models, is assigned to a subspace in the data distribution. We can contrast with ensembles. In a traditional bagging or stacked ensemble, each model in the ensemble is applied to the entire data space and weighted on either training performance or live drift-detected performance (dynamic weighted ensembles for drifting data are presented in [5, 6]). The prediction for an input is then: , where is a static weight for model , with usually assigned empirically. In contrast, our teamed classifier approach assigns an expert, or a family of expert classifiers, to a region of the input data space. We use a gating function to dynamically construct an ensemble during inference, similar to [7]:


Our contributions can be summarized as follows:

  • A teamed classifier approach for video analytics, with focus on the vehicle re-identification task.

  • A general framework for re-identification that uses the naturally induces sparsity of the re-id task to build a sparse gating model with supervision. We evaluate our gating model on the Cars196 Zero-Shot learning task, where the goal is to cluster vehicle brands and models. We achieve state-of-the-art performance with normalized mutual information (NMI, used to evaluate clustering quality) of 66.03 compared to previous SoTA of NMI 64.4 [8] and NMI 64.90 [9]).

  • A simple and strong baseline algorithm for the re-id task that can operate on the subspaces identified by the gating function. Our simple and strong base model is competitive with current state-of-the-art with an order of magnitude fewer parameters: we use approximately 12M parameters for our base model to achieve 64.4 mAP compared with more than 100M parameters for MTML-OSG [10] with 62.6 mAP (mAP, or mean average precision, is a metric for evaluating ranking and retrieval).

The teamed classifier approach addresses two intertwined technical challenges of high inter-class similarity and high intra-class variability:

  • High inter-class similarity

    : The visual similarity of two different vehicle of the same model/year and color, due to the manufacturing process of vehicles. Therefore, identifying the vehicle model/year and color (the best a pixel-based image analysis algorithm can do) would still occasionally be insufficient when two such vehicles appear in the same frame.

  • High intra-class variability: Images of the same vehicle may look very different due to different orientations or environmental occlusion. For example, front-view image of a sedan looks quite different from rear-view image of the same sedan. Pure pixel-based image analysis may have difficulties with such differences.

Ii Related Work

Ii-a Classic Vehicle Tracking Approaches and Research Issues

A typical vehicle tracking system consists of several stages:

  • Object detection: Dense object detection in video streams has long been an integral part of video analytics. Various approaches have been devised for real-time detection of a wide variety of object types, such as vehicles, people, animals, and traffic signs, such as Mask-RCNN [11], YOLO[12], SSD-on-MobileNets [13].

  • Vehicle metadata: A important part of traffic management is tracking vehicle speed to ensure traffic safety and detecting speed limit infractions. While some specialized cameras in a surveillance network may be equipped with speed radar, such functionality is usually not found in common surveillance camera models. Recently, some proposed approaches perform speed detection in common cameras by tracking the vehicles in 3D space [14]. Other metadata include vehicle type, brand, and color.

  • Vehicle re-identification: The re-identification task requires tracking vehicles across cameras and assigning them to correct identities. Challenges lie in re-id under adversarial conditions where vehicles need to be tracked with multi-orientation, multi-scale, multi-resolution values alongside possible occlusion and blur. The vehicle re-identification problem has seen significant work in the past few years due to advances in the general one-shot learning problem [10, 15, 16, 17, 18, 19, 20].

  • Event detection

    : Automated event detection remains a difficult challenge due to the lack of labeled real-world or synthetic data and absence of frameworks for video-based anomaly detection. A few approaches have been tested on simpler, small-scale data, such as LSTM-based

    [21] or predictive coding [22] approach.

There have been several advancements towards some of these tasks. The video-management platforms mentioned also perform object detection using off-the-shelf models. For example, [4] uses pretrained detection models and performs time-stamp base summarization, similar to pretrained YOLO for vehicle detection in [2]. The approach in [1] uses both YOLO and Mask-RCNN for object detection [1].

Fig. 1: The typical vehicle re-identification pipeline. Each layer extractors progressively finer features.

Vehicle re-identification. More recently, there have been approaches for end-to-end vehicle metadata extraction and re-identification in [10, 16, 17, 23, 19, 20]. OIFE [23] proposed stacked convolutional networks (SCN) to extract fine-grained features in conjunction with global features. 20 such keypoints on vehicles, such as headlights, mirrors, and emblem were labeled and extracted by SCNs to build feature masks. Global features and masked features are combined to create orientation invariant features for re-id. RAM [16]

approaches fine-grained feature extraction by splitting vehicle images into three regions and extracting features from each region separately. Features are combined with a fully-connected network for re-id. VAMI

[19] adds additional supervision to fine-grained feature extraction by using the viewpoint information of vehicles. Subnetworks are built for each vehicle viewpoint, and features from view-point subnetworks are combined for re-id. EALN [17] proposes addressing inter-class similarity with intra-class variability by using generated negatives: by using GANs to reconstruct images of existing vehicles, EALN can create potentially infinite training samples from a small dataset to improve inter-class similarity discrimination. MTML [10] combines ideas from RAM and VAMI by creating subnetworks for different orientations, scales, and color corrections. Features from subnetworks are combined for re-id. Finally, QD-DFL [20] proposes retaining spatial information in features by extracting diagonal and anti-diagonal features. Instead of only flattening convolutional features, diagonal features values are also used to improve re-id.

Single Models in Video Analytics. A common theme in the typical methods is the use of a single network for each task (Figure 1). Kestrel [2] uses the same model for all vehicle tracking. While Chameleon uses differently-sized models for low-fps, medium-fps, and high-fps streams, there is little variability beyond this – a single Mask-RCNN model is used for all high-fps videos, for example. Finally, most re-identification models propose a single network for all types of vehicles to perform simultaneous vehicle attribute extraction and identification. Each of the re-id models discussed uses end-to-end training; while a model may have subnetworks, they are used as a single model for each sample. Such approaches are effective in small-scale datasets without much variability. However, large-scale, real-world video networks have a variety of adversarial conditions that can limit single-model effectiveness. A demonstration of such adversarial condition based model degradation is provided in [24], where the authors examine some state-of-the-art re-identification models on adversarial re-id datasets with multi-scale, multi-resolution images along with occlusion and motion blur and find performance deterioration due to high dataset variability. Similarly, recent research in domain adaptation [25, 26] show different datasets that are visually similar for humans encode artifacts that can cause significant model deterioration. The authors of BlazeIt [27], a video querying framework, also make such an observation: video drift due to changes in the data distribution can lead to model performance degradation unless new models are added.

Open and Closed Datasets. One of the important issues in automated video analytics is the distinction between closed datasets that have finite underlying features under non-adversarial conditions and open datasets that are continuously evolving with potentially infinite underlying features. Most datasets used to train vehicle re-id or object detection models are closed datasets: their class distributions are fixed, and they encode a static set of features. This can lead to development of models that do not generalize to real-world data. Findings in generalizability studies in [28] and [29]

show model iteration on closed datasets lead to architectures and model weights that perform well on their respective test sets without generalizing to real-world data in the same domain (CIFAR-10 and ImageNet, respectively). This supports the findings in

[25] where models trained on one person re-id dataset significantly underperform on another, visually indistinguishable person re-id dataset. It is evident, then, that real-world analytics must take into account the open nature of real-world data where the underlying feature distribution is continuously evolving [7] and dataset drift is commonplace [30, 27, 31].

Ii-B Research Issues in Teamed Classifiers

Fig. 2: Teamed Classifiers for Vehicle Identification System; ICS: Inter-Class Similarity; ICV: Intra-Class Variability.

Team Sparsity. An important consideration in teamed classifiers is sparsity of weight assignments from the gating function from Equation 1 to ensure any single sample x uses only a subset of models. Differently from the recently proposed sparse mixture-of-experts model [32]

, subspace model assignments in teamed classifiers are supervised. In the mixture-of-experts approach, the submodels and gating function for submodels are trained together and sparsity is enforced with a penalty term in the loss function. In our teamed classifier approach, we enforce sparsity by exploiting

naturally induced sparsity in our input space; for example, vehicle re-identification has naturally induced sparsity in the manufacturing process: a vehicle must be of a single type (sedan or SUV), and of a single brand (Toyota, Mazda). So, we construct a supervised gating function that ensures sparsity using this natural sparsity in the input space by detecting vehicle brand, then using a brand-specific expert. We again contrast with the sparse mixture-of-experts model, where gating functions and experts and trained together to let the gating function learn the subspace assignments without supervision [32, 33, 34]. Adding new subspaces or changing existing subspaces, as is the case with real-world drift [7, 31], requires retraining the entire mixture-of-experts. In the teamed classifier approach, we can train the gating function and experts independently, allowing us to more easily extend to new subspaces by creating new experts as and when required and training them independently of existing experts. Changes to an existing subspace require only updating that subspace’s assigned models.

Naturally Induced Sparsity. We consider the naturally induced sparsity in vehicle tracking. The vehicle tracking task requires clustering vehicle identities into disjoint groups such that all images of a single identity are identified as such. This research task involves two technical challenges: high inter-class similarity (two vehicles of the same model/year and color are visually the same by manufacturing process), and high intra-class variability (images of the same vehicle from different perspectives can look very different).

The naturally induced sparsity of the re-id task lies in high inter-class similarity, since we observe that the inter-class similarity problem is precisely due to the underlying manufacturing process; some examples of inter-class similarity clusters include groups of Toyota Corollas, black SUVs, or red vehicles. Conversely, existing vehicle re-id datasets such as VeRi-776 [35] and VeRi-Wild [36] primarily focus on intra-class variability. Current approaches in vehicle re-id attempt to address inter-class similarity and intra-class variability in the same end-to-end model [16, 17, 18, 20]. This creates models that sacrifice performance on solving edge cases in intra-class variability to increase discriminative ability for inter-class similarity across the entire data space.

Iii The Teamed Classifier Approach

We introduce our teamed classifier approach for video analytics, specifically for vehicle re-identification. We will first describe a typical video analytics pipeline for re-identification. Then we describe our teamed classifier approach for vehicle re-identification and the advantages it brings over traditional single-model pipelines.

Iii-a Typical Pipeline for Vehicle Re-ID

A typical video surveillance framework for re-id comprises of a pipeline of increasingly fine-grained feature extractors. We show in Figure 1 a standard vehicle surveillance pipeline. Data enters the pipeline through a deployed camera network in the form of video streams. Each layer performs progressively finer-grained feature extraction for knowledge acquisition.

In the Object Detector layer, pretrained object detectors such as YOLO [12] or Mask-RCNN [11] are commonly used for vehicle, person, and sign detection. As we have discussed, the usual approach in current systems such as Kestrel and VideoStorm is to use a single model type for the entire data space. A notable exception is Chameleon [1], which uses a small team of detectors for changes in detection quality requirements: if high-quality detections are requested, a pretrained YOLO detector is used. If low-quality detections are requested, then simpler detectors like SIFT are used. Object detectors extract very coarse features, namely labels.

The Re-ID Model layer is the focus of typical re-id approaches, where a single end-to-end model is developed for fine-grained vehicle identity clustering. Details about these end-to-end models are provided in Related Work. Here we observe that some approaches do use submodels, such as OIFE [23]

; however these submodels are trained together and are each designed for the entire input space. Each submodel’s features are subsequently combined with an additional dense neural network to obtain final re-identification features. Re-id features are clustered to identify unique vehicle identities.

Iii-B Teamed Classifiers for Vehicle Re-ID

We now present our teamed classifier-based pipeline for a vehicle identification system in Figure 2. Our approach differs from the current methods described in Figure 1 by employing classifier teams as feature extractors, where each member of the team is assigned to a different region of the input space. We exploit the naturally induced sparsity of the input space to create disjoint teams with supervision.

Detector Team. We employ detector teams in lieu of cross-domain adaptive object detector models. A challenge in real-world object detection on multi-stream video networks is the sharp difference between frame artifacts generated by each camera or set of cameras. As the analysis of cross-domain performance in [37] shows, even visually similar images are difficult for feature extractors if they are captured in different environments. While there have been some studies in developing domain-adaptive techniques or more generalizable universal detector models [38], we employ the student-teacher model for object detection from [39]. We use a pretrained full YOLOv3 model as the teacher, and train smaller, specialized detectors for each camera. The specialized models are built on SSD-MobileNets and can be deployed on embedded devices [40, 13]. Specialized detectors are covered in recent approaches; we focus on the identification layers.

Inter-Class Similarity Team. For vehicle re-id significant interest has been given towards developing models that can handle both inter-class similarity and intra-class variability, shown in Figure 3. In the former, vehicles with different identities (i.e. license plates) look very similar because they may be from the same brand, same vehicle type, or same color. A Re-ID model must therefore distinguish visually similar vehicles in the same camera using camera-specific artifacts such as spatio-temporal constraints or background information, while also capturing cross-camera features. In terms of implementation, a Re-ID model must generate a set of features for vehicles such that similar vehicles across multiple orientations, resolutions, and scales are projected to the same cluster, while ensuring features of different vehicles that have high inter-class similarity are projected to different clusters.

Fig. 3: Inter-class similarity & intra-class variability. Vehicle 1 is visually similar to Vehicle 2 and can be differentiated only by looking at windows and bumper. This is an example of inter-class similarity in white SUVs

This naturally imposes orthogonal constraints on a re-id model, and fine-grained feature extraction is necessary to ensure a model can address both constraints. Thus, a model must be able to capture the full range of feature combinations in vehicles across multiple brands, orientations, colors, resolutions, and scales, as have been proposed in existing approaches. Consequently, existing approaches build complex networks that perform inter-class similarity discrimination, and intra-class variability minimization in the same model: OIFE [23] uses 20 stacked convolutional networks to extract human-labeled keypoints; RAM [16] builds three sub-networks to evaluate each section of a vehicle (roof, body, chassis); VAMI [19] creates multiple sub-models for each orientation; and QD-DLF [20] builds four networks to extract diagonal features.

While such approaches partially address the inter-class similarity and intra-class variability constraints, they make simple mistakes: we show in Figure 4 some mistakes in vehicle re-id provided by the Group Sensitive Triplet Embedding approach in [41]. Similar examples are provided in other papers. We observe that forcing a re-id model to learn both inter-class similarity discrimination and intra-class variability minimization enforces a learning burden that reduces overall performance.

Fig. 4: Mistakes in state-of-the-art Group Sensitive Triplet Learning [41]; images taken directly from original paper. The first row shows retrievals of sedans for a query of truck. The second row shows retrievals or SUV, sedan, and a blue truck for query of black truck.

We again propose exploiting the naturally induced sparsity of the input space to reduce the burden of learning orthogonal feature extraction for the re-id model. Concretely, our teamed-classifier approach uses two layers of features extractors: an inter-class similarity team to perform coarse clustering of vehicle images using natural feature descriptions, followed by an intra-class variability team that assigns one re-id model to each cluster from the inter-class similarity team. This allows the re-id models in the intra-class variability team to focus on a subset of the input space of vehicles without enforcing a generalization constraint to address inter-class similarity.

Naturally Induced Sparsity. We use our observations from Figure 4 to build the inter-class similarity team; we select three key coarse features for enforcing the intra-class variability team sparsity – vehicle color, vehicle type, and vehicle model. We focus on vehicle model discrimination, since vehicle color and type are coarser, finite features addressed with simpler image classifiers as in the BoxCars116K models [42]. For vehicle model discrimination, we consider the related zero-shot learning task. The zero-shot learning task requires learning feature extractors that can discriminate between classes seen during training and generalize to unseen classes not seen during training. We specifically focus on the Cars196 dataset, since it requires identifying unseen vehicle models using feature extraction on seen vehicle models. This is useful in re-id since new vehicle brands and updated vehicle models are continuously introduced, adding dataset drift to the input space.

We develop a zero-shot learning model that achieves state-of-the-art performance on the Cars196 dataset and use it for model discrimination. Our model implicitly learns relevant features for the unsupervised clustering of vehicle models. We describe our vehicle brand discriminator in Section IV-B.

Intra-Class Variability Team. With an inter-class similarity team to perform coarse-grained clustering, we can build our re-id models to focus on minimizing intra-class variability only. This provides two advantages:

  • Since our models only need to address intra-class variability on a limited subset of the true input space, we achieve higher performance in mAP and rank-1 retrieval compared to recent approaches.

  • We can build smaller models compared to recent approaches. As such, each member of the intra-class variability team uses a single ResNet 18 backbone and can operate in near real-time, compared to 20 stacked convolutional networks in [23], 5 ResNet backbones in [16], and 4 ResNet50 backbones in [20].

We describe our intra-class variability team’s base model in Section V-B.

Iv Intra-Class Similarity Team

We develop an end-to-end model to deploy as a submodel in the intra-class similarity team using a single backbone network. Our approach implicitly learns relevant local and global features for unsupervised clustering without relying on data and feature augmentation or synthetic data. We first describe the Cars196 dataset we use for evaluating our intra-class similarity team’s brand discrimination models.

Iv-a Dataset and Evaluation

The Cars196 dataset, introduced in [43], contains 196 classes of vehicles. It is challenging due to few images per class (on average, Cars196 has 82 images per class). Furthermore, vehicles exhibit a high degree of inter-class similarity as described in Section III-B, since most vehicles fit into a few form factors. We evaluate our models with two metrics: the normalized mutual information measure and the top-1 retrieval rate (we also show results for top-5 retrieval).

Normalized Mutual Information (NMI).

NMI measures clustering correlation between a predicted cluster set and ground truth cluster set; it is the ratio of mutual clustering information and the ground truth, and their harmonic mean. Given a set of predicted clusters

under a -means clustering, we say that each contains instances determined to be of the same class. With ground truth clusters , we calculate NMI as:


where is the entropy and is the mutual information between and . Since NMI is invariant to label index, no alignment is necessary.

Top-k Retrieval. We use the standard top- ranking retrieval accuracy for, calculated as the percentage of classes correctly retrieved at the first rank, and of those missed, percentage retrieved correctly on the second rank, and so on.

Iv-B Model for Brand Discrimination

We make the following observations in creating our intra-class similarity team’s submodel for brand discrimination:

  • It is well known that the earlier kernels in a convolutional network learn abstract, simple features such as colors. Some kernels also learn basic geometric shapes corresponding to image features in the low frequency range of images.

  • Later kernels learn more class-specific details and extract detailed features corresponding to higher-frequency features. While these are useful for traditional image classification, they may overfit on the ZSL task.

  • Convolutional layers are used for feature extraction, and subsequent dense, or fully-connected layers, used for feature interpretation. This forces the dense layers to learn image feature discrimination, instead of relying on convolutional filters. Since convolutional filters focus on nearby pixels with a spatial constraint, we believe relying on convolutional filters for feature interpretation to be more effective in tracking image invariant features.

Since the earlier layers learn coarse features, we propose using the early kernels with attention modules from [44] to improve feature extraction. Since the later layers in a convolutional network already learn fine-grained features, they do not require augmentation; otherwise they would begin to overfit on the training data and fail to generalize to unseen brands. Thus, we use convolutional attention on the early layers only, as opposed to attention throughout. We also address the loss of spatial image features in dense layers by removing them entirely and only use convolutional layers for both feature extraction and interpretation: given query and target, we evaluate their similarity on only convolutional features. In contrast to current approaches that use dense layers after the convolutional backbone to perform feature interpretation, we force the convolutional network to also learn feature interpretation simultaneously with feature extraction.

We show our overall intra-class similarity model in Figure 5. We now describe the backbone and attention modules (CBAM [44] and our novel Global Attention module).

Fig. 5: Overall architecture of intra-class similarity model for brand discrimination

Backbone. We use ResNet-18 as the backbone for the intra-class similarity model. Each ResNet-18 backbone consists of several "basic blocks" chained together. We apply targeted attention to these basic blocks, as opposed to each convolutional layer.

Convolutional Attention. We use the CBAM attention module from [44] to add discriminative ability to the backbone. Since the earlier filters learn coarser features and the later filters learn fine-grained features, adding convolutional attention to all layers improves classification accuracy in general. However, in the brand discrimination task, we need to generalize to unseen brands; so CBAM on later layers causes networks to overfit on fine-grained features of the training set, reducing overall performance (we examine this performance drop in Section VI-B). We add attention only to the first basic block to learn discriminative coarse-grained filters for better unseen class separation and metric learning, improving performance in for Cars196 and brand discrimination in general.

Global Attention.

The CBAM block is not sufficient to improve generalization due to skewed feature maps in early convolutional layers. The first convolutional layer is crucial in feature extraction since it occurs at the beginning of the network. We find that many feature maps at the first layers do not track any useful features; instead they either output random noise or focus on irrelevant features such as shadow. Therefore, we develop the Global Attention module (GA), shown in to perform feature map regularization. The GA module reduced feature map skew by re-weighting feature weights. Whereas CBAM separates channel and spatial attention, GA combines them to ensure spatial features are learned together. GA uses two

conv layers with Leaky ReLU activation to retain negative weights from the first conv layer in the backbone. The output is passed through a sigmoid activation and element-wise multiplied with the (see GA inset in

Figure 5).

We avoid max and average pooling since they cause loss of information and we want to preserve discriminative features for the basic blocks in the architecture core. We show an example of feature map correction in Figure 6; the top layer shows original feature maps, which have skewed towards the shadow under the vehicle; the bottom layer shows the corrected feature map without skew after applying GA.

Fig. 6: Feature map correction after Global Attentio (GA)n. Top features are skewed towards dark shadow, which does not carry any information. GA drives attention towards more useful details and corrects the skew.

ProxyNCA Loss. Our brand discrimination model’s target is to learn a distance metric on vehicle brands such that features for vehicle brands are clustered in the output feature space. We train with the ProxyNCA loss for distance metric learning introduced in [9]. With ProxyNCA, a model learns a set of proxies that map to training classes. Features from training-set images are mapped to the proxy-domain and the distance is maximized amongst the proxy clusters themselves. During inference, the proxies are ignored, and the true features are used to cluster vehicle brands.

V Intra-Class Variability Team

We now describe our base model re-id for the intra-class variability team. First, we describe the VeRi-776 dataset we use for evaluating our model. Then we will cover our intra-class variability results.

V-a VeRi-776 Dataset

The VeRi-776 dataset for vehicle re-id was introduced in [35] to promote research in the field. It contains 776 unique vehicle identities, with 576 used for training and the remaining 200 used for testing. During testing, the target is to retrieve the unseen identities given a query, with performance evaluated on the ranking. The dataset contains mostly intra-class variability, with each identity having several images from multiple cameras in different environmental conditions.

V-B Base Model for Re-ID

We now describe our base re-id model for the intra-class variability team. Since we offload inter-class similarity discrimination to the inter-class similarity team, our re-id models are simpler and smaller than typical re-id models, with better performance. The intra-class variability base model is robust, extensible, and fast, as we will show. We call it REF-GLAMOR, for reference-GLAMOR, where GLAMOR stands for Global Attention Module for Re-ID.

Base Model Construction. We follow similar principles in designing the re-id models that we used in the inter-class similarity brand discrimination model. Specifically, we use the ResNet-18 backbone with GA. Differently from the inter-class similarity models:

  • We do not use CBAM on the re-id models. Since CBAM has already been applied to the first basic block in inter-class similarity models, CBAM on the first basic block in a re-id model would perform redundant feature extraction. CBAM on the last basic block would lead to re-id model overfit on the training data.

  • We use the warmup learning rate suggested in [45] to gradually increase the learning rate during training and improve feature extraction on pretrained backbones.

  • We use the Random Erasing Augmentation proposed in [46]

    . Surprisingly, we found this method, which has been shown to be effective in face recognition, has not been applied to vehicle re-id.

Triplet Loss. We use the standard triplet loss for distance metric learning on the features:


where ,, are the anchor, positive, and negative of a triplet, is the embedding network, and is the margin constraint enforcing the minimum distance difference between two images from the save identity (anchor and positive) compared to two images from distinct identities (anchor and negative). The triplet loss generates a mapping such that individual identities are mapped to the same point.

Vi Evaluation

To validate our approach, we evaluate each novel contribution: the inter-class similarity team for feature discrimination, and the intra-class variability team for re-id feature extraction. We compare our approaches to the state-of-the-art.

Vi-a Brand Detection: Evaluation on Cars196

Each submodel of the inter-class similarity team performance inter-class similarity clustering on natural features to enforce naturally induced sparsity on the subsequent intra-class variability team. As described in Section IV-A, we evaluate on the well-known Cars196 dataset with inter-class similarity.

Evaluation Results. We evaluate NMI and Rank-1 on the Cars196 dataset and compare to recent approaches in the following table. We examine the impact of CBAM on different basic blocks to support our final model construction by testing different architectures: CBAM on all blocks, CBAM on the first block, and CBAM on the final block of ResNet.

Interestingly, addition of CBAM throughout the network reduces performance, since the later basic blocks begin overfitting on the fine-grained features that appear exclusively on the training set. We support this observation with results from CBAM-1, where CBAM is applied to the first basic block, and CBAM-4, where CBAM is applied to the final (or fourth) basic block. The results support our observations in Section IV-B - CBAM-4 has even worse performance, while CBAM-1 increases performance beyond CBAM everywhere.

On our model with CBAM-1 and global attention, we achieve state-of-the-art results (Table I), with NMI 66.03% and Rank-1 of 82.75% (10% better than [9]).

Method NMI R-1 R-2 R-4 R-8
DML [47] 56.70 49.50 - - -
DSP [48] 64.40 72.90 81.60 88.80 -
Proxy-NCA [9] 64.90 73.22 82.42 86.36 88.68
Baseline (B) 54.63 72.28 81.63 88.52 93.27
B+CBAM 56.71 73.55 82.09 87.99 92.26
B+CBAM-4 31.39 22.94 32.72 43.56 56.08
B+CBAM-1 63.04 80.56 87.86 92.52 95.73
66.03 82.75 89.68 93.72 96.41
TABLE I: Experimental results compared to recent approaches on the Cars196 dataset

Vi-B Re-ID: Evaluation on VeRi-776

We now show performance of REF-GLAMOR for the intra-class variability minimization. We evaluate our overall model described in Section V-B on the VeRi-776 dataset and compare to recent approaches. To evaluate the impact of IA, we compare performance to the re-id base model without global attention, and with global attention in Table II.

Approach mAP CMC-1 CMC-5
DAVR [18]
26.35 62.21 73.66
OIFE+ST [23] 51.42 68.30 89.70
Hard-View-EALN [17]
57.44 84.39 94.05
58.27 83.49 90.04
GSTRE [41] 59.47 96.24 98.97
VAMI+ST [19] 61.32 85.92 97.70
RAM [16] 61.50 88.60 94.00
Quadruplet [20]
61.83 88.50 94.46
MTML-OSG [10] 64.6 92.30 94.2
64.48 63.90 86.20
REF-GLAMOR(Ours) 71.08 89.21 95.47
TABLE II: Performance of REF-GLAMOR base model on VeRi-776 re-id dataset

REF-GLAMOR uses only a ResNet-18 backbone and achieves impressive performance compared to existing approaches that use multiple feature extractors. Since we do not need to perform inter-class similarity discrimination, REF-GLAMOR based models perform well on their respective subsets of the input space. Furthermore, performance is significantly improved with the inclusion of the GA module, since the first convolutional layer in the backbone has reduced feature map skew. We show an example in Figure 7, where GA reduces noisy or bad kernels in the first layer.

Fig. 7: Feature map skew correction with Global Attention; loss of information due to poor filters is corrected in the first convolutional layer.

Vii Discussion

Here we discuss qualitative aspects of our teamed classifier approach for a vehicle ID framework.

Vii-a Robustness of Re-ID Team

Approach Construction Parameters Total mAP
OIFE [23]
20 stacked convolutional networks (SCN) for
feature maps, passed through 6 modified Inception
networks, with features combined with dense layers
20x SCN: 26M params
1x Modified Inception: 1M params
1x Output FC: 1280x256 dense layer
521M 51.42
RAM [16]
3 ResNet50 branches, a normalization branch, and a global
features branch, with features combined with dense layers
3x ResNet50: 26M params
1x Norm. branch: 80M params
1x Features branch: 80M params
1x Output FC: 5120x1024 dense layer
244M 61.5
4 convolutional feature extractors, each using ResNet50,
with features combined with dense (FC) layers
4x ResNet50: 26M params
3x FC: 2048x575 dense layers
1x Output FC: 8192x575 dense layer
110M 64.6
REF-GLAMOR ResNet18 backbone + GA module
1x ResNet18: 11M params
1x GA Module: 200K params
11M 71.08
TABLE III: Approximate number of parameters in current re-id approaches compared to REF-GLAMOR

Our team for vehicle re-id is robust to multi-scale, multi-orientation images. This is evident from our results in Table II, where we show state-of-the-art mAP compared to existing approaches. While the CMC-1 is lower than a few methods like GSTRE [41] and MTML-OSG [10], mAP is a better measure of robustness. Compared to top-k retrieval, which measures the recall at k-th ranking, mAP measures overall ranking quality by measuring fraction of true-positives in the retrieved results across all queries. Higher CMC-1 indicates the first result retrieved is relevant. Higher mAP indicates most top-ranked results retrieved are relevant. As such, it is a stronger measure of robustness. Our model with GA achieves overall robust performance at mAP 71.08, compared to the next best mAP of 64.6 from [10].

Vii-B Extensibility of Teamed Classifiers

We have already discussed existing ensemble-based approaches (boosting/stacked ensembles) and the more recent mixture-of-experts ensembles. Our motivation in proposing teamed classifiers comes from our observation that several real-world domains have naturally induced sparsity.

Compared to the sparse mixture-of-experts model, which must learn the underlying sparse regions of the input space, our teamed classifier approach uses supervision to guarantee sparsity in the classifier teams. Furthermore, by separately training the gating function for brands/color and classifier teams for re-id, our pipeline is more extensible to new knowledge. Whereas the sparse mixture-of-experts must be retrained to handle new types of knowledge, our approach simply adds a new gate in the form of a new member to the intra-class similarity team, and corresponding classifiers for that gate to the re-id team. Further, as we demonstrated in the intra-class similarity evaluation, our individual team members handle unseen concepts.

Consequently, our re-id models in the re-id team need to only address intra-class variability. When inter-class similarity discrimination is performed by the intra-class similarity team, the gating models select which members of the re-id team will be used in re-ID. We show an example in the teamed classifier pipeline in 2, where we depict three re-id teams (among many): the Toyota Brand team, the Red Vehicle Team, and the Black SUV Team. If a color discriminator model identifies a red vehicle, its respective re-id team is used to extract identification features. This allows us to reduce instances of re-id mistakes, as shown in Figure 4.

Vii-C Real-time Performance

By offloading inter-class similarity discrimination to the intra-class similarity team, we are able to make our re-id models smaller than existing approaches. We show in Table III the approximate number of parameters in current approaches and our own, along with overall mAP.

Since our re-id models use ResNet18, they can deliver real-time performance for vehicle tracking without GPUs, with reduced parameter ResNet18 [50] achieving 50FPS on CPU.

Viii Conclusion

We have presented a new approach for conditional computation with teamed classifiers for vehicle tracking and identification. We describe an end-to-end approach for vehicle tracking, attribute extraction, and identification using teamed classifiers. With our teamed classifier approach, we build dynamic ensembles for feature extraction that are select at inference time. Similar to the mixture-of-experts model, we build a conditional network with sparsity that gates access to the dynamic ensembles. During pipeline construction, we build teams of models such that each member is assigned to a region of the input space. During inference, we determine the region of input space an input belongs to and dynamically select the team members for feature extraction. Unlike the mixture-of-experts model, where the sparsity constraint is enforced with training, our teamed classifier approach exploits the naturally induced sparsity of the input space in vehicle tracking to perform supervised team generation and gating.

We implement teamed classifiers for object detection (detector team models with camera-specialized detectors), vehicle attribute extraction, and vehicle identification. Since we adapt student teacher networks for the detector team and standard image classifiers for some attribute extractors, we focus evaluation on the novel team models: the brand discrimination team and the re-id models. We demonstrate state-of-the-art performance on each task, and show the advantages of our teamed classifier approach in Section VII, where we contrast the performance improvement of our approach with the reduced number of parameters (and consequently, operations), compared with current methods.


This research has been partially funded by National Science Foundation by CISE’s SAVI/RCN (1402266, 1550379), CNS (1421561), CRISP (1541074), SaTC (1564097) programs, an REU supplement (1545173), and gifts, grants, or contracts from Fujitsu, HP, Intel, and Georgia Tech Foundation through the John P. Imlay, Jr. Chair endowment. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above.