Locality Aware Appearance Metric for Multi-Target Multi-Camera Tracking

11/27/2019 ∙ by Yunzhong Hou, et al. ∙ Tsinghua University Australian National University 5

Multi-target multi-camera tracking (MTMCT) systems track targets across cameras. Due to the continuity of target trajectories, tracking systems usually restrict their data association within a local neighborhood. In single camera tracking, local neighborhood refers to consecutive frames; in multi-camera tracking, it refers to neighboring cameras that the target may appear successively. For similarity estimation, tracking systems often adopt appearance features learned from the re-identification (re-ID) perspective. Different from tracking, re-ID usually does not have access to the trajectory cues that can limit the search space to a local neighborhood. Due to its global matching property, the re-ID perspective requires to learn global appearance features. We argue that the mismatch between the local matching procedure in tracking and the global nature of re-ID appearance features may compromise MTMCT performance. To fit the local matching procedure in MTMCT, in this work, we introduce locality aware appearance metric (LAAM). Specifically, we design an intra-camera metric for single camera tracking, and an inter-camera metric for multi-camera tracking. Both metrics are trained with data pairs sampled from their corresponding local neighborhoods, as opposed to global sampling in the re-ID perspective. We show that the locally learned metrics can be successfully applied on top of several globally learned re-ID features. With the proposed method, we report new state-of-the-art performance on the DukeMTMC dataset, and a substantial improvement on the CityFlow dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Difference between re-ID and MTMCT. Given a query, re-ID searches the global gallery for true matches from all cameras. In comparison, MTMCT matches within a local neighborhood, in single camera tracking (SCT) and multi-camera tracking (MCT). Specifically, in MCT, when the target is in camera 2, we do not consider camera 3, since targets never appear in these two cameras successively (cameras may be too far away).
Figure 2: (A) A global metric learned from the entire train set considers all data. This metric is relatively robust but has a slack decision boundary (errors often exist). (B) This paper proposes a locally learned metric. It has a tight decision boundary and is more sensitive. In MTMCT, data association is usually within a local neighborhood, as opposed to global matching in re-ID. So local metric learning suits better. The proposed locality aware appearance metric (LAAM) has (C) an intra-camera metric for SCT and (D) an inter-camera metric for MCT. The former is learned on tracklet pairs within a time period in the same camera. The latter is learned on tracklet pairs across neighboring cameras (that the target may appear successively).

Multi-target multi-camera tracking (MTMCT) aims to identify and locate all targets in a multi-camera system at all times. In a closely related task, given a probe image, re-identification (re-ID) systems search the gallery to retrieve images of the same identity.

An MTMCT system is composed of several components, including detection, similarity estimation, and data association. Based on similarity estimations, detected object bounding boxes are associated in first single camera tracking (SCT), and then multi-camera tracking (MCT). Since the targets, e.g., pedestrians, vehicles, have continuous trajectories, most tracking systems only search the local neighborhood for data association. For example, temporal sliding window techniques are employed in many MTMCT systems [37, 36, 46]. In SCT, temporal sliding windows restrict the local matching neighborhood to the consecutive frames within a camera. In MCT, these windows restrict the local matching neighborhood to the neighboring cameras that the target may appear successively.

The appearance feature is a driving force in MTMCT. Currently, the tracking community shares very similar appearance representations and deep learning architectures with the re-ID community. That is, the feature is learned

globally from the entire train set, and then applied to both SCT and MCT [37, 54].

However, MTMCT and re-ID have their differences. First, Re-ID systems usually do not have access to trajectories, camera topology, and other spatial-temporal cues. Second, SCT is a very important part of MTMCT. In contrast, re-ID ignores candidates of the same camera as the probe in its evaluation [55].

In this study, we further explore the third and most important difference: local matching in tracking versus global matching in re-ID. As shown in Fig. 1, MTMCT only associates data within within a local neighborhood

(smaller appearance variances). Specifically, in SCT, only consecutive frames within single camera video are searched. In MCT, we match in a limited camera pool, as the search scope is narrowed by temporal sliding windows. On the other hand, given a query image, re-ID searches a

global gallery covering all cameras (large appearance variances). This local versus global difference is non-trivial. When applying the re-ID features directly, the mismatch between local matching in tracking and global re-ID appearance feature may compromise the MTMCT performance.

In fact, we believe this local versus global mismatch is the reason for the phenomenon Ristani et al. noticed. In their work [37], it is found that high-performing re-ID features do not necessarily lead to good MTMCT performance. In fact, re-ID models learn to deal with all kinds of environmental variances. However, in SCT, we only need to match consecutive frames that have relatively small (compared to cross-camera) appearance changes. In MCT, we still do not need to consider all environmental variance. For example, features for MCT do not need to be robust against viewpoint variance and low-resolution simultaneously (Fig. 1), since targets never appear in these cameras successively. In such cases, a stronger re-ID appearance feature does not necessarily lead to a higher MTMCT result.

To fit the local matching procedure in tracking, this paper proposes a locality aware appearance metric (LAAM). Specifically, for SCT, we sample training data pairs from consecutive frames within a single camera. For MCT, the training data pairs are selected from neighboring cameras (that the target may appear successively). Using two sampling strategies, we have an intra-camera metric for SCT and an inter-camera metric for MCT (see Fig. 2).

We show that LAAM can effectively improve tracking accuracy on multiple datasets, including a pedestrian dataset, DukeMTMC [36], and a vehicle dataset, CityFlow [44]. It can also be applied and on top of multiple re-ID features, such as IDE [56], PCB [43] and the triplet feature [21]. With a competitive tracker [37], we report the state-of-the-art accuracy on the DukeMTMC dataset.

Figure 3: MTMCT system overview. Given object bounding boxes, we first connect the bounding boxes into short but reliable tracklets. Then, the tracklets are merged into single camera trajectories. Finally, single camera trajectories are associated to form cross-camera tracks. The proposed LAAM includes an intra-camera metric and a inter-camera metric. Intra/inter camera metrics are applied to generate single camera trajectories and cross-camera tracks, respectively.

2 Related Work

Multi-object tracking in a single camera. In MTMCT, the SCT step is inspired by multi-object tracking (MOT)  [27, 33, 28]. There are both online and offline methods. Online tracking methods should not use data from the future time slots. They usually associate detections to tracklets in a greedy manner [15, 9]. Offline methods can benefit from future information. They usually formulate the problem as batch optimization, such as shortest path [3, 13, 50, 9], bipartite graph [4, 5], and pairwise terms [29, 49, 18, 53, 24, 25, 7, 10, 11, 12]. To reduce computation complexity, some employ a hierarchical approach [24, 42, 41], or temporal sliding windows [38, 34, 9].

Cross-camera tracking in MTMCT. Cross-camera tracking is a unique feature of MTMCT. [46, 32, 37, 52, 54, 23]. On the other hand, offline methods [37, 36, 46, 54] usually employ batch optimization techniques for higher accuracy, which is similar to MOT trackers. Vehicle MTMCT is also studied. In [45], Tang et al. use multiple cues to accommodate the similar appearance, heavy occlusion, and large viewing angle variation in vehicle tracking.

Re-ID features and its application in MTMCT. Re-ID originated from cross-camera tracking [56]. Recently, this area witnessed many competitive CNN structures being proposed [55, 43]

. Loss functions and training techniques are studied, such as the contrastive loss 

[48], triplet loss [39, 8, 30] and hard negative mining [21]. Data augmentation methods are explored to enrich the database [1, 58]. The advancement in re-ID has been pushing forward the state-of-the-art in MTMCT [37, 54, 23]. In [37], Ristani et al. propose a global feature learning method to improve the performance on both re-ID and MTMCT.

Metric learning for multi-camera tasks. Metric learning algorithm has been studied in re-ID  [56]. Besides, metric learning is also investigated in tracking to compute the similarity between observations. Unlike predefined distance metrics, these learned metrics can automatically adapt to a specific scenario and yield higher accuracy [2]. For example, Leal-Taixé et al. [26] train a Siamese network to aggregate pixel values and optic flow. In [51], Xiang et al. jointly learn a global feature representation and a distance metric for multi-object tracking. In [47], Thoreau et al. learn a Siamese network from re-ID datasets for similarity estimation in online tracking.

Departing from existing works, this paper studies the intrinsic dissimilarities between MTMCT and re-ID. Instead of directly learning a global feature representation/metric, we investigate locality aware appearance metric (LAAM) to meet the local matching in MTMCT data association.

3 MTMCT System Overview

Problem formulation. We follow the graph-based problem formulation introduced in [37]. We represent observations (bounding boxes, tracklets, trajectories) as nodes, and the similarities between them as weighted edges in a graph . For a pair of nodes , refers to the estimated similarity between them, and indicates whether they are of the same identity. The optimization problem is formulated as follows,


Eq. 1 maximizes intra-group similarity, minimizes inter-group similarity, and enforces transitivity (two data should be of same identity if both of them share identity with a third data point). In fact, a better performing similarity estimation will make this optimization problem easier, and improve association accuracy.

Detection. For DukeMTMC dataset, we adopt the OpenPose [6] detector following [37]. For CityFlow dataset, we use the SSD [31] detector provided by AI-City 2019 challenge [40].

Similarity estimation. In the baseline, given a pair of CNN features and , their appearance similarity score is computed as,


where is a distance metric, and we simply employ Euclidean distance here. , and . and denote the average feature distance of the same and different identities, respectively.

Data association. Figure 3 depicts the overall data association procedure. First, object bounding boxes are connected into tracklets. Then, the tracklets are matched into single camera trajectories. At last, the single camera trajectories are associated to form cross-camera tracks.

For SCT, we use short temporal sliding window to associate tracklets. For MCT, much longer temporal sliding window is used in data association, due to the long walking time of targets across cameras.

Figure 4: The proposed data sampling strategy for training the intra- and inter-camera metrics. Camera labels of the same person are colored the same. A shorter data sampling window is used to sampled within one camera for the intra-camera metric (A). On the other hand, inter-camera metric uses a longer data sampling window to sample positive data pairs from a different camera, and negative pairs from random cameras (B).
Figure 5:

Structure of the metric network for LAAM. It has three fully connected layers and a 2-dim softmax output layer. The network takes the absolute difference vector between a pair of features as input and outputs the confidence score of the input pair belonging to the same person.

4 Locality Aware Appearance Metric

As mentioned in Section 3, a good similarity estimation substantially improves association accuracy. In this section, we present a novel locality aware appearance metric (LAAM) by focusing on the local neighboring samples. Different form re-ID metric that aims at retrieving images from a global gallery, the learned locality aware appearance metric focuses on matching local neighboring candidates, which better fits the local matching tasks in MTMCT. LAAM is composed of a metric network (not our contribution) and a novel data sampling strategy (main contribution). Descriptions are provided below.

4.1 Metric Network Structure

The metric network is used to compute similarity scores between a pair of tracklets or trajectories, both of which are generated by global average pooling bounding box features. In our method, it replaces the Euclidean distance based similarity estimation (Eq. 2).

As shown in Fig. 5

, the network is a 3-layer perceptron

[17]. The hidden fully connected layers output a

-dim vector followed by ReLU activation. Given a pair of features

and , their absolute difference is used as the network input. The output of the metric network is a -dim softmaxed vector, denoted as . and encode the possibility of the input pair being of different identities or the same identity, respectively.

During training, the re-ID feature extractor is fixed, and only the metric network is updated. The metric network is trained as a classification problem using a cross-entropy loss function. During testing, we exert a scaling factor of

onto the softmax layer, to prevent the appearance similarity score from overshadowing other cues,

e.g., spatial-temporal cues. The similarity score for the proposed metric is computed by,


This similarity value should be positive if the data pair belongs to the same identity, and negative if otherwise.

4.2 Intra-Camera Metric and Inter-Camera Metric

LAAM has an intra-camera metric and an inter-camera metric, for SCT and MCT, respectively (Fig. 2). Both of the metrics are trained with local neighboring data pairs. Similar to data association, we find that temporal windows can effectively find the corresponding local neighborhood intra-camera or inter-camera. Hence, we use temporal windows for sampling training data pairs in the proposed LAAM.

Intra-camera metric. For data associations in SCT, we train an intra-camera metric to provide similarity estimation between tracklets. The metric network takes tracklet features as input for both training and testing. During training, the tracklet features are computed on ground truth images, while during testing, the tracklet features are computed on pedestrian detections.

In training, we sample local neighboring data pairs within a small temporal duration of from the target camera. As shown in Fig. 4, for every tracklet (yellow box) and the corresponding feature , we first randomly select data pair being of the same identity or not. These same identity data pairs and different identity data pairs are denoted as positive pairs and negative pairs, respectively. Note that the positive/negative pairs are generated with a ratio for data balance. We then choose the corresponding tracklet feature for the positive/negative pair. For positive pairs, we sample a tracklet (green box) that belongs to the same identity as the first tracklet, within the -sized window. For negative pairs, we sample a tracklet (red box) that belongs to a different identity, within the same data sampling window. Either way, we end up with a tracklet feature pair and . At last, we feed the absolute difference vector into the metric network as input.

Figure 6: Matching error comparison between the global metric and the proposed LAAM. We report false positive rate and false negative rate on the validation set.

Inter-camera metric. For data association in MCT, we train an inter-camera metric to provide similarity estimation between single camera trajectories. The metric network takes tracklet/trajectory features as input for training and testing, respectively. During training, we use tracklet features instead of trajectory features, due to the scarcity of single camera trajectories. Same as intra-camera metric, tracklet features for training are computed on ground truth images. During testing, trajectory features are computed on pedestrian detections.

The construction of cross-camera data pairs is very similar to that of within-camera data pairs, except for the following differences. First, the data pair is chosen within a -frame-long data sampling window across cameras. This cross-camera sampling window length is usually bigger than the within-camera sampling window length . Second, the positive/negative data pair sampling mechanism is different. For positive data pairs, we sample a tracklet (green box) that belongs to the same identity as the first tracklet within the -sized window, and we require the positive tracklet to be sampled from a different camera. For negative data pairs, we sample a tracklet (red box) that belongs to different identities, from a random camera within the temporal sampling window .

4.3 Discussion

Comparison between data sampling windows and temporal sliding windows. Data sampling windows and temporal sliding windows share some similarities. Both of them are used to restrict the data pool to its local neighborhood. However, there are a major difference. Temporal sliding windows are used in data association. In contrast, data sampling windows are used in data sampling to train the locality aware appearance metric (LAAM).

Data sampling window lengths. The lengths of the data sampling window in LAAM are different for intra-camera metric and inter-camera metric. The window length for within-camera data sampling is similar to the average trajectory duration. On the other hand, window length for cross-camera data sampling should be long enough to cover local neighboring trajectories of the same identity in different cameras. The influence of sampling window length on LAAM training are shown in Fig. 7.

Comparison between global metric and LAAM. We validate LAAM through statistics comparisons. The matching errors during SCT and MCT are shown in Fig. 6. We use the intra-camera metric for SCT and inter-camera metric for MCT. Using the proposed method, the false positive rate is significantly lower than the global metric, while the false negative rate remains very similar.

Extreme cases.

First, when the video frame rate is extremely low, unless they are returning, each target will only appear in one camera once. Under the circumstances, SCT will be completely removed, and thus the intra-camera metric will be obsolete. However, since the trajectory continuity still holds and the topology does not change, the locality in MCT will not be influenced. Thus, the inter-camera metric is still useful. Second, when the scenario is in open topology, targets travel to all cameras at the same probability. This time, the inter-camera metric will fall back to the global metric. However, the SCT data association is still local, and thus the intra-camera metric is useful.

5 Experiment

5.1 Dataset and Evaluation Protocol

Dataset. This paper uses the DukeMTMC dataset [36] and CityFlow dataset [44] to evaluate the proposed metric. DukeMTMC is a pedestrian tracking dataset. It contains 1080p 60fps videos from 8 cameras on a school campus. CityFlow is a vehicle tracking dataset. It has a low frame rate (10fps), severe occlusion, and fast-moving vehicles from 40 cameras, spanning over 2km. For simplicity, if not mentioned, we refer to the “validation set”, “test (easy)”, and “test (hard)” as that of the DukeMTMC dataset.

We also use the DukeMTMC-reID [57] and Market-1501 [55] datasets to evaluate re-ID appearance features.

Training and validation sets for DukeMTMC. In the DukeMTMC experiment, we use the first 40 minutes of the ground truth as the train set and the remaining 10 minutes as the validation set. For both the validation and online testing, we only use the train on the train set.

Evaluation protocol. For MTMCT, following [36]

, we use IDF1, IDP, and IDR as evaluation metrics. Note that CityFlow only evaluates MCT. For both DukeMTMC and CityFlow datasets, we evaluate on their online test set. For re-ID evaluation, we adopt the rank-1 accuracy and mean average precision (mAP) 

[55] evaluation protocol.

Method/variant SCT similarity MCT similarity
“Baseline” Euclidean distance based Euclidean distance based
“Global metric” Global metric Global metric
“intra/global” Intra-camera metric Global metric
“global/inter” Global metric Inter-camera metric
“intra/intra” Intra-camera metric Intra-camera metric
“inter/inter” Inter-camera metric Inter-camera metric
‘inter/intra” Inter-camera metric Intra-camera metric
“intra/inter” (ours) Intra-camera metric Inter-camera met
Table 1: Methods/variants compared in our experiment.

5.2 Implementation Details

Re-ID features. On DukeMTMC, we use three globally learned re-ID features, namely, the ID-discriminative embedding (IDE) [55], the triplet feature [21], and the part-based convolutional baseline (PCB) [43]. To train the three networks, we use the following settings. The input image is resized to . Random erasing [58] is employed for data augmentation. We use of the ground truth images (1 frame every second) as training data for faster convergence. In fact, training re-ID features with fewer frames 1) enables fast convergence and 2) does not lead to accuracy drop. We use ResNet-50 [20]

pre-trained on ImageNet 

[14] as the backbone for the three models.

On CityFlow, we use a DenseNet-121 [22] based re-ID feature with softmax and triplet loss. We train on the provided vehicle re-ID dataset for CityFlow.

Baseline MTMCT tracker. Our baseline is developed from DeepCC [37], with the one modification. We allow a target returning to the same camera. This also helps recognize a target after long-time occlusion, which is difficult for SCT with a short temporal sliding window. On DukeMTMC, each tracklet has frames. The temporal sliding window lengths for SCT and MCT are frames and frames, respectively. On CityFlow, we set the tracklet length to 10 frames. Temporal sliding windows for SCT and MCT are set to 500-frame-long and 2400-frame-long, respectively. and are calculated from the train set in both datasets.

Metric learning settings. The proposed locality aware appearance metric is trained with tracklet features, average pooled from ground truth image re-ID features. The learning rate is set to

for the first 30 epochs, and then decays to

for the last 10 epochs. The batch size is set to . We use the cross-entropy loss to train the metric network. On DukeMTMC, within-camera data sampling window length is frames, whereas cross-camera data sampling window length is frames. On CityFlow, we set and .

Method variants and notations. In Table 1, we present some descriptions and notations of the methods to be evaluated in the experiment. The baseline uses the Euclidean distance for similarity estimation (Eq. 2). Similarity estimation in all the other variants is calculated as Eq. 3. The global metric also adopts the structure in Fig. 5. “intra/inter” is the proposed full system.

5.3 Evaluation of Re-ID Features

The performance of re-ID features used in existing MTMCT works and in our paper is summarized in Table 2. First, we find the accuracy of IDE is on par with the triplet feature and is lower than PCB. It is consistent with the observation in [43]. Second, comparing with the re-ID descriptors used in previous works, our feature extractors are competitive on both the DukeMTMC-reID [57] and Market-1501 [55] datasets. For example, on Market-1501, the rank-1 accuracy is 92.0% for PCB, which is consistent with the reports in [43] and is close to the accuracy in [54]. We further assess their influence on LAAM in Table 3. In the following experiment, if not specified, we use IDE as the default pedestrian descriptor due to its good accuracy and easy implementation.

Features DukeMTMC-reID Market-1501
rank-1 mAP rank-1 mAP
DeepCC [37] 79.8 63.4 89.5 75.7
MTMC_ReID [54] 81.9 N/A 93.9 N/A
TAREIDMTMC [23] 81.6 72.3 87.2 76.4
IDE 79.7 62.9 87.6 72.2
Triplet 81.3 66.4 89.3 76.3
PCB 82.9 68.6 92.0 78.2
Table 2: Rank-1 accuracy (%) and mAP (%) of re-ID features on the Market-1501 and DukeMTMC-reID datasets. The three features (IDE, Triplet and PCB) we use in this paper have competitive accuracy in re-ID.

5.4 Evaluation of the Proposed Method

In this section, we summarize the results obtained by the proposed LAAM and compare it with its variants and the state-of-the-art methods.

Improvement over the baseline tracker and global metric learning. We first compare our method against baseline and global metric. Results are shown in Table 3, Table 4, and Table Table 5. We have two observations.

First, the global metric learning does not improve over the baseline. For example, on the validation set, compared with the Euclidean distance based baseline, applying global metric on IDE feature changes the IDF1 by -0.5% in SCT and by +0.2% in MCT. Under the same setting, on the easy test set, IDF1 accuracy of the global metric is equal to the baseline in SCT and is +0.3% higher in MCT. On the CityFlow dataset, global metric improves the MCT IDF1 by +0.5%. These results indicate that global metric learning does not bring significant benefits. This is because both the baseline and global metric are trained on the global train set, so their discriminative abilities are very close.

Second, the full LAAM method brings a consistent and non-trivial improvement over the baseline and global metric. On the validation set, for example, the full method “LAAM (intra/inter)” has a +2.4% IDF1 improvement over the baseline on the MCT task using IDE as the feature. On test (hard), our method using IDE excels the baseline by +6.9% in terms of IDF1, a significant improvement. On CityFlow, the proposed method improves MCT IDF1 by +6.4%. It demonstrates the effectiveness and generalization ability of our full method in terms of its ability in improving baseline accuracy, thus validating the proposed metric to some extent.

Impact of different re-ID features. Tracking accuracy based on different re-ID features is summarized in Table 3. Under both the SCT and MCT task, we find that the tracking performance of IDE, the triplet feature, and PCB is similar. This finding is consistent with a previous report [37]: improvement in re-ID accuracy can have a diminishing improvement on the MTMCT system. The main reason is that the appearance variation in tracking is much smaller than that in re-ID. For example, in MCT, the gallery in a temporal sliding window might have dozens of images, while that in re-ID has over 10k images. With a much smaller gallery, there is less requirement on feature’s discriminative ability, and PCB would have a similar matching accuracy with IDE. Moreover, MTMCT also has several other components besides feature-based matching. Imperfectness in these components reduces the improvement brought about by the re-ID features.

Variant Validation set IDF1 results
IDE triplet PCB
Baseline 86.4 81.4 86.2 80.9 85.8 80.6
Global metric 85.9 81.6 84.1 79.7 87.4 81.6
LAAM (intra/global) 87.8 83.1 87.6 83.9 87.1 82.4
LAAM (global/inter) 86.0 81.6 84.5 79.9 87.9 82.5
LAAM (intra/intra) 87.8 83.4 87.8 84.2 87.7 82.4
LAAM (inter/inter) 86.9 82.5 87.4 83.9 87.5 82.5
LAAM (inter/intra) 86.3 81.6 85.6 82.1 87.5 82.1
LAAM (intra/inter) 87.9 83.8 87.9 84.5 87.7 82.9
Table 3: IDF1 accuracy on the DukeMTMC validation set. Three re-ID features are evaluated under various methods.

Comparison with variants and ablation study. We replace the intra-camera metric with the global metric or the inter-camera metric; we also replace the inter-camera metric with the global metric or the intra-camera metric. Results are shown in Table 3 and Table 5.

First, we show both metrics are necessary. In Table 3, when replacing intra-camera metric with the global metric, IDF1 based on the IDE feature drops by 1.9% and 2.2% on SCT and MCT, respectively. A similar but smaller accuracy drop can be observed when the inter-camera metric is replaced with the global metric. The drop is consistent when using different re-ID features. These results show that both the intra-camera and inter-camera metrics are necessary components in our system.

Second, from the ablation studies, the removal of the intra-camera metric causes a larger accuracy drop. The possible reason might be the variance gap between local data association in tracking and global matching in re-ID is bigger in SCT and smaller in MCT. Within a single camera, appearance variance of a target is very small. Between neighboring camera pairs, the appearance has a larger variance (still smaller than global). In a global sense, the appearance changes are the largest. Since there is a largest gap between SCT (local) and re-ID (global) matching, the intra-camera metric has a larger improvement.

Variant CityFlow test set MCT results
Baseline 56.6 53.3 60.7
Global metric 57.1 54.4 60.7
LAAM (intra/inter) 63.0 60.7 66.0
Table 4: CityFlow online test set results. Note that CityFlow dataset only evaluate multi-camera tracking. The proposed method yields substantial accuracy increase.
Tracker Detector test (easy) test (hard)
BIPCC [36] DPM  [16] 70.1 83.6 60.4 56.2 67.0 48.4 64.5 81.2 53.5 47.3 59.6 39.2
MTMC_CDSC [46] DPM 77.0 87.6 68.6 60.0 68.3 53.5 65.5 81.4 54.7 50.9 63.2 42.6
MYTRACKER [52] DPM 80.3 87.3 74.4 65.4 71.1 60.6 63.5 73.9 55.6 50.1 58.3 43.9
MTMC_ReIDp [54] DPM 79.2 89.9 70.7 74.4 84.4 66.4 71.6 85.3 61.7 65.6 78.1 56.5
TAREIDMTMC [23] Mask R-CNN [19] 83.8 87.6 80.4 68.8 71.8 66.0 77.9 86.6 70.7 61.2 68.0 55.5
DeepCC [37] OpenPose [6] 89.2 91.7 86.7 82.0 84.4 79.8 79.0 87.4 72.0 68.5 75.9 62.4
MTMC_ReID [54] Faster R-CNN [35] 89.8 92.0 87.7 83.2 85.2 81.2 81.2 89.4 74.5 74.0 81.4 67.8
Baseline OpenPose 91.3 91.8 90.9 87.4 87.8 87.0 83.7 88.8 79.1 75.4 80.0 71.3
Global metric 91.3 92.2 90.4 87.7 88.6 86.8 82.7 89.2 77.1 76.2 82.2 71.0
LAAM (intra/inter) 92.5 93.0 92.0 88.6 89.0 88.1 85.8 91.1 81.1 82.3 87.4 77.8
Table 5: DukeMTMC online test set results. Methods with are online tracking methods. “LAAM (intra/inter)” refers to the proposed method. On both the easy and hard test sets, our method yields very competitive accuracy.

Third, we show that the two metrics are not interchangeable. In Table 3, when we replace the intra-camera metric with the inter-camera metric, as in “LAAM (inter/inter)”, IDF1s drop by 1.0 and 1.3% on the validation set SCT and MCT. Reversely, when we compare the full method with “LAAM (intra/intra)”, SCT and MCT IDF1s drop by 0.1% and 0.4%, respectively. When we swap intra-camera metric and inter-camera metric, IDF1s drop by 1.6% on SCT and 2.2% on MCT. These drops are consistent with different re-ID features. This suggests that the two metrics work best on SCT and MCT, respectively. This is consistent with their respective data sampling methods (Section 4.2).

Comparison with the state-of-the-art methods. In Table 5, we compare our method using the IDE features with state-of-the-art methods. We have two observations. First, baseline tracker is very competitive. The baseline itself surpasses many existing methods like [37, 54]. This demonstrates the effectiveness of our modified tracker and re-ID feature. Second, LAAM further improves over the competitive baseline tracker and achieves new state-of-the-art accuracy on both the easy and hard test sets. On the easy test set, we obtain 92.5% and 88.6% in IDF1 on SCT and MCT, respectively. These numbers are +2.7% and 5.4% higher than the second-best results [54]. On the hard test set, our IDF1 scores are 85.8% and 82.3% on SCT and MCT, respectively. This is +4.6% and +8.3% higher than the second-best method [54]. These comparisons indicate that our method is particularly advantageous in MCT and the challenging scenarios.

Parameter analysis. We assess the impact of data sampling window lengths in Fig. 7 as key parameter analysis. A short sampling window may significantly reduce the choices of training pairs, leaving the metric more prone to overfitting. On the other hand, a long sampling window no longer underpins locality. From the results, the best within-camera and cross-camera sampling window sizes are and , respectively.

We observe two other phenomena worth noticing. First, from Fig. 7, the inter-camera metric designed for MCT improves SCT as well. In fact, our tracker allows returning targets, so correctly labeling of these returning targets improves SCT accuracy. Second, from Fig. 7, the inter-camera metric is inferior to the global metric under short windows. This is because when the cross-camera sampling window is shorter than the camera transition time, there will not be sufficient cross-camera training samples.

Figure 7: Influence of data sampling window length of LAAM on SCT and MCT. The dashed line is the accuracy of the global metric. IDF1 on the validation set is reported.

Computation complexity. The metric network takes 20 minutes to train on a server with a GTX 1080ti GPU. During testing, the CNN features are extracted with GPU, and the tracker including the metric similarity score is computed on the 3.2Ghz Intel Xeon CPU. In fact, frequently calling GPU for the 3-layer metric network takes more time. In testing, creating tracklets takes 912 seconds. Computing single camera trajectories take 447 seconds and 520 seconds in the baseline and LAAM, respectively. Computing cross-camera tracks take 105 seconds in both the baseline and our method. Overall, baseline spends 1,464 seconds in testing, whereas LAAM spends 1,537 seconds. Our method causes 5% more testing time, which is acceptable.

In MTMCT, better similarity estimation usually makes data association easier. Therefore, although LAAM consumes more time in similarity estimation, it saves time in data association by providing more accurate similarity scores. In SCT, data association is relatively easy, and most of the time is spent in similarity computation. In MCT, data association is more difficult and dominates the computation time. As a result, compared with the baseline, our method is slower in SCT and spends a similar time on MCT.

6 Conclusion

This paper draws novel insights towards the inherent differences between re-ID and MTMCT. That is, re-ID is a global matching problem, while MTMCT is based on local matching. This difference compromises the effectiveness of directly applying global re-ID appearance features in local matching of MTMCT. This paper investigates how to effectively fit global appearance features into local matching in tracking. To this end, we propose the locality aware appearance metric (LAAM), which uses a novel training data sampling strategy. Given globally learned re-ID features, pairs of training data are sampled from their local neighborhood. For single camera tracking (SCT), local neighborhood refers to the consecutive frames within a single camera; for multi-camera tracking (MCT), it refers to neighboring cameras that a target may appear successively. On two MTMCT datasets, we show that LAAM leads to significant improvements over the baseline, and report new state-of-the-art tracking accuracy on DukeMTMC.


  • [1] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis (2018) Looking beyond appearances: synthetic training data for deep cnns in re-identification. Computer Vision and Image Understanding 167, pp. 50–62. Cited by: §2.
  • [2] A. Bellet, A. Habrard, and M. Sebban (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Cited by: §2.
  • [3] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua (2011) Multiple object tracking using k-shortest paths optimization. IEEE transactions on pattern analysis and machine intelligence 33 (9), pp. 1806–1819. Cited by: §2.
  • [4] W. Brendel, M. Amer, and S. Todorovic (2011) Multiobject tracking as maximum weight independent set. In

    Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on

    pp. 1273–1280. Cited by: §2.
  • [5] Y. Cai and G. Medioni (2014) Exploring context information for inter-camera multiple target tracking. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pp. 761–768. Cited by: §2.
  • [6] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2018)

    OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields

    In arXiv preprint arXiv:1812.08008, Cited by: §3, Table 5.
  • [7] V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic (2015) On pairwise costs for network flow multi-object tracking.. In CVPR, Vol. 20, pp. 15. Cited by: §2.
  • [8] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1335–1344. Cited by: §2.
  • [9] W. Choi (2015) Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE international conference on computer vision, pp. 3029–3037. Cited by: §2.
  • [10] R. T. Collins (2012) Multitarget data association with higher-order motion models. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 1744–1751. Cited by: §2.
  • [11] A. Das, A. Chakraborty, and A. K. Roy-Chowdhury (2014) Consistent re-identification in a camera network. In European Conference on Computer Vision, pp. 330–345. Cited by: §2.
  • [12] A. Dehghan, S. Modiri Assari, and M. Shah (2015) Gmmcp tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4091–4099. Cited by: §2.
  • [13] A. Dehghan, Y. Tian, P. H. Torr, and M. Shah (2015) Target identity-aware network flow for online multiple target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1146–1154. Cited by: §2.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §5.2.
  • [15] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle (2016) Improving multi-frame data association with sparse representations for robust near-online multi-object tracking. In European Conference on Computer Vision, pp. 774–790. Cited by: §2.
  • [16] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: Table 5.
  • [17] M. W. Gardner and S. Dorling (1998)

    Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences

    Atmospheric environment 32 (14-15), pp. 2627–2636. Cited by: §4.1.
  • [18] S. Hamid Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid (2015) Joint probabilistic data association revisited. In Proceedings of the IEEE international conference on computer vision, pp. 3047–3055. Cited by: §2.
  • [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Table 5.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  • [21] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1, §2, §5.2.
  • [22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §5.2.
  • [23] N. Jiang, S. Bai, Y. Xu, C. Xing, Z. Zhou, and W. Wu (2018) Online inter-camera trajectory association exploiting person re-identification and camera topology. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 1457–1465. Cited by: §2, §2, Table 2, Table 5.
  • [24] S. Joo and R. Chellappa (2007) A multiple-hypothesis approach for multiobject visual tracking. IEEE Transactions on Image Processing 16 (11), pp. 2849–2854. Cited by: §2.
  • [25] R. Kumar, G. Charpiat, and M. Thonnat (2014) Multiple object tracking by efficient graph partitioning. In Asian Conference on Computer Vision, pp. 445–460. Cited by: §2.
  • [26] L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler (2016) Learning by tracking: siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 33–40. Cited by: §2.
  • [27] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942. Cited by: §2.
  • [28] L. Leal-Taixé, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth (2017) Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv preprint arXiv:1704.02781. Cited by: §2.
  • [29] B. Leibe, K. Schindler, and L. Van Gool (2007) Coupled detection and trajectory estimation for multi-object tracking. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 1–8. Cited by: §2.
  • [30] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan (2017) End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing 26 (7), pp. 3492–3506. Cited by: §2.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §3.
  • [32] A. Maksai, X. Wang, F. Fleuret, and P. Fua (2017) Non-markovian globally consistent multi-object tracking. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2563–2573. Cited by: §2.
  • [33] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §2.
  • [34] A. Milan, S. Roth, and K. Schindler (2014) Continuous energy minimization for multitarget tracking.. IEEE Trans. Pattern Anal. Mach. Intell. 36 (1), pp. 58–72. Cited by: §2.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Table 5.
  • [36] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pp. 17–35. Cited by: §1, §1, §2, §5.1, §5.1, Table 5.
  • [37] E. Ristani and C. Tomasi (2018) Features for multi-target multi-camera tracking and re-identification. arXiv preprint arXiv:1803.10859. Cited by: §1, §1, §1, §1, §2, §2, §3, §3, §5.2, §5.4, §5.4, Table 2, Table 5.
  • [38] A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 300–311. Cited by: §2.
  • [39] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2.
  • [40] L. Shine, A. Edison, and C. Jiji (2019)

    A comparative study of faster r-cnn models for anomaly detection in 2019 ai city challenge

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 306–314. Cited by: §3.
  • [41] H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua (2014) Multi-commodity network flow for tracking multiple people. IEEE transactions on pattern analysis and machine intelligence 36 (8), pp. 1614–1627. Cited by: §2.
  • [42] V. K. Singh, B. Wu, and R. Nevatia (2008) Pedestrian tracking by associating tracklets using detection residuals. In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop on, pp. 1–8. Cited by: §2.
  • [43] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, Cited by: §1, §2, §5.2, §5.3.
  • [44] Z. Tang, M. Naphade, M. Liu, X. Yang, S. Birchfield, S. Wang, R. Kumar, D. Anastasiu, and J. Hwang (2019) Cityflow: a city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8797–8806. Cited by: §1, §5.1.
  • [45] Z. Tang, G. Wang, H. Xiao, A. Zheng, and J. Hwang (2018) Single-camera and inter-camera vehicle tracking and 3d speed estimation based on fusion of visual and semantic features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 108–115. Cited by: §2.
  • [46] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah (2017) Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint arXiv:1706.06196. Cited by: §1, §2, Table 5.
  • [47] M. Thoreau and N. Kottege (2018) Improving online multiple object tracking with deep metric learning. arXiv preprint arXiv:1806.07592. Cited by: §2.
  • [48] R. R. Varior, M. Haloi, and G. Wang (2016)

    Gated siamese convolutional neural network architecture for human re-identification

    In European Conference on Computer Vision, pp. 791–808. Cited by: §2.
  • [49] B. Wang, G. Wang, K. Luk Chan, and L. Wang (2014) Tracklet association with online target-specific metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1234–1241. Cited by: §2.
  • [50] X. Wang, E. Türetken, F. Fleuret, and P. Fua (2016) Tracking interacting objects using intertwined flows. IEEE transactions on pattern analysis and machine intelligence 38 (EPFL-ARTICLE-210040), pp. 2312–2326. Cited by: §2.
  • [51] J. Xiang, G. Zhang, J. Hou, N. Sang, and R. Huang (2018) Multiple target tracking by learning feature representation and distance metric jointly. arXiv preprint arXiv:1802.03252. Cited by: §2.
  • [52] K. Yoon, Y. Song, and M. Jeon (2018) Multiple hypothesis tracking algorithm for multi-target multi-camera tracking with disjoint views. IET Image Processing 12 (7), pp. 1175–1184. Cited by: §2, Table 5.
  • [53] S. Yu, D. Meng, W. Zuo, and A. Hauptmann (2016) The solution path algorithm for identity-aware multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3871–3879. Cited by: §2.
  • [54] Z. Zhang, J. Wu, X. Zhang, and C. Zhang (2017)

    Multi-target, multi-camera tracking by hierarchical clustering: recent progress on dukemtmc project

    arXiv preprint arXiv:1712.09531. Cited by: §1, §2, §2, §5.3, §5.4, Table 2, Table 5.
  • [55] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124. Cited by: §1, §2, §5.1, §5.1, §5.2, §5.3.
  • [56] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §1, §2, §2.
  • [57] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §5.1, §5.3.
  • [58] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: §2, §5.2.