CAN: Composite Appearance Network and a Novel Evaluation Metric for Person Tracking

by   Neeti Narayan, et al.
University at Buffalo

Tracking multiple people across multiple cameras is an open problem. It is typically divided into two tasks: (i) single-camera tracking (SCT) - identify trajectories in the same scene, and (ii) inter-camera tracking (ICT) - identify trajectories across cameras for real surveillance scenes. Many of the existing methods cater to single camera person tracking, while inter-camera tracking still remains a challenge. In this paper, we propose a tracking method which uses motion cues and a feature aggregation network for template-based person re-identification by incorporating metadata such as person bounding box and camera information. We present an architecture called Composite Appearance Network (CAN) to address the above problem. The key structure of this architecture is a network called EvalNet that pays attention to each feature vector independently and learns to weight them based on gradients it receives for the overall template for optimal re-identification performance. We demonstrate the efficiency of our approach with experiments on the challenging and large-scale multi-camera tracking dataset, DukeMTMC, and by comparing results to their baseline approach. We also survey existing tracking measures and present an online error metric called "Inference Error" (IE) that provides a better estimate of tracking/re-identification error, by treating within-camera and inter-camera errors uniformly.



There are no comments yet.


page 2


Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project

Although many methods perform well in single camera tracking, multi-came...

Temporal Dynamic Appearance Modeling for Online Multi-Person Tracking

Robust online multi-person tracking requires the correct associations of...

Tracking by Prediction: A Deep Generative Model for Mutli-Person localisation and Tracking

Current multi-person localisation and tracking systems have an over reli...

A Reinforcement Learning Approach to Target Tracking in a Camera Network

Target tracking in a camera network is an important task for surveillanc...

Wide-Baseline Multi-Camera Calibration using Person Re-Identification

We address the problem of estimating the 3D pose of a network of cameras...

Multi-Target Tracking in Multiple Non-Overlapping Cameras using Constrained Dominant Sets

In this paper, a unified three-layer hierarchical approach for solving t...

Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking

To help accelerate progress in multi-target, multi-camera tracking syste...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person tracking is one of the fundamental problems in the field of computer vision. Determining multi-camera person trajectories enable applications such as video surveillance, sports, security, behavior analysis and anomaly detection. Maintaining the correct trajectory of a person across cameras is difficult because of occlusion, illumination change, background clutter, and blind-spots due to non-overlapping cameras often placed far apart to reduce costs. Methods like

[11] focus on target tracking using a low-level handcrafted feature. Although they achieve good tracking performance, they are still inefficient in solving the mentioned obstacles and are limited to scenarios where targets are known a-priori. Some model people entry and exit points on the ground or image plane explicitly [3, 7]. Few others [2, 26] exploit data fusion methods from partially overlapping views or completely overlapping views as a pre-processing step.

Person re-identification for the purpose of estimating the similarity of person images which deals with data comprised of only image pairs, one for the probe and one for the gallery, is relatively straightforward with the use of metric learning methods [13, 25] and feature representation learning [12]. However, in unconstrained tracking (Figure 1

), person tracklet contains multiple images and therefore requires a way to fuse features/attributes to a single feature vector representative of the tracklet. We design a spatio-temporal attention model, Composite Appearance Network (CAN), where the network looks at each individual feature vector of a trajectory in the gallery and predicts how important it is to be a part of the final representative feature vector by incorporating metadata such as person bounding box and camera information. We also discuss the Inference Error (IE) evaluation criterion to measure the performance of online person tracking.

The main contributions of this paper include:

  • Finding the optimization function that best exploits the variances between the appearance features (extracted from a CNN or any other embedding system) using additional metadata. This would result in optimal aggregation weights for pooling.

  • Study the performance measures of a tracking system and demonstrate how they are different when evaluating/ characterizing online and real-time tracking performance. We formally introduce the “Inference Error” metric and emphasize on de-duplication.

2 Related Work

Multi-person pose estimation in videos has been studied in [10]. They jointly perform pose estimation and tracking, but is limited to single camera videos. For multi-camera object tracking, input detections can be regarded as a graph and weight edges between nodes (detections) based on similarity [4]. When merging tracklets, trackers are dependent on the feature extractor/ representation technique to avoid high false negatives. For example, some trackers adopt LOMO [13] appearance features and hankel matrix based IHTLS [6] algorithm for motion feature. A hard mining scheme and an adaptive weighted triplet loss is proposed in [27] to learn person appearance features from individual detections.

In [21]

, an online method for tracking is proposed by using Long Short-Term Memory (LSTM) networks and multiple cues such as appearance, motion, and interaction. However, this solution is for single camera tracking and there is no proof of how well the system scales for multi-camera environment. In

[17], authors address the multi-person multi-camera tracking problem by reformulating it as a pure re-identification task. The objective is to minimize misassociations at every timestep. For this reason, the authors propose a new evaluation metric called the “Inference Error” (IE). The method, however, does not encode long-term spatial and temporal dependencies which are critical for correcting data association errors or recovering from occluded state. We discuss in detail the advantages of using IE and how it adapts to both online and offline tracking systems. An online tracking method, which is an extension of [17], is presented in [18]. LSTM-based space-time tracker is developed using history of location and visual features to learn from past association errors and make a better association in the future. But, the model is not trained end-to-end and the LSTM module needs to be trained on every new dataset to learn the space-time continuum of the camera network.

Fig. 1: Example of unconstrained tracking. Find matching trajectory for probe image.

In [1]

, local features such as Fisher vectors are compared with deep features for aggregation. Traditional hand-crafted features have different distributions of pairwise similarities, which requires careful evaluation of aggregation methods. In

[29], a temporal attention network (TAN) is proposed for multi-object tracking. It adaptively allocates different degrees of attention to different observations and filters out unreliable samples in the trajectory. Very recently, feature pooling has been explored to assess the quality of facial images in a set and sometimes it relies on having to carefully define “weighting functions” to produce intelligent weights. Neural Aggregation Networks (NAN) [24] fuse face features with a set of content adaptive weights using a cascaded attention mechanism to produce a compact representation. An online unsupervised method for face identity learning from unconstrained video streams is proposed in [28]

by coupling CNN based face detection and descriptors with a memory based learning mechanism.

In video-based person re-identification works, researchers explore temporal information related to person motion. In [23], features are extracted by combining color, optical flow information, recurrent layers and temporal pooling in a Siamese network architecture. Quality Aware Network (QAN) [14]

learns the concept of quality for each sample in a set using a quality generation unit and feature generation part for set-to-set recognition, which learns the metric between two image sets. However, most of these methods learn each trajectory’s representation separately and invariably rely on the individual features of person detection sequences, without considering the influence of the trajectory being associated with. Also, many use Recurrent Neural Network (RNN) to handle sequential data (input and output). It is possible to avoid them by borrowing their attention mechanism/differentiable memory handling into a simple feature aggregation framework. Hence, in order to draw different attention when associating different pairs of trajectories, we propose to use orthogonal attributes such as metadata, that are learnt using different target objectives.

The framework proposed in this paper is inspired from [22]. The authors propose set based feature aggregation network (FAN) for the face verification problem. By generating representative template features using metadata like yaw, pitch and face size, their system could outperform traditional pooling approaches. Instead of building a siamese network, we focus on learning the pooled feature for gallery template (trajectory) while coupling the probe metadata beside gallery metadata. This way, we derive each trajectory’s representation by considering the impact or influence of the probe detection in the context of data association.

3 Our Approach

The input to the multi-person multi-camera tracking (MPMCT) system is a set of videos , where is the number of cameras. The output is a set of trajectories across cameras. Here, we take the tracking solution a step further by proposing a simple feature aggregation approach that can well distinguish identical pair of features from non-identical ones to maintain a margin between within-identity and between-identity distances.

3.1 Preliminary Appearance Cues

Given the input person detections in frame-level in a single camera or across multiple cameras over a time period , we extract the appearance feature maps set using DenseNet [9]. We train DenseNet on of the training set (called trainval) of DukeMTMC dataset. Features of length are extracted from the last dense layer. The DenseNet used has blocks with a depth of , compression/reduction of and growth rate of . We run for epochs with a learning rate starting at

and we train using stochastic gradient descent with batch size

and momentum 0.9. As the number of remaining epochs halves, we drop learning rate by a factor of 10 and drop by a factor of 2 at epoch .

3.2 CAN Architecture

The Composite Appearance Network (CAN) is shown in Figure 2. The main component of CAN is EvalNet which evaluates the quality of every feature vector (FV) as being a part of the template (tracklet) by looking at appearance attribute and metadata simultaneously, and outputs a weight accordingly denoting the “importance” of that FV.

The terms “gallery” and “probe” in the tracking scenario apply to tracklets identified dynamically as time progresses and a new detection to be associated with an existing tracklet respectively. The gallery FV list is one of the inputs to the network. It is comprised of a list of appearance attributes corresponding to a trajectory in the gallery. Gallery metadata list is the other input to the network. Metadata for person detections include bounding box information as well as other external details such as camera number . The probe metadata is concatenated with gallery metadata, which in-turn constitutes the gallery metadata list. Each vector in the list is of length . In this way, the probe’s influence is also taken into account for learning weights for aggregation. Probe FV is the probe detection’s appearance attribute. Consider detections in a trajectory and corresponding appearance attributes of length . Then, the gallery FV list is of shape , probe FV has shape and metadata is .

EvalNet is a Fully Connected Network (FCN) with

fully connected layers, batch normalization and ReLU as the activation function. This FCN block produces weights

such that , where is the feature vector and

is the number of detections belonging to a tracklet. These weight predictions are then applied on the corresponding appearance features to obtain the aggregated appearance feature for every tracklet. The similarity between probe and aggregated FV is obtained using cosine similarity. The cosine similarity loss layer and softmax loss layer are optimized against the match label and class label respectively. During real-time tracking, we do not use entire CAN and instead, use only EvalNet to obtain aggregated FV given appearance attributes and metadata information for every tracklet.

Fig. 2: Composite Appearance Network architecture

3.3 CAN Training

Let and be the gallery FV and corresponding metadata vector in a trajectory. If denotes the final composite or aggregated appearance FV, it can be defined as:


where is the EvalNet function on gallery FV and metadata, parameterized by , predicting a weight that evaluates the relative importance of the feature vector using accompanying metadata. is the number of feature vectors/ detections in the trajectory. We use orthogonal information such as probe metadata and gallery metadata to yield additional context that can help discriminate appearance features of an identity.


is the probe FV, then the mean-squared loss function is defined as:


and the categorical cross-entropy loss function is defined as:


where is the ground-truth match label, is the number of classes (unique identities),

is the ground-truth class label of the one-hot encoded vector


is the normalized probability assigned to the correct class. Given gallery-probe vectors, the objective of CAN is to find the optimal parameters

that minimize the following cost function:


where is the number of examples in probe set and is the number of examples in gallery set.

Batch processing: Since each trajectory may be comprised of a variable number of person detections, we would require CAN to be trained on a single gallery trajectory in every iteration as making batches of trajectories would be difficult. However, we work around this problem by coupling an additional input - trajectory index , that corresponds to the unique identity in the batch that each person (features and metadata) would be mapped to. This allows us to gather multiple sets of trajectories as a batch, enabling CAN to converge faster. The aggregated trajectory representation is computed using only the corresponding features as indexed by .

3.4 Tracking

We also include the motion property of each individual in our framework for within-camera tracking. This would help address one key challenge - handling noisy detections - because velocities of an individual can be non-linear due to occlusion or detections being noisy. Here, we propose to use a simple method to aggregate detections into single camera trajectories and further merge these trajectories using multi-camera CAN framework.

3.4.1 Single-camera Motion and Appearance Correlation

Single-camera trajectories are computed online by merging detections of adjacent frames in a sliding temporal window with an intersection of . Let denote a previous detection and denote a current detection at timestep . If two bounding boxes, and , detected in neighboring windows are likely to belong to one target, the factors that effect this likelihood include: (i) appearance similarity and (ii) location closeness and is defined as , computed as:


where is considered as the appearance similarity between two detections - given by cosine distance formulation and is defined as:


The output of this step will include a set of single-camera trajectories from all cameras. The number of trajectories is not a constant and varies from one camera to another.

3.4.2 Multi-camera

Once we have associated detections to tracklets or trajectories, we use CAN as the next step for tracking. Tracking in multi-camera environment is challenging owing to variation in lighting conditions, change in human posture, presence of occlusion or blind-spots. To handle all the above challenges in a principled way, we use CAN to weaken the noisy features corresponding to a trajectory and narrow down variances for an identity, thus making the trajectory representation more discriminative.

The input during testing is the set of trajectories and the output of this step is an association matrix, whose scores are determined using CAN. For every pair of trajectories, appearance feature maps and metadata sequences of one are considered as gallery FV and gallery metadata respectively; while the other forms the probe FV. Probe metadata is appended to each occurrence of gallery metadata. Using EvalNet, we obtain the aggregated gallery FV that corresponds to gallery trajectory’s representation. The association score between aggregated FV and probe FV is determined using cosine similarity. In this manner, we perform pairwise trajectory feature similarity comparisons. Final matching/ associations are based on a simple greedy algorithm [17] which is equivalent to selecting the largest association score sequentially. This controls proliferation of fragmented identities i.e. a subject is associated with one identity only.

4 Experiments and Results

4.1 Performance Measures

In this section, we survey the many existing tracking measures and discuss their use cases. There are two criteria crucial to a MPMCT system: (i) single-camera (within-camera) tracking module and (ii) inter-camera (handover) tracking module. Few metrics evaluate the two separately, some handle both simultaneously. Also, based on the application, measures can be classified into (i) event-based: count how often a tracker makes mistakes and determine where and why mistakes occur, and (ii) identity-based: evaluate how well computed identities conform to true identities.

4.1.1 Multi Object Tracking Accuracy (MOTA)

MOTA (Multiple Object Tracking Accuracy) is typically used to evaluate single camera, multiple person tracking performance. It is defined as:


where FN is the number of false negatives i.e. true targets missed by the tracker, FP is the number of false positives i.e. an association when there is none in the ground truth, Fragmentation is the ID switches, and T is the number of detections. However, MOTA penalizes detection errors, under-reports across-camera errors [4] and can be negative due to false positives.

4.1.2 Identity-based Metrics

We evaluate our framework on DukeMTMC using measures proposed in [19]: Identification F-measure (), Identification Precision () and Identification Recall (). Ristani et al. [19] propose to measure performance by how long a tracker correctly identifies targets, as some users may be more interested in how well they can determine who is where at all times. Few critical aspects such as where or why mismatches occur are disregarded. is the fraction of computed detections that are correctly identified. is the fraction of ground truth detections that are correctly identified. is the ratio of correctly identified detections over the average number of ground-truth and computed detections.

4.1.3 Multi-camera Object Tracking Accuracy (MCTA)

Another existing multi-camera evaluation metric is multi-camera object tracking accuracy (MCTA) [4]. Here, both SCT and ICT are considered equally important in the final performance measurement. So, all aspects of system performance are condensed into one measure and defined as:


where the first term

-score measures the detection power (harmonic mean of precision and recall). The second term penalizes within-camera mismatches (

= ID-switches) and normalized using true positive detections (). The final term penalizes inter-camera mismatches () normalized by true inter-camera detections (). MCTA factors single-camera mismatches, inter-camera mismatches, false-positives and false negatives through a product instead of adding them. In addition, error in each term is multiplied by the product of other two terms which might drastically change the performance measure.

Fig. 3: Duplication effect

4.1.4 Inference Error

Traditional biometric measures [15] such as FMR (False Match Rate), FNMR (False Non-match Rate) assume that the occurrence of an error is a static event which cannot impact future associations. However, in a tracking by re-identification system, the reference gallery is dynamically evolving, as new tracks are created (following “no association” outcomes) or existing tracks are updated (following “association” outcomes).

For this reason, we propose the “Inference Error” measure for tracking by continuous re-identification evaluation and is defined as:


where is the number of misassociations at time and is the number of current detections. The inference error is a real-time evaluation paradigm which measures how often a target is incorrectly associated. We normalize it by the number of detections. It handles multiple simultaneous observations in any given time and also handles identities reappearing in the same camera or in different cameras. In [5], similar metrics such as false dynamic match (FDM) and false dynamic non-match (FDNM) are proposed for biometric re-identification systems. We do not claim that one measure is better than the other, but only suggest that different error metrics are suited for different applications.

The main intuition behind counting the number of misassociations at every timestep is to reduce creating duplicate trajectories for an identity. Consider one true identity as illustrated in Figure 3. Let time be represented in the horizontal direction. In case (i), a tracker mistakenly identifies two trajectories ( and ) for the same person. While in case (ii), three trajectories (, and ) are mistakenly identified for the same person. If covers of ’s path, identification measure would charge of the length of to each of the two cases. However, inference error would assign a higher penalty to case (ii) owing to multiple fragmentations. Given a real-time large-scale tracking scenario with multiple simultaneous observations across multiple cameras, system designers and security professionals will want a tracker that maintains the identity of a person and effectively updates the reference gallery without growing exponentially. So, identifying where and when trajectories are broken is essential for de-duplication. De-duplication is analogous to tracking framework, with the advantage of describing duplication errors (i.e., instances of incorrect ”non-match” matching outcomes resulting in duplicate data, that would otherwise maintain the same identity) better.

4.2 Experiments on DukeMTMC

4.2.1 DukeMTMC Dataset

It consists of more than identities, with over million frames of p, fps video, approximately minutes of video for each of the cameras [19]. We report results across all cameras to demonstrate the efficiency of our approach for MPMCT. We use the ground-truth available for the training set (called trainval) of DukeMTMC for evaluation. Only of this set (we call this ‘net-train’) is used for CNN and CAN training and the remaining (‘net-test’) is used for testing.

4.2.2 Comparison with Baseline

The reference (baseline) approach [19] for DukeMTMC is based on correlation clustering using a binary integer program (BIPCC). The task of tracking is formulated as a graph partitioning problem. Correlations are computed from both appearance descriptors and simple temporal reasoning.

4.2.3 Single-camera Multi-person Tracking (SCT) Results

Our system aggregates detection responses into short tracklets using motion and appearance correlation as described in Section 3.4.1. These tracklets are further aggregated into single camera trajectories using CAN. In Table I and Table II, SCT performances are reported on each one of the cameras. We compare quantitative performance of our method with the baseline. For a fair comparison, the input to our method are the person detections obtained from MOTchallenge DukeMTMC. The average performance () over cameras is using our approach and using BIPCC. An improvement of over is achieved in , approximately in and in metrics. Thus, the results demonstrate that the proposed CAN approach improves within-camera tracking performance.

Method Cam 1 Cam 2 Cam 3 Cam 4
Ours 94.5 98.3 96.4 94.5 97.9 96.2 98.7 97.6 98.1 98.9 99.6 99.2
TABLE I: SCT ID score comparison
Method Cam 5 Cam 6 Cam 7 Cam 8
Ours 93.8 96.5 90.1 93.2 99.3 97.1 97.6 99.2 96.0 98.2
TABLE II: SCT ID score comparison
Method Identification Score Our Measure
BIPCC[19] 1.3
Ours 72.9 0.6
TABLE III: ICT (across all cameras) performance

4.2.4 Inter-camera Multi-person Tracking (ICT) Results

We use CAN for ICT. This demonstrates the ability of CAN to scale to SCT as well as ICT. We compare our results with BIPCC using Inference Error metric and Identification scores. Tracking performance across all cameras is presented in Table III. An improvement of over is achieved in , in and in metrics. Thus, the results demonstrate that the proposed CAN approach improves multi-camera tracking performance.

5 Conclusion

For representation, instead of using one random feature vector, we can use a set of feature vectors available for a trajectory. Using CAN, we can measure the relative quality of every feature map (spatial) in a trajectory for aggregation based on the location and camera information (temporal). Moreover, it can be incorporated into any existing re-identification feature representation framework to obtain an optimal template feature for SCT, ICT or both.

In addition, we study the error measures of a tracking system. We define a new measure that emphasize on tracker mismatches at every timestep. A traditional re-identification system aims at matching person images often on a fixed gallery and cite performance using CMC [16] and ROC [8] curves. However, with a dynamically growing reference set (gallery), it is important to investigate the mismatches. We hope this survey will benefit researchers working on video surveillance and tracking problems.


This material is based upon work supported by the National Science Foundation under Grant IIP #1266183.


  • [1] A. Babenko and V. Lempitsky.

    Aggregating local deep features for image retrieval.

    In ICCV, 2015.
  • [2] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using k-shortest paths optimization. IEEE transactions on pattern analysis and machine intelligence, 2011.
  • [3] Y. Cai and G. Medioni. Exploring context information for inter-camera multiple target tracking. In Winter Conference on Applications of Computer Vision (WACV), 2014.
  • [4] W. Chen, L. Cao, X. Chen, and K. Huang. An equalized global graph model-based approach for multicamera object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
  • [5] B. DeCann and A. Ross. Modelling errors in a biometric re-identification system. IET Biometrics, 2015.
  • [6] C. Dicle, O. I. Camps, and M. Sznaier. The way they move: Tracking multiple targets with similar appearance. In ICCV, 2013.
  • [7] A. Gilbert and R. Bowden. Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In ECCV, 2006.
  • [8] O. Hamdoun, F. Moutarde, B. Stanciulescu, and B. Steux. Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In 2nd ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), 2008.
  • [9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [10] U. Iqbal, A. Milan, and J. Gall. Pose-track: Joint multi-person pose estimation and tracking. In CVPR, 2016.
  • [11] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence, 2012.
  • [12] I. Kviatkovsky, A. Adam, and E. Rivlin. Color invariants for person reidentification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
  • [13] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
  • [14] Y. Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In CVPR, 2017.
  • [15] A. J. Mansfield and J. L. Wayman. Best practices in testing and reporting performance of biometric devices. CESG, Nat. Phys. Lab., Teddington, U.K., NPL Tech. Rep. CMSC 14/02, 2002. Available:
  • [16] N. Martinel and C. Micheloni. Re-identify people in wide area camera network. In

    Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on

    , 2012.
  • [17] N. Narayan, N. Sankaran, D. Arpit, K. Dantu, S. Setlur, and V. Govindaraju. Person re-identification for improved multi-person multi-camera tracking by continuous entity association. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, 2017.
  • [18] N. Narayan, N. Sankaran, S. Setlur, and V. Govindaraju. Re-identification for online person tracking by modeling space-time continuum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018.
  • [19] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, 2016.
  • [20] E. Ristani and C. Tomasi. Tracking multiple people online and in real time. In ACCV, 2014.
  • [21] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV, 2017.
  • [22] N. Sankaran, S. Setlur, S. Tulyakov, and V. Govindaraju. Metadata-based feature aggregation network for face recognition. In ICB, 2017.
  • [23] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, 2017.
  • [24] J. Yang and P. Ren. Neural aggregation network for video face recognition. In CVPR 2017.
  • [25] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In CVPR, 2016.
  • [26] S. Zhang, Y. Zhu, and A. Roy-Chowdhury. Tracking multiple interacting targets in a camera network. Computer Vision and Image Understanding, 2015.
  • [27] E. Ristani, and C. Tomasi. Features for Multi-Target Multi-Camera Tracking and Re-Identification. In CVPR, 2018.
  • [28] F. Pernici, F. Bartoli, M. Bruni, and A. D.  Bimbo. Memory Based Online Learning of Deep Representations From Video Streams. In CVPR, 2018.
  • [29] J. Zhu1, H. Yang1, N. Liu, M. Kim, W. Zhang, and M. Yang. Online multi-object tracking with dual matching attention networks. In ECCV, 2018.