Accurately tracking objects of interest such as pedestrians and vehicles in video streams is an extremely important problem with widespread applications in many fields including surveillance, robotics and industrial safety among others. The problem of Multiple Object Tracking (MOT) in video has mostly been addressed in recent literature using the ‘tracking by detection’ framework. In tracking by detection, detections are combined to estimate the trajectories of tracked objects. Solutions can generally be grouped into online and batch processes. The difference being, online solutions use measurements as they arrive while a batch process may build globally optimal trajectories by considering measurements at all times.
Motivated by the large amounts of labelled data now available for pedestrian re-identification problems, the proposed method uses a deep-learning approach to appearance modelling. We present a convolutional neural network, trained in a Siamese configuration to produce a discriminative appearance similarity metric for pedestrians.
We present three ways in which this deep appearance metric learning can be used in MOT and propose a method using two of these components to achieve competitive performance. We compare our results to those of other methods and selectively evaluate each use of the proposed appearance metric. First we show how a learned appearance metric can be used to improve the assignment of candidate detections to form short tracks (tracklets) as the first step in creating longer optimal tracks. Next we show how the same metric learner can perform detection boosting to reduce false negatives where detections are missing within a person’s track. Lastly the deep appearance metric is used to perform iterative appearance based merging of tracklets to form longer tracks, a process we call tracklet association. We accomplish this as an online process, with a playback delay of only a few seconds, at a frame rate suitable for real-time applications.
The rest of the paper is organised as follows; section II describes the related approaches in the literature, section III introduces the proposed Siamese Deep Metric Tracker, section IV evaluates the proposed method against the publicly available Market-1501 dataset of the MOTChallenge, section V discusses the evaluation results and section VI concludes the paper.
Ii Related Work
Solutions to the multiple object tracking problem fall in to two distinct categories; batch and online processing. In batch processing, detections are combined in a global sense, rather than frame by frame, to form optimal tracks [3, 4, 5]. Despite the apparent performance advantages of batch methods as described by Luo et al. in their extensive literature review , we consider only online methods in this work, where filtered tracks are available with little to no delay, motivated by potential real-time applications in surveillance, robotics and industrial safety.
In some online approaches, tracklet states are estimated by a probabalistic model such as a Kalman filter[7, 8]. Others have used deep learning to learn to estimate the motion of tracked objects from data, including estimating the birth and death of tracks .
A difficult aspect of tracking by detection is solving the data association problem present when grouping detections or merging tracklets. Some work is present in the literature that use confidence estimation to aid in the data association problem by prioritising high confidence tracks [10, 11]. Others leverage image information, where even simple appearance modelling has been shown to make data association more robust . Appearance modelling plays a larger role in single object tracking, where only appearance is used to track objects given a prior .
More recently, the availability of labelled data has motivated methods utilising deep learning for appearance modelling. Siamese networks have been used for single object tracking to great effect by Feichtenhofer et al. and Liu et al., where the deep appearance model is used to search successive frames [14, 15].
In multiple object tracking, online learning has been used to discriminate between tracked objects based on appearance albeit at limited speed due to computational complexity .
Some methods using deep learning have achieved outstanding results on the MOT Challenge [16, 17]. For example, Leal-Taixé et al. achieves good results with a Siamese network, however with a requirement for a gallery of images to be stored from past tracks to do re-identification . Wojke et al. solves this by using a deep similarity metric learning network to store a gallery of metrics . He et al. goes one step further and uses a deep recurrent network to compute an appearance metric which incorporates temporal information from the tracked object, to good effect .
Informed by the available literature, we have developed our proposed Siamese Deep Metric Tracker to address the problem of online multiple object tracking, and evaluate how deep appearance modelling may best be used in real-time applications.
Iii Siamese Deep Metric Tracker
Here we present our proposed Siamese Deep Metric Tracker to perform online multiple object tracking. A strong appearance model is central to this proposed method. We use a single deep neural network, detailed in section III-B, to enable or assist three components of our object tracking algorithm shown at a high level in figure 2. We solve the problem in multiple stages; firstly, detections are Assigned to tracklets, as detailed in section III-D; detections are then Boosted as described in section III-F; and finally tracklets are Associated as described in section III-G.
We use the following notation in all equations, explanations and algorithm listings in this paper. Let the set of estimated tracklets be , containing tracklets . Let the estimated state of tracklet at time be and the predicted state of tracklet be . Let a set of detections at time be containing detections .
Iii-B Deep Similarity Metric
A robust appearance model can improve simple object tracking by preventing tracks from drifting to false positive detections, and by enabling objects to be tracked through occlusion.
At each time step we compute a feature vectorfor each candidate detection in a single batch. Computing all features in a batch is an efficient use of GPU resources, taking only for a typical batch of 40 image patches. The network, with layers listed in table I, uses pre-trained convolutional layers from VGG-16 , followed by two fully connected layers with batch and normalisation on the output layer. The use of pre-trained networks as feature extractors in Siamese/triplet networks has been shown to reduce the number of iterations required for convergence and improve accuracy . Euclidean distance between feature vectors lying within a unit hyper-sphere measures the distance between two input patches in the appearance similarity space. The appearance affinity between two patches is . We use an optimal affinity threshold , determined offline, to separate similar and dissimilar pairs.
is an alternative implementation, using a learned softmax classifier to give a similarity score between the input images. In our approach, we compute the feature vector for a detection and store it with the state of the tracklet at the time of the detection, meaning that we don’t have to store a gallery of images for each tracked object. We assume that the appearance metric computed from the detection will closely match the appearance metric of the true bounding box of the subject. This assumption appears to hold during testing, as bounding boxes are usually well regressed to the true bounding box of the detected object.
The deep similarity network was trained on the Market-1501 pedestrian re-identification dataset , containing 32,000 annotated images of 1501 unique pedestrians in six camera views. Triplet loss has recently been used to good effect in training networks for pedestrian re-identification . Networks using triplet loss have been known to be difficult to train, due to a stagnating training loss. Batch-hard example mining has been shown to improve convergence when training with triplet loss . Our approach uses batch-hard sampling to train our network in a Siamese fashion using margin contrastive loss in a large batch. We sample 4 images each from 32 identities, compute their feature vectors in a forward pass and select the hardest pairings, maximising Euclidean distance between feature vectors for positive pairs and minimising distance for negative pairs, for each of the 128 images.
Iii-D Detection Assignment
Detections are combined across time to estimate the trajectory of a tracked object. This algorithm is shown in listing 1 and detailed below. The motion of small segments, tracklets, are estimated via a Kalman filter with a constant velocity constraint. Tracklet states are predicted at each time step, but are considered inactive after two predictions without being assigned a detection. A tracklet’s state is predicted for another 90 steps for tracklet association, discussed in section III-G. The association of new detections to the set of active tracklets is solved as a data association problem using the Hungarian algorithm 
. The Hungarian algorithm maximises the affinity between tracklets and assigned detections, provided in the affinity matrix, and creates entries in the matching matrix . The affinity used to assign candidate detections to tracklets is a combination of motion affinity, preferencing detections close to the predicted position of the tracklet, and appearance affinity which attempts to match the tracklet with a detection whose appearance is closest to stored appearance information. Motion affinity is implemented as the Intersection over Union (IoU)  between a candidate detection and the predicted bounding box of the tracklet , as shown in equation 1. Motion affinity is constrained to be strictly greater than a motion affinity threshold for assignment.
Appearance affinity is computed as the mean affinity between a candidate detection’s feature vector and the stored feature vectors for a tracklet, shown in equation 2 with denoting the first state of the tracklet. A subset of N past states of the tracklet is used for computational tractability, in practice .
The total affinity, shown in equation 4, is a combination of appearance and motion affinity, balanced by the parameter , typically between and . A value of may be used to ignore appearance entirely when computing assignments, potentially improving frame rate under some implementations.
Iii-E Tracklet Confidence
A minimum length requirement is imposed on tracklets for them to be considered positive. Tracks containing less than six states are considered negative and therefore are not reported. The mean confidence of detections assigned to a given tracklet is also used to filter out low confidence tracklets, with a minimum mean confidence of used in practice. The average cost of assigning detections to a tracklet is used to estimate confidence in it being positive. A tracklet with a high mean assignment cost is likely to be varying in appearance or in motion and is considered negative. Tracklet association and boosting considers only positive tracks to avoid joining false positives with true positives.
Iii-F Detection Boosting
In the case that in a given frame, there exists no detection which matches to a tracklet, but the tracked object is not occluded or out of frame, we wish to re-identify that person. Using the predicted location of the object as a prior, we perform dense sampling around the prediction and select the candidate bounding box which maximises appearance affinity and satisfies the appearance affinity constraint . This detection is added to the detection set and association is performed again, as shown in figure 2. In order to prevent track drift, boosting is limited to no more than once per two frames per track. To stop partial detections from drifting to a true person via boosting and therefore adding false positives, Non Maximum Suppression (NMS) is performed on the detections with a NMS-IoU threshold of .
Iii-G Tracklet Association
Targets may be tracked through occlusion by matching tracklets across time using their appearance. Our association algorithm is shown in listing 2 and described below. Due to uncertainties in camera and target motion, a much looser motion constraint is used to associate tracklets, requiring only a small overlap between the predicted bounding box of the older tracklet and the first bounding box of the newer track i.e. .
As tracking is done in the image plane, changes in camera motion may frequently violate the constant velocity constraint imposed by our Kalman filter based tracking. By building small tracklets with a stricter motion constraint and linking high confidence tracklets in to longer tracks with a looser motion constraint, intuitively our tracking may be robust to changes in camera motion. Tracklet association need not run at every time step, once every 20 time steps is sufficient to not impact performance, resulting in a higher refresh rate.
After tracklets have been merged, temporal gaps are filled by interpolation with a constant velocity, giving a reasonable estimate for the state of the object while it is occluded, this can be seen in figure1.
|SDMT (w/ boosting)||34.2||6,743||65,533||334||25.9|
|SDMT (w/o association)||32.4||3,968||69,965||686||31.6|
|SDMT (w/o appearance modelling)||32.9||4,069||69,468||587||96.8|
The Siamese deep metric network was validated on a subset of Market-1501 dataset not used for training. The network achieved an area under the receiver operator characteristic curve ofafter training iterations, with an equal mix of positive and negative pairs and distractors sampled from the background.
Iv-a CLEAR MOT Metrics
The CLEAR MOT  metrics are used here to compare our performance to others, as well as compare the benefits of each of the uses of our deep appearance model. The specific metrics we use ( denotes metrics in which a higher score is better, denotes metrics in which a lower score is better):
MOTA , combines FP, FN and IDs to give a single metric to summarise accuracy.
FP , is the number of false positive bounding boxes.
FN , is the number of false negative bounding boxes.
IDs , is the number of times tracked targets swap ID’s.
FPS , is the update frequency, an important metric for real-time applications.
Iv-B MOT16 results
A selection of methods suitable for real-time applications was made for comparison. Online approaches that achieve an update rate of greater than 10 Hz on the computationally intensive test set using the public detections are shown in table III compared to our approach.
Among the selected approaches, our method achieves a competitive tracking accuracy (MOTA) for a relatively simple method. At only 602, our method achieves the second lowest number of ID switches (IDs), second only to a method with significantly more false negatives.
We performed repeated testing while enabling/disabling certain aspects of our algorithm, presented in table II. The best performing method from this ablative testing was used in testing presented in table III. The best method did not include boosting and used a lambda value of . Changing to 0 or 1 reduced accuracy on the training set. Adding boosting to the optimal method reduced false negatives but significantly increased false positives. Removing tracklet association, or appearance modelling entirely significantly reduced tracking accuracy. The method without any appearance modelling removed the need to compute feature vectors for each detection, significantly increasing the update rate.
We found that using our deep appearance metric for detection assignment and tracklet association improved the performance of multiple object tracking. Detection boosting was found to hurt the accuracy of our tracking, despite reducing the number of false negatives as intended. This was likely due to the high recall rate of 43% but relatively low precision of the DPM v5 detections provided with the test sequences . Boosting is most useful when there exists no detection for a given target, yet the target is not occluded or out of frame. It is possible that this case does not occur often in the MOT16 dataset, limiting effect of reducing false negatives. Tracklets built from false positive detections that contain some part of a true object, may be boosted, causing drift towards the true object. This may lead to the tracks being merged, increasing false positives.
We presented three uses of deep appearance metric learning for improving multiple object tracking, and demonstrated how two of these uses significantly improved tracking accuracy. Our method achieved competitive results for online methods suitable for real-time applications. Our ablative testing may be used to inform further use of deep appearance metrics in multiple object tracking.
The authors would like to thank Benjamin Tam, Lachlan Tychsen-Smith and Nicholas Panitz for their assistance during this work.
-  L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking,” arXiv:1504.01942 [cs], Apr. 2015.
-  A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for Multi-Object Tracking,” Mar. 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497 [cs], June 2015.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” arXiv:1512.02325 [cs], vol. 9905, pp. 21–37, 2016.
-  L. Tychsen-Smith and L. Petersson, “DeNet: Scalable Real-time Object Detection with Directed Sparse Sampling,” arXiv:1703.10295 [cs], Mar. 2017.
-  W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T.-K. Kim, “Multiple Object Tracking: A Literature Review,” arXiv:1409.7618 [cs], Sept. 2014.
-  E. Bochinski, V. Eiselein, and T. Sikora, “High-Speed tracking-by-detection without using image information,” in IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Aug. 2017, pp. 1–6.
-  A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Online and Realtime Tracking,” arXiv:1602.00763 [cs], pp. 3464–3468, Sept. 2016.
A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler, “Online Multi-Target Tracking Using Recurrent Neural Networks,” Apr. 2016.
-  S. H. Bae and K. J. Yoon, “Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2017.
-  ——, “Robust Online Multi-object Tracking Based on Tracklet Confidence and Online Discriminative Appearance Learning,” in
-  V. Takala and M. Pietikainen, “Multi-object tracking using color, texture and motion,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2007, pp. 1–7.
-  D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2010, pp. 2544–2550.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to Track and Track to Detect,” arXiv:1710.03958 [cs], Oct. 2017.
-  R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese Instance Search for Tracking,” May 2016.
-  H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan, “End-to-End Comparative Attention Networks for Person Re-identification,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3492–3506, July 2017.
-  F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “POI: Multiple Object Tracking with High Performance Detection and Appearance Feature,” arXiv:1610.06136 [cs], Oct. 2016.
-  L. Leal-Taixé, C. C. Ferrer, and K. Schindler, “Learning by tracking: Siamese CNN for robust target association,” Apr. 2016.
-  N. Wojke, A. Bewley, and D. Paulus, “Simple Online and Realtime Tracking with a Deep Association Metric,” Mar. 2017.
-  Q. He, J. Wu, G. Yu, and C. Zhang, “SOT for MOT,” arXiv:1712.01059 [cs], Dec. 2017.
-  K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556 [cs], Sept. 2014.
-  A. Hermans, L. Beyer, and B. Leibe, “In Defense of the Triplet Loss for Person Re-Identification,” arXiv:1703.07737 [cs], Mar. 2017.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable Person Re-identification: A Benchmark,” in IEEE International Conference on Computer Vision (ICCV), Dec. 2015, pp. 1116–1124.
-  H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83–97, 1955.
-  J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “UnitBox: An Advanced Object Detection Network,” arXiv:1608.01471 [cs], pp. 516–520, 2016.
-  R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The CLEAR 2006 Evaluation,” in Multimodal Technologies for Perception of Humans, ser. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, Apr. 2006, pp. 1–44.