Robust tracking of multiple objects 
is a challenging problem in computer vision and acts as an important component of many real-world applications. It aims to reliably recover trajectories and maintain identities of objects of interest in an image sequence. State-of-the-art Multi-Object Tracking (MOT) methods,  mostly utilize the tracking-by-detection strategy because of its robustness against tracking drift. Such a strategy generates per-frame object detection results from the image sequence and associates the detections into object trajectories. It is able to handle newly appearing objects and is robust to tracking drift. The tracking-by-detection methods can be categorized into offline and online methods. The offline methods  use both detection results from past and future with some global optimization techniques for linking detections to generate object trajectories. The online methods, on the other hand, use only detection results up to the current time to incrementally generate object trajectories. Our proposed method focuses on online MOT, which is more suitable for real-time applications including autonomous driving and intelligent surveillance.
In MOT methods, the tracked objects usually show consistent or slowly varying appearance across time. Visual features of the objects are therefore important cues for associating detection boxes into tracklets. In recent years, deep learning techniques have shown great potential in learning discriminative visual features for single-object and multi-object tracking. However, visual cues alone cannot guarantee robust tracking results. When tracked objects with similar appearances occlude or are close to each other, their trajectories might be wrongly associated to other objects. In addition, there also exist mis-detections or inaccurate detections by imperfect object detectors. Such difficulties escalate when the camera is hold by hand or fixed on a car. Each object moves according to its own movement pattern as well as the global camera motion. Solving such problems was explored by modeling interactions between tracked objects in the optimization model. For online MOT methods, there were investigations on modeling inter-object interactions with social force models[5, 6, 7], relative spatial and speed differences [8, 9, 10], and relative motion constraints [3, 11]. Most of the previous methods model pairwise inter-object interactions in symmetric mathematical forms, i.e., pairs of objects influence each other with the same magnitude.
However, such pairwise object interactions should be directional and modeled in an asymmetric form, while existing methods model such interactions in a symmetric way. For instance, large-size detection boxes are more likely to be noisy (if measured in actual pixels). Smaller boxes should influence larger boxes more than large ones to small ones because the smaller ones usually provide more accurate localization for objects. Similarly, high-confidence trajectories should influence low-confidence ones more and low-confidence ones should have minimal impact on the high-confidence ones. In this way, the more accurate detections or trajectories could help correct errors of the inaccurate ones and would not be affected by the inaccurate ones much. Moreover, in existing methods, individual object’s movements and inter-object interactions are usually modeled separately. The relations between the two terms are mostly manually tuned and not effectively studied in a unified framework.
To tackle the difficulties, we propose a Deep Continuous Conditional Random Field (DCCRF) with asymmetric inter-object constraints for solving the problem of online MOT. The DCCRF inputs a pair of consecutive images at time and time , and tracked object’s past trajectories up to time . It estimates locations of the tracked objects at time . The DCCRF optimizes an objective function with two terms, the unary terms, which estimate individual object’s movement patterns, and the asymmetric pairwise terms, which model interactions between tracked objects. The unary terms are modeled by a deep Convolutional Neural Network (CNN), which is trained to estimate each individual object’s displacement between time and time with each object’s visual appearance. The asymmetric pairwise terms aim to tackle the problem caused by object occlusions, object mis-detections and global camera motion. For two neighboring tracked trajectories, the pairwise influence is different along each direction to let the high-confidence trajectory assists the low-confidence one more. Our proposed DCCRF utilizes mean-field approximation for inference and is trained in an end-to-end manner to estimate the optimal displacement for each tracked object. Based on such estimated object locations, a final visual-similarity CNN is proposed for generating the final detection association results.
The contribution of our proposed online MOT framework is two-fold. (1) A novel DCCRF model is proposed for solving the online MOT problem. Each object’s individual movement patterns as well as inter-object interactions are studied in a unified framework and trained in an end-to-end manner. In this way, the unary terms and pairwise terms of our DCCRF can better adapt each other to achieve more accurate tracking performance. (2) An asymmetric inter-object interaction term is proposed to model the directional influence between pairs of objects, which aims to correct errors of low-confidence trajectories while maintain the estimated displacements of the high-confidence ones. Extensive experiments on two public datasets show the effectiveness of our proposed MOT framework.
Ii Related Work
There are a large number of methods on solving the multi-object tracking problem. We focus on reviewing online MOT methods that utilize interactive constraints, as well as single-object and multi-object tracking algorithms with deep neural networks.
Interaction models for MOT. Social force models were adopted in MOT methods [5, 6, 7] to model pairwise interactions (attraction and repulsion) between objects. These methods required objects’ 3D positions for modeling inter-object interactions, which were obtained by visual odometry.
Grabner et al.  assumed that the relative positions between feature points and objects were more or less fixed over short-time intervals. Generalized Hough transform was therefore used to predict each target’s location with the assist of supporter feature points. Duan et al.  proposed mutual relation models to describe the spatial relations between tracked objects and to handle occluded objects. Such constraints are learned by an online structured SVM. Zhang and Maaten  incorporated spatial constraints between objects into an MOT framework to track objects with similar appearances.
The CRF algorithm  was used frequently in segmentation tasks to model the relationship between different pixels in the spatial-domain. There were also many works that modeled the multi-object tracking problem with CRF models. Yang and Nevatia  proposed an online-learned CRF model for MOT, and assumed linear and smooth motion of the objects to associate past and future tracklets. Andriyenko et al.  modeled multi-object tracking as optimizing discrete and continuous CRF models. A continuous CRF was used for enforcing motion smoothness, and a discrete CRF with a temporal interaction pairwise term was optimized for data association. Milan et al.  designed new CRF potenials for modeling spatio-temporal constraints between pairs of trajectories to tackle detection and trajectory-level occlusions.
Deep learning based object tracking. Most existing deep learning based tracking methods focused on single object tracking, because deep neural networks were able to learn powerful visual features for distinguishing the tracked objects from the background and other similar objects. Early single-object tracking methods ,  with deep learning focused on learning discriminative appearance features for online training. However, due to the large learning capacitity of deep neural networks, it is easy to overfit the data. , 
pretrained deep convolutional neural networks on large-scale image dataset to learn discriminative visual features, and updated the classifier online with new training samples. More recently, methods that did not require model updating were proposed. Tao et al. utilized Siamese CNNs to determine visual similarities between image pacthes for tracking. Bertinetto et al.  changed the network into a fully convolutional setting and achieved real-time running speed.
Recently, deep models have been applied to multi-object tracking. Milan et al.  proposed an online MOT framework with two RNNs. One RNN was used for state (object locations, motions, etc.) prediction and update, and the other for associating objects across time. However, this method did not utilize any visual feature and relied solely on spatial locations of the detection results. ,  replaced the hand-crafted features (e.g., color histograms) with the learned features between image patches by a Siamese CNN, which increases the discriminative ability. However, those methods focused on modeling individual object’s movement patterns with deep learning. Inter-object relations were not integrated into deep neural networks.
The overall framework of our proposed MOT method is illustrated in Fig. 1. We propose a Deep Continuous Conditional Random Field (DCCRF) model for solving the online MOT problem. At each time , the framework takes past tracklets up to time and detection boxes at time as inputs, and generates new tracklets up to time . At each time , new tracklets are also initialized and current tracklets are terminated if tracked objects disappear from the scene.
The core components of the proposed DCCRF consist of unary terms and asymmetric pairwise terms. The unary terms of our DCCRF are modeled by a deep CNN that estimates the individual tracked object’s displacements between consecutive times and . The asymmetric pariwise terms aim to model inter-object interactions, which consider differences of speeds, visual-confidence, and object sizes between neighboring objects. Unlike interaction terms in existing MOT methods, which treat inter-object interactions in a symmetric way, asymmetric relationship terms are proposed in our DCCRF. For pairs of tracklets in our DCCRF model, the proposed asymmetric pairwise term models the two directions differently, so that high-confidence trajectories with small-size detection boxes can help correct errors of low-confidence trajectories with large-size detection boxes. Based on the estimated object displacements by DCCRF, we adopt a visual-similarity CNN and Hungarian algorithm to obtain the final tracklet-detection associations.
Iii-a Deep Continuous Conditional Random Field (DCCRF)
The proposed DCCRF takes object trajectories up to time and video frame at time as inputs, and outputs each tracked object’s displacement between time and time . Let represents a random field defined over a set of variables , where each of the variables represents the visual and motion information of an object tracklet. Let represents another random field defined over variables , where each variable represents the displacement of an object between time and time . The domain of each variable is the two-dimensional space , denoting the - and -dimensional displacements of tracked objects. Let represents the new video frame at time .
The goal of our conditional random field is to maximize the following conditional distribution,
where represents the Gibbs energy and is the partition function. Maximizing the conditional distribution w.r.t. is equivalent to minimizing the Gibbs energy function,
where and are the unary terms and pairwise terms.
After the displacements of tracked objects between time and time are obtained, individual object’s estimated locations at time can be easily calculated for associating tracklets and detection boxes to generate tracklets up to time . Such displacements are then iteratively calculated for the following time frames. Without loss of generality, we only discuss the approach for optimizing object displacements between time and time in this section.
Iii-A1 Unary terms
For the th object tracklet, the unary term of our DCCRF model is defined as
This term penalizes the quadratic deviations between the final output displacement and the estimated displacement by a visual displacement estimation function . is an online adaptive parameter for the th object that controls to trust more the estimated displacement based on the th object’s visual cues (the unary terms) or based on inter-object relations (the pairwise terms). Intuitively, when the visual displacement estimator has higher confidence on its estimated displacement, should be larger to bias the final output towards the visually inferred displacements. On the other hand, when has lower confidence on its estimation, due to object occlusion or appearing of similar objects, should be smaller to let the final displacement be mainly inferred by inter-object constraints.
In our framework, the visual displacement estimation function is modeled as a deep Convolution Neural Network (CNN) that utilizes only the tracked objects’ visual information for estimating its location displacement between time and time . For each tracked object , our visual-displacement CNN takes a pair of images patched from frames and as inputs, and outputs the object’s inferred displacement. A network structure similar to ResNet-101  (except for the topmost layer) is adopted for our visual-displacement CNN.
The network inputs and outputs are illustrate in Fig. 2. For the inputs, given currently tracked object ’s bounding box location at time , a larger bounding box centered at is first created. Two image patches are cropped at the same spatial location but from different frames at time and time . They are then concatenated along the channel dimension to serve as the inputs for our visual-displacement CNN. The reasons for using a larger bounding box instead of the original box are to tolerate large possible displacement between the two consecutive frames and also to incorporate more visual contextual information of the object for more accurate displacement estimation. After training with thousands of such pairs, the visual-displacement CNN is able to capture important visual cues from image-patch pairs to infer object displacements between time and time .
For the CNN outputs, instead of directly estimating objects’ two dimensional - and -dimensional displacements, we discretize possible 2D continuous displacements into a 2D discrete grid (bottom-right part in Fig. 2), where represents the displacement corresponding to the th bin of the th object. The visual-displacement CNN is trained to output confidence scores for the displacement bins with a softmax function. The cross-entropy loss is therefore used to train the CNN, and the final estimated displacement for the tracked object is calculated as the weighted average of all possible displacements , where . In practice, we discretize displacements into bins, which is a good trade-off between discretization accuracy and robustness. Note that there are existing tracking methods ,  that also utilize pairs of image patches as inputs to directly estimate object displacements. However, in our method, we propose to use cross-entropy loss for estimating displacements and find that its result achieves more accurate and robust displacement estimations in our experiments. More importantly, it provides displacement confidence scores for calculating the adaptive parameter in Eq. (3) to weight the unary and pairwise terms.
The confidence weight is obtained by the following equation,
is the sigmoid function constraining the range ofbeing between 0 and 1, obtains the maximal confidence of , and and are learnable scalar parameters. In our experiments, the learned parameter is generally positive after training, which denotes that, if the visual-displacement CNN is more confident about its displacement estimations, the value of is larger and the final output displacement can be more biased towards the visually inferred displacement . Otherwise, the final displacement can be biased to be inferred by inter-object constraints.
If the energy function in Eq. (2) consists of only the unary terms , the final output displacement can be solely dependent on each tracked object’s visual information without considering inter-object constraints.
Iii-A2 Asymmetric pairwise terms
The pairwise terms in Eq. (2) are utilized to model asymmetric inter-object relations between object tracklets for regularizing the final displacement results . To handle global camera motion, we assume that from time to time , the speed differences between two tracked objects should be maintained, i.e.,
where is the displacement (which can be viewed as speed) difference between objects and at time , is the speed difference at the previous time , and are a series of weighting functions (two in our experiments) that control the directional influences between the pair of objects,
For better modeling inter-object relations, two important observations are made to define the asymmetric weighting functions . 1) For detection boxes, in terms of localization accuracy, larger object detection boxes are more likely to be noisy, while smaller ones tend to be more stable (as shown in Fig. 3). This is because the displacements of both large and small detection boxes are all recorded in pixels in our tracking frameworks. Noisy large detection boxes would significantly influence the displacement estimation for other boxes. This problem is illustrated in Fig. 4. The two targets in Fig. 4(a) have accurate locations and speeds which can be used to build inter-object constraints at time . When the detector outputs roughly accurate bounding boxes for both targets at time , symmetric inter-object constraints could well refine the objects’ locations (see Fig. 4(b)). However, since the larger-size detection boxes are more likely to be noisy, using the symmetric inter-object constraints would significantly affect tracking results of the small-size objects (see Fig. 4(c)). In contrast, small-size objects have smaller localization errors and could better infer larger-size objects’ locations. Asymmetric small-to-large-size inter-object constraints are robust, even when the smaller-size detection box is noisy(see Fig. 4(d)). Therefore, between a pair of tracked objects, the one with smaller detection box should have more influence to infer the displacement of the ones with larger detection box, and the object with a larger box should have less chance to deteriorate the displacement estimation of the smaller one.
|(a) Time||(b) Time|
|(c) Time||(d) Time|
|Symmetric influence||Asymmetric small-to-large influence|
2) If our above mentioned visual-displacement CNN has high confidence for an object’s displacement, this object’s visually inferred displacement should be used more to infer other objects’ displacements. On the other hand, the objects with low confidences on their visually inferred displacements should not affect other objects with high-confidence displacements. Based on the two observations, we model the weighting function by a product of a size-based weighting function and a confidence-based weighting function between a pair of tracked objects as
where denotes the sigmoid function, denotes the size of the th tracked object at time , obtains the maximal displacement confidence from by our proposed visual-displacement CNN, and , , , are learnable scalar parameters. In our DCCRF, these parameters can be learned by back-propagation algorithm with mean-field approximation. If we use the mean-field approximation for DCCRF inference, the influence from object to and that from and are different (see next subsection for details). After training, we see that and , which means that smaller and larger lead to greater weights. It validates our above mentioned observations that objects with smaller sizes and greater visual-displacement confidences should have greater influences to other objects, but not the other around.
In Fig. 5, we show example values of one learned weighting function . In Fig. 5(a), compared with object 6, objects 2-4 are of smaller sizes and also higher visual confidences. With the directional weighting functions, they have greater influence to correct errors of tracking object 6 (red vs. green rectangles of object 6) and are not affected much by the erroneous estimation of object 6. Similar directional weighting function values can be found in Fig. 5(b), where objects 1, 3, 4 with high visual-displacement confidences are able to correct tracking errors of object 5 with low visual-displacement confidence.
For our unary terms, we utilize forward propagation of the visual-displacement CNN for calculating objects’ estimated displacements and displacement confidences . After the unary term inference, the overall maximum posterior marginal inference is achieved by mean-field approximation. This approximation yields an iterative message passing for approximate inference. Our unary terms and pairwise terms are both of quadratic form. The energy function is convex and the optimal displacement is obtained as the mean value of the energy function,
In each iteration, the node receives messages from all other objects to update its displacement estimation. The mean-field approximation is usually converged in 5-10 iterations. The above displacement update equation clearly shows the differences between the messages transmitted from to and that from object to because of the asymmetric weighting functions . For a pair of objects, and are generally different. Even if , when , object has greater influence to than that from to .
A detailed derivation of Eq. (7) is given as follows. The mean-field method is to approximate the distribution with a distribution , which can be expressed as a product of independent marginals . The optimal approximation of is obtained by minimizing Kullback-Leibler (KL) divergence between and . The solution for has the following form,
where denotes expectation under distributions over all variables for . The inference is formulated as
Each is a quadratic form with respect to and its means therefore are
The inference task is to minimize . Since we approximate conditional distribution with product of independent marginals, an estimate of each is obtained as the expected value of the corresponding quadratic function,
Iii-B The Overall MOT Algorithm
The overall algorithm with our proposed DCCRF is shown in Algorithm 1. At each time , the DCCRF inputs are existing tracklets up time , and consecutive frames at time and time . It outputs each tracklet’s displacement estimation. After obtaining displacement estimations for each tracklet by DCCRF, its estimated location at time can be simply calculated as the summation of its location at time and its estimated displacement , i.e.,
Based on such estimated locations, we utilize a visual-similarity CNN (Section III-B1) as well as the Intersection-over-Union value as the criterion for tracklet-detection association to generate longer tracklets (Section III-B2). To make our online MOT system complete, we also specify our detailed strategies for tracklet initialization (Section III-B3), occlusion handling and tracklet termination (Section III-B4).
Iii-B1 Visual-similarity CNN
The tracklet-detection associations need to be determined based on visual cues and spatial cues simultaneously. We propose a visual-similarity CNN for calculating visual similarities between image patches cropped at bounding box locations in the same frame. The visual-similarity CNN has similar network structure as our visual-displacement CNN in Section III-A1. However, the network takes image patches in the same video frame as inputs and outputs the confidence whether the input pair represents the same object. It is therefore trained with a binary cross-entropy loss. In addition, the training samples are generated differently for the visual-similarity CNN. Instead of cropping two consecutive video frames at the same bounding box locations as the visual-displacement CNN, the visual-similarity CNN requires positive pairs to be cropped at different locations of the same object at anytime in the same video, while the negative pairs to be image patches belonging to different objects. For cropping image patches, we dont’t enlarge the object’s bounding box, which is also different to our visual-displacement CNN. During training, the ratio between positive and negative pairs are set to : and the network is trained similarly to that of visual-displacement CNN.
Iii-B2 Tracklet-detection association
Given the estimated tracklet locations and detection boxes at time , they are associated with detection boxes based on the visual and spatial similarities between them. The associated detection boxes can then be appended to their corresponding tracklets to form longer ones up to time . Let and denote the th tracklet’s estimated location and the th detection box at time
. Their visual similarity calculated by the visual-similarity CNN in SectionIII-B1 is denoted as . The spatial similarity between the estimated tracklet locations and detection boxes are measured as the their box Intersection-over-Union values . If a detection box is tried to be associated with multiple tracklets, Hungrian algorithm is utilized to determine the optimal associations with the following overall similarity,
where is the weight balancing the visual and spatial similarities and is set to 1 in our experiments. After the box association by Hungarian algorithm, if a tracklet is associated with a detection box that has an IoU value greater than 0.5 with it, the associated detection box are directly appended to the end of the tracklet. If the IoU value is between 0.3 and 0.5, the average of the associated detection box and estimated tracklet box are appended to the tracklet to compensate for the possible noisy detection box. If the IoU value is smaller than 0.3, tracklet might be considered as being terminated or temporally occluded (Section III-B4).
Iii-B3 Tracklet initialization
If an object detection box at time is not associated to any tracklet in the above tracklet-detection association step, it is treated as a candidate box for initializing new tracklets. For each such candidate box at time , its visually inferred displacement between time and is first obtained by our visual-displacement CNN in Section III-A1. Its estimated box location can be easily calculated following Eq. (12). The visual similarities and spatial similarities between the estimated box at and candidate boxes at are calculated. To form new candidate tracklet, the candidate box at time is only associated with the candidate box at time that has 1) greater-than-0.3 and 2) greater-than-0.8 visual similarity with its estimated box location. If there are multiple candidate associations, Hungarian algorithm is utilized to associate the candidate box at to its optimal candidate association at according to the overall similarities (Eq. (13)). If none of the candidate associations at time satisfies the above two conditions with the candidate box at , the candidate box is ignored and would not be used for tracking initialization. Such operations are iterated over time to generate longer candidate tracklets. If a candidate tracklet is over frames ( for pedestrain tracking with 25-fps videos), it is initialized as a new tracklet.
Iii-B4 Occlusion handling and tracklet termination
If a past tracklet is not associated to any detection box at time , the tracked object is considered as being possibly occluded or temporally missed. For a possibly occluded object, we directly associate its past tracklet to its estimated location by our DCCRF at time to create a virtual tracklet. The same operation is iterated for frames, i.e., if the virtual tracklet is not associated to any detection box for more than time steps, the virtual tracklet is terminated. For pedestrian tracking, we empirically set .
In this section, we present experimental results of the proposed online MOT algorithm. We first introduce evaluation datasets and implementation details for our proposed framework in Sections IV-A and IV-B. In Section IV-C, we compare the proposed method with state-of-the-art approaches on the public MOT datasets. The individual components of our proposed method are evaluated in Section IV-D.
Iv-a Datasets and Evaluation Metric
We conduct experiments on the 2DMOT15  and 2DMOT16  benchmarks, which are widely used to evaluate the performance of MOT methods. Both of them have two tracks: public detection boxes [2, 3, 24] and private detection boxes [30, 31]. For comparing with only the performance of tracking algorithms, we evaluate our method with the provided public detection boxes.
This dataset is one of the largest datasets with moving or static cameras, different viewpoints and different weather conditions. It contains a total of 22 sequences, half for training and half for testing, with a total of 11286 frames (or 996 seconds). The training sequences contain over 5500 frames, 500 annotated trajectories and 39905 annotated bounding boxes. The testing sequences contain over 5700 frames, 721 annotated trajectories and 61440 annotated bounding boxes. The public detection boxes in 2DMOT15 are generated with aggregated channel features (ACF).
This dataset is an extension to 2DMOT15. Compared to 2DMOT15, new sequences are added and the dataset contains almost 3 times more bounding boxes for training and testing. Most sequences are in high resolution, and the average pedestrian number in each video frame is 3 times higher than that of the 2DMOT15. In 2DMOT16, deformable part models (DPM) based methods are used to generate public detection boxes, which are more accurate than boxes in 2DMOT15.
Iv-A3 Evaluation Metric
For the quantitative evaluation, we adopt the popular CLEAR MOT metrics , which include:
MOTA: Multiple Object Tracking Accuracy. This metric is usually chosen as the main performance indicator for MOT methods. It combines three types of errors: false positives, false negatives, and identity switches.
MOTP: Multiple Object Tracking Precision. The misalignment between the annotated and the predicted bounding boxes.
MT: Mostly Tracked targets. The ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span.
ML: Mostly Lost targets. The ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span.
FP: The total number of false positives.
FN: The total number of false negatives (missed targets).
ID Sw: The total number of identity switches. Please note that we follow the stricter definition of identity switches as described in MOT challenge.
Frag: The total number of times a trajectory is fragmented (i.e., interrupted during tracking).
Iv-B Implementation details
Iv-B1 Training schemes and setting
as the network structure and replace the topmost layer to output displacement confidence or same-object confidence. Both CNN are pretrained on the ImageNet dataset. For cropping image patches from, we enlarge each detection box by a factor of 5 in width and 2 in height to obtain . Image patches for the two CNNs are cropped at the same locations from consecutive frames as described in Section III-A1, which are then resized to as the CNN inputs.
We train our proposed DCCRF in three stages. In the first stage, the proposed visual-displacement CNN is trained with the cross-entropy loss and batch Stochastic Gradient Descent (SGD) with a batch size of 5. The initial learning rate is set toand is decreased by a factor of 1/10 every 50,000 iterations. The training generally converges after 600,000 iterations. In the second stage, the learned visual-displacement CNN from stage-1 is fixed and other parameters in our DCCRF are trained with loss,
where and are estimated displacements and the ground-truth displacements for the th tracked object. In the final stage, the DCCRF is trained in an end-to-end manner with the above loss and the cross-entropy loss for visual-displacement CNN in unary terms. We find that 5 iterations of the mean-field approximation generate satisfactory results. The DCCRF is trained with an initial learning rate of
, which is decreased by a factor of 1/3 every 5,000 iterations. The training typically converges after 3 epochs.
Our code is implemented with MATLAB and Caffe. The overall tracking speed of the proposed method on MOT16 test sequences is 0.1 fps using the 2.4GHz CPU and a Maxwell TITAN X GPU without some acceleration library packages.
Iv-B2 Data augmentation
To introduce more variation into the training data and thus reduce possible overfitting, we augment the training data. For pre-training the visual-displacement CNN, the input images are image patches centered at detection boxes. We augment the training samples by random flipping as well as randomly shifting the cropping positions by no more than of detection box width or height for and dimensions respectively. For end-to-end training the DCCRF, except for random flipping of whole video frames, the time intervals between the two input video frames are randomly sampled from the interval of to generate more frame pairs with larger possible displacements between them.
|Tracking Mode||Method||MOTA||MOTP||MT||ML||FP||FN||ID Sw||Frag|
|Tracking Mode||Method||MOTA||MOTP||MT||ML||FP||FN||ID Sw||Frag|
|(a) MOT1603||(b) MOT1606||(c) MOT1607|
|(d) MOT1608||(e) MOT1612||(f) MOT1614|
Iv-C Quantitative results on 2DMOT15 and 2DMOT16
On the MOT2015 and MOT2016 datasets, we test our proposed method and compare it with state-of-the-art MOT methods111Note that only methods in peer-reviewed publications are compared in this paper. ArXiv papers that have not undergone peer-review are not included. including SMOT , MDP , SCEA , CEM , RNN_LSTM , RMOT , TC_ODAL , CNNTCM , SiameseCNN , oICF , NOMT , CDA_DDAL . The results of the compared methods are listed in Tables I and II. We focus on the MOTA value as the main performance indicator, which is a weighted combination of false negatives (FN), false positives (FP) and identity switches (ID Sw). Note that offline methods generally have higher MOTA than online methods because they can utilize not only past but also future information for object tracking and are only listed for reference here. Our proposed online MOT method outperforms all compared online methods and most offline methods [2, 3, 39, 24, 25]. As shown by the quantitative results, our proposed method is able to alleviate the difficulties caused by object mis-detections, noisy detections, and short-term occlusion. The qualitative results are shown in Fig. 6.
Compared with SCEA , which also models inter-object interactions and speed differences to handle mis-detections caused by global camera motion, our learned DCCRF shows better performance, especially in FN for our more accurate displacement prediction which is able to recover more mis-detections. Our proposed method also outperforms MDP  in terms of MOTA and FP by a large margin. MDP learns to predict four target states (active, tracked, lost and inactive) for each tracked object. However, it only models tracked object’s movement patterns with a constant speed assumption, which is likely to result in false tracklet-detection associations and thus increases FP. CDA_DDAL  focuses on using discriminative visual features by a siamese CNN for tracklet-detection associations, which is not robust for occlusions and is easy to increase FN. Compared with other algorithms DCO_X  and LTTSC-CRF  which also use conditional random field approximation to solve MOT problems, the results show that our proposed DCCRF has great advantages over other CRF-based methods in MOTA.
However, our method produces more ID switches than some compared methods, which is due to long-term occlusions that cannot be solved by our method.
|DCCRF w/o size-asym||43.6%||8063||93724||1035|
|DCCRF w/o cfd-asym||43.8%||7353||94163||969|
|DCCRF w/ symmetry||43.4%||9100||93076||1104|
Iv-D Component analysis on 2DMOT16
To analyze the effectiveness of different components in our proposed framework, we also design a series of baseline methods for comparison. The results of these baselines and our final method are reported in Table III. Similar to the above experiments, we focus on MOTA value as the main performance indicator. 1) Unary-only: this baseline utilizes only our unary terms in DCCRF, i.e., the visual-displacement CNN, with our overall MOT algorithm. Such a baseline model considers only tracked objects’ appearance information. Compared with our proposed DCCRF, it has a MOTA drop, which denotes that the inter-object relations are crucial for regularizing each object’s estimated displacement and should not be ignored. 2) Unary-only+-loss (reg): since our visual-displacement CNN is trained with proposed cross-entropy loss instead of conventional or losses for regression problems, we train a visual-displacement CNN with smooth -loss and test it in the same way as the above unary-only baseline. Compared with unary-only baseline, unary-only+-loss has a significant MOTA drop, which demonstrates that our proposed cross-entropy loss results in much better displacement estimation accuracy. 3) DCCRF w/o cfd-asym and DCCRF w/o size-asym: the weighting functions of the pairwise term in our proposed DCCRF have two terms, a confidence-asymmetric term and a size-asymmetric term. We test using only one of them in our DCCRF’s pairwise terms. The results show more than drop in terms of MOTA for both baseline methods compared with our proposed DCCRF, which validates the need of both terms in the weighting functions. 4) DCCRF w/ symmetry: this baseline method replaces the asymmetric pairwise term in our DCCRF with a symmetric one,
where is the coordinates of th object’s center position and are learnable Gaussian kernel bandwidth parameters. Such a symmetric term assumes that the speed differences between close-by objects should be better maintained across time, while those between far-away objects are less regularized. There is a MOTA drop compared with our proposed DCCRF, which shows our asymmetric term is beneficial for the final performance. We also try to directly replace the sigmoid function in Eq. (5) with a Gaussian-like function in the weighting function (Eq. (15)), which results in even worse performance.
In addition to the above, we also conduct experiments to analysize the effects of different hyper-parameters to show our DCCRF robustness. 1) The controls the weight between the visual-similarity term and the DCCRF location prediction term for tracklet-detection association in Eq. (13). We test three different values of and the results of different are reported in Table IV, which the final performance is not sensitive to the value. 2) The is the length of a candidate tracklet to create an actual tracklet in section III-B3. We additionally test in Table V, which shows slightly performance drop, because larger will cause more low-confidence detections to be ignored. 3) The denotes the number of consecutive frames of missing objects to terminate its associated tracklet in section III-B4. We additionally test and the results in Table VI show the peformance is not sensitive to the choice of .
In this paper, we present the Deep Continuous Conditional Random Field (DCCRF) model with asymmetric inter-object constraints for solving the MOT problem. The unary terms are modeled as a visual-displacement CNN that estimates object displacements across time with visual information. The asymmetric pairwise terms regularize inter-object speed differences across time with both size-based and confidence-based weighting functions to weight more on high-confidence tracklets to correct tracking errors. By jointly training the two terms in DCCRF, the relations between objects’ individual movement patterns and complex inter-object constraints can be better modeled and regularized to achieve more accurate tracking performance. Extensive experiments demonstrate the effectiveness of our proposed MOT framework as well as the individual components of our DCCRF.
-  W. Luo, J. Xing, X. Zhang, X. Zhao, and T. K. Kim, “Multiple object tracking: A literature review,” arXiv preprint arXiv:1409.7618, 2014.
-  Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4705–4713.
J. Hong Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, “Online multi-object
tracking via structural constraint event aggregation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1392–1400.
-  S. Tang, B. Andres, M. Andriluka, and B. Schiele, “Multi-person tracking by multicut and deep matching,” arXiv preprint arXiv:1608.05404, 2016.
-  S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 261–268.
-  A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971.
L. Leal-Taixé, G. Pons-Moll, and B. Rosenhahn, “Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker,” inComputer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 120–127.
-  X. Chen, Z. Qin, L. An, and B. Bhanu, “Multiperson tracking by online learned grouping model with nonlinear motion context,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 12, pp. 2226–2239, 2016.
-  L. Zhang and L. Van Der Maaten, “Preserving structure in model-free tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 756–769, 2014.
-  G. Duan, H. Ai, S. Cao, and S. Lao, “Group tracking: Exploring mutual relations for multiple object tracking,” Computer Vision–ECCV 2012, pp. 129–143, 2012.
-  J. H. Yoon, M.-H. Yang, J. Lim, and K.-J. Yoon, “Bayesian multi-object tracking using motion context from multiple objects,” in Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on. IEEE, 2015, pp. 33–40.
-  H. Grabner, J. Matas, L. Van Gool, and P. Cattin, “Tracking the invisible: Learning where the object might be,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1285–1292.
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.
-  B. Yang and R. Nevatia, “An online learned crf model for multi-target tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2034–2041.
-  A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous optimization for multi-target tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1926–1933.
-  A. Milan, K. Schindler, and S. Roth, “Detection- and trajectory-level exclusion in multiple object tracking,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
-  H. Li, Y. Li, and F. Porikli, “Robust online visual tracking with a single convolutional neural network,” in Asian Conference on Computer Vision. Springer, 2014, pp. 194–209.
-  N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in neural information processing systems, 2013, pp. 809–817.
S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning
discriminative saliency map with convolutional neural network,” in
International Conference on Machine Learning, 2015, pp. 597–606.
-  L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1420–1429.
-  L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” arXiv preprint arXiv:1606.09549, 2016.
-  A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler, “Online multi-target tracking using recurrent neural networks.” in AAAI, 2017, pp. 4225–4232.
-  S.-H. Bae and K.-J. Yoon, “Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler, “Learning by tracking: Siamese cnn for robust target association,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 33–40.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deep regression networks,” in European Conference on Computer Vision. Springer, 2016, pp. 749–765.
-  A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
-  L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multi-target tracking,” arXiv preprint arXiv:1504.01942, 2015.
-  H. Roberto, L.-T. Laura, C. Daniel, and R. Bodo, “A novel multi-detector fusion framework for multi-object tracking,” arXiv preprint arxiv:1705.08314, 2017.
-  F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “Poi: Multiple object tracking with high performance detection and appearance feature,” in European Conference on Computer Vision Workshops, 2016, pp. 36–42.
-  X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbd-net for object detection,” arXiv preprint arXiv:1610.02579, 2016.
-  C. Dicle, O. I. Camps, and M. Sznaier, “The way they move: Tracking multiple targets with similar appearance,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2304–2311.
-  A. Milan, S. Roth, and K. Schindler, “Continuous energy minimization for multitarget tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 58–72, 2014.
-  A. Milan, K. Schindler, and S. Roth, “Multi-target tracking by discrete-continuous energy minimization,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2054–2068, 2016.
-  B. Wang, L. Wang, B. Shuai, Z. Zuo, T. Liu, K. Luk Chan, and G. Wang, “Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 1–8.
-  W. Choi, “Near-online multi-target tracking with aggregated local flow descriptor,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3029–3037.
-  S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1218–1225.
-  H. Kieritz, S. Becker, W. Hübner, and M. Arens, “Online multi-person tracking using integral channel features,” in Advanced Video and Signal Based Surveillance (AVSS), 2016 13th IEEE International Conference on. IEEE, 2016, pp. 122–130.
A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun, “3d traffic scene understanding from movable platforms,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 1012–1025, 2014.
-  N. Le, A. Heili, and J.-M. Odobez, “Long-term time-sensitive costs for crf-based tracking by detection,” in Computer Vision-Eccv 2016 Workshops, Pt Ii, vol. 9914, no. EPFL-CONF-221401. Springer Int Publishing Ag, 2016, pp. 43–51.
-  L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle, “Improving multi-frame data association with sparse representations for robust near-online multi-object tracking.” in ECCV (8), 2016, pp. 774–790.
-  C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis tracking revisited,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4696–4704.
-  Y. Ban, S. Ba, X. Alameda-Pineda, and R. Horaud, “Tracking multiple persons based on a variational bayesian model,” in ECCV Workshop on Benchmarking Mutliple Object Tracking, 2016.
-  R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, “Online multi-target tracking with strong and weak detections.” in ECCV Workshops (2), 2016, pp. 84–99.