I Introduction
Robust tracking of multiple objects [1]
is a challenging problem in computer vision and acts as an important component of many realworld applications. It aims to reliably recover trajectories and maintain identities of objects of interest in an image sequence. Stateoftheart MultiObject Tracking (MOT) methods
[2], [3] mostly utilize the trackingbydetection strategy because of its robustness against tracking drift. Such a strategy generates perframe object detection results from the image sequence and associates the detections into object trajectories. It is able to handle newly appearing objects and is robust to tracking drift. The trackingbydetection methods can be categorized into offline and online methods. The offline methods [4] use both detection results from past and future with some global optimization techniques for linking detections to generate object trajectories. The online methods, on the other hand, use only detection results up to the current time to incrementally generate object trajectories. Our proposed method focuses on online MOT, which is more suitable for realtime applications including autonomous driving and intelligent surveillance.In MOT methods, the tracked objects usually show consistent or slowly varying appearance across time. Visual features of the objects are therefore important cues for associating detection boxes into tracklets. In recent years, deep learning techniques have shown great potential in learning discriminative visual features for singleobject and multiobject tracking. However, visual cues alone cannot guarantee robust tracking results. When tracked objects with similar appearances occlude or are close to each other, their trajectories might be wrongly associated to other objects. In addition, there also exist misdetections or inaccurate detections by imperfect object detectors. Such difficulties escalate when the camera is hold by hand or fixed on a car. Each object moves according to its own movement pattern as well as the global camera motion. Solving such problems was explored by modeling interactions between tracked objects in the optimization model. For online MOT methods, there were investigations on modeling interobject interactions with social force models
[5, 6, 7], relative spatial and speed differences [8, 9, 10], and relative motion constraints [3, 11]. Most of the previous methods model pairwise interobject interactions in symmetric mathematical forms, i.e., pairs of objects influence each other with the same magnitude.However, such pairwise object interactions should be directional and modeled in an asymmetric form, while existing methods model such interactions in a symmetric way. For instance, largesize detection boxes are more likely to be noisy (if measured in actual pixels). Smaller boxes should influence larger boxes more than large ones to small ones because the smaller ones usually provide more accurate localization for objects. Similarly, highconfidence trajectories should influence lowconfidence ones more and lowconfidence ones should have minimal impact on the highconfidence ones. In this way, the more accurate detections or trajectories could help correct errors of the inaccurate ones and would not be affected by the inaccurate ones much. Moreover, in existing methods, individual object’s movements and interobject interactions are usually modeled separately. The relations between the two terms are mostly manually tuned and not effectively studied in a unified framework.
To tackle the difficulties, we propose a Deep Continuous Conditional Random Field (DCCRF) with asymmetric interobject constraints for solving the problem of online MOT. The DCCRF inputs a pair of consecutive images at time and time , and tracked object’s past trajectories up to time . It estimates locations of the tracked objects at time . The DCCRF optimizes an objective function with two terms, the unary terms, which estimate individual object’s movement patterns, and the asymmetric pairwise terms, which model interactions between tracked objects. The unary terms are modeled by a deep Convolutional Neural Network (CNN), which is trained to estimate each individual object’s displacement between time and time with each object’s visual appearance. The asymmetric pairwise terms aim to tackle the problem caused by object occlusions, object misdetections and global camera motion. For two neighboring tracked trajectories, the pairwise influence is different along each direction to let the highconfidence trajectory assists the lowconfidence one more. Our proposed DCCRF utilizes meanfield approximation for inference and is trained in an endtoend manner to estimate the optimal displacement for each tracked object. Based on such estimated object locations, a final visualsimilarity CNN is proposed for generating the final detection association results.
The contribution of our proposed online MOT framework is twofold. (1) A novel DCCRF model is proposed for solving the online MOT problem. Each object’s individual movement patterns as well as interobject interactions are studied in a unified framework and trained in an endtoend manner. In this way, the unary terms and pairwise terms of our DCCRF can better adapt each other to achieve more accurate tracking performance. (2) An asymmetric interobject interaction term is proposed to model the directional influence between pairs of objects, which aims to correct errors of lowconfidence trajectories while maintain the estimated displacements of the highconfidence ones. Extensive experiments on two public datasets show the effectiveness of our proposed MOT framework.
Ii Related Work
There are a large number of methods on solving the multiobject tracking problem. We focus on reviewing online MOT methods that utilize interactive constraints, as well as singleobject and multiobject tracking algorithms with deep neural networks.
Interaction models for MOT. Social force models were adopted in MOT methods [5, 6, 7] to model pairwise interactions (attraction and repulsion) between objects. These methods required objects’ 3D positions for modeling interobject interactions, which were obtained by visual odometry.
Grabner et al. [12] assumed that the relative positions between feature points and objects were more or less fixed over shorttime intervals. Generalized Hough transform was therefore used to predict each target’s location with the assist of supporter feature points. Duan et al. [10] proposed mutual relation models to describe the spatial relations between tracked objects and to handle occluded objects. Such constraints are learned by an online structured SVM. Zhang and Maaten [9] incorporated spatial constraints between objects into an MOT framework to track objects with similar appearances.
The CRF algorithm [13] was used frequently in segmentation tasks to model the relationship between different pixels in the spatialdomain. There were also many works that modeled the multiobject tracking problem with CRF models. Yang and Nevatia [14] proposed an onlinelearned CRF model for MOT, and assumed linear and smooth motion of the objects to associate past and future tracklets. Andriyenko et al. [15] modeled multiobject tracking as optimizing discrete and continuous CRF models. A continuous CRF was used for enforcing motion smoothness, and a discrete CRF with a temporal interaction pairwise term was optimized for data association. Milan et al. [16] designed new CRF potenials for modeling spatiotemporal constraints between pairs of trajectories to tackle detection and trajectorylevel occlusions.
Deep learning based object tracking. Most existing deep learning based tracking methods focused on single object tracking, because deep neural networks were able to learn powerful visual features for distinguishing the tracked objects from the background and other similar objects. Early singleobject tracking methods [17], [18] with deep learning focused on learning discriminative appearance features for online training. However, due to the large learning capacitity of deep neural networks, it is easy to overfit the data. [19], [20]
pretrained deep convolutional neural networks on largescale image dataset to learn discriminative visual features, and updated the classifier online with new training samples. More recently, methods that did not require model updating were proposed. Tao et al.
[21] utilized Siamese CNNs to determine visual similarities between image pacthes for tracking. Bertinetto et al. [22] changed the network into a fully convolutional setting and achieved realtime running speed.Recently, deep models have been applied to multiobject tracking. Milan et al. [23] proposed an online MOT framework with two RNNs. One RNN was used for state (object locations, motions, etc.) prediction and update, and the other for associating objects across time. However, this method did not utilize any visual feature and relied solely on spatial locations of the detection results. [24], [25] replaced the handcrafted features (e.g., color histograms) with the learned features between image patches by a Siamese CNN, which increases the discriminative ability. However, those methods focused on modeling individual object’s movement patterns with deep learning. Interobject relations were not integrated into deep neural networks.
Iii Method
The overall framework of our proposed MOT method is illustrated in Fig. 1. We propose a Deep Continuous Conditional Random Field (DCCRF) model for solving the online MOT problem. At each time , the framework takes past tracklets up to time and detection boxes at time as inputs, and generates new tracklets up to time . At each time , new tracklets are also initialized and current tracklets are terminated if tracked objects disappear from the scene.
The core components of the proposed DCCRF consist of unary terms and asymmetric pairwise terms. The unary terms of our DCCRF are modeled by a deep CNN that estimates the individual tracked object’s displacements between consecutive times and . The asymmetric pariwise terms aim to model interobject interactions, which consider differences of speeds, visualconfidence, and object sizes between neighboring objects. Unlike interaction terms in existing MOT methods, which treat interobject interactions in a symmetric way, asymmetric relationship terms are proposed in our DCCRF. For pairs of tracklets in our DCCRF model, the proposed asymmetric pairwise term models the two directions differently, so that highconfidence trajectories with smallsize detection boxes can help correct errors of lowconfidence trajectories with largesize detection boxes. Based on the estimated object displacements by DCCRF, we adopt a visualsimilarity CNN and Hungarian algorithm to obtain the final trackletdetection associations.
Iiia Deep Continuous Conditional Random Field (DCCRF)
The proposed DCCRF takes object trajectories up to time and video frame at time as inputs, and outputs each tracked object’s displacement between time and time . Let represents a random field defined over a set of variables , where each of the variables represents the visual and motion information of an object tracklet. Let represents another random field defined over variables , where each variable represents the displacement of an object between time and time . The domain of each variable is the twodimensional space , denoting the  and dimensional displacements of tracked objects. Let represents the new video frame at time .
The goal of our conditional random field is to maximize the following conditional distribution,
(1) 
where represents the Gibbs energy and is the partition function. Maximizing the conditional distribution w.r.t. is equivalent to minimizing the Gibbs energy function,
(2) 
where and are the unary terms and pairwise terms.
After the displacements of tracked objects between time and time are obtained, individual object’s estimated locations at time can be easily calculated for associating tracklets and detection boxes to generate tracklets up to time . Such displacements are then iteratively calculated for the following time frames. Without loss of generality, we only discuss the approach for optimizing object displacements between time and time in this section.
IiiA1 Unary terms
For the th object tracklet, the unary term of our DCCRF model is defined as
(3) 
This term penalizes the quadratic deviations between the final output displacement and the estimated displacement by a visual displacement estimation function . is an online adaptive parameter for the th object that controls to trust more the estimated displacement based on the th object’s visual cues (the unary terms) or based on interobject relations (the pairwise terms). Intuitively, when the visual displacement estimator has higher confidence on its estimated displacement, should be larger to bias the final output towards the visually inferred displacements. On the other hand, when has lower confidence on its estimation, due to object occlusion or appearing of similar objects, should be smaller to let the final displacement be mainly inferred by interobject constraints.
In our framework, the visual displacement estimation function is modeled as a deep Convolution Neural Network (CNN) that utilizes only the tracked objects’ visual information for estimating its location displacement between time and time . For each tracked object , our visualdisplacement CNN takes a pair of images patched from frames and as inputs, and outputs the object’s inferred displacement. A network structure similar to ResNet101 [26] (except for the topmost layer) is adopted for our visualdisplacement CNN.
The network inputs and outputs are illustrate in Fig. 2. For the inputs, given currently tracked object ’s bounding box location at time , a larger bounding box centered at is first created. Two image patches are cropped at the same spatial location but from different frames at time and time . They are then concatenated along the channel dimension to serve as the inputs for our visualdisplacement CNN. The reasons for using a larger bounding box instead of the original box are to tolerate large possible displacement between the two consecutive frames and also to incorporate more visual contextual information of the object for more accurate displacement estimation. After training with thousands of such pairs, the visualdisplacement CNN is able to capture important visual cues from imagepatch pairs to infer object displacements between time and time .
For the CNN outputs, instead of directly estimating objects’ two dimensional  and dimensional displacements, we discretize possible 2D continuous displacements into a 2D discrete grid (bottomright part in Fig. 2), where represents the displacement corresponding to the th bin of the th object. The visualdisplacement CNN is trained to output confidence scores for the displacement bins with a softmax function. The crossentropy loss is therefore used to train the CNN, and the final estimated displacement for the tracked object is calculated as the weighted average of all possible displacements , where . In practice, we discretize displacements into bins, which is a good tradeoff between discretization accuracy and robustness. Note that there are existing tracking methods [22], [27] that also utilize pairs of image patches as inputs to directly estimate object displacements. However, in our method, we propose to use crossentropy loss for estimating displacements and find that its result achieves more accurate and robust displacement estimations in our experiments. More importantly, it provides displacement confidence scores for calculating the adaptive parameter in Eq. (3) to weight the unary and pairwise terms.
The confidence weight is obtained by the following equation,
(4) 
where
is the sigmoid function constraining the range of
being between 0 and 1, obtains the maximal confidence of , and and are learnable scalar parameters. In our experiments, the learned parameter is generally positive after training, which denotes that, if the visualdisplacement CNN is more confident about its displacement estimations, the value of is larger and the final output displacement can be more biased towards the visually inferred displacement . Otherwise, the final displacement can be biased to be inferred by interobject constraints.If the energy function in Eq. (2) consists of only the unary terms , the final output displacement can be solely dependent on each tracked object’s visual information without considering interobject constraints.
IiiA2 Asymmetric pairwise terms
The pairwise terms in Eq. (2) are utilized to model asymmetric interobject relations between object tracklets for regularizing the final displacement results . To handle global camera motion, we assume that from time to time , the speed differences between two tracked objects should be maintained, i.e.,
(5) 
where is the displacement (which can be viewed as speed) difference between objects and at time , is the speed difference at the previous time , and are a series of weighting functions (two in our experiments) that control the directional influences between the pair of objects,
For better modeling interobject relations, two important observations are made to define the asymmetric weighting functions . 1) For detection boxes, in terms of localization accuracy, larger object detection boxes are more likely to be noisy, while smaller ones tend to be more stable (as shown in Fig. 3). This is because the displacements of both large and small detection boxes are all recorded in pixels in our tracking frameworks. Noisy large detection boxes would significantly influence the displacement estimation for other boxes. This problem is illustrated in Fig. 4. The two targets in Fig. 4(a) have accurate locations and speeds which can be used to build interobject constraints at time . When the detector outputs roughly accurate bounding boxes for both targets at time , symmetric interobject constraints could well refine the objects’ locations (see Fig. 4(b)). However, since the largersize detection boxes are more likely to be noisy, using the symmetric interobject constraints would significantly affect tracking results of the smallsize objects (see Fig. 4(c)). In contrast, smallsize objects have smaller localization errors and could better infer largersize objects’ locations. Asymmetric smalltolargesize interobject constraints are robust, even when the smallersize detection box is noisy(see Fig. 4(d)). Therefore, between a pair of tracked objects, the one with smaller detection box should have more influence to infer the displacement of the ones with larger detection box, and the object with a larger box should have less chance to deteriorate the displacement estimation of the smaller one.
(a) Time  (b) Time 
Symmetric influence  
(c) Time  (d) Time 
Symmetric influence  Asymmetric smalltolarge influence 
2) If our above mentioned visualdisplacement CNN has high confidence for an object’s displacement, this object’s visually inferred displacement should be used more to infer other objects’ displacements. On the other hand, the objects with low confidences on their visually inferred displacements should not affect other objects with highconfidence displacements. Based on the two observations, we model the weighting function by a product of a sizebased weighting function and a confidencebased weighting function between a pair of tracked objects as
(6) 
where denotes the sigmoid function, denotes the size of the th tracked object at time , obtains the maximal displacement confidence from by our proposed visualdisplacement CNN, and , , , are learnable scalar parameters. In our DCCRF, these parameters can be learned by backpropagation algorithm with meanfield approximation. If we use the meanfield approximation for DCCRF inference, the influence from object to and that from and are different (see next subsection for details). After training, we see that and , which means that smaller and larger lead to greater weights. It validates our above mentioned observations that objects with smaller sizes and greater visualdisplacement confidences should have greater influences to other objects, but not the other around.
In Fig. 5, we show example values of one learned weighting function . In Fig. 5(a), compared with object 6, objects 24 are of smaller sizes and also higher visual confidences. With the directional weighting functions, they have greater influence to correct errors of tracking object 6 (red vs. green rectangles of object 6) and are not affected much by the erroneous estimation of object 6. Similar directional weighting function values can be found in Fig. 5(b), where objects 1, 3, 4 with high visualdisplacement confidences are able to correct tracking errors of object 5 with low visualdisplacement confidence.
(a)  (b) 
IiiA3 Inference
For our unary terms, we utilize forward propagation of the visualdisplacement CNN for calculating objects’ estimated displacements and displacement confidences . After the unary term inference, the overall maximum posterior marginal inference is achieved by meanfield approximation. This approximation yields an iterative message passing for approximate inference. Our unary terms and pairwise terms are both of quadratic form. The energy function is convex and the optimal displacement is obtained as the mean value of the energy function,
(7) 
In each iteration, the node receives messages from all other objects to update its displacement estimation. The meanfield approximation is usually converged in 510 iterations. The above displacement update equation clearly shows the differences between the messages transmitted from to and that from object to because of the asymmetric weighting functions . For a pair of objects, and are generally different. Even if , when , object has greater influence to than that from to .
A detailed derivation of Eq. (7) is given as follows. The meanfield method is to approximate the distribution with a distribution , which can be expressed as a product of independent marginals . The optimal approximation of is obtained by minimizing KullbackLeibler (KL) divergence between and . The solution for has the following form,
(8) 
where denotes expectation under distributions over all variables for . The inference is formulated as
(9)  
Each is a quadratic form with respect to and its means therefore are
(10) 
The inference task is to minimize . Since we approximate conditional distribution with product of independent marginals, an estimate of each is obtained as the expected value of the corresponding quadratic function,
(11) 
IiiB The Overall MOT Algorithm
The overall algorithm with our proposed DCCRF is shown in Algorithm 1. At each time , the DCCRF inputs are existing tracklets up time , and consecutive frames at time and time . It outputs each tracklet’s displacement estimation. After obtaining displacement estimations for each tracklet by DCCRF, its estimated location at time can be simply calculated as the summation of its location at time and its estimated displacement , i.e.,
(12) 
Based on such estimated locations, we utilize a visualsimilarity CNN (Section IIIB1) as well as the IntersectionoverUnion value as the criterion for trackletdetection association to generate longer tracklets (Section IIIB2). To make our online MOT system complete, we also specify our detailed strategies for tracklet initialization (Section IIIB3), occlusion handling and tracklet termination (Section IIIB4).
IiiB1 Visualsimilarity CNN
The trackletdetection associations need to be determined based on visual cues and spatial cues simultaneously. We propose a visualsimilarity CNN for calculating visual similarities between image patches cropped at bounding box locations in the same frame. The visualsimilarity CNN has similar network structure as our visualdisplacement CNN in Section IIIA1. However, the network takes image patches in the same video frame as inputs and outputs the confidence whether the input pair represents the same object. It is therefore trained with a binary crossentropy loss. In addition, the training samples are generated differently for the visualsimilarity CNN. Instead of cropping two consecutive video frames at the same bounding box locations as the visualdisplacement CNN, the visualsimilarity CNN requires positive pairs to be cropped at different locations of the same object at anytime in the same video, while the negative pairs to be image patches belonging to different objects. For cropping image patches, we dont’t enlarge the object’s bounding box, which is also different to our visualdisplacement CNN. During training, the ratio between positive and negative pairs are set to : and the network is trained similarly to that of visualdisplacement CNN.
IiiB2 Trackletdetection association
Given the estimated tracklet locations and detection boxes at time , they are associated with detection boxes based on the visual and spatial similarities between them. The associated detection boxes can then be appended to their corresponding tracklets to form longer ones up to time . Let and denote the th tracklet’s estimated location and the th detection box at time
. Their visual similarity calculated by the visualsimilarity CNN in Section
IIIB1 is denoted as . The spatial similarity between the estimated tracklet locations and detection boxes are measured as the their box IntersectionoverUnion values . If a detection box is tried to be associated with multiple tracklets, Hungrian algorithm is utilized to determine the optimal associations with the following overall similarity,(13) 
where is the weight balancing the visual and spatial similarities and is set to 1 in our experiments. After the box association by Hungarian algorithm, if a tracklet is associated with a detection box that has an IoU value greater than 0.5 with it, the associated detection box are directly appended to the end of the tracklet. If the IoU value is between 0.3 and 0.5, the average of the associated detection box and estimated tracklet box are appended to the tracklet to compensate for the possible noisy detection box. If the IoU value is smaller than 0.3, tracklet might be considered as being terminated or temporally occluded (Section IIIB4).
IiiB3 Tracklet initialization
If an object detection box at time is not associated to any tracklet in the above trackletdetection association step, it is treated as a candidate box for initializing new tracklets. For each such candidate box at time , its visually inferred displacement between time and is first obtained by our visualdisplacement CNN in Section IIIA1. Its estimated box location can be easily calculated following Eq. (12). The visual similarities and spatial similarities between the estimated box at and candidate boxes at are calculated. To form new candidate tracklet, the candidate box at time is only associated with the candidate box at time that has 1) greaterthan0.3 and 2) greaterthan0.8 visual similarity with its estimated box location. If there are multiple candidate associations, Hungarian algorithm is utilized to associate the candidate box at to its optimal candidate association at according to the overall similarities (Eq. (13)). If none of the candidate associations at time satisfies the above two conditions with the candidate box at , the candidate box is ignored and would not be used for tracking initialization. Such operations are iterated over time to generate longer candidate tracklets. If a candidate tracklet is over frames ( for pedestrain tracking with 25fps videos), it is initialized as a new tracklet.
IiiB4 Occlusion handling and tracklet termination
If a past tracklet is not associated to any detection box at time , the tracked object is considered as being possibly occluded or temporally missed. For a possibly occluded object, we directly associate its past tracklet to its estimated location by our DCCRF at time to create a virtual tracklet. The same operation is iterated for frames, i.e., if the virtual tracklet is not associated to any detection box for more than time steps, the virtual tracklet is terminated. For pedestrian tracking, we empirically set .
Iv Experiments
In this section, we present experimental results of the proposed online MOT algorithm. We first introduce evaluation datasets and implementation details for our proposed framework in Sections IVA and IVB. In Section IVC, we compare the proposed method with stateoftheart approaches on the public MOT datasets. The individual components of our proposed method are evaluated in Section IVD.
Iva Datasets and Evaluation Metric
We conduct experiments on the 2DMOT15 [29] and 2DMOT16 [28] benchmarks, which are widely used to evaluate the performance of MOT methods. Both of them have two tracks: public detection boxes [2, 3, 24] and private detection boxes [30, 31]. For comparing with only the performance of tracking algorithms, we evaluate our method with the provided public detection boxes.
IvA1 2dmot15
This dataset is one of the largest datasets with moving or static cameras, different viewpoints and different weather conditions. It contains a total of 22 sequences, half for training and half for testing, with a total of 11286 frames (or 996 seconds). The training sequences contain over 5500 frames, 500 annotated trajectories and 39905 annotated bounding boxes. The testing sequences contain over 5700 frames, 721 annotated trajectories and 61440 annotated bounding boxes. The public detection boxes in 2DMOT15 are generated with aggregated channel features (ACF).
IvA2 2dmot16
This dataset is an extension to 2DMOT15. Compared to 2DMOT15, new sequences are added and the dataset contains almost 3 times more bounding boxes for training and testing. Most sequences are in high resolution, and the average pedestrian number in each video frame is 3 times higher than that of the 2DMOT15. In 2DMOT16, deformable part models (DPM) based methods are used to generate public detection boxes, which are more accurate than boxes in 2DMOT15.
IvA3 Evaluation Metric
For the quantitative evaluation, we adopt the popular CLEAR MOT metrics [29], which include:

MOTA: Multiple Object Tracking Accuracy. This metric is usually chosen as the main performance indicator for MOT methods. It combines three types of errors: false positives, false negatives, and identity switches.

MOTP: Multiple Object Tracking Precision. The misalignment between the annotated and the predicted bounding boxes.

MT: Mostly Tracked targets. The ratio of groundtruth trajectories that are covered by a track hypothesis for at least 80% of their respective life span.

ML: Mostly Lost targets. The ratio of groundtruth trajectories that are covered by a track hypothesis for at most 20% of their respective life span.

FP: The total number of false positives.

FN: The total number of false negatives (missed targets).

ID Sw: The total number of identity switches. Please note that we follow the stricter definition of identity switches as described in MOT challenge.

Frag: The total number of times a trajectory is fragmented (i.e., interrupted during tracking).
IvB Implementation details
IvB1 Training schemes and setting
For visualdisplacement and visualsimilarity CNNs, we adopt ResNet101 [26, 32]
as the network structure and replace the topmost layer to output displacement confidence or sameobject confidence. Both CNN are pretrained on the ImageNet dataset. For cropping image patches from
, we enlarge each detection box by a factor of 5 in width and 2 in height to obtain . Image patches for the two CNNs are cropped at the same locations from consecutive frames as described in Section IIIA1, which are then resized to as the CNN inputs.We train our proposed DCCRF in three stages. In the first stage, the proposed visualdisplacement CNN is trained with the crossentropy loss and batch Stochastic Gradient Descent (SGD) with a batch size of 5. The initial learning rate is set to
and is decreased by a factor of 1/10 every 50,000 iterations. The training generally converges after 600,000 iterations. In the second stage, the learned visualdisplacement CNN from stage1 is fixed and other parameters in our DCCRF are trained with loss,(14) 
where and are estimated displacements and the groundtruth displacements for the th tracked object. In the final stage, the DCCRF is trained in an endtoend manner with the above loss and the crossentropy loss for visualdisplacement CNN in unary terms. We find that 5 iterations of the meanfield approximation generate satisfactory results. The DCCRF is trained with an initial learning rate of
, which is decreased by a factor of 1/3 every 5,000 iterations. The training typically converges after 3 epochs.
Our code is implemented with MATLAB and Caffe. The overall tracking speed of the proposed method on MOT16 test sequences is 0.1 fps using the 2.4GHz CPU and a Maxwell TITAN X GPU without some acceleration library packages.
IvB2 Data augmentation
To introduce more variation into the training data and thus reduce possible overfitting, we augment the training data. For pretraining the visualdisplacement CNN, the input images are image patches centered at detection boxes. We augment the training samples by random flipping as well as randomly shifting the cropping positions by no more than of detection box width or height for and dimensions respectively. For endtoend training the DCCRF, except for random flipping of whole video frames, the time intervals between the two input video frames are randomly sampled from the interval of to generate more frame pairs with larger possible displacements between them.
Tracking Mode  Method  MOTA  MOTP  MT  ML  FP  FN  ID Sw  Frag 
Offline  SMOT [33]  18.2%  71.2%  2.8%  54.8%  8780  40310  1148  2132 
Offline  CEM [34]  19.3%  70.7%  8.5%  46.5%  14180  34591  813  1023 
Offline  DCO_X [35]  19.6%  71.4%  5.1%  54.9%  10652  38232  521  819 
Offline  SiameseCNN [25]  29.0%  71.2%  8.5%  48.4%  5160  37798  639  1316 
Offline  CNNTCM [36]  29.6%  71.8%  11.2%  44.0%  7786  34733  712  943 
Offline  NOMT [37]  33.7%  71.9%  12.2%  44.0%  7762  32547  442  823 
Online  TC_ODAL [38]  15.1%  70.5%  3.2%  55.8%  12970  38538  637  1716 
Online  RNN_LSTM [23]  19.0%  71.0%  5.5%  45.6%  11578  36706  1490  2081 
Online  RMOT [11]  18.6%  69.6%  5.3%  53.3%  12473  36835  684  1282 
Online  oICF [39]  27.1%  70.0%  6.4%  48.7%  7594  36757  454  1660 
Online  SCEA [3]  29.1%  71.1%  8.9%  47.3%  6060  36912  604  1182 
Online  MDP [2]  30.3%  71.3%  13.0%  38.4%  9717  32422  680  1500 
Online  CDA_DDAL [24]  32.8%  70.7%  9.7%  42.2%  4983  35690  614  1583 
Online  Proposed Method  33.6%  70.9%  10.4%  37.6%  5917  34002  866  1566 
Tracking Mode  Method  MOTA  MOTP  MT  ML  FP  FN  ID Sw  Frag 

Offline  TBD [40]  33.7%  76.5%  7.2%  54.2%  5804  112587  2418  2252 
Offline  LTTSCCRF [41]  37.6%  75.9%  9.6%  55.2%  11969  101343  481  1012 
Offline  LINF [42]  41.0%  74.8%  11.6%  51.3%  7896  99224  430  963 
Offline  MHT_DAM [43]  42.9%  76.6%  13.6%  46.9%  5668  97919  499  659 
Offline  JMC [4]  46.3%  75.7%  15.5%  39.7%  6373  90914  657  1114 
Offline  NOMT [37]  46.4%  76.6%  18.3%  41.4%  9753  87565  359  504 
Online  OVBT [44]  38.4%  75.4%  7.5%  47.3%  11517  99463  1321  2140 
Online  EAMTT_pub [45]  38.8%  75.1%  7.9%  49.1%  8114  102452  965  1657 
Online  oICF [39]  43.2%  74.3%  11.3%  48.5%  6651  96515  381  1404 
Online  CDA_DDAL [24]  43.9%  74.7%  10.7%  44.4%  6450  95175  676  1795 
Online  Proposed Method  44.8%  75.6%  14.1%  42.3%  5613  94125  968  1378 
(a) MOT1603  (b) MOT1606  (c) MOT1607 
(d) MOT1608  (e) MOT1612  (f) MOT1614 
IvC Quantitative results on 2DMOT15 and 2DMOT16
On the MOT2015 and MOT2016 datasets, we test our proposed method and compare it with stateoftheart MOT methods^{1}^{1}1Note that only methods in peerreviewed publications are compared in this paper. ArXiv papers that have not undergone peerreview are not included. including SMOT [33], MDP [2], SCEA [3], CEM [34], RNN_LSTM [23], RMOT [11], TC_ODAL [38], CNNTCM [36], SiameseCNN [25], oICF [39], NOMT [37], CDA_DDAL [24]. The results of the compared methods are listed in Tables I and II. We focus on the MOTA value as the main performance indicator, which is a weighted combination of false negatives (FN), false positives (FP) and identity switches (ID Sw). Note that offline methods generally have higher MOTA than online methods because they can utilize not only past but also future information for object tracking and are only listed for reference here. Our proposed online MOT method outperforms all compared online methods and most offline methods [2, 3, 39, 24, 25]. As shown by the quantitative results, our proposed method is able to alleviate the difficulties caused by object misdetections, noisy detections, and shortterm occlusion. The qualitative results are shown in Fig. 6.
Compared with SCEA [3], which also models interobject interactions and speed differences to handle misdetections caused by global camera motion, our learned DCCRF shows better performance, especially in FN for our more accurate displacement prediction which is able to recover more misdetections. Our proposed method also outperforms MDP [2] in terms of MOTA and FP by a large margin. MDP learns to predict four target states (active, tracked, lost and inactive) for each tracked object. However, it only models tracked object’s movement patterns with a constant speed assumption, which is likely to result in false trackletdetection associations and thus increases FP. CDA_DDAL [24] focuses on using discriminative visual features by a siamese CNN for trackletdetection associations, which is not robust for occlusions and is easy to increase FN. Compared with other algorithms DCO_X [35] and LTTSCCRF [41] which also use conditional random field approximation to solve MOT problems, the results show that our proposed DCCRF has great advantages over other CRFbased methods in MOTA.
However, our method produces more ID switches than some compared methods, which is due to longterm occlusions that cannot be solved by our method.
Method  MOTA  FP  FN  ID Sw 

Proposed DCCRF  44.8%  5613  94125  968 
Unaryonly  41.9%  7392  97618  876 
Unaryonly+loss (reg)  34.2%  12089  104810  3134 
DCCRF w/o sizeasym  43.6%  8063  93724  1035 
DCCRF w/o cfdasym  43.8%  7353  94163  969 
DCCRF w/ symmetry  43.4%  9100  93076  1104 
IvD Component analysis on 2DMOT16
To analyze the effectiveness of different components in our proposed framework, we also design a series of baseline methods for comparison. The results of these baselines and our final method are reported in Table III. Similar to the above experiments, we focus on MOTA value as the main performance indicator. 1) Unaryonly: this baseline utilizes only our unary terms in DCCRF, i.e., the visualdisplacement CNN, with our overall MOT algorithm. Such a baseline model considers only tracked objects’ appearance information. Compared with our proposed DCCRF, it has a MOTA drop, which denotes that the interobject relations are crucial for regularizing each object’s estimated displacement and should not be ignored. 2) Unaryonly+loss (reg): since our visualdisplacement CNN is trained with proposed crossentropy loss instead of conventional or losses for regression problems, we train a visualdisplacement CNN with smooth loss and test it in the same way as the above unaryonly baseline. Compared with unaryonly baseline, unaryonly+loss has a significant MOTA drop, which demonstrates that our proposed crossentropy loss results in much better displacement estimation accuracy. 3) DCCRF w/o cfdasym and DCCRF w/o sizeasym: the weighting functions of the pairwise term in our proposed DCCRF have two terms, a confidenceasymmetric term and a sizeasymmetric term. We test using only one of them in our DCCRF’s pairwise terms. The results show more than drop in terms of MOTA for both baseline methods compared with our proposed DCCRF, which validates the need of both terms in the weighting functions. 4) DCCRF w/ symmetry: this baseline method replaces the asymmetric pairwise term in our DCCRF with a symmetric one,
(15) 
where is the coordinates of th object’s center position and are learnable Gaussian kernel bandwidth parameters. Such a symmetric term assumes that the speed differences between closeby objects should be better maintained across time, while those between faraway objects are less regularized. There is a MOTA drop compared with our proposed DCCRF, which shows our asymmetric term is beneficial for the final performance. We also try to directly replace the sigmoid function in Eq. (5) with a Gaussianlike function in the weighting function (Eq. (15)), which results in even worse performance.
MOTA 

MOTA  FP  FN  

MOTA  FP  FN  

In addition to the above, we also conduct experiments to analysize the effects of different hyperparameters to show our DCCRF robustness. 1) The controls the weight between the visualsimilarity term and the DCCRF location prediction term for trackletdetection association in Eq. (13). We test three different values of and the results of different are reported in Table IV, which the final performance is not sensitive to the value. 2) The is the length of a candidate tracklet to create an actual tracklet in section IIIB3. We additionally test in Table V, which shows slightly performance drop, because larger will cause more lowconfidence detections to be ignored. 3) The denotes the number of consecutive frames of missing objects to terminate its associated tracklet in section IIIB4. We additionally test and the results in Table VI show the peformance is not sensitive to the choice of .
V Conclusion
In this paper, we present the Deep Continuous Conditional Random Field (DCCRF) model with asymmetric interobject constraints for solving the MOT problem. The unary terms are modeled as a visualdisplacement CNN that estimates object displacements across time with visual information. The asymmetric pairwise terms regularize interobject speed differences across time with both sizebased and confidencebased weighting functions to weight more on highconfidence tracklets to correct tracking errors. By jointly training the two terms in DCCRF, the relations between objects’ individual movement patterns and complex interobject constraints can be better modeled and regularized to achieve more accurate tracking performance. Extensive experiments demonstrate the effectiveness of our proposed MOT framework as well as the individual components of our DCCRF.
References
 [1] W. Luo, J. Xing, X. Zhang, X. Zhao, and T. K. Kim, “Multiple object tracking: A literature review,” arXiv preprint arXiv:1409.7618, 2014.
 [2] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multiobject tracking by decision making,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4705–4713.

[3]
J. Hong Yoon, C.R. Lee, M.H. Yang, and K.J. Yoon, “Online multiobject
tracking via structural constraint event aggregation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 1392–1400.  [4] S. Tang, B. Andres, M. Andriluka, and B. Schiele, “Multiperson tracking by multicut and deep matching,” arXiv preprint arXiv:1608.05404, 2016.
 [5] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multitarget tracking,” in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 261–268.
 [6] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. FeiFei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971.

[7]
L. LealTaixé, G. PonsMoll, and B. Rosenhahn, “Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker,” in
Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 120–127.  [8] X. Chen, Z. Qin, L. An, and B. Bhanu, “Multiperson tracking by online learned grouping model with nonlinear motion context,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 12, pp. 2226–2239, 2016.
 [9] L. Zhang and L. Van Der Maaten, “Preserving structure in modelfree tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 756–769, 2014.
 [10] G. Duan, H. Ai, S. Cao, and S. Lao, “Group tracking: Exploring mutual relations for multiple object tracking,” Computer Vision–ECCV 2012, pp. 129–143, 2012.
 [11] J. H. Yoon, M.H. Yang, J. Lim, and K.J. Yoon, “Bayesian multiobject tracking using motion context from multiple objects,” in Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on. IEEE, 2015, pp. 33–40.
 [12] H. Grabner, J. Matas, L. Van Gool, and P. Cattin, “Tracking the invisible: Learning where the object might be,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1285–1292.

[13]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in
Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.  [14] B. Yang and R. Nevatia, “An online learned crf model for multitarget tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2034–2041.
 [15] A. Andriyenko, K. Schindler, and S. Roth, “Discretecontinuous optimization for multitarget tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1926–1933.
 [16] A. Milan, K. Schindler, and S. Roth, “Detection and trajectorylevel exclusion in multiple object tracking,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
 [17] H. Li, Y. Li, and F. Porikli, “Robust online visual tracking with a single convolutional neural network,” in Asian Conference on Computer Vision. Springer, 2014, pp. 194–209.
 [18] N. Wang and D.Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in neural information processing systems, 2013, pp. 809–817.

[19]
S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning
discriminative saliency map with convolutional neural network,” in
International Conference on Machine Learning
, 2015, pp. 597–606.  [20] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
 [21] R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1420–1429.
 [22] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fullyconvolutional siamese networks for object tracking,” arXiv preprint arXiv:1606.09549, 2016.
 [23] A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler, “Online multitarget tracking using recurrent neural networks.” in AAAI, 2017, pp. 4225–4232.
 [24] S.H. Bae and K.J. Yoon, “Confidencebased data association and discriminative deep appearance learning for robust online multiobject tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [25] L. LealTaixé, C. CantonFerrer, and K. Schindler, “Learning by tracking: Siamese cnn for robust target association,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 33–40.
 [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [27] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deep regression networks,” in European Conference on Computer Vision. Springer, 2016, pp. 749–765.
 [28] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multiobject tracking,” arXiv preprint arXiv:1603.00831, 2016.
 [29] L. LealTaixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multitarget tracking,” arXiv preprint arXiv:1504.01942, 2015.
 [30] H. Roberto, L.T. Laura, C. Daniel, and R. Bodo, “A novel multidetector fusion framework for multiobject tracking,” arXiv preprint arxiv:1705.08314, 2017.
 [31] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “Poi: Multiple object tracking with high performance detection and appearance feature,” in European Conference on Computer Vision Workshops, 2016, pp. 36–42.
 [32] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbdnet for object detection,” arXiv preprint arXiv:1610.02579, 2016.
 [33] C. Dicle, O. I. Camps, and M. Sznaier, “The way they move: Tracking multiple targets with similar appearance,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2304–2311.
 [34] A. Milan, S. Roth, and K. Schindler, “Continuous energy minimization for multitarget tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 58–72, 2014.
 [35] A. Milan, K. Schindler, and S. Roth, “Multitarget tracking by discretecontinuous energy minimization,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2054–2068, 2016.
 [36] B. Wang, L. Wang, B. Shuai, Z. Zuo, T. Liu, K. Luk Chan, and G. Wang, “Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 1–8.
 [37] W. Choi, “Nearonline multitarget tracking with aggregated local flow descriptor,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3029–3037.
 [38] S.H. Bae and K.J. Yoon, “Robust online multiobject tracking based on tracklet confidence and online discriminative appearance learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1218–1225.
 [39] H. Kieritz, S. Becker, W. Hübner, and M. Arens, “Online multiperson tracking using integral channel features,” in Advanced Video and Signal Based Surveillance (AVSS), 2016 13th IEEE International Conference on. IEEE, 2016, pp. 122–130.

[40]
A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun, “3d traffic scene understanding from movable platforms,”
IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 1012–1025, 2014.  [41] N. Le, A. Heili, and J.M. Odobez, “Longterm timesensitive costs for crfbased tracking by detection,” in Computer VisionEccv 2016 Workshops, Pt Ii, vol. 9914, no. EPFLCONF221401. Springer Int Publishing Ag, 2016, pp. 43–51.
 [42] L. FagotBouquet, R. Audigier, Y. Dhome, and F. Lerasle, “Improving multiframe data association with sparse representations for robust nearonline multiobject tracking.” in ECCV (8), 2016, pp. 774–790.
 [43] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis tracking revisited,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4696–4704.
 [44] Y. Ban, S. Ba, X. AlamedaPineda, and R. Horaud, “Tracking multiple persons based on a variational bayesian model,” in ECCV Workshop on Benchmarking Mutliple Object Tracking, 2016.
 [45] R. SanchezMatilla, F. Poiesi, and A. Cavallaro, “Online multitarget tracking with strong and weak detections.” in ECCV Workshops (2), 2016, pp. 84–99.
Comments
There are no comments yet.