1 Introduction
MultiPerson Tracking (MPT), as a key component of several intelligent applications such as automatic driving and video surveillance, has attracted special attention beyond general object tracking. The goal of MPT is to estimate the states of multiple observed persons while preserving their identifications under appearance variation over time. Existing MPT methods are mainly developed within the detectiontoassociation paradigm, where human in each frame are usually detected by pretrained classifiers and associated for identifying the trajectories of persons throughout video sequences. Recently proposed MPT methods have shown impressive performance improvement thanks to the development of object (pedestrian) detectors (e.g., deep learning based models). Nevertheless, the problem still remains unsolved in complex scenes (see Fig.
1 for examples) due to the following reasons:
Mutual interactions and occlusions of moving persons usually degenerate the performances of human detectors and the resulting false positive detections increase the complexity of conserving person identifications.

It is quite difficult to handle ambiguities caused by person appearance and motion variations throughout sequences. Some offline methods (i.e., by exploiting detections from a span of deferred observations) are usually adopted but not suitable for realistic applications (i.e., working with less observed data).
To address the abovementioned issues, in this work we propose to amend the traditional detectiontoassociation paradigm by learning instanceaware person representations. Unlike the existing methods that usually employ generic (categorylevel) human detectors, our approach targets on assigning each moving person a specific tracker to reduce ambiguities in complex scenes. Additionally, modern advances in the development of deep feature representation learning
LinWZF016pami ; XieDZWF17pami ; WuYL17pr for object appearance have created new opportunities for MPT methods, which partially motivate us to learn instancelevel object representations by deep neural nets. Therefore, we develop a multibranch neural network (MBN) that dynamically learns instancelevel representations of tracked persons at a low cost, which facilitates robustly online data association for multiple target tracking and thus gives birth to our INstanceAware Representation Learning and Association (INARLA) framework.The proposed MBN architecture consists of three main components: i) a shared backbonenet for extracting convolutional features of input regions, ii) a detpruningsubnet for rejecting the regions from human detection proposals and iii) a variable number of instancesubnets for measuring the confidence of the remaining candidate regions with respect to the tracked targets. Each instancesubnet explicitly corresponds to an individual in the scene and can be online updated by mining hard examples. Moreover, new instancesubnets can be dynamically created to handle newly appearing targets. In this way, our MBN enables to improve the trackers’ robustness by adaptively capturing appearance variations for all the targets over time. Moreover, it is beneficial to relieve the burden of the following step of data association. Traditional detectiontoassociation trackers usually rely on an expensive step for associating observed data with trajectories (identifications) by establishing spatiotemporal coherence, especially for those offline methods milan2015joint ; wang2015learning . In contrast, our INARLA framework handles it in a simple and efficient way, thanks to the MBN that can provide powerful instancelevel affinity measures for the observed regions. Specifically, we construct a joint association matrix based on the outputs of MBN. This matrix can be divided into four blocks that represent meaningful states of tracked persons (e.g., being tracked or disappearing from the scene), and it results in a standard assignment problem that can be solved efficiently by the Hungarian algorithm munkres1957algorithms . In sum, our approach handles the problem of online multiperson tracking with the following steps: i) initializing generic human detections in an input video frame; ii) pruning lowconfidence human detections via the detpruningsubnet; iii) predicting the location of each being tracked individual via its corresponding instancesubnet; iv) inferring the states of all targets by constructing an association matrix with results of step ii) and iii); v) making the MBN network updated according to the inferred states of the targets.
The main contributions of this paper are summarized as follows. First, it presents a novel deep multibranch neural network that enables dynamically instanceaware representation learning to address realistic challenges in multiperson tracking. Second, it presents a simple yet effective solver for data association based on the deep architecture, which is capable of inferring the states of tracked individuals in a framebyframe way. Experimental results on a standard benchmark underline our method’s favorable performance in comparison with existing multiperson tracking methods.
2 Related Work
In literature much efforts have been dedicated in multiobject tracking (MOT), and we review them according to their main technical components, i.e., object representation and data association.
2.1 Object representation
How to represent objects plays an important role in MOT for affinity computation or linking object detections across frames. Many different cues have been presented in the literature, e.g., appearance, location and motion.
Earlier MOT works mostly adopt handcrafted features for object representation ChoiS10eccv ; DalalT05cvpr ; Andriyenko2011Multi ; QianYG13pr . Color histograms are commonly used to represent object appearance in multiobject tracking ChoiS10eccv ; Bae2014Robust , and histograms of oriented gradients (HOG) DalalT05cvpr is also a popular choice BenfoldR11cvpr ; LiWZLLW17tip . In Andriyenko2011Multi , optical flow that reflects the motion information is incorporated for object representation. In addition, appropriate fusion of multiple cues can yield improved results KimKFH12accv ; lenz2015followme ; WuGCW18nca
. Moreover, sophisticated machine learning techniques
Bae2014Robust ; FelzenszwalbGMR10pami are introduced to better describe object appearance models. However, conventional object representation methods are often badly affected by challenging factors like illumination variations, object deformation, background clutters, etc., which limits their performance and generalization ability to various complex scenarios.Recently, researchers actively learn object appearance features with deep learning based models due to their powerful representation learning ability, e.g., convolutional neural networks (CNNs)
Wang2015Visual ; LiWLL18jcamand recurrent neural networks (RNNs)
Cui2016CVPR ; Ondruska2016Deep . A fully convolutional neural network is adopted in Wang2015Visual for object tracking, where features from top and lower layers that characterize the target from different perspectives are jointly used with a switch mechanism. In Cui2016CVPR , a recurrently targetattending tracking method is presented, which attempts to identify and exploit reliable parts that are beneficial for the tracking process. But these mentioned deep learning based methods mainly focus on single object tracking with the object being indicated at the first video frame. As for MOT, recently LealTaixe et al. Leal2016CVPRWorkshops exploit siamese CNN for pairwise pedestrian similarity mesurement in offline tracking, while Gaidon and Vig gaidon2015online take advantage of the convolutional features in online domain adaption between instances and category in a Bayesian tracking framework. Different from these methods, in this paper we employ a MBN network for instanceaware object representations, in which a backbonesubnet is trained with a novel multitask loss and instancesubnets are dynamically initialized from a detpruningsubnet and trained discriminatively online.2.2 Data association
To address the data association problem, existing MOT works can mainly be roughly divided into two categories: offline methods milan2015joint ; wang2015learning ; lenz2015followme and online methods KimKFH12accv ; gaidon2015online ; yoon2016online .
Most MOT methods belong to the first category and process the video in an offline way, where the data association is optimized over the whole video or a span of frames and requires future frames to determine objects’ states in the current frame. Network flowbased MOT methods Tang2017Multiple ; WangTFF16pami are quite typical in this category, and they generally solve the MOT problem using minimumcost flow optimization. In Tang2017Multiple , linking person hypotheses over time is formulated as a minimum cost lifted multicut problem. In order to track interacting objects well, Wang et al. WangTFF16pami propose novel intertwined flows to handle this issue. Integer program is also often used for formulating data association in MOT LeibeSG07iccv ; WangTFF14eccv . In LeibeSG07iccv
, the quadratic integer program formulation is solved to local optimality by custom heuristics based on recursive search. Mixed integer program is introduced to handle the interaction of multiple objects in
WangTFF14eccv . In MaksaiWFF17iccv , a nonMarkovian approach is proposed to impose global consistency by using behavioral patterns to guide the association. These offline methods generally yield better performance by incorporating future frames into formulation and optimization, but this characteristic and the resulted high complexity also add great constraints to their application.The online methods only use information up to the current frame and require no deferred processing, which are more practical in realworld applications. In KimKFH12accv
, the data association between consecutive frames is formulated as bipartite matching and solved by structural support vector machines. Bae et al.
Bae2014Robust perform online multiobject tracking by combination of local and global association based on tracklet confidence. Recently, more sophisticated learning methods are introduced to handle this problem. In xiang2015learning, the online association is modeled by Markov Decision Process (MDP) with reinforcement learning. In
MilanRDSR16 , RNNs are employed to learn the data association from data for online multiobject tracking. While the recent works spend costly computation in online joint association, this paper introduces an efficient solver for the online association based on the outputs of the MBN network.3 InstanceAware Representation Learning
Our INARLA framework incorporates instanceaware representation learning into joint association for online multiperson tracking and can combine with any human detector. As shown in Fig. 2, we train a multibranch neural network (MBN) for instanceaware representation learning. In a new frame, our approach embeds the MBN network’s outputs in an association matrix to jointly infer the objects’ states, which will be fed back to the MBN network.
3.1 Multibranch neural network
The architecture of our MBN network is illustrated in Fig. 2, which consists of three main components: a shared backbonesubnet, a detpruningsubnet and a variable number of instancesubnets. The backbonesubnet is fully convolutional and can take an image of arbitrary size as input to extract convolutional features. Among the branch subnets, the detpruningsubnet is designed to evaluate and reject the noisy person proposals from a public human detector and also to initialize instancesubnets, while each instancesubnet predicts the location of its tracked person and also outputs the confidence score of a candidate being the target.
We build the MBN network from the Fast RCNN model girshick2015fast using CaffeNet krizhevsky2012imagenet . We borrow the lower five layers from Fast RCNN architecture as our backbonesubnet, while the branch subnet structure is specially defined to accommodate our task. Different branch subnets have the same structure definition. In order to handle the online learning of tracked instances with few examples, we define a lightweight branch subnet architecture, which comprises a regionofinterest (ROI) layer, and three fully connected layers with size of 256, 256 and 2, respectively.
3.2 Network learning
For concise description, we use to denote the backbonesubnet and to denote the th branch subnet. The 0th branch is the detpruningsubnet and the th branch () is the th instancesubnet, which dynamically changes in conformance with the number of maintained persons. In addition, denotes the corresponding network that is formed by subnet and the th branch subnet (i.e. ).
The backbonesubnet is initialized from the Fast RCNN model trained on the largescale VOC datasets girshick2015fast . We initialize the detpruningsubnet
from zeromean Gaussian distributions with standard deviation 0.01.
We train the network offline, and employ a multitask loss on each labeled RoI to jointly optimize for classification and distance metric embedding:
(1) 
where
is defined as the log loss function over two classes.
is computed by a softmax over the 2 outputs in the final fully connected layer, and =1 indicates the target and =0 otherwise.As illustrated in Fig. 3, we add an auxiliary subnet (in the dashedline box), consisting of two fully connected layers with sizes 4096 and 1, respectively. A tripletlike loss is used: . Here and are positive examples of the same human object (e.g., sampled nearby or at different frames), while denotes a set of negative examples. denotes the 4096dimensional feature vector, and is the norm distance (i.e. ). The function is defined as YunRV14nips .
This tripletlike loss can drive similar (dissimilar) examples close to (apart from) each other in the feature space. Optimizing the multitask loss Eq. (1) can make the feature exacted by the backbonesubnet suitable for discriminating both human/nonhuman objects and different humans, which is helpful for later instancesubnet training and prediction. To maintain the balance of positive and negative examples, we set the cardinality of
as 2. Thus the batch size for optimization is a multiple of 4. The hyperparameter
in Eq. (1) is set as 0.7 in our experiments.In optimization process, the gradients of the tripletlike loss with respective to the vector
can be calculated based on the chain rule:
(2) 
where and .
We train the network in a hardexamplemining scheme Shrivastava2016CVPR . Specifically, we start with a dataset of positive examples and a random set of negative examples. The network is trained to converge on this dataset and subsequently applied to a larger dataset to harvest false positives. Then the network is trained again on the augmented training set with the false positives added. The auxiliary subnet is removed when training is finished.
In the test stage, the instance network () is created dynamically by adding a new branch instancesubnet and trained online when a person is newly detected. The new instancesubnet is initialized from subnet , and further trained using only the classification loss by setting in Eq. (1).
We collect (=500) positive samples and (=256) negative samples. The intersectionoverunion (IoU) overlap ratios of positive and negative samples with this target’s detection bounding box are greater than (=0.5) and less than (=0.3), respectively. Beyond that, we collect positive samples from every other object as negative samples for this new target to make its specific subnet more discriminative. In updating, we exploit hard negative examples for online training in the hardexamplemining scheme. Given a sample , the score measures the similarity between the sample and the person target .
3.3 Instance prediction
In frame , we apply the proposed MBN network for instance prediction tasks. An instancesubnet independently predicts the corresponding target’s location , which consists of center coordinates , width and height . We sample candidates varying in displacement and scale for each target from its previous location . Specifically, a candidate is denoted as , with
drawn from a normal distribution whose mean is
and covariance is a diagonal matrix with diagonal vector . The candidates of the target will pass the network and get their scores . Most previous works select the candidate with the maximum score as the optimal location. However, this strategy renders unstable prediction. It is because our features are extracted from a downsampling layer, and candidates with similar locations may be projected to the same region in the feature map and thus get the same feature after RoI pooling. Such instability will be more drastic for smallsized objects. We use a simple and effective scheme to overcome this problem by averaging all the locations whose score over . So the predicted location of target will be calculated as(3) 
4 Joint State Inference for Tracking
Different states are employed to describe a person target in the video, and Fig. 4 shows the state transition. A person in the “New” state denotes being newly detected, and a new identity will be assigned to it (a new instancesubnet will be initialized as well) before it transits to the “Tracked” state. When the “Tracked” person is considered not found in a frame, its state will be changed to the “Lost” state. The “Lost” person is still maintained and continues to be looked for, and it will transit to the “Tracked” state again if it is found. However, if the “Lost” person stays in this state for a certain amount of frames, it will be changed to the “Discarded” state, and all its information (identity and instancesubnet) will be removed. Based on the outputs of MBN, we propose an efficient solver for the joint state inference.
4.1 Joint association matrix construction
Assume that we maintain tracked person targets and there exist new person observations in frame after applying the proposed MBN network. Let be the targets’ predictions and () be the person observations.
As shown in Fig. 5(a), a conventional association matrix can be constructed, with each element reflecting the pairwise relationship between prediction and observation. The association matrix is equivalent to a bipartite graph, with the predictions and observations as nodes and the matrix elements as edge weights. The association problem is thus can be solved to obtain matching pairs with lowest cost via graph optimization methods such as maxflow or Hungarian algorithms. In our context, the prediction with matched observation is considered successfully tracked. A prediction (observation) with no match is considered as lost (new target). However, the aforementioned association matrix may easily run into the risk of generating uncorrect pairs of prediction and observation.
Therefore, we propose to construct a novel joint association matrix that can bridge the joint association optimization with a standard assignment problem. In our formulation, as illustrated in Fig. 5(b), the rows and columns both comprise predictions and observations, and thus predictions (observations) can assign not only to their counterparts but also explicitly to themselves. In this way, the joint association matrix can be divided into 4 blocks, and each has meaningful representation when its element is chosen (i.e., lost, tracked or new target).
To be specific, matrix is defined below:
(4) 
where is a square matrix, with row and column indices representing predictions and new observations. Matrix is composed of 4 blocks, where an element chosen in the submatrix , and implies that the corresponding target’s state is judged as “Lost”, “Tracked” and “New”, respectively. denotes the transpose of .
A type of function , is introduced to measure the pairwise relationship. A larger value of indicates stronger correlation.
In block , we define its element as follows:
(5) 
Here, when a prediction is highly selfassociated, we consider it to be lost. For two predictions of different person targets, we do not assign any coupling evidence and set the value to be .
In block , we define its element as follows:
(6) 
where and . The element definition indicates that a target is successfully tracked when it is highly coupled with a person observation.
In block , we define its element as follows:
(7) 
Similar to the definition of the elements in , a person observation that highly associates itself is considered as a new target. We also do not assign any coupling evidence between any two person observations and set the corresponding value to be .
The essential issue is how to define the functions so that the aforementioned requirements can be satisfied. Many criteria based on multiple cues in the literature, such as appearance and motion, can be exploited. In this paper, we propose to use measurements tightly associated with our MBN network. We define as the sum of two terms:
(8) 
where and are related to the confidence and location outputs of the MBN network, respectively. The three parameters , are preset constants.
In particular, we define
(9) 
where denotes the output confidence by feeding observation into the th instance detector. is an intersectionoverunion function which returns the area ratios of intersection and union between the bounding boxes of and .
Then the terms , , and are defined as follows:
(10)  
(11)  
Specifically, Eq. (10) indicates that a target is considered selfassociated (or lost) when its own instancesubnet outputs low confidence and the predicted location is weakly coupled with the observations. Likewise, a person observation is considered selfassociated (or new object), as implied by Eq. (11), when it retrieves low evidence from all available instancesubnets and their predicted locations. We note that the terms and , are all in the range .
4.2 Joint state inference
By constructing the joint association matrix, the joint tracking inference problem of all targets can be converted to an assignment problem by finding an optimal permutation vector consisting of . The energy function is formulated as:
(12) 
where is the th element of and denotes the matrix element in row and column of . Let to be the maximum element of , and replace each element with to obtain the matrix . Then Eq. (12) is equivalent to
(13) 
We solve this energy function efficiently via the Hungarian algorithm munkres1957algorithms .
We will update the instancesubnet when the target is in “Tracked” state but with . For a person observation that is inferred as “New”, a corresponding branch subnet will be initialized for it. For a target judged in “Lost” state, if it has been in this state for consecutive frames, it will be transferred to the “Discarded” state. Otherwise it will continue to be predicted and participate in the joint inference in next frame.
Algorithm 1 depicts the procedure of the proposed INARLA framework.
4.3 Assumption validation
There exists a key assumption of selection in . That is, we have to ensure that once the elements in are chosen, the symmetric elements in must be chosen as well, because we incorporate both predictions and observations in rows and columns and thus a matched pair should take two symmetric elements simultaneously. Fortunately, due to the special structure of , this assumption can be validated.
Let us take the joint matrix in Fig. 5(b) for explanation. It can be observed that elements marked in red form a potential optimal solution, with each occupying distinct row and column and the elements being symmetric. However, the two elements marked in green in the left and the three elements marked in red in the right also seem to form a plausible optimal solution. But we will show that this is not true in our formulation context. Assume such asymmetric solution to be optimal. Let be the sum of elements chosen in and be the sum of elements chosen in . If , it is obvious that we can choose the elements in that are symmetric to those chosen in to get a better solution. It conflicts with the optimum assumption. It is a similar case when . It is almost impossible that because we set matrix elements in floating numbers. In the extreme situation that , the problem has multiple optimal solutions even not expressed in our joint matrix. In practice, extensive experimental results show that the optimal solution is symmetric.
5 Experiments
5.1 Experimental settings
Dataset The proposed method is evaluated on the 2D MOT 2015 benchmark dataset leal2015motchallenge , which contains 11 sequences for training and 11 sequences for testing, consisting of sequences filmed by both static and moving cameras in unconstrained environments. The MOT benchmark releases ground truth for the training sequences. The human detection results provided by the benchmark dataset, which were generated by the ACF detector dollar2014fast , are used in our evaluation so as to provide fair comparison with other MPT methods.
Evaluation metrics Multiple metrics are used to evaluate the tracking performance as suggested by the MOT research community bernardin2008evaluating ; RistaniSZCT16eccv , including Multiple Object Tracking Accuracy (MOTA, taking FN, FP and IDS into account), ID F1 Score (IDF1, the ratio of correctly identified detections over the average number of groundtruth and computed detections), Mostly Tracked targets (MT, the ratio of groundtruth trajectories that are covered by a track hypothesis for at least 80% of their respective life span), Mostly Lost targets (ML, The ratio of groundtruth trajectories that are covered by a track hypothesis for at most 20% of their respective life span), the total number of False Positives (FP), the total number of False Negatives (FN), the total number of ID Switches (IDS), the total number of times a trajectory is Fragmented (Frag), and processing speed (Hz, in frames per second excluding the detector) on the benchmark.
MBN architecture As mentioned in Sect. 3.1, the structure of the backbonesubnet is the same as the lower five layers of CaffeNet used in Fast RCNN girshick2015fast . Specifically, the five convolutional layers have 96 kernels of size , 256 kernels of size , 384 kernels of size , 384 kernels of size and 256 kernels of size
, respectively. The output feature maps of the first two convolutional layers are maxpooled (
kernel) and normalized before being fed into the next layer. Moreover, outputs of all the five layers are immediately filtered by a rectified linear unit (ReLU) before any pooling or normalization operation. Branch subnets, including the detpruningsubnet and instancesubnets, have the same structure, consisting of a ROI layer, and three fully connected layers with size of 256, 256 and 2, respectively.
Implementation details
Our algorithm is implemented in python using Caffe platform. The network
(backbonesubnet with pruning subnet) is trained on the training set from leal2015motchallenge for 40K SGD iterations and the learning rate is lowered by 0.1 in the last 10k iterations. We double the learning rate for training instance network for fast adaption and run for 50 iterations. The images on both training and testing phases are rescaled so that the shorter side of them is 600 pixels. We set , , , , , and in the experiments by empirical study. We will further discuss important parameter settings in ablation study (Sect. 5.3). Our algorithm runs on a PC with 8 cores of 3.70 GHZ CPU, and a Tesla K40 GPU.Algorithm  MOTA(%)  IDF1(%)  MT(%)  ML(%)  FP  FN  IDS  Frag  Hz 
SiameseCNN (2016)LealTaixeCS16cvpr  29.0  34.3  8.5  48.4  5160  37798  639  1316  52.8 
CNNTCM (2016)WangWSZLCW16cvpr  29.6  36.8  11.2  44.0  7786  34733  712  943  1.7 
QuadMOT (2017)SonBCH17cvpr  33.8  40.4  12.9  36.9  7898  32061  703  1430  3.7 
TSDA_OAL (2017)JuIET2017  18.6  36.1  9.4  42.3  16350  32853  806  1544  19.7 
RNN_LSTM (2016)MilanRDSR16  19.0  17.1  5.5  45.6  11578  36706  1490  2081  165.2 
OMT_DFH (2017)ju2017online  21.2  37.3  7.1  46.5  13218  34657  563  1255  28.6 
EAMTTpub (2016)eccvMatilla16  22.3  32.8  5.4  52.7  7924  38982  833  1485  12.2 
oICF (2016)avssKieritzBHA16  27.1  40.5  6.4  48.7  7594  36757  454  1660  1.4 
SCEA (2016)yoon2016online  29.1  37.2  8.9  47.3  6060  36912  604  1182  6.8 
MDP (2015)xiang2015learning  30.3  44.7  13.0  38.4  9717  32422  680  1500  1.1 
DCCRF (2018)Zhou2018CSVT  33.6  39.1  10.4  37.6  5917  34002  866  1566  0.1 
AM (2017)ChuOLWLY17iccv  34.3  48.3  11.4  43.4  5154  34848  348  1463  0.5 
INARLA (Ours)  34.7  42.1  12.5  30.0  9855  29158  1112  2848  2.6 
denotes offline methods. 
Sequence  Density  Speed  Sequence  Density  Speed 

ETHCrossing  4.6  6.1  ETHJelmoli  5.8  4.6 
ETHLinthescher  7.5  5.4  KITTI19  5  4.2 
TUDCrossing  5.5  4.8  KITTI16  8.1  3.2 
ADLRundle3  16.3  2.0  Venice1  10.1  3.3 
ADLRundle1  18.6  1.3  PETS09S2L2  22.1  1.0 
AVGTownCentre  15.9  1.0 
5.2 Benchmark evaluation
We compare our INARLA tracker with nine recent online MPT methods that published their results on the 2D MOT 2015 benchmark, including TSDA_OAL JuIET2017 , RNN_LSTM MilanRDSR16 , OMT_DFH ju2017online , EAMTTpub eccvMatilla16 , oICF avssKieritzBHA16 , SCEA yoon2016online , MDP xiang2015learning , DCCRF Zhou2018CSVT and AM ChuOLWLY17iccv . Among them, RNN_LSTM, DCCRF and AM are deep learningbased methods. We also include three recent deep learningbased offline MPT methods (i.e., SiameseCNN LealTaixeCS16cvpr , CNNTCM WangWSZLCW16cvpr and QuadMOT SonBCH17cvpr ) for comparison. Table 1 summarizes the quantitative comparison results, and the best result in each metric is marked in bold font. The uparrow next to a metric indicates higher values are better, while the downarrow indicates lower values are better.
Among these metrics, MOTA is an integrated metric that summarizes multiple aspects of tracking performance and is used by the MOT benchmark for ranking the trackers. Our method achieves the highest MOTA against these recent methods including the deep learningbased methods. Moreover, our method also achieves the best performance in terms of ML and FN since our network achieves robust performance in the presence of missing detections. The outstanding performance demonstrates the advantages of our MBN network and joint state inference solver. However, working in a framebyframe way, our method will recover targets judged as “Lost” for many times, resulting in a high Frag value. This can be further addressed by introducing a proper postprocessing strategy. Fig. 1 and Fig. 6 illustrate our tracking results on the test set of the MOT benchmark in static and dynamic scenes, respectively.
Our algorithm runs at around 2.6 frames per second without code optimization. Note that the number of tracked objects actually affects the running speed. Therefore, we show in Table 2 the relationship between the density (objects per frame, OPF) and the processing speed (frames per second, FPS) on each sequence of the test set. It can be inferred from Table 2 that the speed of a single instance tracker roughly ranges from 20 to 30 fps. Due to the properties of our MBN, we are confident that improved processing efficiency can be achieved by parallel implementation in branch subnets.
5.3 Ablation study
The contributions of different components in our method are assessed on the 2D MOT 2015 benchmark. The ablation study is conducted on the training set because the annotations of the test set are not released and the benchmark webpage limits evaluation submissions (a user can only post a submission every three days and submit no more than 3 times in total). The 11 training sequences are partitioned into training and validation subsets to analyze the proposed algorithm, with 5 sequences (TUDStadtmitte, ETHBahnhof, ADLRundle8, PETS09S2L1, KITTI13) for training and the rest for validation.
Table 3 reports the quantitative evaluation results of different versions of our MPT method in ablation study. The results of the full version of our method, which contains all the proposed components, are shown in the last row of the table. Below we evaluate and analyze each component of the proposed MPT method in detail.
Version  MOTA(%)  IDF1(%)  MT(%)  ML(%)  FP  FN  IDS  Frag 
no_aux_loss  40.2  48.3  20.9  33.9  3286  9233  223  510 
no_pruning  32.2  29.9  17.0  35.2  3549  10286  605  779 
no_update  38.7  45.5  18.3  34.8  3451  9404  216  488 
only_IoU  38.5  44.1  19.6  36.5  3164  9711  237  458 
only_confidence  36.4  43.1  16.5  39.1  3220  10066  267  464 
balance_learned  39.7  47.6  20.4  35.7  3370  9258  219  467 
greedy  25.4  31.8  17.4  37.8  5579  9726  597  753 
with vgg16  39.0  46.7  19.1  36.1  3058  9708  242  489 
with vgg_m  40.6  47.3  19.1  36.5  3031  9397  237  487 
full  41.1  48.7  21.7  35.7  3097  9248  201  461 
1) MBN network: The offline training of our MBN network is augmented with an auxiliary subnet in a multitask optimization scheme, as described in Sect. 3.2. And it aims to make the MBN network more discriminative for our MPT task. To evaluate its effectiveness, we remove the auxiliary subnet and set in Eq. (1) for offline model training, and this version of our method is termed “no_aux_loss”. From Table 3, we can observe that its MOTA performance drops by about 1% with most of the other metrics also degraded. The increase in FP reveals it includes more false human detections. These results demonstrate the positive role of the auxiliary subnet.
The “no_pruning” version of our method denotes our framework does not include the process of the detpruningsubnet that aims to filter out false human detections. As can be observed, its MOTA drops dramatically to 32.2%, with a decrease of 8.9%. The FP metric increases from 3097 to 3451. A sharp performance degradation can be viewed in most of the metrics, which demonstrates the significant effectiveness of the detpruningsubnet.
The instancesubnets of our MBN network are dynamically added and trained online. They are also updated during tracking so as to adapt to appearance changes of corresponding human instances. The “no_update” version denotes an instancesubnet will not be updated after it is trained. As shown in Table 3, the deterioration in all the metrics except ML reveals the importance of online update.
2) Association matrix: The second group of rows in Table 3 evaluates the effectiveness of our data association component that builds upon the constructed association matrix. As depicted by Eq. (8), elements of the association matrix involves two terms (i.e., output confidence and IoU) and three parameters (i.e., , and ). We carry out experiments to evaluate their influence on our method’s performance.
The “only_confidence” and “only_IoU” versions of our method denote Eq. (8) only contains the confidence or IoUrelated term, corresponding to setting and (), respectively. Performance degradation in all the metrics are witnessed from Table 3 for both these two versions. We can also infer that the IoUrelated term has a larger impact on our method’s performance because “only_IoU” performs better than “only_confidence” in the evaluation metrics.
We further discuss the problem of balancing the two terms in Eq. (8), i.e., choosing the best values for parameters (). A balancelearning scheme was tried to find the optimal parameter setting. The scheme is designed as follows. Given initial parameter setting of , the proposed algorithm is run on the training set. Then we check the groundtruth for a pair in function every frame, and the expected output of is set as 1 if the pair is matched and 0 otherwise. We learn by minimizing the sum of squared errors of actual and expected outputs. The process is executed for several iterations with the learned value of as new initial setting. The best results of this balancelearning scheme are shown in Table 3 as “balance_learned”. As can be seen, this scheme does not work quite well. It performs worse than the “full” version in which are manually set by empirical study. In future, we will try new schemes to handle this problem.
To further analyze the contribution of our association component, we replace it with a simple greedy association algorithm. That is, in the association stage, a new person observation will be assigned to a tracked target who has the largest IoU ratio of bounding boxes with it. This version of our method is termed “greedy”. As exhibited in Table 3, its performance worsens sharply in all the metrics, which instead reveals the significant role of the proposed association component.
3) Choices of backbonesubnet: As described in Sect. 5.1, the backbonesubnet of our MBN network is CaffeNet, a smallscale neural network. Here we make other choices for the backbonesubnet to evaluate their impact on the performance. Specifically, we use vgg_cnn_m_1024 ChatfieldSVZ14bmvc and vgg16 SimonyanZ14acorr network models as the backbonesubnet. The vgg_cnn_m_1024 model is the same deep as CaffeNet but is wider, and the vgg16 model is very deep with 16 layers. With these two models, the corresponding versions of our method are termed “with vgg_m” and “with vgg16” in Table 3
. It can be observed that “with vgg_m” has almost the same performance with “full”, with 0.5% decrease in MOTA. However, “with vgg16” shows larger degradation in performance with MOTA decreased by 2.1%. We visualized the tracking results and took indepth analysis, and found that the “with vgg16” version did not work well on smallsized persons. It may be attributed to that a small image region contains less appearance details that are important for discriminating instances of the same category (e.g. human) and the feature extracted by the deep vgg16 model is less reflective of those details since the vgg16 architecture induces stronger reduction of subtle features (e.g., with more maxpooling layers than CaffeNet), as also reported in previous work
Wang2015Visual ; LiWLL18jcam . It is also worth noting that the “full”, “with vgg_m” and “with vgg16” versions run at about 2.7, 2.2, 1.6 FPS averagely on the validation set, respectively. The foregoing comparison reveals the “full” version performs the best in both accuracy and efficiency among the three versions.4) Update threshold: To evaluate the influence of the update threshold on our method’s performance, we change its value while fixing the values of the other parameters. The results are plotted in Fig. 7 with four metrics (MOTA, IDF1, MT and ML) that are expressed in percentage. As can be observed, the performance is the best when the update threshold is at around 0.5, but the performance does not exhibit a sharp change as the threshold changes.
6 Conclusion
In this paper, we have introduced a novel deep learning based online multiperson tracking approach that emphasizes instanceaware representation learning with the MBN network. While the backbonesubnet provides robust deeplylearned image feature, the instancesubnets cast instancelevel appearance discrimination to reduce ambiguities between different targets and release the burden of later data association. We construct an association matrix based on the outputs of the MBN network for joint state inference of the targets, where a simple yet effective solver is developed thanks to the powerful support from MBN. The effectiveness of our approach is verified through extensive experimental evaluation with recent MPT methods.
There are several directions that we can improve the proposed INARLA framework in future. First, the backbonesubnet of our MBN network will be enhanced to empower its extracted feature more robustness and discrimination. Our approach can handle smallsized objects better by making the feature extraction process adapt to different sizes of objects. Second, a more efficient model should be designed for the instancesubnet. This is because we found in experiments that online training and updating of instancesubnets often occupy more than half of the total processing time although the instancesubnet in our MBN network has a lightweight structure. Recent works show that correlation filter models can achieve good accuracy at high running speed in single object tracking. We will make indepth attempts to incorporate such models into our MBN network since they also involve convolution. Third, more effort will be devoted to the state inference procedure. We will investigate more effective terms for composing elements of the association matrix and exploit new data association algorithms for the online MPT task. Moreover, we intend to extend our work to incorporate full category detection and form a unified framework.
Acknowledgement
This work was supported by the National Natural Science Foundation of China (61876045, U1811463), Zhujiang Science and Technology New Star Project of Guangzhou (201906010057), the Major Program of Science and Technology Planning Project of Guangdong Province (2017B010116003), and Guangdong Natural Science Foundation (2016A030313285). The authors would like to thank Shiyi Hu and Xu Cai who partly joined this work when they were graduate students at Sun Yatsen University.
References
References
 (1) L. Lin, G. Wang, W. Zuo, X. Feng, L. Zhang, Crossdomain visual matching via generalized similarity measure and feature learning, TPAMI 39 (6) (2016) 1089–1102.
 (2) J. Xie, G. Dai, F. Zhu, E. K. Wong, Y. Fang, Deepshape: Deeplearned shape descriptor for 3d shape retrieval, TPAMI 39 (7) (2017) 1335–1345.
 (3) Y. Wu, F. Yin, C. Liu, Improving handwritten chinese text recognition using neural network language models and convolutional neural network shape models, Pattern Recognition 65 (2017) 251–264.
 (4) A. Milan, L. LealTaixé, K. Schindler, I. Reid, Joint tracking and segmentation of multiple targets, in: CVPR, 2015, pp. 5397–5406.
 (5) S. Wang, C. Fowlkes, Learning optimal parameters for multitarget tracking, in: BMVC, 2015.
 (6) J. Munkres, Algorithms for the assignment and transportation problems, Journal of the Society for Industrial and Applied Mathematics 5 (1) (1957) 32–38.
 (7) W. Choi, S. Savarese, Multiple target tracking in world coordinate with single, minimally calibrated camera, in: ECCV, 2010, pp. 553–567.
 (8) N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR, 2005, pp. 886–893.
 (9) A. Andriyenko, K. Schindler, Multitarget tracking by continuous energy minimization, in: CVPR, 2011, pp. 1265–1272.
 (10) J. Qian, J. Yang, G. Gao, Discriminative histograms of local dominant orientation (DHLDO) for biometric image feature extraction, Pattern Recognition 46 (10) (2013) 2724–2739.
 (11) S. H. Bae, K. J. Yoon, Robust online multiobject tracking based on tracklet confidence and online discriminative appearance learning, in: CVPR, 2014, pp. 1218–1225.
 (12) B. Benfold, I. D. Reid, Stable multitarget tracking in realtime surveillance video, in: CVPR, 2011, pp. 3457–3464.
 (13) H. Li, H. Wu, H. Zhang, S. Lin, X. Luo, R. Wang, Distortionaware correlation tracking, IEEE TIP 26 (11) (2017) 5421–5434.
 (14) S. Kim, S. Kwak, J. Feyereisl, B. Han, Online multitarget tracking by large margin structured learning, in: ACCV, 2012, pp. 98–111.
 (15) P. Lenz, A. Geiger, R. Urtasun, et al., Followme: Efficient online mincost flow tracking with bounded memory and computation, in: ICCV, 2015, pp. 4364–4372.
 (16) H. Wu, C. Gao, Y. Cui, R. Wang, Multipoint infrared laserbased detection and tracking for people counting, Neural Computing and Applications 29 (5) (2018) 1405–1416.
 (17) P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, D. Ramanan, Object detection with discriminatively trained partbased models, TPAMI 32 (9) (2010) 1627–1645.
 (18) L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking with fully convolutional networks, in: ICCV, 2015, pp. 3119–3127.
 (19) H. Li, H. Wu, S. Lin, X. Luo, Coupling deep correlation filter and online discriminative learning for visual object tracking, Journal of Computational and Applied Mathematics 329 (2018) 191–201.
 (20) Z. Cui, S. Xiao, J. Feng, S. Yan, Recurrently targetattending tracking, in: CVPR, 2016.
 (21) P. Ondruska, I. Posner, Deep tracking: Seeing beyond seeing using recurrent neural networks, in: AAAI, 2016, pp. 3361–3368.
 (22) L. LealTaixe, C. CantonFerrer, K. Schindler, Learning by tracking: Siamese cnn for robust target association, in: CVPR Workshops, 2016.
 (23) A. Gaidon, E. Vig, Online domain adaptation for multiobject tracking, in: BMVC, 2015, pp. 1–13.
 (24) J. H. Yoon, C.R. Lee, M.H. Yang, K.J. Yoon, Online multiobject tracking via structural constraint event aggregation, in: CVPR, 2016.
 (25) S. Tang, M. Andriluka, B. Andres, B. Schiele, Multiple people tracking by lifted multicut and person reidentification, in: CVPR, 2017.
 (26) X. Wang, E. Türetken, F. Fleuret, P. Fua, Tracking interacting objects using intertwined flows, TPAMI 38 (11) (2016) 2312–2326.
 (27) B. Leibe, K. Schindler, L. J. V. Gool, Coupled detection and trajectory estimation for multiobject tracking, in: ICCV, 2007, pp. 1–8.
 (28) X. Wang, E. Türetken, F. Fleuret, P. Fua, Tracking interacting objects optimally using integer programming, in: ECCV, 2014, pp. 17–32.
 (29) A. Maksai, X. Wang, F. Fleuret, P. Fua, Nonmarkovian globally consistent multiobject tracking, in: ICCV, 2017, pp. 2563–2573.
 (30) Y. Xiang, A. Alahi, S. Savarese, et al., Learning to track: Online multiobject tracking by decision making, in: ICCV, 2015, pp. 4705–4713.
 (31) A. Milan, S. H. Rezatofighi, A. R. Dick, K. Schindler, I. D. Reid, Online multitarget tracking using recurrent neural networks, ArXiv (2016) abs/1604.03635.
 (32) R. Girshick, Fast RCNN, in: ICCV, 2015, pp. 1440–1448.

(33)
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012, pp. 1097–1105.
 (34) H. Yun, P. Raman, S. V. N. Vishwanathan, Ranking via robust binary classification, in: NIPS, 2014, pp. 2582–2590.
 (35) A. Shrivastava, A. Gupta, R. Girshick, Training regionbased object detectors with online hard example mining, in: CVPR, 2016.
 (36) L. LealTaixé, A. Milan, I. Reid, S. Roth, K. Schindler, Motchallenge 2015: Towards a benchmark for multitarget tracking, ArXiv (2015) abs/1504.01942.
 (37) P. Dollár, R. Appel, S. Belongie, P. Perona, Fast feature pyramids for object detection, TPAMI 36 (8) (2014) 1532–1545.
 (38) K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance: the CLEAR MOT metrics, EURASIP Journal on Image and Video Processing (2008) 1–10.
 (39) E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, C. Tomasi, Performance measures and a data set for multitarget, multicamera tracking, in: ECCV Workshops, 2016, pp. 17–35.
 (40) L. LealTaixé, C. CantonFerrer, K. Schindler, Learning by tracking: Siamese CNN for robust target association, in: CVPR Workshops, 2016.
 (41) B. Wang, L. Wang, B. Shuai, Z. Zuo, T. Liu, K. L. Chan, G. Wang, Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association, in: CVPR Workshops, 2016.
 (42) J. Son, M. Baek, M. Cho, B. Han, Multiobject tracking with quadruplet convolutional neural networks, in: CVPR, 2017, pp. 3786–3795.

(43)
J. Ju, D. Kim, B. Ku, D. K. Han, H. Ko, Online multiperson tracking with twostage data association and online appearance model learning, IET Computer Vision 11 (2017) 87–95.
 (44) J. Ju, D. Kim, B. Ku, D. K. Han, H. Ko, Online multiobject tracking with efficient track drift and fragmentation handling, J. Opt. Soc. Am. A Opt. Image Sci. Vis. 34 (2) (2017) 280–293.
 (45) R. SanchezMatilla, F. Poiesi, A. Cavallaro, Online multitarget tracking with strong and weak detections, in: ECCV Workshops, 2016, pp. 84–99.
 (46) H. Kieritz, S. Becker, W. Hübner, M. Arens, Online multiperson tracking using integral channel features, in: AVSS, 2016, pp. 122–130.
 (47) H. Zhou, W. Ouyang, J. Cheng, X. Wang, H. Li, Deep continuous conditional random fields with asymmetric interobject constraints for online multiobject tracking, IEEE TCSVT (2018) 1–12, online available.
 (48) Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, N. Yu, Online multiobject tracking using CNNbased single object tracker with spatialtemporal attention mechanism, in: ICCV, 2017, pp. 4846–4855.
 (49) K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, in: BMVC, 2014.
 (50) K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, ArXiv (2014) abs/1409.1556.