Introduction
Spatiotemporal data filtering plays a key role in numerous security, remote sensing, surveillance, automation and forecasting algorithms. As one of the most important steps in a sequential filtering task, prediction (estimation) of the state variables provides important insight about the past, present and future data. Particularly for a multitarget filtering and tracking (MTFT) problem, the prediction step conveys the past information about the latent state variables and suggests target
proposals. As the core of a Bayesian filtering paradigm, motion models are used to perform this task. Then, the correction (update) stage is applied over these predictions through the statetomeasurement space mapping. Kalman filter assumes linear motion models with Gaussian distributions for both prediction and update steps. Using the Taylor series expansion and deterministic approximation of nonGaussian distributions, nonlinearity and nonGaussian behaviour are addressed by Extended and Unscented Kalman Filters (EKF, UKF), respectively. Using the importance sampling principle, particle filters are also used to estimate the likelihood and posterior densities, addressing nonlinearity and nonGaussian behaviour
[Vo, Singh, and Doucet2005, Moratuwage et al.2014]. Mahler proposed random finite sets (RFS) [Mahler2003], which provides an encapsulated formulation of multitarget filtering, incorporating clutter densities and detection, survival and birth of target probabilities. To this end, targets and measurements are assumed to form sets with variable random cardinalities. One approach to represent the target state is to use the Probability Hypothesis Density (PHD) maps [Vo, Singh, and Doucet2005, Vo and Ma2006].Vo and Ma proposed Gaussian Mixture PHD (GMPHD), which propagates the firstorder statistical moments to estimate the posterior as a mixture of Gaussians
[Vo and Ma2006]. While GMPHD is based on Gaussian distributions, a particle filterbased solution is proposed by Sequential Monte Carlo PHD (SMCPHD) to address nonGaussian distributions [Vo, Singh, and Doucet2005]. Since a large number of particles should be propagated during SMCPHD, the computational complexity can be high and hence gating might be necessary [Moratuwage et al.2014]. Cardinalised PHD (CPHD) is proposed to also propagate the RFS cardinality over time [Mahler2007], while Nagappa:2017 addressed its intractability [Nagappa et al.2017]. The Labelled MultiBernoulli Filter (LMB) [Reuter et al.2014] performs tracktotrack association and outperforms previous algorithms in the sense of not relying on high signal to noise ratio (SNR). Vo et al. proposed Generalized Labelled MultiBernoulli (GLMB) as a labelled multitarget filtering [Vo, Vo, and Phung2014].Since the incoming data is (usually) noisy, cluttered and varies with time, an a priori definition of a motion model applicable to all of the targets is not always straightforward. Failure in accurately formulating the targets’ statetostate transition functions used by the Bayesian filters explained above can cause erroneous predictions, especially for complex target motions (or when projected over the measurement frames). A robust filtering algorithm should hence be capable of learning
such (multitarget) motion behaviour, enabling accurate predictions for the following time steps. Recently, machine learning has seen the rise of deep learning methods, achieving stateoftheart results in many fields, from image classification tasks via convolutional neural networks (CNN)
[Krizhevsky, Sutskever, and Hinton2012]to natural language processing via recurrent neural networks (RNNs)
[Graves, Mohamed, and Hinton2013]. CNNs can learn the underlying spatial information by sliding learnable filters over the input data. On the other hand, as a nonlinear dynamical system [Bay, Lepsoy, and Magli2016], RNNs can store and exploit past information through feedback loops. The cyclic connections between units allow RNNs to be suitable for predicting temporal sequences, training the network in such a way that it uses its current outputs as input for the following steps. Long shortterm memory (LSTM) architecture [Hochreiter and Schmidhuber1997]is introduced to resolve the vanishing gradient phenomenon during the training of RNN. While RNN and CNN networks are capable of learning the temporal and spatial information from the input signals, respectively, their use for multitarget data analysis is not straightforward. Since in an MTFT problem, the number of targets are constantly changing, the motion can not be easily modelled using a network with a fixed architecture (neural networks usually have fixed and predefined number of input and output neurons). One solution is to allocate an LSTM network to each target
[Emambakhsh, Bay, and Vazquez2018], which can significantly increase the computational complexity.In this paper, we propose a solution which addresses both of the above problems which were: (1) the use of fixed models by the Bayesian filtering methods and (2) difficulties in utilising deep neural networks for problems with variable input/output sizes (cardinalities), such as in MTFT. We formulate the MTFT problem from an explicit discrete multistate vector estimation to an implicit multidimensional spatiotemporal prediction. While the former introduces MTFT as a variable (unknown) discrete number of nodes over the state space, the latter performs MTFT implicitly over a continuous state space (such similar explicit vs. implicit intuition exists in Snakes vs. level sets active contours for image segmentation
[Emambakhsh, Ebrahimnezhad, and Sedaaghi2010], ormeans vs. hierarchical clustering for pattern recognition). Our approach is based on defining probability density difference (PDD) maps, which encapsulate the multitarget information flow over the state space. Then using a network of deep convolutional LSTMs (ConvLSTM
[Shi et al.2015]) as a regression block, the spatiotemporal prediction is learned and estimated: the spatial dependencies over the multitarget state space are modelled by the convolutional filters, while the temporal dependencies are learned using the LSTM’s recurrence. Our extensive experimental results over several pedestrian tracking benchmarks show remarkable potential of our MTFT algorithm. The definition of the used metrics, hyperparameter evaluation and video samples are provided in our Supplementary Material.Scientific contributions:• Compared to RFS Bayesian algorithms [Mahler2007, Nagappa et al.2017, Reuter et al.2014, Vo, Vo, and Phung2014, Vo, Vo, and Hoang2017], our proposed method models the multitarget motion by learning from the incoming data. The use of the state space, the PDD maps and LSTM networks enable our algorithm to memorise longterm dependencies, as opposed to the detecttotrack tracking methods [Henriques et al.2015]. To the best of our knowledge, our proposed MTFT algorithm is one of the first methods, which implicitly performs multitarget spatiotemporal prediction by integrating RFS and ConvLSTM. Unlike [Emambakhsh, Bay, and Vazquez2018], which performs prediction by allocating an LSTM network to each target, our MTFT approach simultaneously estimates state variable for all targets, which significantly increases the computational speed ( (fps)). A qualitative performance of our algorithm at a glance is shown in Fig. 1.
On the mathematical notation:•
Throughout this paper, we use italic notation for scalars, RFSs, probability density functions (PDFs) and PDD maps. We use bold text for vectors, matrices and tuples. The subscripts and superscripts indicate the time steps for RFSs and scalars/vectors/matrices, respectively.
Probability density difference (PDD) maps
Let us define the target state RFS at the time step as , where represents the number of targets (set cardinality). Each is the target state tuple containing the mean state vector over a dimensional state space, covariance matrix , Gaussian mixture weight , integer track label , target age (the higher the age, the longer the target has survived) and motion vector (along the target’s movement direction).
The target state RFS can be used to create the density function over (hypothetical) continuous target state , as , in which
is a Normal distribution over
[Vo and Ma2006]. peaks where the target RFS is located. Moreover, is assigned such that the following condition is satisfied,(1) 
which indicates that the expected number of targets should be given, when is integrated over the target state space [Mahler2003, Vo and Ma2006]. When both of these aforementioned properties are satisfied, will represent a GMPHD function [Vo and Ma2006]. We define a PDD as the difference between two consecutive GMPHD functions as follows,
(2) 
While the PHD function conveys the latent target state information until the iteration, the PDD function contains target state flow information between the two consecutive time steps and , emphasizing the most recent variations.
MTFT pipeline
There is a temporal correlation between a sequence of consecutive PDD maps. Also, assuming a 2D target state, a PDD can be viewed as a texture image, in which its pixel values are functions of their location. The core of our proposed MTFT algorithm is to learn this latent spatiotemporal information within the PDD maps, using a ConvLSTM network: The spatial dependencies between the hypothetical target states are learned using convolutional filters, while the network’s recurrence extracts the temporal dependencies. To be more specific, we use the ConvLSTM as a spatiotemporal regression block, predicting the next PDD at every time step. Using this approach, (both linear or nonlinear) motions are learned by the network, simultaneously for all targets. The overall pipeline of our proposed MTFT algorithm is illustrated in Fig. 2 and explained in details in the following sections.
Online training step
The multitarget motion model is learned during the online training step. A batch of PDD maps are used to train a ConvLSTM model as shown in Fig. 3. During training, a KullbackLeibler (KL) divergence loss is minimised, defined as follows,
(3)  
in which and are the normalised (integrated to one) versions of and . is computed via (2), using the PHD functions at the and time steps and is the predicted target states at the epoch. assumes the inputs as PDFs and computes the relative entropy between the two distributions. This trained ConvLSTM network (which is actually a spatiotemporal regression block) is then used to predict the multitarget state map, which is explained in the next section.
Prediction and implicit to explicit representation
(i.e. the output PDD of Fig. 3) is given to the trained ConvLSTM network to compute an initial predicted PDD (i.e. input for Fig. 2). When is summed with the PHD function , it gives the (initial) predicted PHD map (see (2)). Due to the nonlinearity and weights/biases multiplications/additions imposed by the neural network to the input data, the output may not satisfy the PHD conditions (1
). Also, because of the padding effect caused by the convolutional filters, there may be artifacts added to the borders of the output data. In order to resolve these issues, the boundary values of
are replaced with the median of the inner parts. Moreover, we assume that the prediction step does not alter the number of targets. Since the number of targets is equal to the integration of the PHD filter (see (1)), after the median filtering is applied, the output map is normalised such that it is integrated to . The result of this postprocessing step is the PHD function , which is then used to extract the predicted target state RFS . The peaks of the PHD function correspond to the current target mean vectors. Therefore, in order to extract the explicit target states from , first, its peaks are found as follows,(4) 
where computes the highest peaks of the input PHD function. is an RFS with cardinality, containing the predicted dimensional target state mean vectors. The peak values of , correspond to the GMPHD weights which are computed as follows,
(5) 
where is an RFS containing the GMPHD peaks for the targets. In order to compute the covariance RFS , we have examined two approaches. The first one uses and as the location and height of a Gaussian mixture, respectively. Then fits a 2D mixture of Gaussian functions to the PHD map to compute the covariance matrices. Another solution is based on finding the corresponding pairs between mean RFS and using combinatorial optimisation. is then assigned to its corresponding elements from . We have observed that both of these approaches generate similar results, while the latter is significantly faster, as it is not optimising over a continuous parameter space (unlike the 2D Gaussian fitting) and is less vulnerable to stop at local minima. The overall approach explained above can be interpreted as a mapping from an implicit representation () to an explicit target state representation ().
The union of and the birth RFS , which are assigned using the current measurement RFS , is then computed as follows,
(6) 
where is the birth target tuple, initialised with covariance matrix , birth weight , birth label identifier , birth age and birth motion vector . The predicted RFS is then updated using the measurement RFS , which is explained in the next section.
Update step
Assuming is a dimensional measurement vector, the updated GMPHD mean, covariance matrix and Gaussian weights are computed as follows [Vo and Ma2006],
(7)  
where is a identity matrix, , and are the mean vector, covariance matrix and Gaussian weight of the member of , respectively. is a prediction to measurement space mapping matrix. is the probability of detection and is the clutter intensity at time . is a Gaussian distribution, over the measurement space at time , with updated mean and covariance matrix using the target, i.e.
(8) 
where is a covariance of measurement noise. is a Kalman gain matrix for the predicted target computed as:
After computing the update state tuples, the Gaussian mixture pruning and merging steps explained in [Vo and Ma2006] are performed over the targets. In order to allow propagating mean state vectors corresponding to sudden target birth, the “maximum allowable number of Gaussian terms” ( in [Vo and Ma2006]) is selected by computing the maximum between
(the number of targets in the previous time step) and a sample from a Poisson distribution with
mean.Target state extraction
After the update step is performed, those targets whose weights are higher than are selected as initial updated target RFS . Using the target RFS , a tracktotrack association is performed over the selected targets. The following distance metric is computed between the target in the previous time step () and the (yet unlabelled) target computed at the current time step (),
(9) 
where computes the intersection over union between the two targets. calculates a distance measure between the targets; is the age of the target at the previous time step and is multiplied with the IoU to increase the importance of those targets with longer temporal presence. Computing between all the previous and current targets constitutes the matrix , which is given to the Hungarian Assignment algorithm to determine the survival, birth and death of targets, as explained as a pseudocode in Algorithm 1. Finally, the most mature targets whose age is higher than a threshold are selected as . is then used to compute the PHD function . The PDD map is then calculated and appended to the training batch to train the ConvLSTM network for the next time step. In Algorithm 1, incrementing and decrementing the age of the target are performed as follows,
(10) 
where computes the integer part and ) are the target age amplification and attenuation factors, respectively.
Experimental results
Datasets
We have evaluated our algorithm over the Multiple Object Tracking 2015 (MOT15) and 2017 (MOT17) datasets [LealTaixé et al.2015], which contain 11 and 14 video sequences, respectively, captured via both fixed and moving platforms from various crowded scenes for pedestrian tracking. The pedestrian detection is performed by: Aggregated Channel Features (ACF) [Dollár et al.2014] for MOT15; Deformable Parts Model (DPM) [Felzenszwalb et al.2010], Faster Regionbased CNN (RCNN) [Ren et al.2015] and Scale Dependent Pooling (SDP) [Yang, Choi, and Lin2016] for MOT17. It should be mentioned that we have particularly used the MOT15 dataset to evaluate the multitarget filtering performance, because of the high intensity of clutter generated by ACF over this dataset. We have also used the Performance Evaluation of Tracking and Surveillance 2009 S2 (PETS09) sequences (http://www.cvg.reading.ac.uk/PETS2009/) with an ACF pedestrian detector. The PNNL Parking Lot 1 [Shu et al.2012]
, the only sequence of PNNL dataset with publicly available detections, is also utilised. The proposed MTFT algorithm is implemented, endtoend, in Python 2.7. Keras with a Tensorflow backend is used for the ConvLSTM implementation, over an NVIDIA GeForce GTX 1080 Ti GPU, where the average frame per second of the proposed algorithm is
(fps). Our following results are obtained using, ReLU activation function, one block of sixteen
convolutional filters, 20 training epochs, pixels, for all detection algorithms, , , , , and , where is a identity matrix. During the training of the ConvLSTM network we used the ADAM optimiser, with default parameters (learning rate, , and as 0.001, 0.9, and 0.99, respectively).One time step at a glance
In this section, we briefly show an overview of the output of major blocks in Fig. 2. Figure 4b shows the output PDD map from the ConvLSTM prediction step, i.e , for the current image at the time step, shown in Fig. 4a. The peaks in indicate those regions corresponding to “faster” target movement. Here by faster we mean how quickly a target is moving with respect to its covariance matrix. For such targets, the corresponding peak in the next time step’s PHD function will be farther away, resulting in a high peak after subtraction, creating the bright yellow regions in Fig. 4b. On the other hand, the darker regions indicate those targets which are mostly stationary.
Using (2) and the postprocessing step explained in the MTFT pipeline, the corresponding predicted PHD function () is calculated, which is shown in Fig. 4c, where the peaks of correspond to the expected location of targets. After obtaining the measurements at the time step (), the updated PHD function is calculated. An overlaid plot of and the image is illustrated in Fig. 5, where the peaks show the expected locations of the targets. This new PHD function is used to compute , which is appended to the previous batch, to (online) train the ConvLSTM and predict for the next time step.
Filtering performance
We compute the Optimal SubPattern Assignment (OSPA, [Schuhmacher, Vo, and Vo2008]) distance, which has been extensively utilised for evaluating the accuracy of multitarget filtering algorithms [Vo, Vo, and Hoang2017, Fantacci et al.2018, Meyer et al.2017, Beard et al.2017] (the mathematical definition of OSPA is provided in Supplementary Material). As OSPA assumes point targets, here we used the centre of the bounding boxes to represent each target (in the next section, however, we evaluate the tracking performance using bounding box representation for targets).
Table 1 and 2 show comparative performance of several MTFT algorithms over the MOT15 and MOT17 datasets, respectively. As OSPA is computed at each time step, the results in these tables are the average over all time steps for all video sequences (it should be mentioned that Loc OSPA can be easily computed by subtracting the overall and Card errors). Considering the PHDbased algorithms (PHDEKF, PHDSMC and PHDUKF) as baseline, our proposed ConvLSTM MTFT algorithm shows significantly better performance over both datasets. Particularly there is remarkable reduction in the cardinality error. We have also compared our method with four tracking algorithms: SORT [Bewley et al.2016], which is one of the fastest online tracking approaches reported over MOT15, DeepSORT [Wojke, Bewley, and Paulus2017] an extension of SORT with deep associations, Re [Gordon, Farhadi, and Fox2018], a deep RNNbased multitarget tracker, and RNN_LSTM algorithm [Milan et al.2017], one of the pioneering algorithms using RNN for multitarget tracking. Our overall average OSPA error is , , and lower than RNN_LSTM, SORT, DeepSORT and Re, respectively. For the case of MOT17, the overall average error is higher than MOT15 results for all algorithms (Table 2). The reason is that compared to MOT15, there are significantly higher number of annotated objects in this dataset. However, similar to MOT15, our algorithm outperforms the other methods in terms of overall OSPA, with Loc and Card errors.
Methods  OSPA Card  Overall OSPA 

PHDEKF  
PHDSMC  
PHDUKF  
CPHDEKF  
CPHDSMC  
CPHDUKF  
LMBEKF  
LMBSMC  
LMBUKF  
GLMBEKF  
GLMBSMC  
GLMBUKF  
SORT  
DeepSORT  
Re  
RNNLSTM  
ConvLSTM  19.16  40.40 
Methods  OSPA Card  Overall OSPA 

PHDEKF  
PHDSMC  
PHDUKF  
CPHDEKF  
CPHDSMC  
CPHDUKF  
LMBEKF  
LMBSMC  
LMBUKF  
GLMBEKF  
GLMBSMC  
GLMBUKF  
SORT  
DeepSORT  
Re  
RNNLSTM  
ConvLSTM  53.79  62.29 
Algorithm  Rcll  Prcn  MOTA  MOTAL 

SORTDPM  37.0  76.6  24.7  25.7 
SORTFRCNN  50.5  97.3  48.5  49.2 
SORTSDP  63.0  98.3  61.0  61.8 
DeepSORTDPM  32.4  91.3  28.2  29.3 
DeepSORTFRCNN  51.7  95.7  48.6  49.4 
DeepSORTSDP  64.7  97.3  61.6  62.9 
ReDPM  37.0  76.2  24.3  25.4 
ReFRCNN  50.6  97.1  48.4  49.0 
ReSDP  62.9  98.0  60.7  61.6 
RNN_LSTMDPM  32.8  83.6  25.6  26.4 
RNN_LSTMFRCNN  44.7  88.9  38.4  39.1 
RNN_LSTMSDP  49.7  87.7  41.5  42.7 
ConvLSTMDPM  38.9  69.9  20.3  22.1 
ConvLSTMFRCNN  53.3  92.6  48.1  49.0 
ConvLSTMSDP  67.1  94.9  62.0  63.5 
Algorithm  Rcll  Prcn  MOTA  MOTAL 

SORT  71.4  98.5  69.0  70.3 
DeepSORT  74.3  97.5  70.5  72.1 
Re  70.9  97.8  68.0  69.3 
ConvLSTM  77.7  93.3  71.0  72.1 
Algorithm  Rcll  Prcn  MOTA  MOTAL 

SORT  75.0  87.4  61.9  64.1 
DeepSORT  88.0  83.0  65.2  62.9 
Re  73.1  83.6  56.5  58.8 
RNN_LSTM  91.1  68.1  43.9  48.3 
ConvLSTM  80.9  86.8  66.9  68.5 
Tracking performance
The widely used multitarget tracking CLEAR MOT metrics explained in [Bernardin and Stiefelhagen2008] are used to quantitatively evaluate the proposed ConvLSTM algorithm. Three datasets are used: MOT17, PNNL Parking Lot and PETS09. Unlike the OSPA results, for which point targets were assumed, here each target is represented as a bounding box. The multitarget tracking results for MOT17, PNNL Parking Lot and PETS09 are shown in Tables 3, 4, 5, respectively. For MOT17, all three publicly available detections are used. ConvLSTM’s performance is at the highest when the SDP detector is used, with MOTA, recall and MOTAL. Also, ConvLSTM generates and MOTA, over the Parking Lot and PETS09, when the detection method in [Shu et al.2012] and ACF are used, respectively. For the Parking Lot sequence, ground truth is only available for the first 748 frames, and hence we have evaluated the algorithms over these frames only. Our ConvLSTM approach generates lower miss rate (false negatives), resulting in higher recall, as it can be seen in Tables 3 and 4.
Conclusions
This paper detailed a spatiotemporal data prediction approach applicable for MTFT problems. The prediction is simultaneously performed for all of the targets, over an implicit continuous hypothetical target space, via ConvLSTM neural network. The proposed approach not only significantly improves the baseline RFS filters, but also shows substantial potential when compared with other stateoftheare MTFT algorithms. Our algorithm, which makes an important step towards an endtoend learning model for online MTFT over realistic scenarios, can be enhanced in several aspects. An immediate improvement can be to include the update step within the ConvLSTM framework. Also, as an alternative for the ConvLSTM network (which is a discriminative algorithm approximating the posterior densities), generative neural networks can be utilised to directly estimate the joint PDF, eliminating the need for a separate covariance estimation at the prediction step.
References
 [Bay, Lepsoy, and Magli2016] Bay, A.; Lepsoy, S.; and Magli, E. 2016. Stable limit cycles in recurrent neural networks. In 2016 International Conference on Communications (COMM), 89–92.
 [Beard et al.2017] Beard, M.; Vo, B. T.; Vo, B. N.; and Arulampalam, S. 2017. Void probabilities and CauchySchwarz divergence for generalized labeled multiBernoulli models. IEEE Transactions on Signal Processing 65(19):5047–5061.
 [Bernardin and Stiefelhagen2008] Bernardin, K., and Stiefelhagen, R. 2008. Evaluating multiple object tracking performance: the CLEAR MOT metrics. Journal on Image and Video Processing 2008:1.
 [Bewley et al.2016] Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; and Upcroft, B. 2016. Simple online and realtime tracking. In ICIP, 3464–3468.
 [Dollár et al.2014] Dollár, P.; Appel, R.; Belongie, S.; and Perona, P. 2014. Fast feature pyramids for object detection. IEEE Transactions on PAMI 36(8):1532–1545.
 [Emambakhsh, Bay, and Vazquez2018] Emambakhsh, M.; Bay, A.; and Vazquez, E. 2018. Deep recurrent neural network for multitarget filtering. arXiv preprint arXiv:1806.06594.
 [Emambakhsh, Ebrahimnezhad, and Sedaaghi2010] Emambakhsh, M.; Ebrahimnezhad, H.; and Sedaaghi, M. 2010. Integrated regionbased segmentation using color components and texture features with prior shape knowledge. International Journal of Applied Mathematics and Computer Science 20(4):711–726.
 [Fantacci et al.2018] Fantacci, C.; Vo, B. N.; Vo, B. T.; Battistelli, G.; and Chisci, L. 2018. Robust fusion for multisensor multiobject tracking. IEEE Signal Processing Letters 25(5):640–644.
 [Felzenszwalb et al.2010] Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ramanan, D. 2010. Object detection with discriminatively trained partbased models. IEEE transactions on pattern analysis and machine intelligence 32(9):1627–1645.
 [Gordon, Farhadi, and Fox2018] Gordon, D.; Farhadi, A.; and Fox, D. 2018. Re: Realtime recurrent regression networks for visual tracking of generic objects. IEEE Robotics and Automation Letters 3(2):788–795.
 [Graves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.r.; and Hinton, G. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, 6645–6649. IEEE.
 [Henriques et al.2015] Henriques, J. F.; Caseiro, R.; Martins, P.; and Batista, J. 2015. Highspeed tracking with kernelized correlation filters. IEEE Transactions on PAMI 37(3):583–596.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
 [LealTaixé et al.2015] LealTaixé, L.; Milan, A.; Reid, I.; Roth, S.; and Schindler, K. 2015. MOTChallenge 2015: Towards a benchmark for multitarget tracking. arXiv preprint arXiv:1504.01942.
 [Mahler2003] Mahler, R. P. S. 2003. Multitarget Bayes filtering via firstorder multitarget moments. IEEE Transactions on Aerospace and Electronic Systems 39(4):1152–1178.
 [Mahler2007] Mahler, R. 2007. PHD filters of higher order in target number. IEEE Transactions on Aerospace and Electronic Systems 43(4):1523–1543.
 [Meyer et al.2017] Meyer, F.; Braca, P.; Willett, P.; and Hlawatsch, F. 2017. A scalable algorithm for tracking an unknown number of targets using multiple sensors. IEEE Transactions on Signal Processing 65(13):3478–3493.
 [Milan et al.2017] Milan, A.; Rezatofighi, S.; Dick, A.; Reid, I.; and Schindler, K. 2017. Online multitarget tracking using recurrent neural networks. In AAAI.
 [Moratuwage et al.2014] Moratuwage, D.; Wang, D.; Rao, A.; Senarathne, N.; and Wang, H. 2014. RFS collaborative multivehicle SLAM: SLAM in dynamic highclutter environments. IEEE Robotics Automation Magazine 21(2):53–59.
 [Nagappa et al.2017] Nagappa, S.; Delande, E. D.; Clark, D. E.; and Houssineau, J. 2017. A tractable forwardbackward CPHD smoother. IEEE Transactions on Aerospace and Electronic Systems 53(1):201–217.
 [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
 [Reuter et al.2014] Reuter, S.; Vo, B. T.; Vo, B. N.; and Dietmayer, K. 2014. The labeled multiBernoulli filter. IEEE Transactions on Signal Processing 62(12):3246–3260.
 [Schuhmacher, Vo, and Vo2008] Schuhmacher, D.; Vo, B. T.; and Vo, B. N. 2008. A consistent metric for performance evaluation of multiobject filters. IEEE Transactions on Signal Processing 56(8):3447–3457.
 [Shi et al.2015] Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; and Woo, W.c. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802–810.
 [Shu et al.2012] Shu, G.; Dehghan, A.; Oreifej, O.; Hand, E.; and Shah, M. 2012. Partbased multipleperson tracking with partial occlusion handling. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 1815–1821. IEEE.
 [Vo and Ma2006] Vo, B. N., and Ma, W. K. 2006. The Gaussian mixture probability hypothesis density filter. IEEE Transactions on Signal Processing 54(11):4091–4104.
 [Vo, Singh, and Doucet2005] Vo, B. N.; Singh, S.; and Doucet, A. 2005. Sequential Monte Carlo methods for multitarget filtering with random finite sets. IEEE Transactions on Aerospace and Electronic Systems 41(4):1224–1245.
 [Vo, Vo, and Hoang2017] Vo, B. N.; Vo, B. T.; and Hoang, H. G. 2017. An efficient implementation of the generalized labeled multiBernoulli filter. IEEE Transactions on Signal Processing 65(8):1975–1987.
 [Vo, Vo, and Phung2014] Vo, B. N.; Vo, B. T.; and Phung, D. 2014. Labeled random finite sets and the Bayes multitarget tracking filter. IEEE Transactions on Signal Processing 62(24):6554–6567.
 [Wojke, Bewley, and Paulus2017] Wojke, N.; Bewley, A.; and Paulus, D. 2017. Simple online and realtime tracking with a deep association metric. In ICIP, 3645–3649.

[Yang, Choi, and Lin2016]
Yang, F.; Choi, W.; and Lin, Y.
2016.
Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers.
In Proceedings of the IEEE conference on computer vision and pattern recognition, 2129–2137.