Convolutional Recurrent Predictor: Implicit Representation for Multi-target Filtering and Tracking

11/01/2018 ∙ by Mehryar Emambakhsh, et al. ∙ Cortexica Vision Systems Ltd 6

Defining a multi-target motion model, which is an important step of tracking algorithms, can be very challenging. Using fixed models (as in several generative Bayesian algorithms, such as Kalman filters) can fail to accurately predict sophisticated target motions. On the other hand, sequential learning of the motion model (for example, using recurrent neural networks) can be computationally complex and difficult due to the variable unknown number of targets. In this paper, we propose a multi-target filtering and tracking (MTFT) algorithm which learns the motion model, simultaneously for all targets, from an implicitly represented state map and performs spatio-temporal data prediction. To this end, the multi-target state is modelled over a continuous hypothetical target space, using random finite sets and Gaussian mixture probability hypothesis density formulations. The prediction step is recursively performed using a deep convolutional recurrent neural network with a long short-term memory architecture, which is trained as a regression block, on the fly, over "probability density difference" maps. Our approach is evaluated over widely used pedestrian tracking benchmarks, remarkably outperforming state-of-the-art multi-target filtering algorithms, while giving competitive results when compared with other tracking approaches.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Spatio-temporal data filtering plays a key role in numerous security, remote sensing, surveillance, automation and forecasting algorithms. As one of the most important steps in a sequential filtering task, prediction (estimation) of the state variables provides important insight about the past, present and future data. Particularly for a multi-target filtering and tracking (MTFT) problem, the prediction step conveys the past information about the latent state variables and suggests target


. As the core of a Bayesian filtering paradigm, motion models are used to perform this task. Then, the correction (update) stage is applied over these predictions through the state-to-measurement space mapping. Kalman filter assumes linear motion models with Gaussian distributions for both prediction and update steps. Using the Taylor series expansion and deterministic approximation of non-Gaussian distributions, non-linearity and non-Gaussian behaviour are addressed by Extended and Unscented Kalman Filters (EKF, UKF), respectively. Using the importance sampling principle, particle filters are also used to estimate the likelihood and posterior densities, addressing non-linearity and non-Gaussian behaviour

[Vo, Singh, and Doucet2005, Moratuwage et al.2014]. Mahler proposed random finite sets (RFS) [Mahler2003], which provides an encapsulated formulation of multi-target filtering, incorporating clutter densities and detection, survival and birth of target probabilities. To this end, targets and measurements are assumed to form sets with variable random cardinalities. One approach to represent the target state is to use the Probability Hypothesis Density (PHD) maps [Vo, Singh, and Doucet2005, Vo and Ma2006].

Figure 1: Qualitative results: Row 1: our result; Row 2: SORT [Bewley et al.2016]; Row 3: DeepSORT [Wojke, Bewley, and Paulus2017]; Row 4: Re [Gordon, Farhadi, and Fox2018]; Row 5: RNN_LSTM [Milan et al.2017]. While the other algorithms lose the occluded target, our proposed MTFT approach maintains its ID until it reappears.

Vo and Ma proposed Gaussian Mixture PHD (GM-PHD), which propagates the first-order statistical moments to estimate the posterior as a mixture of Gaussians

[Vo and Ma2006]. While GM-PHD is based on Gaussian distributions, a particle filter-based solution is proposed by Sequential Monte Carlo PHD (SMC-PHD) to address non-Gaussian distributions [Vo, Singh, and Doucet2005]. Since a large number of particles should be propagated during SMC-PHD, the computational complexity can be high and hence gating might be necessary [Moratuwage et al.2014]. Cardinalised PHD (CPHD) is proposed to also propagate the RFS cardinality over time [Mahler2007], while Nagappa:2017 addressed its intractability [Nagappa et al.2017]. The Labelled Multi-Bernoulli Filter (LMB) [Reuter et al.2014] performs track-to-track association and outperforms previous algorithms in the sense of not relying on high signal to noise ratio (SNR). Vo et al. proposed Generalized Labelled Multi-Bernoulli (GLMB) as a labelled multi-target filtering [Vo, Vo, and Phung2014].

Since the incoming data is (usually) noisy, cluttered and varies with time, an a priori definition of a motion model applicable to all of the targets is not always straightforward. Failure in accurately formulating the targets’ state-to-state transition functions used by the Bayesian filters explained above can cause erroneous predictions, especially for complex target motions (or when projected over the measurement frames). A robust filtering algorithm should hence be capable of learning

such (multi-target) motion behaviour, enabling accurate predictions for the following time steps. Recently, machine learning has seen the rise of deep learning methods, achieving state-of-the-art results in many fields, from image classification tasks via convolutional neural networks (CNN)

[Krizhevsky, Sutskever, and Hinton2012]

to natural language processing via recurrent neural networks (RNNs)

[Graves, Mohamed, and Hinton2013]. CNNs can learn the underlying spatial information by sliding learnable filters over the input data. On the other hand, as a non-linear dynamical system [Bay, Lepsoy, and Magli2016], RNNs can store and exploit past information through feedback loops. The cyclic connections between units allow RNNs to be suitable for predicting temporal sequences, training the network in such a way that it uses its current outputs as input for the following steps. Long short-term memory (LSTM) architecture [Hochreiter and Schmidhuber1997]

is introduced to resolve the vanishing gradient phenomenon during the training of RNN. While RNN and CNN networks are capable of learning the temporal and spatial information from the input signals, respectively, their use for multi-target data analysis is not straightforward. Since in an MTFT problem, the number of targets are constantly changing, the motion can not be easily modelled using a network with a fixed architecture (neural networks usually have fixed and predefined number of input and output neurons). One solution is to allocate an LSTM network to each target

[Emambakhsh, Bay, and Vazquez2018], which can significantly increase the computational complexity.

In this paper, we propose a solution which addresses both of the above problems which were: (1) the use of fixed models by the Bayesian filtering methods and (2) difficulties in utilising deep neural networks for problems with variable input/output sizes (cardinalities), such as in MTFT. We formulate the MTFT problem from an explicit discrete multi-state vector estimation to an implicit multi-dimensional spatio-temporal prediction. While the former introduces MTFT as a variable (unknown) discrete number of nodes over the state space, the latter performs MTFT implicitly over a continuous state space (such similar explicit vs. implicit intuition exists in Snakes vs. level sets active contours for image segmentation

[Emambakhsh, Ebrahimnezhad, and Sedaaghi2010], or

-means vs. hierarchical clustering for pattern recognition). Our approach is based on defining probability density difference (PDD) maps, which encapsulate the multi-target information flow over the state space. Then using a network of deep convolutional LSTMs (ConvLSTM

[Shi et al.2015]) as a regression block, the spatio-temporal prediction is learned and estimated: the spatial dependencies over the multi-target state space are modelled by the convolutional filters, while the temporal dependencies are learned using the LSTM’s recurrence. Our extensive experimental results over several pedestrian tracking benchmarks show remarkable potential of our MTFT algorithm. The definition of the used metrics, hyper-parameter evaluation and video samples are provided in our Supplementary Material.

Scientific contributions: Compared to RFS Bayesian algorithms [Mahler2007, Nagappa et al.2017, Reuter et al.2014, Vo, Vo, and Phung2014, Vo, Vo, and Hoang2017], our proposed method models the multi-target motion by learning from the incoming data. The use of the state space, the PDD maps and LSTM networks enable our algorithm to memorise long-term dependencies, as opposed to the detect-to-track tracking methods [Henriques et al.2015]. To the best of our knowledge, our proposed MTFT algorithm is one of the first methods, which implicitly performs multi-target spatio-temporal prediction by integrating RFS and ConvLSTM. Unlike [Emambakhsh, Bay, and Vazquez2018], which performs prediction by allocating an LSTM network to each target, our MTFT approach simultaneously estimates state variable for all targets, which significantly increases the computational speed ( (fps)). A qualitative performance of our algorithm at a glance is shown in Fig. 1.

On the mathematical notation:

Throughout this paper, we use italic notation for scalars, RFSs, probability density functions (PDFs) and PDD maps. We use bold text for vectors, matrices and tuples. The subscripts and superscripts indicate the time steps for RFSs and scalars/vectors/matrices, respectively.

Probability density difference (PDD) maps

Let us define the target state RFS at the time step as , where represents the number of targets (set cardinality). Each is the target state tuple containing the mean state vector over a -dimensional state space, covariance matrix , Gaussian mixture weight , integer track label , target age (the higher the age, the longer the target has survived) and motion vector (along the target’s movement direction).

Figure 2: Overall MTFT pipeline: Once trained, ConvLSTM predicts the target state, implicitly, over PDD maps. The update is then performed by extracting individual target states, which are then given to the track-to-track associator.

The target state RFS can be used to create the density function over (hypothetical) continuous target state , as , in which

is a Normal distribution over

 [Vo and Ma2006]. peaks where the target RFS is located. Moreover, is assigned such that the following condition is satisfied,


which indicates that the expected number of targets should be given, when is integrated over the target state space [Mahler2003, Vo and Ma2006]. When both of these aforementioned properties are satisfied, will represent a GM-PHD function [Vo and Ma2006]. We define a PDD as the difference between two consecutive GM-PHD functions as follows,


While the PHD function conveys the latent target state information until the iteration, the PDD function contains target state flow information between the two consecutive time steps and , emphasizing the most recent variations.

MTFT pipeline

There is a temporal correlation between a sequence of consecutive PDD maps. Also, assuming a 2D target state, a PDD can be viewed as a texture image, in which its pixel values are functions of their location. The core of our proposed MTFT algorithm is to learn this latent spatio-temporal information within the PDD maps, using a ConvLSTM network: The spatial dependencies between the hypothetical target states are learned using convolutional filters, while the network’s recurrence extracts the temporal dependencies. To be more specific, we use the ConvLSTM as a spatio-temporal regression block, predicting the next PDD at every time step. Using this approach, (both linear or non-linear) motions are learned by the network, simultaneously for all targets. The overall pipeline of our proposed MTFT algorithm is illustrated in Fig. 2 and explained in details in the following sections.

Figure 3: The online training step using a ConvLSTM network over a batch of

PDD maps: the inputs and target outputs are shown on the left and right, respectively.

Online training step

The multi-target motion model is learned during the online training step. A batch of PDD maps are used to train a ConvLSTM model as shown in Fig. 3. During training, a Kullback-Leibler (KL) divergence loss is minimised, defined as follows,


in which and are the normalised (integrated to one) versions of and . is computed via (2), using the PHD functions at the and time steps and is the predicted target states at the epoch. assumes the inputs as PDFs and computes the relative entropy between the two distributions. This trained ConvLSTM network (which is actually a spatio-temporal regression block) is then used to predict the multi-target state map, which is explained in the next section.

Prediction and implicit to explicit representation

(i.e. the output PDD of Fig. 3) is given to the trained ConvLSTM network to compute an initial predicted PDD (i.e. input for Fig. 2). When is summed with the PHD function , it gives the (initial) predicted PHD map (see (2)). Due to the non-linearity and weights/biases multiplications/additions imposed by the neural network to the input data, the output may not satisfy the PHD conditions (1

). Also, because of the padding effect caused by the convolutional filters, there may be artifacts added to the borders of the output data. In order to resolve these issues, the boundary values of

are replaced with the median of the inner parts. Moreover, we assume that the prediction step does not alter the number of targets. Since the number of targets is equal to the integration of the PHD filter (see (1)), after the median filtering is applied, the output map is normalised such that it is integrated to . The result of this post-processing step is the PHD function , which is then used to extract the predicted target state RFS . The peaks of the PHD function correspond to the current target mean vectors. Therefore, in order to extract the explicit target states from , first, its peaks are found as follows,


where computes the highest peaks of the input PHD function. is an RFS with cardinality, containing the predicted -dimensional target state mean vectors. The peak values of , correspond to the GM-PHD weights which are computed as follows,


where is an RFS containing the GM-PHD peaks for the targets. In order to compute the covariance RFS , we have examined two approaches. The first one uses and as the location and height of a Gaussian mixture, respectively. Then fits a 2D mixture of Gaussian functions to the PHD map to compute the covariance matrices. Another solution is based on finding the corresponding pairs between mean RFS and using combinatorial optimisation. is then assigned to its corresponding elements from . We have observed that both of these approaches generate similar results, while the latter is significantly faster, as it is not optimising over a continuous parameter space (unlike the 2D Gaussian fitting) and is less vulnerable to stop at local minima. The overall approach explained above can be interpreted as a mapping from an implicit representation () to an explicit target state representation ().

The union of and the birth RFS , which are assigned using the current measurement RFS , is then computed as follows,


where is the birth target tuple, initialised with covariance matrix , birth weight , birth label identifier , birth age and birth motion vector . The predicted RFS is then updated using the measurement RFS , which is explained in the next section.

Update step

Assuming is a -dimensional measurement vector, the updated GM-PHD mean, covariance matrix and Gaussian weights are computed as follows [Vo and Ma2006],


where is a identity matrix, , and are the mean vector, covariance matrix and Gaussian weight of the member of , respectively. is a prediction to measurement space mapping matrix. is the probability of detection and is the clutter intensity at time . is a Gaussian distribution, over the measurement space at time , with updated mean and covariance matrix using the target, i.e.


where is a covariance of measurement noise. is a Kalman gain matrix for the predicted target computed as:

After computing the update state tuples, the Gaussian mixture pruning and merging steps explained in [Vo and Ma2006] are performed over the targets. In order to allow propagating mean state vectors corresponding to sudden target birth, the “maximum allowable number of Gaussian terms” ( in [Vo and Ma2006]) is selected by computing the maximum between

(the number of targets in the previous time step) and a sample from a Poisson distribution with


% Outputs an binary matrix
for  in  do
        % survives:
        Increment age, keep the label and append to , update motion vector with the associated track;
end for
for  in  do
        % is a birth:
        Assign , and and append to ;
end for
for  in  do
        % is a decaying target
       Decrement age, keep the label and append to , do not update the motion vector;
end for
% Most mature targets: iterate over targets in
for  do
        if  then
               Append to ;
        end if
end for
Algorithm 1 Target extraction algorithm: The pseudo code explains how the track-to-track (t2t) association and mature targets extraction are performed to obtain ; , and compute the row-, column-wise summations and Hungarian Assignment, respectively.

Target state extraction

After the update step is performed, those targets whose weights are higher than are selected as initial updated target RFS . Using the target RFS , a track-to-track association is performed over the selected targets. The following distance metric is computed between the target in the previous time step () and the (yet unlabelled) target computed at the current time step (),


where computes the intersection over union between the two targets. calculates a distance measure between the targets; is the age of the target at the previous time step and is multiplied with the IoU to increase the importance of those targets with longer temporal presence. Computing between all the previous and current targets constitutes the matrix , which is given to the Hungarian Assignment algorithm to determine the survival, birth and death of targets, as explained as a pseudo-code in Algorithm 1. Finally, the most mature targets whose age is higher than a threshold are selected as . is then used to compute the PHD function . The PDD map is then calculated and appended to the training batch to train the ConvLSTM network for the next time step. In Algorithm 1, incrementing and decrementing the age of the target are performed as follows,


where computes the integer part and ) are the target age amplification and attenuation factors, respectively.

Experimental results


We have evaluated our algorithm over the Multiple Object Tracking 2015 (MOT15) and 2017 (MOT17) datasets [Leal-Taixé et al.2015], which contain 11 and 14 video sequences, respectively, captured via both fixed and moving platforms from various crowded scenes for pedestrian tracking. The pedestrian detection is performed by: Aggregated Channel Features (ACF) [Dollár et al.2014] for MOT15; Deformable Parts Model (DPM) [Felzenszwalb et al.2010], Faster Region-based CNN (R-CNN) [Ren et al.2015] and Scale Dependent Pooling (SDP) [Yang, Choi, and Lin2016] for MOT17. It should be mentioned that we have particularly used the MOT15 dataset to evaluate the multi-target filtering performance, because of the high intensity of clutter generated by ACF over this dataset. We have also used the Performance Evaluation of Tracking and Surveillance 2009 S2 (PETS09) sequences ( with an ACF pedestrian detector. The PNNL Parking Lot 1 [Shu et al.2012]

, the only sequence of PNNL dataset with publicly available detections, is also utilised. The proposed MTFT algorithm is implemented, end-to-end, in Python 2.7. Keras with a Tensorflow backend is used for the ConvLSTM implementation, over an NVIDIA GeForce GTX 1080 Ti GPU, where the average frame per second of the proposed algorithm is

(fps). Our following results are obtained using

, ReLU activation function, one block of sixteen

convolutional filters, 20 training epochs, pixels, for all detection algorithms, , , , , and , where is a identity matrix. During the training of the ConvLSTM network we used the ADAM optimiser, with default parameters (learning rate, , and as 0.001, 0.9, and 0.99, respectively).

One time step at a glance

In this section, we briefly show an overview of the output of major blocks in Fig. 2. Figure 4-b shows the output PDD map from the ConvLSTM prediction step, i.e , for the current image at the time step, shown in Fig. 4-a. The peaks in indicate those regions corresponding to “faster” target movement. Here by faster we mean how quickly a target is moving with respect to its covariance matrix. For such targets, the corresponding peak in the next time step’s PHD function will be farther away, resulting in a high peak after subtraction, creating the bright yellow regions in Fig. 4-b. On the other hand, the darker regions indicate those targets which are mostly stationary.

Figure 4: Prediction step: (a) ; (b) Image at the iteration; (c)

Using (2) and the post-processing step explained in the MTFT pipeline, the corresponding predicted PHD function () is calculated, which is shown in Fig. 4-c, where the peaks of correspond to the expected location of targets. After obtaining the measurements at the time step (), the updated PHD function is calculated. An overlaid plot of and the image is illustrated in Fig. 5, where the peaks show the expected locations of the targets. This new PHD function is used to compute , which is appended to the previous batch, to (online) train the ConvLSTM and predict for the next time step.

Figure 5: Overlaid plot of over the current image.

Filtering performance

We compute the Optimal Sub-Pattern Assignment (OSPA, [Schuhmacher, Vo, and Vo2008]) distance, which has been extensively utilised for evaluating the accuracy of multi-target filtering algorithms [Vo, Vo, and Hoang2017, Fantacci et al.2018, Meyer et al.2017, Beard et al.2017] (the mathematical definition of OSPA is provided in Supplementary Material). As OSPA assumes point targets, here we used the centre of the bounding boxes to represent each target (in the next section, however, we evaluate the tracking performance using bounding box representation for targets).

Table 1 and 2 show comparative performance of several MTFT algorithms over the MOT15 and MOT17 datasets, respectively. As OSPA is computed at each time step, the results in these tables are the average over all time steps for all video sequences (it should be mentioned that Loc OSPA can be easily computed by subtracting the overall and Card errors). Considering the PHD-based algorithms (PHD-EKF, PHD-SMC and PHD-UKF) as baseline, our proposed ConvLSTM MTFT algorithm shows significantly better performance over both datasets. Particularly there is remarkable reduction in the cardinality error. We have also compared our method with four tracking algorithms: SORT [Bewley et al.2016], which is one of the fastest online tracking approaches reported over MOT15, DeepSORT [Wojke, Bewley, and Paulus2017] an extension of SORT with deep associations, Re [Gordon, Farhadi, and Fox2018], a deep RNN-based multi-target tracker, and RNN_LSTM algorithm [Milan et al.2017], one of the pioneering algorithms using RNN for multi-target tracking. Our overall average OSPA error is , , and lower than RNN_LSTM, SORT, DeepSORT and Re, respectively. For the case of MOT17, the overall average error is higher than MOT15 results for all algorithms (Table 2). The reason is that compared to MOT15, there are significantly higher number of annotated objects in this dataset. However, similar to MOT15, our algorithm outperforms the other methods in terms of overall OSPA, with Loc and Card errors.

Methods OSPA Card Overall OSPA
ConvLSTM 19.16 40.40
Table 1: OSPA error on MOT15: Comparison against PHD [Mahler2007, Vo and Ma2006], CPHD [Nagappa et al.2017, Mahler2007], LMB [Reuter et al.2014] and GLMB [Vo, Vo, and Phung2014, Vo, Vo, and Hoang2017], with EKF, SMC, and UKF prediction and update steps, and other multi-target tracking algorithms.
Methods OSPA Card Overall OSPA
ConvLSTM 53.79 62.29
Table 2: OSPA error on the MOT17 dataset.
Algorithm Rcll Prcn MOTA MOTAL
SORT-DPM 37.0 76.6 24.7 25.7
SORT-FRCNN 50.5 97.3 48.5 49.2
SORT-SDP 63.0 98.3 61.0 61.8
DeepSORT-DPM 32.4 91.3 28.2 29.3
DeepSORT-FRCNN 51.7 95.7 48.6 49.4
DeepSORT-SDP 64.7 97.3 61.6 62.9
Re-DPM 37.0 76.2 24.3 25.4
Re-FRCNN 50.6 97.1 48.4 49.0
Re-SDP 62.9 98.0 60.7 61.6
RNN_LSTM-DPM 32.8 83.6 25.6 26.4
RNN_LSTM-FRCNN 44.7 88.9 38.4 39.1
RNN_LSTM-SDP 49.7 87.7 41.5 42.7
ConvLSTM-DPM 38.9 69.9 20.3 22.1
ConvLSTM-FRCNN 53.3 92.6 48.1 49.0
ConvLSTM-SDP 67.1 94.9 62.0 63.5
Table 3: Multi-target tracking performance over MOT17.
Algorithm Rcll Prcn MOTA MOTAL
SORT 71.4 98.5 69.0 70.3
DeepSORT 74.3 97.5 70.5 72.1
Re 70.9 97.8 68.0 69.3
ConvLSTM 77.7 93.3 71.0 72.1
Table 4: Multi-target tracking performance over PNNL Parking Lot dataset using [Shu et al.2012] detector.
Algorithm Rcll Prcn MOTA MOTAL
SORT 75.0 87.4 61.9 64.1
DeepSORT 88.0 83.0 65.2 62.9
Re 73.1 83.6 56.5 58.8
RNN_LSTM 91.1 68.1 43.9 48.3
ConvLSTM 80.9 86.8 66.9 68.5
Table 5: Multi-target tracking performance over the PETS09 dataset using ACF [Dollár et al.2014] as the detector.

Tracking performance

The widely used multi-target tracking CLEAR MOT metrics explained in [Bernardin and Stiefelhagen2008] are used to quantitatively evaluate the proposed ConvLSTM algorithm. Three datasets are used: MOT17, PNNL Parking Lot and PETS09. Unlike the OSPA results, for which point targets were assumed, here each target is represented as a bounding box. The multi-target tracking results for MOT17, PNNL Parking Lot and PETS09 are shown in Tables 3, 4, 5, respectively. For MOT17, all three publicly available detections are used. ConvLSTM’s performance is at the highest when the SDP detector is used, with MOTA, recall and MOTAL. Also, ConvLSTM generates and MOTA, over the Parking Lot and PETS09, when the detection method in [Shu et al.2012] and ACF are used, respectively. For the Parking Lot sequence, ground truth is only available for the first 748 frames, and hence we have evaluated the algorithms over these frames only. Our ConvLSTM approach generates lower miss rate (false negatives), resulting in higher recall, as it can be seen in Tables 3 and 4.


This paper detailed a spatio-temporal data prediction approach applicable for MTFT problems. The prediction is simultaneously performed for all of the targets, over an implicit continuous hypothetical target space, via ConvLSTM neural network. The proposed approach not only significantly improves the baseline RFS filters, but also shows substantial potential when compared with other state-of-the-are MTFT algorithms. Our algorithm, which makes an important step towards an end-to-end learning model for online MTFT over realistic scenarios, can be enhanced in several aspects. An immediate improvement can be to include the update step within the ConvLSTM framework. Also, as an alternative for the ConvLSTM network (which is a discriminative algorithm approximating the posterior densities), generative neural networks can be utilised to directly estimate the joint PDF, eliminating the need for a separate covariance estimation at the prediction step.


  • [Bay, Lepsoy, and Magli2016] Bay, A.; Lepsoy, S.; and Magli, E. 2016. Stable limit cycles in recurrent neural networks. In 2016 International Conference on Communications (COMM), 89–92.
  • [Beard et al.2017] Beard, M.; Vo, B. T.; Vo, B. N.; and Arulampalam, S. 2017. Void probabilities and Cauchy-Schwarz divergence for generalized labeled multi-Bernoulli models. IEEE Transactions on Signal Processing 65(19):5047–5061.
  • [Bernardin and Stiefelhagen2008] Bernardin, K., and Stiefelhagen, R. 2008. Evaluating multiple object tracking performance: the CLEAR MOT metrics. Journal on Image and Video Processing 2008:1.
  • [Bewley et al.2016] Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; and Upcroft, B. 2016. Simple online and realtime tracking. In ICIP, 3464–3468.
  • [Dollár et al.2014] Dollár, P.; Appel, R.; Belongie, S.; and Perona, P. 2014. Fast feature pyramids for object detection. IEEE Transactions on PAMI 36(8):1532–1545.
  • [Emambakhsh, Bay, and Vazquez2018] Emambakhsh, M.; Bay, A.; and Vazquez, E. 2018. Deep recurrent neural network for multi-target filtering. arXiv preprint arXiv:1806.06594.
  • [Emambakhsh, Ebrahimnezhad, and Sedaaghi2010] Emambakhsh, M.; Ebrahimnezhad, H.; and Sedaaghi, M. 2010. Integrated region-based segmentation using color components and texture features with prior shape knowledge. International Journal of Applied Mathematics and Computer Science 20(4):711–726.
  • [Fantacci et al.2018] Fantacci, C.; Vo, B. N.; Vo, B. T.; Battistelli, G.; and Chisci, L. 2018. Robust fusion for multisensor multiobject tracking. IEEE Signal Processing Letters 25(5):640–644.
  • [Felzenszwalb et al.2010] Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ramanan, D. 2010. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32(9):1627–1645.
  • [Gordon, Farhadi, and Fox2018] Gordon, D.; Farhadi, A.; and Fox, D. 2018. Re: Real-time recurrent regression networks for visual tracking of generic objects. IEEE Robotics and Automation Letters 3(2):788–795.
  • [Graves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.-r.; and Hinton, G. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, 6645–6649. IEEE.
  • [Henriques et al.2015] Henriques, J. F.; Caseiro, R.; Martins, P.; and Batista, J. 2015. High-speed tracking with kernelized correlation filters. IEEE Transactions on PAMI 37(3):583–596.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
  • [Leal-Taixé et al.2015] Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; and Schindler, K. 2015. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942.
  • [Mahler2003] Mahler, R. P. S. 2003. Multitarget Bayes filtering via first-order multitarget moments. IEEE Transactions on Aerospace and Electronic Systems 39(4):1152–1178.
  • [Mahler2007] Mahler, R. 2007. PHD filters of higher order in target number. IEEE Transactions on Aerospace and Electronic Systems 43(4):1523–1543.
  • [Meyer et al.2017] Meyer, F.; Braca, P.; Willett, P.; and Hlawatsch, F. 2017. A scalable algorithm for tracking an unknown number of targets using multiple sensors. IEEE Transactions on Signal Processing 65(13):3478–3493.
  • [Milan et al.2017] Milan, A.; Rezatofighi, S.; Dick, A.; Reid, I.; and Schindler, K. 2017. Online multi-target tracking using recurrent neural networks. In AAAI.
  • [Moratuwage et al.2014] Moratuwage, D.; Wang, D.; Rao, A.; Senarathne, N.; and Wang, H. 2014. RFS collaborative multivehicle SLAM: SLAM in dynamic high-clutter environments. IEEE Robotics Automation Magazine 21(2):53–59.
  • [Nagappa et al.2017] Nagappa, S.; Delande, E. D.; Clark, D. E.; and Houssineau, J. 2017. A tractable forward-backward CPHD smoother. IEEE Transactions on Aerospace and Electronic Systems 53(1):201–217.
  • [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
  • [Reuter et al.2014] Reuter, S.; Vo, B. T.; Vo, B. N.; and Dietmayer, K. 2014. The labeled multi-Bernoulli filter. IEEE Transactions on Signal Processing 62(12):3246–3260.
  • [Schuhmacher, Vo, and Vo2008] Schuhmacher, D.; Vo, B. T.; and Vo, B. N. 2008. A consistent metric for performance evaluation of multi-object filters. IEEE Transactions on Signal Processing 56(8):3447–3457.
  • [Shi et al.2015] Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; and Woo, W.-c. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802–810.
  • [Shu et al.2012] Shu, G.; Dehghan, A.; Oreifej, O.; Hand, E.; and Shah, M. 2012. Part-based multiple-person tracking with partial occlusion handling. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 1815–1821. IEEE.
  • [Vo and Ma2006] Vo, B. N., and Ma, W. K. 2006. The Gaussian mixture probability hypothesis density filter. IEEE Transactions on Signal Processing 54(11):4091–4104.
  • [Vo, Singh, and Doucet2005] Vo, B. N.; Singh, S.; and Doucet, A. 2005. Sequential Monte Carlo methods for multitarget filtering with random finite sets. IEEE Transactions on Aerospace and Electronic Systems 41(4):1224–1245.
  • [Vo, Vo, and Hoang2017] Vo, B. N.; Vo, B. T.; and Hoang, H. G. 2017. An efficient implementation of the generalized labeled multi-Bernoulli filter. IEEE Transactions on Signal Processing 65(8):1975–1987.
  • [Vo, Vo, and Phung2014] Vo, B. N.; Vo, B. T.; and Phung, D. 2014. Labeled random finite sets and the Bayes multi-target tracking filter. IEEE Transactions on Signal Processing 62(24):6554–6567.
  • [Wojke, Bewley, and Paulus2017] Wojke, N.; Bewley, A.; and Paulus, D. 2017. Simple online and realtime tracking with a deep association metric. In ICIP, 3645–3649.
  • [Yang, Choi, and Lin2016] Yang, F.; Choi, W.; and Lin, Y. 2016.

    Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, 2129–2137.