I Introduction
Multitarget filtering (MTF) is the process of obtaining true positive samples from a cluttered and noisy data sequence. It has numerous applications in tracking [1, 2, 3], radar/LiDAR signal processing [4], simultaneous localization and mapping (SLAM) and occupancy grid computation in robotics, and sensor fusion [5, 6, 7, 8, 9, 10].
Defining a robust motion model is a key step for MTF algorithms [11, 12, 13]. Briefly, motion models formulate the prior knowledge about the variations over the state (latent) space. In a Bayesian framework, they are used to predict the target states which are then corrected using the obtained measurements (observations). A weak motion model can deteriorate the filtering performance by propagating a wrong prediction over the state space. Such issue can be even more salient for fixed motion models, which do not adapt themselves to the incoming data.
On the other hand, learning such motion patterns can be difficult, because: (1) In an MTF problem, the number of targets are usually variable and unknown, making the model design very difficult; (2) Since the incoming data sequence is usually highly cluttered and noisy, learningbased models can be trained on false positive samples creating misinformation propagation; (3) Due to the high number of parameters influencing filtering problems, assigning separate train/validation and test scenarios is very difficult and can lead the model to overfit. The motion model is expected to be learned online, using the incoming data until the current time step; (4) Speed is crucial; a learningbased method should be computationally comparable with its fixed motion modelbased rivals.
Considering these challenges, in this paper we propose online learning of the motion models (OLMM) to perform MTF from the incoming sequence of data. This is performed, on the fly, by training recurrent neural networks (RNN) with long shortterm memory (LSTM) architecture, used as a regression block, over the target state space. The filtering and update is then performed by a novel data association algorithm. Our implementation allows GPU memory reusability by placeholders
utilisation and facilitates transfer learning by initialising the LSTM state predictors by reusing weights and biases from other targets. We have evaluated the algorithm over two datasets containing point targets: (1) A commonly used synthetic data, which contains numerous MTF challenges, such as nonlinear motion, birth, spawn, merge and death of targets; (2) The bird’seye view of the Duke MultiTarget, MultiCamera (DukeMTMC) pedestrian tracking dataset. Our experimental results show a remarkable performance of our algorithm when compared with previous filtering approaches.
Contributions. Unlike the MTF algorithms in [14, 15], which use fixed motion models, the proposed algorithm learns the motion model from the incoming data. The proposed data association algorithm has linear complexity and compared with RFS MTF algorithms [14, 15], relies on significantly fewer number of hyper parameters. For example, [16]
requires hyper parameters to perform pruning, merge and truncation of the output density function, in addition to clutter distribution, survival and detection probabilities. As opposed to the previous neural networkbased methods
[17], OLMM does not rely on a separate training and test steps and is trained on the fly. To the best of our knowledge, the proposed algorithm is one of the first of its kind, which is capable of applying LSTM to filter a densely clutter sequence of data. It should be mentioned that the current paper is an extension of our recent work in [18], which has been significantly enhanced in the following aspects: (1) More extensive experimental results are provided over both synthetic and real data; (2) Compared to our initial paper, a significantly improved and unified mathematical framework is provided. Algorithm is explained via several diagrams and a pseudo code for immediate implementation is provided; (3) Complexity analysis and elapsed time for each time step are reported.In this paper, we use italic font for scalar, tuple and random finite set (RFS) parameters/variables, while bold font is used for vectors and matrices. Also, we use
, and to indicate the time step, sample index from target and measurement RFS, respectively.Ii Related work
Iia Fixed models: Bayesian paradigms
Prior modelling of the targets’ behaviour can be based on appearance or motion (kinematics) equations. Using these models the state vectors defined for each target are predicted. Then the predictions are mapped onto the measurement step to perform correction. In a Bayesian formulation of a singletarget filtering, the goal is to estimate the (hidden) target state, from a set of observations. Filtering is a recursive problem; The state estimation at the
time step is usually obtained by Maximum A Posteriori (MAP) criterion, over the state space given the past observations. Kalman filter is arguably the most popular online filtering approach. It assumes linear motion models with Gaussian distributions for both the prediction and update steps. Nonlinearity and nonGaussian behaviour are addressed by Extended and Unscented Kalman Filters (EKF, UKF), respectively.
Mahler proposed RFS [19], an encapsulated formulation for MTF, incorporating clutter density, probabilities of detection, survival and birth of targets [20, 19, 21]. Targets and measurements are assumed to form sets, with variable random cardinalities. Using Finite Set Statistics [19], the posterior distribution for a single target can be extended from vectors to RFS. Facilitated by the RFS formulation, Probability Hypothesis Density (PHD) maps [20, 16]
are proposed to represent target states. These maps have two basic features: 1) Their peaks correspond to the location of targets; 2) Their integration gives the expected number of targets at each time step. In their seminal paper, Vo and Ma proposed Gaussian Mixture PHD (GMPHD), which propagates the firstorder statistical moments to estimate the posterior density as a mixture of Gaussians
[16].While GMPHD represents the hypothetical target state via mixture of Gaussians, a particle filterbased solution is proposed by Sequential Monte Carlo PHD (SMCPHD) to address nonGaussian distributions [20]. Cardinalised PHD (CPHD) is proposed by Mahler to also propagate the cardinality of the targets over time in [21], while its intractability is addressed in [22]. Also, Lu et al. proposed an algorithm addressing missed detections, enhancing the track continuity [23]. On the other hand, target spawning within CPHD framework is addressed in [24]. A PHD and CPHD filter which propagates the second order statistics in parallel with the mean is proposed by Schlangen et al. [25], which significantly outperforms CPHD in terms of computational cost.
The Labelled MultiBernoulli Filter (LMB) is introduced in [14], which performs tracktotrack association and outperforms previous algorithms in the sense of not relying on high signal to noise ratio. Vo et al. proposed Generalized Labelled MultiBernoulli (GLMB) as a labelled MTF [15], while GarcíaFernández et al. introduced an approach to derive Poisson LMB without using the probability generating functionals [26]. Since a large number of particles needs to be propagated during Monte Carlo based methods, the computational complexity can be high and hence gating might be necessary. An inaccurate gating, however, can filter out legitimate targets and increase the false negative rate.
IiB Neural filtering
Parisini and Zoppoli reformulated the process of state estimation of MTF algorithms to a nonlinear programming problem
[27]. Although since then, the neural network based sequential learning solutions were infamous for their vulnerability to small datasets, easily under/overfitting and slow computational speed during test and train phases, with the advances in computation power in recent years, neural networkbased approaches have been capable of learning from large number of sequences. This has opened a new window for MTF as these methods are capable of modelling the latent information within the data sequence in parallel with filtering its false positive samples. RNNs are neural networks with feedback loops, through which past information can be stored and exploited. They offer promising solutions to difficult tasks such as system identification, prediction, pattern classification, and stochastic sequence modelling [28]. RNNs are known to be particularly hard to train, especially when long temporal dependencies are involved, due to the socalled vanishing gradient phenomenon. Learning motion models via neural filters can be difficult, because of the varying number of targets in a cluttered scene, which is quite common in an MTF problem. This can make the model design very difficult, especially since neural networks usually have a fixed architecture. Also, since the incoming data sequence is usually highly cluttered and noisy, learningbased models can be trained on false positive samples creating misinformation propagation. Assigning separate train/validation and test scenarios is very difficult for filtering scenarios. This makes neural filtering algorithms vulnerable to overfitting. And finally, the motion model is expected to be learned online, using the available data at the current time step, which is another challenge for neural filtering.List of symbols

: time step

: target index

: measurement (observation) index

: target tuple at time

: target RFS at

: number of targets at

: target state matrix at

: number of collected sample for the target at

: dimensionality of the state space

: age of the target at

: genuinity error of the target at

: freeze state of the target at

: layer of LSTM model at

: number of LSTM network hidden layers

: hidden state of the layer of LSTM model at

: input gate of the layer of LSTM model at

: transform gate of the layer of LSTM model at

: forget gate of the layer of LSTM model at

: output gate of the layer of LSTM model at

: memory cell of the layer of LSTM model at

: LSTM model tuple at

: target’s estimated state at

: target index within the state matrix

: number of predicted targets

: target predicted tuple

: predicted target RFS

: measurement RFS at

: number of measurements at

: residual tuple computed using the target and measurement at

: residual RFS at

: targetness error calculated using the target and measurement at

: measurement at

: targetness error matrix at

: index of the closest target to the measurement at

: index of the closest measurement to the target at

: distance of the closest target to the measurement at

: distance of the closest measurement to the target at

: vector containing each for

: vector containing each for

: histogram of

: histogram of

: mean of the Poisson distribution

: standard deviation of the radial detection error

: standard deviation of the bearing detection error

: minimum target age

: minimum target genuinity error

: maximum target genuinity error

: updated target RFS at

: birth target RFS at
Iii Target representation
Let us define , the target tuple at the time step as a member of RFS as follows,
(1) 
where (an matrix) contains target state of samples over a dimensional state space. is an integer indicating the age of the target (the higher the age, the longer the target has survived). is a real positive number containing the target’s genuinity error; It quantifies how legitimate the current target is over a continuous space, where its higher values correspond to higher likelihood of false positivity. is a binary freeze state variable, which is: 1 (True), if there is no associated measurement for this target (due to occlusions, false positivity or detection failure); Or 0 (False), when there is at least one measurement associated with this target (a “surviving” target). is an RFS with cardinality, which contains all the target tuples at .
Iv OLMM pipeline
The overall MTF pipeline in one time step is illustrated as a block diagram in Fig. 1. The LSTM network is trained using the available data for each target and then used to predict target state. Next, a set of residuals is computed over the predictions and current measurement sets. Filtering and data association are finally performed to assign target survival and birth sets. In the following sections, each of these steps is explained in detail.
Iva Online motion modelling
The target state variations over video frames can be seen as a sequential learning problem. Thus, we apply an LSTM network to learn a global motion model, since dedicating one LSTM for each target leads to memory management issues. The network is trained online for each target using its past measurements, transferring the learned weights and biases from one to the other target. Formally, we used an layer LSTM defined as,
(2)  
(3) 
where, for each layer , each hidden block computes the hidden state , using four gates , , , (i.e. input, transform, forget, and output gates, respectively). The nonlinear elementwise activation functions are defined as , while is the memory cell. Then, the network estimates the next target state as , for , using the last hidden state and a linear elementwise activation function . Therefore, the network is completely described by the model tuple , containing the weights and biases of the network: (for simplicity of notation, we have omitted
for the weight matrix and bias vectors within the model tuple). Thus, given the model tuple at the previous time step
, the network is updated as a regression block to minimise a mean square error loss function as follows,(4) 
which is calculated over the new estimated target state and the expected , for . Minimising gives the new model parameters (). contains the updated weights and biases of LSTM.
IvB Predicting the target state
After the LSTM network is trained, we use the updated weights and the latest target state vector to compute the predicted target state . This procedure is repeated for all targets in , resulting in the following predicted RFS,
(5) 
where is the prediction tuple and is the predicted RFS.
IvC Filtering and update
Computing residuals. At the time step, a set of residuals are calculated using the obtained measurement RFS . If has cardinality, assuming no gating is performed, there will be residuals which are stored as , where its tuple contains the residual information between the target and measurement as follows,
(6) 
in which is the measurement RFS, is the () measurement vector and is the targetness error parameter, which is computed as the second norm between the measurement and target state as follows,
(7) 
is a distance metric between the predicted target vector and measurement vector .
is used to perform the filtering step, at which survival of targets are determined, new births are assigned and false positive targets and measurements are removed. To do this, first using (7), an matrix is constructed as follows,
whose element at the row and column gives . contains the targetness errors between all measurements and target states. In the next section, we detail how is used to perform data association.
Data association. For each column and row of the measurement and target indexes corresponding to the lowest are computed, respectively, as follows,
(8) 
where and
are and vectors, containing the indexes of the closest target and measurement, respectively. In addition to their indexes, the corresponding minimum values of each row and column of are also computed,
(9) 
where and
are and vectors, respectively. Each element of and quantifies the measurementtotarget and targettomeasurement closest distance, respectively.
In other words, and are the measurement and target errors for the measurement and target, respectively, which quantitatively indicate how genuine the found associated sample is.
Next, the histogram of and are computed as and , respectively, as follows,
(10) 
and compute the histogram of the input vectors by filling and bins, respectively. The element of (i.e. ) shows the number of associations for the predicted target. On the other hand, indicates the number of association to the measurement.
Decay, survival and birth of targets. The target tuples are then updated using the data association approach explained as a pseudo code in Algorithm 1. Basically, one of the following three hypotheses (cases in Algorithm 1) are assigned for each filtered target: Case (1) decaying status, which indicate the target has no association; Case (2) survival status, for the targets with at least one measurement association; Case (3) birth status, for those (isolated) measurements without any association.
For Case (1) shown in Fig. 2a, the freeze state is set to one, meaning that the association step failed to find a measurement sample for the current target (Fig. 2atop) possibly due to occlusion, measurement failure or due to the fact that the target itself is a false positive (Fig. 2abottom where the associated measurement is far from the target ). In the absence of an associated measurement sample, the predicted target state is appended to to create the new state matrix: For Case (2) illustrated in Fig. 2b, the freeze state of the target is set to zero as the target is associated with at least one measurement (). Its target state matrix is updated by appending the associated measurement vector , i.e. For both cases, to optimise memory allocation we define a maximum batch size. If the number of rows in (i.e. ) was greater than a maximum assigned batch size, the first row of which corresponds to the oldest saved prediction or measurement is removed. The assigned target tuples form the updated RFS as explained in Algorithm 1.
In parallel with the above two procedures, the third case (Fig. 2c) is evaluated to determine birth of targets. For a measurement with no target association ( shown at Fig. 2ctop) or an isolated measurement (whose , shown at Fig. 2cbottom), a new target tuple is assigned. Concatenating all of these tuples form the target birth RFS . The target tuple at the time step is calculated as the union of births and survivals, i.e. , which has cardinality.
On the data association algorithm complexity. The complexity of similar assignment method, such as the Hungarian matching derivations, can reach to [29]. Also, assuming no measurement gating is performed, the computational complexity of GMPHD filter is [16]. On the other hand, the proposed data association has complexity, while implementing the two loops in Algorithm 1 in parallel can reduce the complexity to .
V Experimental results
Va Datasets and evaluation metric
The proposed MTF algorithm is evaluated over two data sequences: (1) a controlled simulation MTF introduced by Vo and Ma [14, 15]; (2) A bird’seye view representation of the targets in the DukeMTMC dataset. We compute the Optimal SubPattern Assignment (OSPA, [30]) distance to quantitatively evaluate the proposed algorithm. The OSPA error consists of two terms: one is related to the difference in the number compared sets (cardinality (Card) error); and the other relates to the localisation (Loc), which is the smallest pairwise distance among all the elements in the two sets. In our work, we have used the Hungarian assignment to compute this minimal distance. OSPA has been widely used for evaluating the accuracy of the point target filtering algorithms [31, 7]. The overall pipeline is implemented (endtoend) in Python 2.7, and all the experiments are tested using an NVIDIA GeForce GTX 1080 GPU and an CPU. We have used a 3layer LSTM network (), each having 20 hidden units, outputting a fullyconnected layer (3), with as an identity function. The network is trained online over the currently updated patch for each target, minimising the mean square error as the loss function and using Adam optimisation method. The training procedure is performed with 0.001 learning rate, and optimiser parameters of and . As in [30], we choose and .
The process of initialising the LSTM by allocating GPU memory via Tensorflow, training it as a regression block and predicting the output sample takes
millisecond for each target, while it takes milliseconds (per target) to finetune the LSTM network. To be more specific, as described in Section IVA, at every time step, we transfer the weights and biases () learned from the motion trajectories of other targets to the current one and only finetune its weights and biases using fewer number of epochs. This reduces the computation time from
milliseconds (elapsed time to initialise the LSTM, allocate the GPU memory, train as regression block using 50 epochs and predict the output sample) to millisecond for each target, with 20 epochs (Please see our supplementary material for video samples).VB Results on synthetic data
In this scenario, there are 10 targets appearing in the scene at different times, having various birth times and lifespans (Fig. 3
). The measurements is performed by computing the range and bearing (azimuth) of a target from the origin. It also contains clutter with uniform distribution along range and azimuth, with a random intensity sampled from a Poisson distribution with
mean. The obtained measurements are degraded by a Gaussian noise with zero mean and (unit distance) and (rad) standard deviation, respectively. The problem is to perform online MTF to recover true positives from clutter. In our first experiment we compute the OSPA error, assuming clutter intensity.Algorithm  OSPA Card  OSPA Loc  OSPA 

PHDEKF  
PHDSMC  
PHDUKF  
CPHDEKF  
CPHDSMC  
CPHDUKF  
LMBEKF  
LMBSMC  
LMBUKF  
GLMBEKF  
GLMBSMC  
GLMBUKF  
OLMM  17.26 
In Table I, we report the average overall OSPA and its two terms related to cardinality error (OSPA Card) and optimal Hungarian distance (OSPA Loc  the term). We compare our method with PHD, CPHD, LMB, and GLMB algorithm, when EKF, SMC, and UKF used as basis for the prediction and update steps (The following Matlab implementation of these algorithms is used: http://batuong.voau.com/codes.html). Our method outperforms all the other algorithms in terms of overall OSPA. In particular, this is due to a significant drop of the Loc error, while cardinality error is comparable with most of the others.
The resulting trajectories of this experiment for our method are illustrated in Fig. 3. The red dots represent the predicted location of the targets at every time step, filtered out from the measurements clutters (black dots). They almost overlap with the ground truth (green dots), except very few (only three) false positives (predicted but no ground truth) and false negatives (ground truth but no prediction). The target and clutter data projections onto the horizontal and vertical axes at each time step are also plotted in Fig. 6.
Moreover, in Fig. 5 we show the overall OSPA at every time steps. During the initial time steps (frame number ), our OSPA error is higher. This is mostly due to the underfitting of the LSTM model because of lack of data. However, after time step our OSPA error becomes significantly lower than other approaches, having an overall average of , while the average OSPA error for other algorithms are . The impulsive peaks correspond to those time steps when birth of targets occur, at which the Card error suddenly increases. In order to show the robustness of our algorithm for higher clutter densities, in the second experiment, we increase the clutter intensity and find the average OSPA over all time steps. Figure 4 shows the results of this experiment, for . Our filtering algorithm provides a relatively constant and comparably lower overall OSPA error even when the clutter is increased to 50. Both of the SMCbased algorithms (GLMBSMC and LMBSMC) generate highest OSPA error, which can be due to the particle filter algorithm divergence. On the other hand, lower OSPA errors generated by the LMB with an EKF model shows how successfully this particular simulated scenario can be modelled using such nonlinear filter. It should be mentioned, however, that our method does not rely on any prior motion model capable of learning the nonlinearity within the data sequence.
VC Results on the Duke dataset
DukeMTMC is a pedestrian tracking dataset, captured using 8 synchronised cameras [32]. In our experiments, we use its 177840 frames, whose ground truth are provided. In order to evaluated our point target filtering algorithm, we map the coordinates of the bottom centre of each bounding box to an aerial perspective. Each of these points are first mapped from the image plane to the world coordinate system, and then to the aerial map (the bird’seye view map). We repeat the same procedure over the provided OpenPose [33] detection results (used as measurement sets in our experiments) to obtain their corresponding aerial view representation.
The computed trajectories of this experiment for our method are shown in Fig. 7. The red dots represent the target locations given by OLMM at every time step, filtering the highly cluttered measurements (black dots). The filtered point targets almost totally overlap with the ground truth (green dots), except for some false positive targets. As the camera locations are fixed, such targets are mainly caused by persistent false detections.
The MTF performance of several algorithms are quantitatively illustrated in Table II in term of their OSPA errors. For each algorithm, the OSPA errors are calculated for each data frame and then averaged for the whole 177840 sequence. The performance of OLMM outperforms all the other algorithms (PHD, CPHD, LMB, and GLMB, when EKF, SMC, and UKF are used as basis for the prediction and update steps). OLMM generates the third lowest cardinality error, while simultaneously maintaining a low OSPA Loc error, resulting in the lowest overall OSPA.
Algorithm  OSPA Card  OSPA Loc  OSPA 

PHDEKF  
PHDSMC  
PHDUKF  
CPHDEKF  
CPHDSMC  
CPHDUKF  
LMBEKF  
LMBSMC  
LMBUKF  
GLMBEKF  
GLMBSMC  
GLMBUKF  
OLMM  56.61 
Vi Conclusions
This paper proposes an MTF algorithm which learns the motion models, on the fly, using an RNN with an LSTM architecture, as a regression problem. The target state predictions are then corrected using a novel data association algorithm, with a low computational complexity. The proposed algorithm is evaluated over synthetic and real point target filtering scenarios, demonstrating a remarkable performance over highly cluttered data sequences.
The proposed OLMM algorithm can be applied to various applications where point targets are obtained by the detectors. Some examples can be target tracking from satellite images, keypoint filtering for 3D scene mapping, radar point scatterer detection and tracking, and LiDAR signal processing. Also, as the proposed approach does not assign limits over the state space dimensionality, in addition to point target filtering, the algorithm’s potential to filter extended targets can be investigated.
References
 [1] S. Li, W. Yi, R. Hoseinnezhad, B. Wang, and L. Kong, “Multiobject tracking for generic observation model using labeled random finite sets,” IEEE Transactions on Signal Processing, vol. 66, no. 2, pp. 368–383, 2018.
 [2] Y. G. Punchihewa, B. Vo, B. Vo, and D. Y. Kim, “Multiple object tracking in unknown backgrounds with labeled random finite sets,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 3040–3055, 2018.
 [3] A. Roy and D. Mitra, “Multitarget trackers using cubature kalman filter for doppler radar tracking in clutter,” IET Signal Processing, vol. 10, pp. 888–901(13), 2016.
 [4] G. Y. Kulikov and M. V. Kulikova, “The accurate continuousdiscrete extended Kalman filter for radar tracking,” IEEE Transactions on Signal Processing, vol. 64, no. 4, pp. 948–958, 2016.
 [5] C. Evers and P. A. Naylor, “Optimized selflocalization for SLAM in dynamic scenes using probability hypothesis density filters,” IEEE Transactions on Signal Processing, vol. 66, no. 4, pp. 863–878, 2018.
 [6] K. Y. K. Leung, F. Inostroza, and M. Adams, “Relating random vector and random finite set estimation in navigation, mapping, and tracking,” IEEE Transactions on Signal Processing, vol. 65, no. 17, pp. 4609–4623, 2017.
 [7] C. Fantacci, B. N. Vo, B. T. Vo, G. Battistelli, and L. Chisci, “Robust fusion for multisensor multiobject tracking,” IEEE Signal Processing Letters, vol. 25, no. 5, pp. 640–644, 2018.
 [8] S. Li, W. Yi, R. Hoseinnezhad, G. Battistelli, B. Wang, and L. Kong, “Robust distributed fusion with labeled random finite sets,” IEEE Transactions on Signal Processing, vol. 66, no. 2, pp. 278–293, 2018.
 [9] Z. Xing and Y. Xia, “Comparison of centralised scaled unscented Kalman filter and extended Kalman filter for multisensor data fusion architectures,” IET Signal Processing, vol. 10, pp. 359–365(6), 2016.
 [10] L. Yan, L. Jiang, J. Liu, Y. Xia, and M. Fu, “Optimal distributed Kalman filtering fusion for multirate multisensor dynamic systems with correlated noise and unreliable measurements,” IET Signal Processing, vol. 12, pp. 522–531(9), 2018.
 [11] M. Roth, G. Hendeby, and F. Gustafsson, “EKF/UKF maneuvering target tracking using coordinated turn models with polar/Cartesian velocity,” in 17th International Conference on Information Fusion (FUSION). IEEE, 2014, pp. 1–8.
 [12] X. R. Li and V. P. Jilkov, “Survey of maneuvering target tracking: dynamic models,” in Signal and Data Processing of Small Targets 2000, vol. 4048. International Society for Optics and Photonics, 2000, pp. 212–236.
 [13] G. Zhai, H. Meng, and X. Wang, “A constant speed changing rate and constant turn rate model for maneuvering target tracking,” Sensors, vol. 14, no. 3, pp. 5239–5253, 2014.
 [14] S. Reuter, B. T. Vo, B. N. Vo, and K. Dietmayer, “The labeled multiBernoulli filter,” IEEE Transactions on Signal Processing, vol. 62, no. 12, pp. 3246–3260, 2014.
 [15] B. N. Vo, B. T. Vo, and D. Phung, “Labeled random finite sets and the Bayes multitarget tracking filter,” IEEE Transactions on Signal Processing, vol. 62, no. 24, pp. 6554–6567, 2014.
 [16] B. N. Vo and W. K. Ma, “The Gaussian mixture probability hypothesis density filter,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4091–4104, 2006.
 [17] A. Milan, S. Rezatofighi, A. Dick, I. Reid, and K. Schindler, “Online multitarget tracking using recurrent neural networks,” in AAAI, 2017.
 [18] M. Emambakhsh, A. Bay, and E. Vazquez, “Deep recurrent neural network for multitarget filtering,” in International Conference on MultiMedia Modeling (MMM), 2019, pp. 519–531.
 [19] R. P. S. Mahler, “Multitarget Bayes filtering via firstorder multitarget moments,” IEEE Transactions on Aerospace and Electronic Systems, vol. 39, no. 4, pp. 1152–1178, 2003.
 [20] B. N. Vo, S. Singh, and A. Doucet, “Sequential Monte Carlo methods for multitarget filtering with random finite sets,” IEEE Transactions on Aerospace and Electronic Systems, vol. 41, no. 4, pp. 1224–1245, 2005.
 [21] R. Mahler, “PHD filters of higher order in target number,” IEEE Transactions on Aerospace and Electronic Systems, vol. 43, no. 4, pp. 1523–1543, 2007.
 [22] S. Nagappa, E. D. Delande, D. E. Clark, and J. Houssineau, “A tractable forwardbackward CPHD smoother,” IEEE Transactions on Aerospace and Electronic Systems, vol. 53, no. 1, pp. 201–217, 2017.
 [23] Z. Lu, W. Hu, and T. Kirubarajan, “Labeled random finite sets with moment approximation,” IEEE Transactions on Signal Processing, vol. 65, no. 13, pp. 3384–3398, 2017.
 [24] D. S. Bryant, E. D. Delande, S. Gehly, J. Houssineau, D. E. Clark, and B. A. Jones, “The CPHD filter with target spawning,” IEEE Transactions on Signal Processing, vol. 65, no. 5, pp. 13 124–13 138, 2017.

[25]
I. Schlangen, E. D. Delande, J. Houssineau, and D. E. Clark, “A secondorder PHD filter with mean and variance in target number,”
IEEE Transactions on Signal Processing, vol. 66, no. 1, pp. 48–63, 2018.  [26] . F. GarcíaFernández, J. L. Williams, K. Granström, and L. Svensson, “Poisson multiBernoulli mixture filter: Direct derivation and implementation,” IEEE Transactions on Aerospace and Electronic Systems, vol. 54, no. 4, pp. 1883–1901, 2018.
 [27] T. Parisini and R. Zoppoli, “Neural networks for nonlinear state estimation,” International Journal of Robust and Nonlinear Control, vol. 4, no. 2, pp. 231–248, 1994.
 [28] A. Bay, S. Lepsoy, and E. Magli, “Stable limit cycles in recurrent neural networks,” in 2016 International Conference on Communications (COMM), 2016, pp. 89–92.
 [29] L. Liu and D. Shell, “Assessing optimal assignment under uncertainty: An intervalbased algorithm,” in Proceedings of Robotics: Science and Systems, 2010.
 [30] D. Schuhmacher, B. T. Vo, and B. N. Vo, “A consistent metric for performance evaluation of multiobject filters,” IEEE Transactions on Signal Processing, vol. 56, no. 8, pp. 3447–3457, 2008.
 [31] B. N. Vo, B. T. Vo, and H. G. Hoang, “An efficient implementation of the generalized labeled multiBernoulli filter,” IEEE Transactions on Signal Processing, vol. 65, no. 8, pp. 1975–1987, 2017.

[32]
E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance
measures and a data set for multitarget, multicamera tracking,” in
European Conference on Computer Vision workshop on Benchmarking MultiTarget Tracking
, 2016.  [33] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh, “Realtime multiperson 2D pose estimation using part affinity fields,” in CVPR, 2017.