Multi-target filtering consists of automatically excluding clutter from (usually unlabelled) input data sequences. It has numerous applications in denoising spatio-temporal data, object detection and recognition, tracking, and data-, object- and track-level sensor fusion [14, 16, 4]
. It is also frequently used as part of various military applications (target recognition), automation pipelines, autonomous vehicles, and localisation and occupancy-grid mapping (using optical, e.g. stereo camera or LiDAR, or radar sensors) for robotics. Particularly for a tracking problem, the performance of a multi-target tracking algorithm heavily relies on its filtering step. A robust filtering algorithm is capable of considering occlusions, probability of detection, possibility of birth and spawn of targets, and incorporating clutter densities. An accurate filtering algorithm can improve the lifespan of the generated tracklets and the localisation of targets. Target state estimation by a filtering algorithm can be achieved by considering prior motion and measurement models. In a Bayesian filtering framework, motion models are used to predict the location of the target at the next time step, while measurement models are used to map predictions to the measurement space to perform correction (the update step). Due to the high complexity of a target motion, the use of fixed models, however, may not result in satisfactory outputs and therefore, can deteriorate the filtering performance.
Considering this challenge and inspired by the recurrent neural network tracker proposed in  and random finite sets (RFS) multi-target filtering paradigm [8, 7, 13], in this work, we propose a novel algorithm to perform multi-target filtering while simultaneously learning the motion model. To this end, a long short-term memory (LSTM,) recurrent neural network (RNN) architecture is defined over sets of target tuples and trained online using the incoming data sequences. The prediction step is performed by applying the trained LSTM network to data patch generated from targets. This LSTM network (which is then trained using the new updated states) gradually learns a global transferable motion model of the detected targets. After obtaining the measurements, we use a novel data association algorithm, which is compatible with the generated tracklet tuples to assign survivals, deaths and births of new targets. The updated patches for each target are then used to train the LSTM model in the next time steps. To evaluate our algorithm we have designed a multi-target simulation scenario. During the simulation, by increasing the clutter (false positive rate) intensity, we evaluate the filtering robustness. This work which is one of the first papers addressing the multi-target filtering task with recurrent neural networks shows remarkable potential, while outperforming well-known multi-target filtering approaches.
This paper is organised as follows: First, related work in the literature is explained in Section 2 and the overall pipeline is briefly explained in Section 3. Incorporation of recurrent neural networks for motion modelling is explained in Section 4. Tracklet tuples and data association is explained in Section 5. Section 6 is dedicated to the experimental results. We conclude the paper in Section 7, giving future working directions.
1.1 Scientific contribution
The proposed algorithm is capable of filtering multiple targets, with non-linear motion and non-Gaussian error models. Unlike [11, 16], no prior motion modelling is performed and mapping from state to observation space is learned from the incoming data sequence. Moreover, unlike  the proposed algorithm is being trained online and does not rely on a separate training phase. Since the predicted targets are being concatenated over time within the target tuples, higher Markov order is preserved, enabling longer term target state memorisation. Compared with RFS multi-target filtering algorithms [11, 16], there are significantly fewer number of hyper parameters for the proposed algorithm. For example,  requires hyper parameters to perform pruning, merge and truncation of the output density function, in addition to clutter distribution, survival and detection probabilities. To the best of our knowledge, this is one of the first papers addressing the multi-target filtering task with recurrent neural networks, particularly without the need of pre-training the network.
2 Related work
Model-based approaches: In a Bayesian formulation of a multi-target filtering, the goal is to estimate the (hidden) target state at the time step, from a set of observations in the previous time steps , i.e.,
where and are the likelihood and transition densities, respectively. From (1), it is clear that this is an recursive problem. The state estimation at the
iteration is usually obtained by Maximum A Posteriori (MAP) criterion. Kalman filter is arguably the most popular online filtering approach. It assumes linear motion models with Gaussian distributions for both of the prediction and update steps. Using the Taylor series expansion and deterministic approximation of non-Gaussian distributions, non-linearity and non-Gaussian behaviour are addressed by Extended and Unscented Kalman Filters (EKF, UKF), respectively. Using the importance sampling principle, particle filters are also used to estimate the likelihood and posterior densities. Particle filters are one of the most widely used multi-target filtering algorithms capable of addressing non-linearity and non-Gaussian motions[17, 14].
Mahler proposed random finite sets (RFS) formulation for multi-target filtering . RFS provides an encapsulated formulation for multi-target filtering, incorporating clutter density, probabilities of detection, survival and birth of targets [14, 8, 7]. To this end, targets and measurements assume to form sets with variable random cardinalities. Using Finite Set Statistics , the posterior distribution in (1
) can be extended from vectors to RFS as follows,
where and are measurement (containing both clutter and true positives) and target RFS, respectively, and an appropriate reference measure . One approach to represent targets is to use Probability Hypothesis Density (PHD) maps [14, 13]
. These maps have two basic features: 1) Their peaks correspond to the location of targets; 2) Their integration gives the expected number of targets at each time step. Vo and Ma proposed Gaussian Mixture PHD (GM-PHD), which propagates the first-order statistical moments to estimate the posterior in (2) as a mixture of Gaussians .
Non-model based approaches: While GM-PHD is based on Gaussian distributions, a particle filter-based solution is proposed by Sequential Monte Carlo PHD (SMC-PHD) to address non-Gaussian distributions . Since a large number of particles needs to be propagated during SMC-PHD, the computational complexity can be high and hence gating might be necessary.
Cardinalised PHD (CPHD) is proposed by Mahler to also propagate the cardinality of the targets over time in , while its intractability is addressed in . The Labelled Multi-Bernoulli Filter (LMB) is introduced in  which performs track-to-track association and outperforms previous algorithms in the sense of not relying on high signal to noise ratio (SNR). Vo et al. proposed Generalized Labelled Multi-Bernoulli (GLMB) as a labelled multi-target filtering .
RNNs and LSTM networks: RNNs are neural networks with feedback loops, through which past information can be stored and exploited. They offer promising solutions to difficult tasks such as system identification, prediction, pattern classification, and stochastic sequence modelling . Unfortunately, RNNs are known to be particularly hard to train, especially when long temporal dependencies are involved, due to the so-called vanishing gradient phenomenon. Many attempts were carried on in order to address this problem from choosing an appropriate initial configuration of the weights to exploiting orthogonality in the hidden-to-hidden weight matrix . Also modifications in the architecture  were proposed to sidestep this problem through gating mechanism, which enhance the memory of the network.
The latter case includes LSTMs (which is the architecture we consider in this paper as well). Formally, an LSTM is defined by four gates (input, candidate, forget and output, respectively), i.e. 
for each time step , where
represent element-wise non-linear activation functions (with), are the learning weights matrices and bias, is the input, and is the hidden state at the previous time step. These gates are then combined to update the memory cell unit and compute the new hidden state as follows ,
where represent the element-wise product.
Finally, the hidden state is mapped through a fully-connected layer to estimate the predicted output ,
where similarly to the RNN equation, is the element-wise output function and and
are learning weight matrix and bias vector, respectively.
3 Overall pipeline
The overall pipeline is shown in Fig. 1 in a block diagram. First, the predicted locations of the target tuples from the previous time step are computed. Then given the current measurements, a set of "residuals" are calculated for each target. These residuals are then used to perform filtering (rejecting the false positives and obtain survivals) and birth assignments. The union of the resulting birth and survival tuple sets are finally used as the targets for the iteration. In the following sections each of these steps are detailed.
4 Online motion modelling via LSTM
The target state variations over video frames can be seen as a sequential learning problem. The target tuples are predicted using an LSTM network. After being updated with their associated measurement, training patches are updated for each target which are used to re-train the very same LSTM model. As a result of this recursive process, motion modelling of the incoming data is performed. Using this approach, a non-linear/non-Gaussian input is learned without incorporating any prior knowledge about the motion. In our work, we investigated assigning an LSTM network to each target, which is separately trained over the predicted data from the previous time steps in order to predict for the following target position. The main issue with this approach is the (GPU) memory management and its re-usability. When the number of targets increases, it becomes infeasible to release the (GPU) memory part related to the absent targets and re-allocate the memory to the new targets.
One solution to this is to define a single LSTM network as a graph, whose nodes are simply "placeholders"  pointer variables. Every time a target is present in the scene, the placeholder nodes are updated with those weights and biases corresponding to this target. Once the prediction is performed, the memory is released. Such memory allocation enables learning a global target motion over the video sequence, which can be useful in crowd behaviour detection, but will provide poor results analysing the targets separately.
Thus, as a solution we propose an online LSTM training, where each target shares the same LSTM weights (i.e.
) with the other targets. During the online training step, these weights and biases are shared from other targets and fine-tuned based on the past measurements. In other words, we fine-tune the LSTM weights and biases for each target by transferring the learned weights and biases from another targets. This gives us the advantage to save memory storage even further, gradually reduce the re-training number of epochs, and have predictions which are mainly affected by the recent information for each specific target. In the following sections, we describe the filtering pipeline and show how the data is prepared to be given to the LSTM network for the training and prediction steps.
5 Tracklet tuples and data association
We define a tracklet tuple as a subset of at time , containing the following four components,
is an target state matrix over a -dimensional space. is the number of previous prediction appended for this target, which are used as the training patch for the LSTM network. is an integer indicating the target maturity until the iteration. is a real positive number containing the target genuinity error. is a binary freeze state variable, which is either 1 when an observed measurement is used to update the state of the target, or is 0 when the update is performed without any measurement but with the target’s past states. and are random finite sets with cardinalities, containing all the target tuples at , i.e and .
5.1 Filtering and birth assignment
The LSTM architecture explained in Section 4 is used to predict target state . This is performed by first sequentially training the LSTM network with its samples from the tuple. Then the sample in the last (the ) row of is given to the trained network to predict , which is then appended to the input tuple to create the predicted tuple . The resulting is hence an RFS with cardinality, similar to . At the time step, a set of residuals are calculated using the obtained measurement RFS . If has cardinality, assuming no gating is performed, there will be residuals which are stored as , where has the following structure,
in which is the targetness error parameter, which is computed from . The value of shows how close the predicted target is to the current measurement.
is used to perform the filtering step, at which survival of targets are determined, new births are assigned and false positive targets and measurements are removed. To do this, first an targetness matrix is constructed, whose element at the row and column shows the second norm between the measurement and predicted target ( and ). In the next section, we detail how is used to perform data association.
5.1.1 Data association
For each column and row of the measurement and target indexes are computed, respectively, as follows,
where and are and vectors, containing the minimum target and measurement indexes, respectively. In addition, the minimum of each column of is also computed,
where and are and vectors containing the measurement and target genuinity errors, respectively. In other words, and are the measurement and target genuinity errors for the and measurement and target, respectively. During the next step, the histogram bin of both and are computed as and , respectively as follows,
Each element of shows how many associations exist for the predicted target. On the other hand, each element of indicates the number of association to the measurement. The states of the targets are then updated using the data association approach explained as a pseudo code in Algorithm 1.
5.2 Update survivals and assign births
The survived targets form an RFS . A set member has the same structure as in (8), with the only difference that its , , and are being updated according to the cases explained in Algorithm 1. During the update stage (shown in the block diagram of Fig. 1), if the freeze state of a target is zero, meaning that a measurement has successfully been associated with the target, its is being updated by appending the associated measurement , i.e, . On the other hand, if the freeze state is one, meaning that the association step failed to find a measurement for the current target (possibly due to occlusion, measurement failure or the target itself is a false positive), the predicted target state is appended to to create the new state matrix . For both cases, to optimise memory allocation we define a maximum batch size. If the number of rows in was greater than the batch size, the first row of which corresponds to the oldest saved prediction or measurement is being removed. Using the updated target states, an RFS is generated for the survived targets. Each of its members () is a tuple, having the same structure as (7), with updated states computed according to the data association.
In parallel with the above procedure, for each birth vector , a target tuple is assigned as, . The target tuple at the time step is calculated as the union of births and survivals, i.e. , which has cardinality.
6 Experimental results
In this section we present the experimental results of our method in a controlled simulation on synthetic multi-target data. We compute the Optimal Sub-Pattern Assignment (OSPA, ) distance, an improved version of the Optimal Mass Transfer (OMAT) metric to quantitatively evaluate the proposed algorithm. Assuming two sets and , the OSPA distance of order and cut-off is defined as ,
where and are the number of elements in and , respectively. It essentially consists of two terms: one is related to the difference in the number of elements in the sets and (cardinality error); and the other related to the localisation (Loc), which is the smallest pair-wise distance among all the elements in the two sets (the best-worst objective function ). In our work, we have used the Hungarian assignment to compute this minimal distance. As in , we choose and . OSPA has been widely used for evaluating the accuracy of the filtering and tracking algorithms [15, 4]. The overall pipeline is implemented (end-to-end) in Python 2.7, and all the experiments are tested using an NVIDIA GeForce GTX 1080 GPU and an CPU. We have used a 3-layer LSTM network, each having 20 hidden units, outputting a fully-connected layer (6), with
as an identity function. The network is trained online at each time step for 50 epochs over the currently updated patch for each target, minimising the mean square error as the loss function and using Adam optimisation method.
In order to evaluate our algorithm over different filtering problems, such as occlusions, birth and death of the targets, non-linear motion and spawn, we have used the multi-target simulation introduced by Vo and Ma [11, 16]
. In this scenario, there are 10 targets appearing in the scene having various birth times and lifespans. The measurements is performed by computing the range and bearing (azimuth) of a target from the origin. It also contains clutter with uniform distribution along range and azimuth, with a random intensity sampled from a Poisson distribution withmean. The obtained measurements are degraded by a Gaussian noise with zero mean and (unit distance) and
(rad) standard deviation, respectively. The problem is to perform online multi-target filtering to recover true positives from clutter. In our first experiment we compute the OSPA error, assumingclutter intensity.
|Algorithm||OSPA Card||OSPA Loc||OSPA|
In Table 1, we report the average and standard deviation for the overall OSPA (see (12)) and its two terms related to cardinality error (OSPA Card) and optimal Hungarian distance (OSPA Loc - the term in (12)). We compare our method with PHD, CPHD, LMB, and GLMB algorithm, when EKF, SMC, and UKF used as basis for the prediction and update steps (The following Matlab implementation of these algorithms is used: http://ba-tuong.vo-au.com/codes.html). Our method outperform all the other algorithms in terms of overall OSPA. In particular, this is due to a significant drop of the Loc error, while cardinality error is comparable with most of the others. For example, despite our algorithm has (samples) higher average OSPA cardinality error than LMB-UKF, our Loc and overall OSPA distances are about and (unit distances) lower, respectively. The resulting trajectories of this experiment for our method are illustrated in Fig. 2. The red dots represent the predicted location of the targets at every time step, filtered out from the measurements clutters (black dots). They almost overlap with the ground truth (green dots), except very few (only three) false positives (predicted but no ground truth) and false negatives (ground truth but no prediction).
Moreover, in Fig. 3, we show the overall OSPA at every time steps. During the initial time steps (frame number ), our OSPA error is higher. This is mostly due to the under-fitting of the LSTM model because of lack of data. However, after iteration our OSPA error becomes significantly lower than other approaches, having an overall average of , while the average OSPA error for other algorithms are . In order to show the robustness of our algorithm for higher clutter densities, in the second experiment, we increase the clutter intensity and find the average OSPA over all time steps. Figure 4 shows the results of this experiment, for . Our filtering algorithm provides a relatively constant and comparably lower overall OSPA error even when the clutter is increased to 50. Both of the SMC-based algorithms (GLMB-SMC and LMB-SMC) generate highest OSPA error, which can be due to the particle filter algorithm divergence. On the other hand, lower OSPA errors generated by the LMB with an EKF model shows how successfully this particular simulated scenario can be modelled using such non-linear filter. It should be mentioned, however, that our method does not rely on any prior motion model capable of learning the non-linearity within the data sequence.
This paper addressed the problem of fixed motion and measurement models for the multi-target filtering using an adaptive deep learning framework. This is performed by defining target tuples with random finite set terminology and utilisation of LSTM networks, learning to model the target motion while simultaneously filtering the clutter. We defined a novel data association algorithm compatible with the predicted tracklet tuples, enabling the update of occluded targets, in addition to assigning birth, survival and death of targets. Finally, the algorithm is evaluated over a commonly used filtering scenario via OSPA metric computation.
Our algorithm can be extended by investigating an end-to-end solution for tracking, encapsulating the data association step within the recurrent neural network architecture.
-  Abadi, M., et al.
-  Bay, A., Lepsoy, S., Magli, E.: Stable limit cycles in recurrent neural networks. In: 2016 International Conference on Communications (COMM). pp. 89–92 (2016)
Emambakhsh, M., Evans, A.: Nasal patches and curves for expression-robust 3D face recognition. IEEE Transactions on PAMI39(5), 995–1007 (2017)
-  Fantacci, C., Vo, B.N., Vo, B.T., Battistelli, G., Chisci, L.: Robust fusion for multisensor multiobject tracking. IEEE Signal Processing Letters 25(5), 640–644 (2018)
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Mahler, R.: PHD filters of higher order in target number. IEEE Transactions on Aerospace and Electronic Systems 43(4), 1523–1543 (2007)
-  Mahler, R.P.S.: Multitarget Bayes filtering via first-order multitarget moments. IEEE Transactions on Aerospace and Electronic Systems 39(4), 1152–1178 (2003)
Milan, A., Rezatofighi, S., Dick, A., Reid, I., Schindler, K.: Online multi-target tracking using recurrent neural networks. In: AAAI Conference on Artificial Intelligence Thirty (2017)
-  Nagappa, S., Delande, E.D., Clark, D.E., Houssineau, J.: A tractable forward-backward CPHD smoother. IEEE Transactions on Aerospace and Electronic Systems 53(1), 201–217 (2017)
-  Reuter, S., Vo, B.T., Vo, B.N., Dietmayer, K.: The labeled multi-Bernoulli filter. IEEE Transactions on Signal Processing 62(12), 3246–3260 (2014)
-  Schuhmacher, D., Vo, B.T., Vo, B.N.: A consistent metric for performance evaluation of multi-object filters. IEEE Transactions on Signal Processing 56(8), 3447–3457 (2008)
-  Vo, B.N., Ma, W.K.: The Gaussian mixture probability hypothesis density filter. IEEE Transactions on Signal Processing 54(11), 4091–4104 (2006)
-  Vo, B.N., Singh, S., Doucet, A.: Sequential Monte Carlo methods for multitarget filtering with random finite sets. IEEE Transactions on Aerospace and Electronic Systems 41(4), 1224–1245 (2005)
-  Vo, B.N., Vo, B.T., Hoang, H.G.: An efficient implementation of the generalized labeled multi-Bernoulli filter. IEEE Transactions on Signal Processing 65(8), 1975–1987 (2017)
-  Vo, B.N., Vo, B.T., Phung, D.: Labeled random finite sets and the bayes multi-target tracking filter. IEEE Transactions on Signal Processing 62(24), 6554–6567 (2014)
-  Vo, B.N., Singh, S., Doucet, A.: Sequential Monte Carlo implementation of the PHD filter for multi-target tracking. In: Sixth International Conference of Information Fusion, 2003. Proceedings of the. vol. 2, pp. 792–799 (2003)
-  Vorontsov, E., Trabelsi, C., Kadoury, S., Pal, C.: On orthogonality and learning recurrent networks with long term dependencies. arXiv preprint arXiv:1702.00071 (2017)