Autonomous vehicles need to operate in rich environments that contain a large variety of interacting object. This variety motivates the need for class-agnostic object trackers, which break with the popular tracking-by-detection paradigm [1, 2, 3, 4]. In tracking-by-detection, static video frames are first analysed by an object detector,a pre-trained deep CNN such as yolo (Redmon et al. ), and then the detected objects are linked across frames. Algorithms from this family can achieve high accuracy, provided sufficient labelled data to train the object detector, and given that all encountered objects can be associated with known classes.
HART is a recently proposed alternative for single-object tracking (sot), where an arbitrary object can be tracked from an initial video frame (Kosiorek et al. 
). Since the initial bounding-box is user-provided and may be placed over any part of the image, regardless of whether it corresponds to an object and its class, HART can track arbitrary objects. HART efficiently processes just the relevant part of an image using spatial attention; it also integrates object detection, feature extraction, and motion modelling into one network, which is trained fully end-to-end. Contrary to tracking-by-detection, where only one video frame is typically processed at any given time to generate bounding box proposals, end-to-end learning in HART allows discovering complex visual and spatio-temporal patterns in videos, which is conducive to inferring what an object is and how it moves.
In the original formulation, HART is limited to the single-object modality—as are other existing end-to-end trackers [7, 8, 9]. In this work, we present MOHART, a class-agnostic tracker with complex relational reasoning capabilities provided by a multi-headed self-attention module (Vaswani et al. , Lee et al. ). MOHART infers the latent state of every tracked object in parallel, and uses self-attention to inform per-object states about other tracked objects. This helps to avoid performance loss under self-occlusions of tracked objects or strong ego-motion. Moreover, since the model is trained end-to-end, it is able to learn how to manage faulty or missing sensor inputs. It can also use the inferred objects’ states to predict their future trajectories, which depend on interactions between different objects.
After describing related work in Section 2 and the methodology in Section 3, we employ the algorithm on toy domains to validate its efficacy in Section 4. By controlling the stochasticity of toy environments, we show that single-object tracking is sufficient in some cases, even those featuring strong long-range interactions, while it may fail in other cases. This may hint at a similar phenomenon in the real world: tracking objects or predicting their future motion independently may be possible in most (but not all) cases, while solving the remaining corner cases might require taking interactions between objects into account. It is these corner cases that motivate our work. In Section 5, we test MOHART on three real world datasets (MOTChallenge , UA-DETRAC , Stanford Drone dataset ) and show that relational reasoning between objects is most important on the MOTChallenge dataset. We hypothesise that this is due to its richness in ego-motion, occlusions and crowded scenes—a result supported by our ablation study. Furthermore, we show that MOHART is able to gracefully handle missing sensory inputs—without any architectural changes. In this case, it falls back on its internal motion model, which also allows for accurate prediction of object locations multiple time steps into the future, learned in a data-driven manner.
2 Related Work
Vision-based tracking approaches typically follow a tracking-by-detection paradigm: objects are first detected in each frame independently, and then a tracking algorithm links the detections from different frames to propose a coherent trajectory [1, 2, 3, 4]
. Motion models and appearance are often used to improve the association between detected bounding-boxes and multiple trackers in a postprocessing step. Recently, elements of this pipeline have been replaced with learning-based approaches such as deep learning[15, 16, 4, 3]17]. Some approaches are targeted towards robustness across domains, for example by using a category-agnostic object detector and performing classification only in a post-processing step [18, 19].
A newly established and much less explored stream of work approaches tracking in an end-to-end fashion. A key difficulty here is that extracting an image crop (according to bounding-boxes provided by a detector), is non-differentiable and results in high-variance gradient estimators.Kahou et al.  propose an end-to-end tracker with soft spatial-attention using a 2D grid of Gaussians instead of a hard bounding-box. HART draws inspiration from this idea, employs an additional attention mechanism, and shows promising performance on the real-world KITTI dataset . HART, which forms the foundation of this work, is explained in detail in Section 3. It has also been extended to incorporate depth information from rgbd cameras . Gordon et al. 
propose an approach in which the crop corresponds to the scaled up previous bounding-box. This simplifies the approach, but does not allow the model to learn where to look—no gradient is backpropagated through crop coordinates. To the best of our knowledge, there are no successful implementations of any such end-to-end approaches for multi-object tracking beyondsqair (Kosiorek et al. ), which works only on datasets with static backgrounds. On real-world data, the only end-to-end approaches correspond to applying multiple single-object trackers in parallel—a method which does not leverage the potential of scene context or inter-object interactions.
Pedestrian trajectory prediction
Predicting pedestrian trajectories has a long history in computer vision and robotics. Initial research modelled social forces using hand-crafted features[21, 22, 23, 24] or mdp-based motion transition models , while more recent approaches learn from context information,positions of other pedestrians or landmarks in the environment. Social-lstm 
employs a LSTM to predict pedestrian trajectories and uses max-pooling to model global social context. Attention mechanisms have been employed to query the most relevant information, such as neighbouring pedestrians, in a learnable fashion[27, 28, 29]. Apart from relational learning, context , periodical time information , and constant motion priors  have proven effective in predicting long-term trajectories.
Our work stands apart from this prior art by not relying on ground truth tracklets. Instead, it addresses the more challenging task of working directly with visual input, performing tracking, modelling interactions, and, depending on the application scenario, simultaneously predicting future motions. As such, it can also be compared to Visual Interaction Networks (vin) 
, which use a CNN to encode three consecutive frames into state vectors—one per object—and feeds these into a RNN, which has an Interaction Network at its core. Vins are able to make accurate predictions in physical scenarios, but, to the best of our knowledge, have not been applied to real world data.
3 Recurrent Multi-Object Tracking with Self-Attention
We start by describing the HARTHART algorithm , and then follow with an extension of HART to tracking multiple objects, where multiple instances of HART communicate with each other using multi-headed attention to facilitate relational reasoning. We also explain how this method can be extended to trajectory prediction instead of just tracking.
Hierarchical Attentive Recurrent Tracking (hart)
Hart is an attention-based recurrent algorithm, which can efficiently track single objects in a video. It uses a spatial attention mechanism to extract a glimpse , which corresponds to a small crop of the image at time-step , containing the object of interest. This allows it to dispense with the processing of the whole image and can significantly decrease the amount of computation required. HART uses a CNN to convert the glimpse into features , which then update the hidden state of a LSTM core. The hidden state is used to estimate the current bounding-box , spatial attention parameters for the next time-step , as well as object appearance. Importantly, the recurrent core can learn to predict complicated motion conditioned on the past history of the tracked object, which leads to relatively small attention glimpses—contrary to CNN-based approaches (Held et al. , Valmadre et al. ), HART does not need to analyse large regions-of-interest to search for tracked objects. In the original paper, hart processes the glimpse with an additional ventral and dorsal stream on top of the feature extractor. Early experiments have shown that this does not improve performance on the MOTChallenge dataset, presumably due to the oftentimes small objects and overall small amount of training data. Figure 1 illustrates HART, further details are provided in Appendix A.
The algorithm is initialised with a bounding-box111We can use either a ground-truth bounding-box or one provided by an external detector; the only requirement is that it contains the object of interest. for the first time-step, and operates on a sequence of raw images . For time-steps , it recursively outputs bounding-box estimates for the current time-step and predicted attention parameters for the next time-step. The performance of both algorithms is measured as intersection-over-union (IoU) averaged over all time steps in which an object is present, excluding the first time step.
Although HART can track arbitrary objects, it is limited to tracking one object at a time. While it can be deployed on several objects in parallel, different HART instances have no means of communication. This results in performance loss, as it is more difficult to identify occlusions, ego-motion and object interactions. Below, we propose an extension of HART which remedies these shortcomings.
Multi-Object Hierarchical Attentive Recurrent Tracking (mohart)
Multi-object support in HART requires the following modifications. Firstly, in order to handle a dynamically changing number of objects, we apply HART to multiple objects in parallel, where all parameters between HART instances are shared. We refer to each HART instance as a tracker. Secondly, we introduce a presence variable for object
. It is used to mark whether an object should interact with other objects, as well as to mask the loss function (described in) for the given object when it is not present. In this setup, parallel trackers cannot exchange information and are conceptually still single-object trackers, which we use as a baseline, referred to as HART (despite it being an extension of the original algorithm). Finally, to facilitate communication between trackers, we augment HART with an additional step between feature extraction and the LSTM.
Let be the feature vector extracted from the glimpse corresponding to the m object, and let be the set of such features extracted from all glimpses. Since different objects can interact with each other, it is necessary to use a method that can inform each object about the effects of their interactions with other objects. Moreover, since features extracted from different objects comprise a set, this method should be permutation-equivariant,the results should not depend on the order in which object features are processed. Therefore, we use the multi-head self-attention block (sab, Lee et al. ), which is able to account for higher-order interactions between set elements when computing their representations, thereby allowing rich information exchange, and it can do so in a permutation-equivariant manner. Intuitively, in our case, sab allows any of the trackers to query other trackers about attributes of their respective objects,distance between objects, their direction of movement, or their relation to the robot. This is implemented as follows,
where is the output of the relational reasoning module for object . Time-step subscripts are dropped to decrease clutter. In Equation 1, each of the extracted features is linearly projected into a triplet of key , query and value vectors. Together, they comprise and matrices with rows and columns, respectively. and are then split up into multiple heads , which allows to query different attributes by comparing and aggregating different projection of features. Multiplying in Equation 2 allows to compare every query vector to all key vectors , where the value of the corresponding dot-products represents the degree of similarity. Similarities are then normalised via a operation and used to aggregate values . Finally, outputs of different attention heads are concatenated in Equation 3. Sab produces output vectors, one for each input, which are then concatenated with corresponding inputs and fed into separate LSTMs for further processing, as in HART—see Figure 1.
MOHART is trained fully end-to-end, contrary to other tracking approaches [1, 2, 3, 4]. It maintains a hidden state, which can contain information about the object’s motion. One benefit is that in order to predict future trajectories, one can simply feed black frames into the model. Our experiments show that the model learns to fall back on the motion model captured by the LSTM in this case.
4 Validation on Simulated Data
To test the efficacy of the proposed algorithm, we conduct experiments on a toy domain. First, we show that hart as an end-to-end single-object tracker is able to capture complex motion patterns and leverage these to make accurate predictions. Second, we create a scenario which is not solvable for a single object tracker as it requires knowledge about the state of the other objects and relational reasoning. We show that mohart, using self-attention for relational reasoning, is able to capture these interactions with high accuracy and compare it to other possible implementations of mohart (e.g., using max-pooling instead of self-attention). In order to accurately investigate the model’s understanding of motion patterns and interactions between objects, in contrast to traditional tracking, the model is not trained to predict the current location of the object, but its location in a future time step. The domain we create for this purpose is a two dimensional squared box. It contains circular objects with approximated elastic collisions (energy and momentum conservation) between objects and with walls (see Figures 3 and 2).
In the first scenario (Figure 2), four circles each exert repulsive forces on each other, where the force scales with , being their distance. hart is applied four times in parallel and is trained to predict the location of each circle three time steps into the future. The different forces from different objects lead to a non-trivial force field at each time step. Predicting the future location just using the previous motion of one object (Figure 2 shows that each spatial attention box covers only the current object) accurately is therefore challenging. Surprisingly, the single object tracker solves this task with an average of IoU over sequences of 15 time steps. This shows the efficacy of end-to-end tracking to capture complex motion patterns and use them to predict future locations. This, of course, could also be used to generate robust bounding boxes for a tracking task.
The second scenario (Figure 3) is constructed to be impossible to solve without exchanging information between objects. This is achieved by introducing two colour-coded identities. Agents of the same identity repel each other, agents of different identities attract each other. Crucially, each agent is randomly assigned its identity in each time step. Hence, the algorithm can no longer infer the forces exerted on one object without knowledge of the state of the other objects in the current time step. The forces in this scenario scale with and the algorithm was trained to predict one time step into the future. hart is indeed unable to predict the future location of the objects accurately (Figure 3 - top). The achieved average IoU is , which is only slightly higher than predicting the objects to have the same position in the next time step as in the current one (). A possible interpretation of the qualitative results (green boxes in Figure 3 - top) is that the model uses the momentum of each object to extrapolate into the future. This sometimes works well (bottom right object in frame 31) and sometimes not (top right object in frame 30). Using the relational reasoning module (Figure 3 - bottom), the model is now able to make meaningful predictions ( IoU). Interestingly, in each frame, the attention scores have a strong correlation with the interaction strength (which directly scales with distance). Despite this not being necessary for the relational reasoning module, this is an interesting side-product as it did not receive any direct supervision.
Figure 4 (left) shows a quantitative comparison of augmenting hart with different relational reasoning modules when identities are re-assigned in every timestep (). Exchanging information between trackers of different objects in the latent space with an MLP leads to slightly worse performance than the SOT baseline, while simple max-pooling performs significantly better (). This can be explained through the permutation invariance of the problem: the list of latent representation of the different objects has no meaningful order and the output of the model should therefore be invariant to the ordering of the objects. The MLP is in itself not permutation invariant and therefore prone to overfit to the (meaningless) order of the objects in the training data. Max-pooling, however, is permutation invariant and can in theory, despite its simplicity, be used to approximate any permutation invariant function - given a sufficiently large latent space [37, 38]. Max-pooling is often used to exchange information between different tracklets, e.g., in the trajectory prediction domain [26, 39]. However, self-attention, allowing for learned querying and encoding of information, solves the relational reasoning task significantly more accurately. In Figure 4 (right), the frequency with which object identities are reassigned randomly is varied. The results show that, in a deterministic environment, tracking does not necessarily profit from relational reasoning - even in the presence of long-range interactions. The less random, the more static the force field is and a static force field can be inferred from a small number of observations (see Figure 2). This does of course not mean that all stochastic environments profit from relational reasoning. What these experiments indicate is that tracking can not be expected to profit from relational reasoning by default in any environment, but instead in environments which feature (potentially non-deterministic) dynamics and predictable interactions.
5 Relational Reasoning in Real-World Tracking
Having established that mohart is capable of performing complex relational reasoning, we now test the algorithm on three real world datasets and analyse the effects of relational reasoning on performance depending on dataset and task. We find consistent improvements of mohart compared to hart throughout. Relational reasoning yields particularly high gains for scenes with ego-motion, crowded scenes, and simulated faulty sensor inputs.
5.1 Experimental Details
. In order to increase scene dynamics and make the tracking/prediction problems more challenging, we sub-sample some of the high framerate scenes with a stride of two. Training and architecture details are given inAppendix A and Appendix B. We conduct experiments in three different modes:
Tracking. The model is initialised with the ground truth bounding boxes for a set of objects in the first frame. It then consecutively sees the following frames and predicts the bounding boxes. The sequence length is 30 time steps and the performance is measured as intersection over union (IoU) averaged over the entire sequence excluding the first frame. This algorithm is either applied to the entire dataset or subsets of it to study the influence of certain properties of the data.
Camera Blackout. This simulates unsteady or faulty sensor inputs. The setup is the same as in Tracking, but sub-sequences images are blacked out. The algorithm is expected to recognise that no new information is available and that it should resort to its internal motion model.
Prediction. Testing mohart’s ability to capture motion patterns, only the first two frames are shown to the model followed by three black frames. IoU is measured seperately for each time step.
floatrowsep=qquad, captionskip=4pt Entire Only No Crowded Camera Dataset Ego-Motion Ego-Motion Scenes Blackout MOHART 68.5% 66.9% 64.7% 69.1% 63.6% HART 66.6% 64.0% 62.9% 66.9% 60.6% 1.9% 2.9% 1.8% 2.2% 3.0% 
|All||Crowded Scenes||Camera Blackout|
|All||Camera Blackout||CamBlack Bikes|
5.2 Results and Analysis
On the MOTChallenge dataset, hart achieves intersection over union (see Table 3), which in itself is impressive given the small amount of training data of only 5225 training frames and no pre-training. mohart achieves (both numbers are averaged over 5 runs, independent samples -test resulted in ). The performance gain increases when only considering ego-motion data. This is readily explained: movements of objects in the image space due to ego-motion scenarios are correlated and can therefore be better understood when combining information from movements of multiple objects, i.e. performing relational reasoning. In another ablation, we filtered for only crowded scenes by requesting five objects to be present for, on average, 90% of the frames in a sub-sequence. For the MOT-Challenge dataset, this only leads to a minor increase of the performance gain of mohart indicating that the dataset exhibits a sufficient density of objects to learn interactions. The biggest benefit from relational reasoning can be observed in the camera blackout experiments (setup explained in Section 5.1). Both hart and mohart learn to rely on their internal motion models when confronted with black frames and propagate the bounding boxes according to the previous movement of the objects. It is unsurprising that this scenario profits particularly from relational reasoning. Qualitative tracking and camera blackout results are shown in Figure 5 and in Appendix C, respectively.
Tracking performance on the UA-DETRAC dataset only profits from relational reasoning when filtering for crowded scenes (see Table 3). The fact that the performance of mohart is slightly worse on the vanilla dataset () can be explained with more overfitting. As there is no exchange between trackers for each object, each object constitutes an independent training sample.
The Stanford drone dataset (see Table 3) is qualitatively different to the other two as it is filmed from a top down view. The scenes are more crowded and each object only covers a small number of pixels rendering it a difficult problem for tracking. The dataset was designed for trajectory prediction, a problem setup where an algorithm is typically provided with ground truth tracklets in coordinate space and potentially an image as context information. The task is then to extrapolate these tracklets into the future. The tracking performance profits from relational reasoning more than on the UA-DETRAC dataset but less than on the MOTChallenge dataset. The performance gain on the camera blackout experiments are particularly strong when only considering cyclists.
In the results from the prediction experiments (see Figure 6) mohart consistently outperforms hart. On both datasets, the model outperforms a baseline which uses momentum to linearly extrapolate the bounding boxes from the first two frames. This shows that even from just two frames, the model learns to capture motion models which are more complex than what could be observed from just the bounding boxes (i.e. momentum), suggesting that it uses visual information (hart & mohart) as well as relational reasoning (mohart). The strong performance gain of mohart compared to hart on the UA-DETRAC dataset, despite the small differences for tracking on this dataset, can be explained as follows: this dataset features little interactions but strong correlations in motion. Hence when only having access to the first two frames, mohart profits from estimating the velocities of multiple cars simultaneously.
With MOHART, we introduce an end-to-end multi-object tracker that is capable of capturing complex interactions and leveraging these for precise predictions as experiments both on toy and real world data show. However, the experiments also show that the benefit of relational reasoning strongly depends on the nature of the data. The toy experiments showed that in an entirely deterministic world relational reasoning was much less important than in a stochastic environment. Amongst the real-world dataset, the highest performance gains from relational reasoning were achieved on the MOTChallenge dataset, which features crowded scenes, ego-motion and occlusions.
We thank Stefan Saftescu for his contributions, particularly for integrating the Stanford Drone Dataset, and Adam Golinski as well as Stefan Saftescu for proof-reading. This research was funded by the EPSRC AIMS Centre for Doctoral Training at Oxford University and an EPSRC Programme Grant (EP/M019918/1). We acknowledge use of Hartree Centre resources in this work. The STFC Hartree Centre is a research collaboratory in association with IBM providing High Performance Computing platforms funded by the UK’s investment in e-Infrastructure. The Centre aims to develop and demonstrate next generation software, optimised to take advantage of the move towards exa-scale computing.
- Zhang et al.  L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. CVPR, 2008.
- Milan et al.  A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. PAMI, 2014.
- Bae and Yoon  S.-H. Bae and K.-J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Keuper et al.  M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele. Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
Redmon et al. 
J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.
You only look once: Unified, real-time object detection.
Conference on Computer Vision and Pattern Recognition, 2016.
- Kosiorek et al.  A. R. Kosiorek, A. Bewley, and I. Posner. Hierarchical attentive recurrent tracking. Neural Information Processing Systems, 2017.
- Kahou et al.  S. E. Kahou, V. Michalski, and R. Memisevic. RATM: recurrent attentive tracking model. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
Rasouli Danesh et al. 
M. Rasouli Danesh, S. Yadav, S. Herath, Y. Vaghei, and S. Payandeh.
Deep attention models for human tracking using rgbd.Sensors, 19:750, 02 2019.
- Gordon et al.  D. Gordon, A. Farhadi, and D. Fox. Re3 : Real-Time Recurrent Regression Networks for Visual Tracking of Generic Objects. RA-L, 2018.
- Vaswani et al.  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Neural Information Processing Systems, 2017.
Lee et al. 
J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh.
International Conference on Machine Learning, 2019.
- Milan et al.  A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multi-object tracking. 2016. arXiv: 1603.00831.
- Wen et al.  L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim, M. Yang, and S. Lyu. DETRAC: A new benchmark and protocol for multi-object tracking. arXiv, 1511.04136, 2015.
- Robicquet et al.  A. Robicquet, A. Sadeghian, A. Alahi, and S. Savaresei. Learning social etiquette: Human trajectory prediction in crowded scenes. European Conference on Computer Vision, 2016.
Nam and Han 
H. Nam and B. Han.
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking.CVPR, 2016.
- Ning et al.  G. Ning, Z. Zhang, C. Huang, Z. He, X. Ren, and H. Wang. Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking. ISCAS, 2017.
- Xiang et al.  Y. Xiang, A. Alahi, and S. Savarese. Learning to Track: Online Multi- Object Tracking by Decision Making Multi-Object Tracking. ICCV, 2015.
- Ošep et al.  A. Ošep, W. Mehner, P. Voigtlaender, and B. Leibe. Track, then Decide: Category-Agnostic Vision-based Multi-Object Tracking. ICRA, 2018.
- Ondruska and Posner  P. Ondruska and I. Posner. Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks. AAAI, 2016.
- Kosiorek et al.  A. Kosiorek, H. Kim, Y. W. Teh, and I. Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems, pages 8606–8616, 2018.
- Lerner et al.  A. Lerner, Y. Chrysanthou, and D. Lischinski. Crowds by example. In Computer Graphics Forum, 2007.
-  S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV 2009.
- Trautman and Krause  P. Trautman and A. Krause. Unfreezing the robot: Navigation in dense, interacting crowds. In IROS, 2010.
- Yamaguchi et al.  K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who are you with and where are you going? In CVPR, 2011.
- Rudenko et al.  A. Rudenko, L. Palmieri, and K. O. Arras. Joint long-term prediction of human motion using a planning-based social force approach. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.
- Alahi et al.  A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, 2016.
Su et al. 
H. Su, Y. Dong, J. Zhu, H. Ling, and B. Zhang.
Crowd scene understanding with coherent recurrent neural networks.2016.
- Fernando et al.  T. Fernando, S. Denman, S. Sridharan, and C. Fookes. Soft+ hardwired attention: An lstm framework for human trajectory prediction and abnormal event detection. Neural networks, 2018.
- Sadeghian et al.  A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Varshneya and Srinivasaraghavan  D. Varshneya and G. Srinivasaraghavan. Human trajectory prediction using spatially aware deep attention models. arXiv preprint:1705.09436, 2017.
- Sun et al.  L. Sun, Z. Yan, S. M. Mellado, M. Hanheide, and T. Duckett. 3dof pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data. In 2018 IEEE International Conference on Robotics and Automation. IEEE, 2018.
- Schöller et al.  C. Schöller, V. Aravantinos, F. Lay, and A. Knoll. The simpler the better: Constant velocity for pedestrian motion prediction. arXiv preprint arXiv:1903.07933, 2019.
- Watters et al.  N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti. Visual Interaction Networks: Learning a Physics Simulator from Video. NIPS, 2017.
- Battaglia et al.  P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu. Interaction Networks for Learning about Objects, Relations and Physics. NIPS, 2016.
- Held et al.  D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, 2016.
- Valmadre et al.  J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. End-to-end representation learning for correlation filter based tracking. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Zaheer et al.  M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Póczos, R. Salakhutdinov, and A. Smola. Deep Sets. In Advances in Neural Information Processing Systems, 2017.
- Wagstaff et al.  E. Wagstaff, F. B. Fuchs, M. Engelcke, I. Posner, and M. A. Osborne. On the limitations of representing functions on sets. International Conference on Machine Learning, 2019.
- Gupta et al.  A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social GAN: socially acceptable trajectories with generative adversarial networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
Anexa A Architecture Details
The architecture details were chosen to optimise hart performance on the MOTChallenge dataset. They deviate from the original hart implementation  as follows: The presence variable is predicted with a binary cross entropy loss. The maximum number of objects to be tracked simultaneously was set to 5 for the UA-DETRAC and MOTChallenge dataset. For the more crowded Stanford drone dataset, this number was set to 10. The feature extractor is a three layer convolutional network with a kernel size of 5, a stride of 2 in the first and last layer, 32 channels in the first two layers, 64 channels in the last layer, ELU activations, and skip connections. This converts the initial glimpse into a feature representation. This is followed by a fully connected layer with a 128 dimensional output and an elu activation. The spatial attention parameters are linearly projected onto 128 dimensions and added to this feature representation serving as a positional encoding. The LSTM has a hidden state size of 128. The self-attention unit in mohart comprises linear projects the inputs to dimensionality 128 for each keys, queries and values. For the real-world experiments, in addition to the extracted features from the glimpse, the hidden states from the previous LSTM state are also fed as an input by concatinating them with the features. In all cases, the output of the attention module is concatenated to the input features of the respective object.
As an optimizer, we used RMSProp with momentum set to and learning rate . For the MOTChallenge dataset and the UA-DETRAC dataset, the models were trained for 100,000 iterations of batch size 10 and the reported IoU is exponentially smoothed over iterations to achieve lower variance. For the Stanford Drone dataset, the batch size was increased to 32, reducing time to convergence and hence model training to 50,000 iterations.
Anexa B Experimental Details
The MOTChallenge and the UA-DETRAC dataset discussed in this section are intended to be used as a benchmark suite for multi-object-tracking in a tracking-by-detection paradigm. Therefore, ground truth bounding boxes are only available for the training datasets. The user is encouraged to upload their model which performs tracking in a data association paradigm leveraging the provided bounding box proposals from an external object detector. As we are interested in a different analysis (IoU given inital bounding boxes), we divide the training data further into training and test sequences. To make up for the smaller training data, we extend the MOTChallenge 2017 dataset with three sequences from the 2015 dataset (ETH-Sunnyday, PETS09-S2L1, ETH-Bahnhof). We use the first 70% of the frames of each of the ten sequences for training and the rest for testing. Sequences with high frame rates (30Hz) are sub-sampled with a stride of two. For the UA-DETRAC dataset, we split the 60 available sequences into 44 training sequences and 16 test sequences. For the considerably larger Stanford Drone dataset we took three videos of the scene deathCircle for training and the remaining two videos from the same scene for testing. The videos of the drone dataset were also sub-sampled with a stride of two to increase scene dynamics.
Anexa C Camera Blackout Experiments
In Section 5, we conducted a set of camera blackout experiments to test mohart’s capability of dealing with faulty sensor inputs. While traditional pipeline methods require careful consideration of different types of corner cases to properly handle erroneous sensor inputs, mohart is able to capture these automatically, especially when confronted with similar issues in the training scenarios. To simulate this, we replace subsequences of the images with black frames. Figure 7 and Figure 8 show two such examples from the test data together with the model’s prediction. mohart learns not to update its internal model when confronted with black frames and instead uses the LSTM to propagate the bounding boxes. When proper sensor input is available again, the model uses this to make a rapid adjustment to its predicted location and ‘snap’ back onto the object. This works remarkably well in both the presence of occlusion (Figure 7) and ego-motion (Figure 8). Tables 3, 3 and 3 show that the benefit of relational reasoning is particularly high in these scenarios specifically. These experiments can also be seen as a proof of concept of mohart’s capabalities of predicting future trajectories—and how this profits from relational reasoning.