Neural relational inference for interacting systems - pytorch
Interacting systems are prevalent in nature, from dynamical systems in physics to complex societal dynamics. The interplay of components can give rise to complex behavior, which can often be explained using a simple model of the system's constituent parts. In this work, we introduce the neural relational inference (NRI) model: an unsupervised model that learns to infer interactions while simultaneously learning the dynamics purely from observational data. Our model takes the form of a variational auto-encoder, in which the latent code represents the underlying interaction graph and the reconstruction is based on graph neural networks. In experiments on simulated physical systems, we show that our NRI model can accurately recover ground-truth interactions in an unsupervised manner. We further demonstrate that we can find an interpretable structure and predict complex dynamics in real motion capture and sports tracking data.READ FULL TEXT VIEW PDF
Many complex natural and cultural phenomena are well modelled by systems...
Multi-interaction systems abound in nature, from colloidal suspensions t...
Graph neural networks are promising architecture for learning and infere...
Real-world dynamical systems often consist of multiple stochastic subsys...
Understanding and interacting with everyday physical scenes requires ric...
The roundworm C. elegans exhibits robust escape behavior in response to
We propose a framework for the completely unsupervised learning of laten...
Neural relational inference for interacting systems - pytorch
A wide range of dynamical systems in physics, biology, sports, and other areas can be seen as groups of interacting components, giving rise to complex dynamics at the level of individual constituents and in the system as a whole. Modeling these type of dynamics is challenging: often, we only have access to individual trajectories, without knowledge of the underlying interactions or dynamical model.
As a motivating example, let us take the movement of basketball players on the court. It is clear that the dynamics of a single basketball player are influenced by the other players, and observing these dynamics as a human, we are able to reason about the different types of interactions that might arise, e.g. defending a player or setting a screen for a teammate. It might be feasible, though tedious, to manually annotate certain interactions given a task of interest. It is more promising to learn the underlying interactions, perhaps shared across many tasks, in an unsupervised fashion.
Recently there has been a considerable amount of work on learning the dynamical model of interacting systems using implicit interaction models (Sukhbaatar et al., 2016; Guttenberg et al., 2016; Santoro et al., 2017; Watters et al., 2017; Hoshen, 2017; van Steenkiste et al., 2018). These models can be seen as graph neural networks (GNNs) that send messages over the fully-connected graph, where the interactions are modeled implicitly by the message passing function (Sukhbaatar et al., 2016; Guttenberg et al., 2016; Santoro et al., 2017; Watters et al., 2017) or with an attention mechanism (Hoshen, 2017; van Steenkiste et al., 2018).
In this work, we address the problem of inferring an explicit interaction structure while simultaneously learning the dynamical model of the interacting system in an unsupervised way. Our neural relational inference (NRI) model learns the dynamics with a GNN over a discrete latent graph, and we perform inference over these latent variables. The inferred edge types correspond to a clustering of the interactions. Using a probabilistic model allows us to incorporate prior beliefs about the graph structure, such as sparsity, in a principled manner.
In a range of experiments on physical simulations, we show that our NRI model possesses a favorable inductive bias that allows it to discover ground-truth physical interactions with high accuracy in a completely unsupervised way. We further show on real motion capture and NBA basketball data that our model can learn a very small number of edge types that enable it to accurately predict the dynamics many time steps into the future.
We start by giving a brief introduction to a recent class of neural networks that operate directly on graph-structured data by passing local messages (Scarselli et al., 2009; Li et al., 2016; Gilmer et al., 2017). We refer to these models as graph neural networks (GNN). Variants of GNNs have been shown to be to be highly effective at relational reasoning tasks (Santoro et al., 2017), modeling interacting or multi-agent systems (Sukhbaatar et al., 2016; Battaglia et al., 2016), classification of graphs (Bruna et al., 2014; Duvenaud et al., 2015; Dai et al., 2016; Niepert et al., 2016; Defferrard et al., 2016; Kearnes et al., 2016) and classification of nodes in large graphs (Kipf & Welling, 2017; Hamilton et al., 2017).
Given a graph with vertices and edges 111Undirected graphs can be modeled by explicitly assigning two directed edges in opposite direction for each undirected edge., we define a single node-to-node message passing operation in a GNN as follows, similar to Gilmer et al. (2017):
where is the embedding of node in layer , is an embedding of the edge , and and summarize initial (or auxiliary) node and edge features, respectively (e.g. node input and edge type). denotes the set of indices of neighbor nodes connected by an incoming edge. and represent a node-to-edge () and an edge-to-node () mapping, respectively (see Figure 2).
In practice, it is often convenient to choose the following functional form for and :
denotes concatenation of vectors, andand are node- and edge-specific neural networks (e.g. small MLPs), respectively. It is also common in practice to replace the sum in Eq. (4) with an average. Eqs. (3)–(4) allow for the composition of models that map from edge to node representations or vice-versa via multiple rounds of message passing. Note that Eqs. (3)–(4) share similarities with the definition of set convolutions (Zaheer et al., 2017) which are related to GNNs.
In the original GNN formulation from Scarselli et al. (2009) the node embedding depends only on , the embedding of the sending node, and the edge type, but not on , the embedding of the receiving node. This is of course a special case of this formulation, and more recent works such as interaction networks (Battaglia et al., 2016) or message passing neural networks (Gilmer et al., 2017) are in line with our more general formulation. We further note that some recent works factor into a product of two separate functions, one of which acts as a gating or attention mechanism (Monti et al., 2017; Duan et al., 2017; Hoshen, 2017; Veličković et al., 2018; Garcia & Bruna, 2018; van Steenkiste et al., 2018) which in some cases can have computational benefits or introduce favorable inductive biases.
Our NRI model consists of two parts trained jointly: An encoder that predicts the interactions given the trajectories, and a decoder that learns the dynamical model given the interaction graph.
More formally, our input consists of trajectories of objects. We denote by the feature vector of object at time , e.g. location and velocity. We denote by the set of features of all objects at time , and we denote by the trajectory of object , where is the total number of time steps. Lastly, we mark the whole trajectories by . We assume that the dynamics can be modeled by a GNN given an unknown graph where represents the discrete edge type between objects and . The task is to simultaneously learn to predict the edge types and learn the dynamical model in an unsupervised way.
We formalize our model as a variational autoencoder (VAE)(Kingma & Welling, 2014) that maximizes the ELBO:
The encoder returns a factorized distribution of , where
is a discrete categorical variable representing the edge type between objectand . We use a one-hot representation of the interaction types for .
models with a GNN given the latent graph structure .
is a factorized uniform distribution over edges types. If one edge type is “hard coded” to represent “non-edge” (no messages being passed along this edge type), we can use an alternative prior with higher probability on the “non-edge” label. This will encourage sparser graphs.
There are some notable differences between our model and the original formulation of the VAE (Kingma & Welling, 2014). First, in order to avoid the common issue in VAEs of the decoder ignoring the latent code (Chen et al., 2017), we train the decoder to predict multiple time steps and not a single step as the VAE formulation requires. This is necessary since interactions often only have a small effect in the time scale of a single time step. Second, the latent distribution is discrete, so we use a continuous relaxation (Jang et al., 2017; Maddison et al., 2017) in order to use the reparameterization trick. Lastly, we note that we do not learn the probability (i.e. for ) as we are interested in the dynamics and interactions, and this does not have any effect on either (but would be easy to include if there was a need).
The overall model is schematically depicted in Figure 3. In the following, we describe the encoder and decoder components of the model in detail.
At a high level, the goal of the encoder is to infer pairwise interaction types given observed trajectories . Since we do not know the underlying graph, we can use a GNN on the fully-connected graph to predict the latent graph structure.
More formally, we model the encoder as , where is a GNN acting on the fully-connected graph (without self-loops). Given input trajectories our encoder computes the following message passing operations:
Finally, we model the edge type posterior as:
where summarizes the parameters of the neural networks in Eqs. (7) to (10). The use of multiple passes, two in the model presented here, allows the model to “disentangle” multiple interactions while still using only binary terms. The functions are neural networks that map between the respective representations. In our experiments we used either fully-connected networks (MLPs) or 1D convolutional networks (CNNs) with attentive pooling similar to (Lin et al., 2017) for the functions. See appendix for further details.
While this model falls into the general framework presented in Sec. 3, there is a conceptual difference in how are interpreted. Unlike in a typical GNN, the messages are no longer considered just a transient part of the computation, but an integral part of the model that represents the edge embedding used to perform edge classification.
While it is straightforward to sample from , we cannot use the reparametrization trick (Kingma & Welling, 2014)
and backpropagate though the sampling as our latent variables are discrete. A recently popular approach to handle this difficulty is to sample from a continuous approximation of the discrete distribution(Maddison et al., 2017; Jang et al., 2017) and use the repramatrization trick to get (biased) gradients from this approximation. We used the concrete distribution (Maddison et al., 2017) where samples are drawn as:
where is a vector of i.i.d. samples drawn from a distribution and (softmax temperature) is a parameter that controls the “smoothness” of the samples. This distribution converges to one-hot samples from our categorical distribution when .
The task of the decoder is to predict the future continuation of the interacting system’s dynamics . Since the decoder is conditioned on the graph we can in general use any GNN algorithm as our decoder.
For physics simulations the dynamics is Markovian , if the state is location and velocity and is the ground-truth graph. For this reason we use a GNN similar to interaction networks (Battaglia et al., 2016); unlike interaction networks we have a separate neural network for each edge type. More formally:
Note that denotes the -th element of the vector and
is a fixed variance. Whenis a discrete one-hot sample the messages are for the selected edge type , and for the continuous relaxation we get a weighted sum. Also note that since in Eq. 14 we add the present state our model only learns the change in state .
If we look at the ELBO, Eq. 5, the reconstruction loss term has the form which involves only single step predictions. One issue with optimizing this objective is that the interactions can have a small effect on short-term dynamics. For example, in physics simulations a fixed velocity assumption can be a good approximation for a short time period. This leads to a sub-optimal decoder that ignores the latent edges completely and achieves only a marginally worse reconstruction loss.
We address this issue in two ways: First, we predict multiple steps into the future, where a “degenerate” decoder (which ignores the latent edges) would perform much worse. Second, instead of having one neural network that computes the messages given , as was done in (Battaglia et al., 2016), we have a separate MLP for each edge type. This makes the dependence on the edge type more explicit and harder to be ignored by the model.
Predicting multiple steps is implemented by replacing the correct input , with the predicted mean for steps (we used in our experiments), then feed in the correct previous step and reiterate. More formally, if we denote our decoder as then we have:
We are backpropagating through this whole process, and since the errors accumulate for steps the degenerate decoder is now highly suboptimal.
In many applications the Markovian assumption used in Sec. 3.3 does not hold. To handle such applications we use a recurrent decoder that can model . Our recurrent decoder adds a GRU (Cho et al., 2014) unit to the GNN message passing operation. More formally:
The input to the message passing operation is the recurrent hidden state at the previous time step. denotes an output transformation, modeled by a small MLP. For each node the input to the GRU update is the concatenation of the aggregated messages , the current input , and the previous hidden state .
If we wish to predict multiple time steps in the recurrent setting, the method suggested in Sec. 3.4 will be problematic. Feeding in the predicted (potentially incorrect) path and then periodically jumping back to the true path will generate artifacts in the learned trajectories. In order to avoid this issue we provide the correct input in the first steps, and only utilize our predicted mean as input at the last time steps.
Now that we have described all the elements, the training goes as follows: Given training example we first run the encoder and compute , then we sample from the concrete reparameterizable approximation of . We then run the decoder to compute . The ELBO objective, Eq. 5, has two terms: the reconstruction error and KL divergence
. The reconstruction error is estimated by:
while the KL term for a uniform prior is just the sum of entropies (plus a constant):
Due to the fact that we use a reparameterizable approximation, we can compute gradients by backpropagation and optimize.
Several recent works have studied the problem of learning the dynamics of a physical system from simulated trajectories (Battaglia et al., 2016; Guttenberg et al., 2016; Chang et al., 2017) and from generated video data (Watters et al., 2017; van Steenkiste et al., 2018) with a graph neural network. Unlike our work they either assume a known graph structure or infer interactions implicitly.
A number of recent works (Monti et al., 2017; Duan et al., 2017; Hoshen, 2017; Veličković et al., 2018; Garcia & Bruna, 2018; van Steenkiste et al., 2018) parameterize messages in GNNs with a soft attention mechanism (Luong et al., 2015; Bahdanau et al., 2015). This equips these models with the ability to focus on specific interactions with neighbors when aggregating messages. Our work is different from this line of research, as we explicitly perform inference over the latent graph structure. This allows for the incorporation of prior beliefs (such as sparsity) and for an interpretable discrete structure with multiple relation types.
Recent related works on graph-based methods for human motion prediction include (Alahi et al., 2016) where the graph is not learned but is based on proximity and (Le et al., 2017) tries to cluster agents into roles.
An alternative approach to inferring interactions is Granger causality (Granger, 1969). Granger causality however assumes a linear dynamical model and only tries to infer the existence of a causal relation and not to infer relation types.
Our encoder implementation uses fully-connected networks (MLPs) or 1D CNNs with attentive pooling as our message passing function. For our decoder we used fully-connected networks or alternatively a recurrent decoder. Optimization was performed using the Adam algorithm (Kingma & Ba, 2015)
. We provide full implementation details in the appendix. Our implementation uses PyTorch(Paszke et al., 2017) and is available online222https://github.com/ethanfetaya/nri.
We experimented with three simulated systems: particles connected by springs, charged particles and phase-coupled oscillators (Kuramoto model) (Kuramoto, 1975). These settings allow us to attempt to learn the dynamics and interactions when the interactions are known. These systems, controlled by simple rules, can exhibit complex dynamics. For the springs and Kuramoto experiments the objects do or do not interact with equal probability. For the charged particles experiment they attract or repel with equal probability. Example trajectories can be seen in Fig. 4. We generate 50k training examples, and 10k validation and test examples for all tasks. Further details on the data generation and implementation are in the appendix.
We note that the simulations are differentiable and so we can use it as a ground-truth decoder to train the encoder.333We used an external code base (Laszuk, 2017) for stable integration of the Kuramoto ODE and therefore do not have access to gradient information in this particular simulation. The charged particles simulation, however, suffers from instability which led to some performance issues when calculating gradients; see appendix for further details.
|NRI (full graph)||1.66e-5||1.64e-3||6.31e-3||1.09e-3||3.78e-3||9.24e-3||2.15e-2||5.19e-2||8.96e-2|
|NRI (true graph)||1.69e-11||1.32e-9||7.06e-6||1.04e-3||3.03e-3||5.71e-3||1.35e-2||1.54e-2||2.19e-2|
We ran our NRI model on all three simulated physical systems and compared our performance, both in future state prediction and in accuracy of estimating the edge type in an unsupervised manner.
For edge prediction, we compare to the “gold standard” i.e. training our encoder in a supervised way given the ground-truth labels. We also compare to the following baselines: Our NRI model with the ground-truth simulation decoder, NRI (sim.), and two correlation based baselines, Corr. (path) and Corr. (LSTM). Corr. (path) estimates the interaction graph by thresholding the matrix of correlations between trajectory feature vectors. Corr. (LSTM) trains an LSTM (Hochreiter & Schmidhuber, 1997) with shared parameters to model each trajectory individually and calculates correlations between the final hidden states to arrive at an interaction matrix after thresholding. We provide further details on these baselines in the appendix
Results for the unsupervised interaction recovery task are summarized in Table 1. As can be seen, the unsupervised NRI model, NRI (learned), greatly surpasses the baselines and recovers the ground-truth interaction graph with high accuracy on most tasks. For the springs model our unsupervised method is comparable to the supervised “gold standard” benchmark. We note that our supervised baseline is similar to the work by (Santoro et al., 2017), with the difference being that we perform multiple rounds of message passing in the graph.
For future state prediction we compare to the static baseline, i.e. , two LSTM baselines, and a full graph baseline. One LSTM baseline, marked as “single”, runs a separate LSTM (with shared weights) for each object. The second, marked as “joint” concatenates all state vectors and feeds it into one LSTM that is trained to predict all future states simultaneously. Note that the latter will only be able to operate on a fixed number of objects (in contrast to the other models). In the full graph baseline, we use our message passing decoder on the fully-connected graph without edge types, i.e. without inferring edges. This is similar to the model used in (Watters et al., 2017). We also compare to the “gold standard” model, denoted as NRI (true graph), which is training only a decoder using the ground-truth graph as input. The latter baseline is comparable to previous works such as interaction networks (Battaglia et al., 2016).
In order to have a fair comparison, we generate longer test trajectories and only evaluate on the last part unseen by the encoder. Specifically, we run the encoder on the first 49 time steps (same as in training and validation), then predict with our decoder the following 20 unseen time steps. For the LSTM baselines, we first have a “burn-in” phase where we feed the LSTM the first 49 time steps, and then predict the next 20 time steps. This way both algorithms have access to the first 49 steps when predicting the next 20 steps. We show mean squared error (MSE) results in Table 2, and note that our results are better than using LSTM for long term prediction. Example trajectories predicted by our NRI (learned) model for up to 50 time steps are shown in Fig. 5.
For the Kuramoto model, we observe that the LSTM baselines excel at smoothly continuing the shape of the waveform for short time frames, but fail to model the long-term dynamics of the interacting system. We provide further qualitative analysis for these results in the appendix.
It is interesting to note that the charged particles experiment achieves an MSE score which is on par with the NRI model given the true graph, while only predicting 82.6% of the edges accurately. This is explained by the fact that far away particles have weak interactions, which have only small effects on future prediction. An example can be seen in Fig. 5 in the top row where the blue particle is repelled instead of being attracted.
The CMU Motion Capture Database (CMU, 2003) is a large collection of motion capture recordings for various tasks (such as walking, running, and dancing) performed by human subjects. We here focus on recorded walking motion data of a single subject (subject #35). The data is in the form of 31 3D trajectories, each tracking a single joint. We split the different walking trials into non-overlapping training (11 trials), validation (4 trials) and test sets (7 trials). We provide both position and velocity data. See appendix for further details. We train our NRI model with an MLP encoder and RNN decoder on this data using 2 or 4 edge types where one edge type is “hard-coded” as non-edge, i.e. messages are only passed on the other edge types. We found that experiments with 2 and 4 edge types give almost identical results, with two edge types being comparable in capacity to the fully connected graph baseline while four edge types (with sparsity prior) are more interpretable and allow for easier visualization.
We find that the learned graph depends on the particular phase of the motion (Fig. 5), which indicates that the ideal underlying graph is dynamic. To account for this, we dynamically re-evaluate the NRI encoder for every time step during testing, effectively resulting in a dynamically changing latent graph that the decoder can utilize for more accurate predictions. In initial experiments we found that this dynamic re-evaluation strategy significantly improves predictive performance in this setting.
The qualitative results for our method and the same baselines used in Sec. 5.1 can be seen in Fig. 6. As one can see, we outperform the fully-connected graph setting in long-term predictions, and both models outperform the LSTM baselines. One interesting observation is that the skeleton graph is quite suboptimal, which is surprising as the skeleton is the “natural” graph. When examining the edges found by our model (trained with 4 edge types and a sparsity prior) we see an edge type that mostly connects a hand to other extremities, especially the opposite hand, as seen in Fig. 5. This can seem counter-intuitive as one might assume that the important connections are local, however we note that some leading approaches for modeling motion capture data (Jain et al., 2016) do indeed include hand to hand interactions.
The first edge type is “hard-coded” as non-edge and was trained with a prior probability of 0.91. All other edge types received a prior of 0.03 to favor sparse graphs that are easier to visualize. We visualize test data not seen during training.. Skeleton shown for reference. Red arrowheads denote directionality of a learned edge. The edge type shown favors a specific hand depending on the state of the movement and gathers information mostly from other extremities.
The National Basketball Association (NBA) uses the SportVU tracking system to collect player tracking data, where each frame contains the location of all ten players and the ball. Similar to our previous experiments, we test our model on the task of future trajectory prediction. Since the interactions between players are dynamic, and our current formulation assumes fixed interactions during training, we focus on the short Pick and Roll (PnR) instances of the games. PnR is one of the most common offensive tactics in the NBA where an offensive player sets a screen for the ball handler, attempting to create separation between the ball handler and his matchup.
We extracted 12k segments from the 2016 season and used 10k, 1k, 1k for training, validation, and testing respectively. The segments are 25 frames long (i.e. 4 seconds) and consist of only 5 nodes: the ball, ball hander, screener, and defensive matchup for each of the players.
We trained a CNN encoder and a RNN decoder with 2 edge types. For fair comparison, and because the trajectory continuation is not PnR anymore, the encoder is trained on only the first 15 time steps (as deployed in testing). Further details are in the appendix. Results for test MSE are shown in Figure 6. Our model outperforms a baseline LSTM model, and is on par with the full graph.
To understand the latent edge types we show in Fig. 8 how they are distributed between the players and the ball. As we can see, one edge type mostly connects ball and ball handler (off-ball) to all other players, while the other is mostly inner connections between the other three players. As the ball and ball handler are the key elements in the PnR play, we see that our model does learn an important semantic structure by separating them from the rest.
In this work we introduced NRI, a method to simultaneously infer relational structure while learning the dynamical model of an interacting system. In a range of experiments with physical simulations we demonstrate that our NRI model is highly effective at unsupervised recovery of ground-truth interaction graphs. We further found that it can model the dynamics of interacting physical systems, of real motion tracking and of sports analytics data at a high precision, while learning reasonably interpretable edge types.
Many real-world examples, in particular multi-agent systems such as traffic, can be understood as an interacting system where the interactions are dynamic. While our model is trained to discover static interaction graphs, we demonstrate that it is possible to apply a trained NRI model to this evolving case by dynamically re-estimating the latent graph. Nonetheless, our solution is limited to static graphs during training and future work will investigate an extension of the NRI model that can explicitly account for dynamic latent interactions even at training time.
The authors would like to thank the Toronto Raptors and the NBA for the use of the SportVU data. We would further like to thank Christos Louizos and Elise van der Pol for helpful discussions. This project is supported by the SAP Innovation Center Network.
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
One-shot imitation learning.In Advances in Neural Information Processing Systems (NIPS), 2017.
Structural-rnn: Deep learning on spatio-temporal graphs.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.In International Conference on Learning Representations (ICLR), 2017.
Relational neural expectation maximization: Unsupervised discovery of objects and their interactions.In International Conference on Learning Representations (ICLR), 2018.
From the results in our main paper it became evident that a simple LSTM model excel at predicting the dynamics of a network of phase-coupled oscillators (Kuramoto model) for short periods of time, while predictive performance deteriorates for longer sequences. It is interesting to compare the qualitative predictive behavior of this fully recurrent model with our NRI (learned) model that models the state at time solely based on the state at time t and the learned latent interaction graph. In Fig. 9 we provide visualizations of model predictions for the LSTM (joint) and the NRI (learned) model, compared to the ground truth continuation of the simulation.
It can be seen that the LSTM model correctly captures the shape of the sinusoidal waveform but fails to model the phase dynamics that arise due to the interactions between the oscillators. Our NRI (learned) model captures the qualitative behavior of the original coupled model at a very high precision and only in some cases slightly misses the phase dynamics (e.g. in the purple and green curve in the lower right plot). The LSTM model rarely matches the phase of the ground truth trajectory in the last few time steps and often completely goes “out of sync” by up to half a wavelength.
In Fig. 10 we visualize predictions of a trained NRI model with learned latent graph for the motion capture dataset. We show 30 predicted time steps of future movement, conditioned on 49 time steps that are provided as ground truth to the model. It can be seen that the model can capture the overall form of the movement with high precision. Mistakes (e.g. the misplaced toe node in frame 30) are possible due to the accumulation of small errors when predicting over long sequences with little chance of recovery. Curriculum learning schemes where noise is gradually added to training sequences can potentially alleviate this issue.
We show examples of three pick and roll trajectories in Fig. 11. In the left column we show the ground truth, in the middle we show our prediction and in the right we show the edges that where sampled by our encoder. As we can see even when our model does not predict the true future path, which is extremely challenging for this data, it still makes semantically reasonable predictions. For example in the middle row it predicts that the player defending the ball handler passes between him and the screener (going over the screen) which is a reasonable outcome even though in reality the defenders switched players.
We simulate particles (point masses) in a 2D box with no external forces (besides elastic collisions with the box). We randomly connect, with probability , each pair of particles with a spring. The particles connected by springs interact via forces given by Hooke’s law where is the force applied to particle by particle , is the spring constant and is the 2D location vector of particle . The initial location is sampled from a Gaussian , and the initial velocity is a random vector of norm . Given the initial locations and velocity we can simulate the trajectories by solving Newton’s equations of motion PDE. We do this by leapfrog integration using a step size of 0.001 and then subsample each 100 steps to get our training and testing trajectories.
We note that since the leapfrog integration is differentiable, we are able to use it as a ground-truth decoder and back-propagate through it to train the encoder. We implemented the leapfrog integration in PyTorch, which allows us to compare model performance with a learned decoder versus the ground-truth simulation decoder.
Similar to the springs model, we simulate particles in a 2D box, but instead of springs now our particles carry positive or negative charges , sampled with uniform probability, and interact via Coulomb forces: where is some constant. Unlike the springs simulations, here every two particles interact, although the interaction might be weak if they stay far apart, but they can either attract or repel each other.
Since the forces diverge when the distance between particles goes to zero, this can cause issues when integrating with a fixed step size. The problem might be solved by using a much smaller step size, but this would slow the generation considerably. To circumvent this problem, we clip the forces to some maximum absolute value. While not being exactly physically accurate, the trajectories are indistinguishable to a human observer and the generation process is now stable.
The force clipping does, however, create a problem for the simulation ground-truth decoder, as gradients become zero when the forces are clipped during the simulation. We attempted to fix this by using “soft” clipping with a function in the differentiable simulation decoder, but this similarly resulted in vanishing gradients once the model gets stuck in an unfavorable regime with large forces.
The Kuramoto model is a nonlinear system of phase-coupled oscillators that can exhibit a range of complicated dynamics based on the distribution of the oscillators’ internal frequencies and their coupling strengths. We use the common form for the Kuramoto model given by the following differential equation:
with phases , coupling constants , and intrinsic frequencies . We simulate 1D trajectories by solving Eq. (23) with a fourth-order Runge-Kutta integrator with step size .
We simulate phase-coupled oscillators in 1D with intrinsic frequencies and initial phases sampled uniformly from and , respectively. We randomly, with probability of , connect pairs of oscillators and (undirected) with a coupling constant . All other coupling constants are set to 0. We subsample the simulated by a factor of 10 and create trajectories by concatenating , , and the intrinsic frequencies (copied for every time step as are static).
We will describe here the details of our encoder and decoder implementations.
The message passing operations and can be evaluated in parallel for all nodes (or edges) in the graph and allow for an efficient vectorized implementation. More specifically, the node-to-edge message passing function can be vectorized as:
with and defined analogously (layer index omitted), where and are the total number of features and edges, respectively. denotes transposition. Both message passing matrices are dependent on the graph structure and can be computed in advance if the underlying graph is static. is a sparse binary matrix with when the -th node is connected to the -th edge (arbitrary ordering) via an incoming link and otherwise. is defined analogously for outgoing edges.
Similarly, we can vectorize the edge-to-node message passing function as:
with . For large sparse graphs (e.g. by constraining interactions to nearest neighbors), it can be beneficial to make use of sparse-dense matrix multiplications, effectively allowing for an algorithm.
The basic building block of our MLP encoder is a 2-layer MLP with hidden and output dimension of 256, with batch normalization, dropout, and ELU activations. Given this, the forward model for our encoder is given by the code snippet in Fig.12. The node2edge module returns for each edge the concatenation of the receiver and sender features. The edge2node module accumulates all incoming edge features via a sum.
The CNN encoder uses another block which performs 1D convolutions with attention. This allows for encoding with changing trajectory size, and is also appropriate for tasks like the charged particle simulations when the interaction can be strong for a small fraction of time. The forward computation of this module is presented in Fig. 13 and the overall decoder in Fig. 14.
In Fig. 15 we present the code for a single time-step prediction using our MLP decoder for Markovian data.
All experiments were run using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0005, decayed by a factor of 0.5 every 200 epochs. Unless otherwise noted, we train with a batch size of 128. The concrete distribution is used with. During testing, we replace the concrete distribution with a categorical distribution to obtain discrete latent edge types. Physical simulation and sports tracking experiments were run for 500 training epochs. For motion capture data we used 200 training epochs, as models tended to converge earlier. We saved model checkpoints after every epoch whenever the validation set performance (measured by path prediction MSE) improved and loaded the best performing model for test set evaluation. We observed that using significantly higher learning rates than 0.0005 often produced suboptimal decoders that ignored the latent graph structure.
The springs, charged particles and Kuramoto datasets each contain 50k training instances and 10k validation and test instances. Training and validation trajectories where of length 49 while test trajectories continue for another 20 time steps (50 for visualization). We train an MLP encoder for the springs experiment, and CNN encoder for the charged particles and Kuramoto experiments. All experiments used MLP decoders and two edge types. For the Kuramoto model experiments, we explicitly hard-coded the first edge type as a “non-edge”, i.e. no messages are passed along edges of this type.
As noted previously, all of our MLPs have hidden and output dimension of 256. The overall input/output dimension of our model is 4 for the springs and charged particles experiments (2D position and velocity) and 3 for the Kuramoto model experiments (phase-difference, amplitude and intrinsic frequency). During training, we use teacher forcing in every 10-th time step (i.e. every 10th time step, the model receives a ground truth input, otherwise it receives its previous prediction as input). As we always have two edge types in these experiments and their ordering is abitrary (apart from the Kuramoto model where we assign a special role to edge type 1), we choose the ordering for which the accuracy is highest.
In edge recovery experiments, we report the following baselines along with the performance of our NRI (learned) model:
Corr. (path): We calculate a correlation matrix , where with being the covariance between all trajectories and (for objects and ) in the training and validation sets. We determine an ideal threshold so that if and otherwise, based on predictive accuracy on the combined training and validation set. denotes the presence of an interaction edge (arbitrary type) between object and . We repeat the same procedure for the absolute value of , i.e. if and otherwise. Lastly, we pick whichever of the two ( or ) produced the best match with the ground truth graph (i.e. highest accuracy score) and report test set accuracy with this setting.
Corr. (LSTM): Here, we train a two-layer LSTM with shared parameters and 256 hidden units that models each trajectory individually. It is trained to predict the position and velocity for every time step directly and is conditioned on the previous time steps. The input to the model is passed through a two-layer MLP (256 hidden units and ReLU activations) before it is passed to the LSTM, similarly we pass the LSTM output (last time step) through a two-layer MLP (256 hidden units and ReLU activation on the hidden layer). We provide ground truth trajectory information as input at every time step. We train to minimize MSE between model prediction and ground truth path. We train this model for 10 epochs and finally apply the same correlation matrix procedure as in Corr. (path), but this time calculating correlations between the output of the second LSTM layer at the last time step (instead of using the raw trajectory features). The LSTM is only trained on the training set. The optimal correlation threshold is estimated using the combined training and validation set.
NRI (sim.): In this setting, we replace the decoder of the NRI model with the ground-truth simulator (i.e. the integrator of the Newtonian equations of motion). We implement both the charged particle and the springs simulator in PyTorch which gives us access to gradient information. We train the overall model with the same settings as the original NRI (learned) model by backpropagating directly through the simulator. We find that for the springs simulation, a single leap-frog integration step is sufficient to closely approximate the trajectory of the original simulation, which was generated with 100 leap-frog steps per time step. For the charged particle simulation, 100 leap-frog steps per time step are necessary to match the original trajectory when testing the simulation decoder in isolation. We find, however, that due to the force clipping necessary to stabilize the original charged particle simulation, gradients will often become zero, making model training difficult or infeasible.
Supervised: For this baseline, we train the encoder in isolation and provide ground-truth interaction graphs as labels. We train using a cross-entropy error and monitor the validation accuracy (edge prediction) for model checkpointing. We train with dropout of on the hidden layer representation of every MLP in the encoder model, in order to avoid overfitting.
Here, we use the following baselines along with our NRI (learned) model:
Static: This baseline simply copies the previous state vector .
LSTM (single): Same as the LSTM model in Corr. (LSTM), but trained to predict the state vector difference at every time step (as in the NRI model). Instead of providing ground truth input at every time step, we use the same training protocol as for an NRI model with recurrent decoder (see main paper).
LSTM (joint): This baseline differs from LSTM (single) in that it concatenates the input representations from all objects after passing them through the input MLP. This concatenated representation is fed into a single LSTM where the hidden unit number is multiplied by the number of objects—otherwise same setting as LSTM (single). The output of the second LSTM layer at the last time step is then divided into vectors of same size, one for each object, and fed through the output MLP to predict the state difference for each object separately. LSTM (joint) is trained with same training protocol as the LSTM (single) model.
NRI (full graph): For this model, we keep the latent graph fixed (fully-connected on edge type 2; note that edge types are exclusive, i.e. edges of type 1 are not present in this case) and train the decoder in isolation in the otherwise same setting as the NRI (learned) model.
NRI (true graph): Here, we train the decoder in isolation and provide the ground truth interaction graph as latent graph representation.
Our extracted motion capture dataset has a total size of 8,063 frames for 31 tracked points each. We normalize all features (position/velocity) to maximum absolute value of 1. Training and validation set samples are 49 frames long (non-overlapping segments extracted from the respective trials). Test set samples are 99 frames long. In the main paper, we report results on the last 50 frames of this test set data.
We choose the same hyperparameter settings as in the physical simulation experiments, with the exception that we train models for 200 epochs and with a batch size of 8. Our model here uses an MLP encoder and an RNN decoder (as the dynamics are not Markovian). We further take samples from the discrete distribution during the forward pass in training and calculate gradients via the concrete relaxation. The baselines are identical to before (path prediction experiments for physical simulations) with the following exception: For LSTM (joint) we choose a smaller hidden layer size of 128 units and train with a batch size of 1, as the model did otherwise not fit in GPU memory.
For the NBA data each example is a 25 step trajectory of a pick and roll (PnR) instance, subsampled from the original 25 frames-per-second SportVU data. Unlike the physical simulation where the dynamics of the interactions do not change over time and the motion capture data where the dynamics are approximately periodic, the dynamics here change considerably over time. The middle of the trajectory is, more or less, the pick and roll itself and the behavior before and after are quite different. This poses a problem for fair comparison, as it is wrong to just look at the next time steps, i.e. after the PnR event, for our test data since they are quite different from our training data. If we train our model on full sequences and at test time only show the encoder the first 17 and predict the last 8 steps, we get fairly bad results. This is not surprising as there is a big gap between the way we train and test the model. To overcome this issue and still get a fair comparison we show our encoder only the first 17 time steps during training and then try to reconstruct all 25 time steps.
We used a CNN encoder and RNN decoder with two edge types to have comparable capacity to the full graph model. If we “hard code” one edge type to represent “non-edge” then our model learns the full graph as all players are highly connected. We also experimented with 10 and 20 edge types which did not perform as well on validation data, probably due to over-fitting.