1 Introduction
In categorical probabilistic forecasting, we seek to predict a discrete probability distribution
at some instantaneous time , based on observed timestamped data (Gneiting and Katzfuss, 2014). Consider the example of forecasting the most likely locations of the next earthquake over a finite set of locations at , given the history of earthquake times and locations. We can view locations as vertices on a graph with edges that represent adjacency. Specifically, the probability of an earthquake at node in the near future is mostly influenced by the probability of earthquakes at nodes within its neighborhood. This type of graphical structure also appears in other problems, including traffic forecasting (Yu et al., 2018), information diffusion in social networks (Bakshy et al., 2012), epidemic diffusion (Wang et al., 2021; Huang et al., 2010), urban conflict patterns (Linderman and Adams, 2014), and is an example of a marked temporal point process (Daley and VereJones, 2003).In this paper, we consider categorical probabilistic forecasts where there is a graphical structure to inform us of the local dynamics governing
over time. We formalize the intuition that each component of the probability vector
obeys local dynamics using the differential equation(1) 
which we use to inform our model’s inductive bias. Here, governs the local dynamics, denotes the set of neighboring nodes of , and denotes the probability at node . To capture the equivariant local dynamics of our forecast , we propose Gopher, a model that learns a neural ODE (Chen et al., 2018) with graph neural network (GNN) (Wu et al., 2021) dynamics.
Our method Gopher introduces two inductive biases to aid with probabilistic forecasting over graphstructured categories by 1) utilizing graph structure explicitly and 2) introducing temporal evolution through a neural ODE. To disentangle the benefits of these two biases, we introduce two baseline models, ablating each bias. We find that utilizing the known graph structure results is key, and results in 10x improvements in accuracy and sample efficiency. On the other hand, explicitly modelling the temporal dynamics surprisingly results in little benefits.
2 Gopher: Forecasting with temporal dynamics and graph structure
Let be a graph, and let denote the timestamp of an event at node . Given and an irregularly sampled dataset , we want to learn the probability vector of each at any time . We wish to model the dynamics of such that the change in the probability at node depends only on the neighborhood around , as described in Equation 1. However, directly parameterizing from Equation 1 with a neural ODE can violate conservation of probability: .
Instead of explicitly enforcing the sum constraint into our neural ODE, we model the dynamics in a continuoustime embedding space from which we derive the dynamics . Specifically, let , where denotes the embedding space dimension. We use to denote row of at initial time , corresponding to the embedding of node . We then model the dynamics of the continuoustime embeddings via
(2) 
where is the learned graph neural network (GNN) dynamics. To map to a probability space while preserving equivariance, we learn a shared projection such that
(3) 
Provided that and
are differentiable, which can be done using smooth activation functions, our model then implicitly models the local temporal dynamic of our problem in
Equation 1. Finally, we train our model Gopher by maximizing the log likelihood with respect to the parameters of , , and the initial condition .Incorporating node attributes.
In some cases may have node attributes for each node that affect the interaction dynamics, such as the geographical coordinates of each node in a spatial graph or the demographics of a user in a social network. Node attributes can be easily incorporated by letting the initial node embeddings be a learned function of the attributes, , and optimizing with respect to the parameters of .
Related works.
Our paper lies at the intersection of probabilistic forecasting, neural ODEs, and graph neural networks (GNNS), and can be seen as the discrete analogue of continuous normalizing flows (Chen et al., 2018; Grathwohl et al., 2018; Chen et al., 2020) on manifolds (Lou et al., 2020; Mathieu and Nickel, 2020). Probabilistic forecasting seeks to predict a full distribution at each time step (Hyndman and Athanasopoulos, 2018; Gneiting and Katzfuss, 2014), with contemporary methods often relying on deep probabilistic models (Salinas et al., 2020; Wang et al., 2019; Rasul et al., 2020; Rangapuram et al., 2018). A direct application of categorical probabilistic forecasts is marked temporal point processes, which learn the rate of an event type at time , summarized by the conditional intensity function (Daley and VereJones, 2003). The inductive bias of a learnable ODE with GNN dynamics has also been explored in the context of other problems, including graph generation (Deng et al., 2019), node classification (Poli et al., 2021; Chamberlain et al., 2021), multiparticle trajectory prediction (Poli et al., 2021)
, learning partial differential equations
(Iakovlev et al., 2020), and knowledge graph forecasting
(Han et al., 2021).3 Results: I Can’t Believe Temporal Dynamics Don’t Matter!
Synthetic datasets.
We apply our method to model the mark component of a marked temporal point process (TPP) occurring on the nodes of a graph such that is the probability of an event occurring on vertex at time . We create a synthetic dataset where events occur over time on a directed graph , with node probabilities that obey graph advection as an example of local dynamics (Chapman and Mesbahi, 2011). Graph advection conserves the total probability by ensuring . We represent the graph by the weighted adjacency matrix , where for . We sample sequences of events over time from a homogeneous Poisson process with constant temporal intensity and temporal node probability governed by the graph advection equation (Chapman and Mesbahi, 2011)
(4) 
Here, denotes the outdegree graph Laplacian, and denotes the diagonal outdegree matrix with .
We create two graphs structures for our synthetic datasets, a ring graph and a random geometric graph, and visualize their advection on the graph over time in Figure 1; see Appendix D for more details on their construction. We also visualize the advection dynamics of each component of for the ring graph in Figure 9 of Appendix D. We use seconds for the ring graph dataset and second for the geometric graph dataset. Since the timestamps are sampled from a Poisson process and not equidistantly spaced over , the continuous time aspect of the problem is clearly evident in the dataset.
Evaluating each inductive bias of Gopher.
We evaluate the accuracy and sampleefficiency improvements from incorporating graph structure and modelling temporal dynamics in Gopher. To disentangle the effects of these two inductive biases, we compare our model to two baseline models. The first is a twolayer MLP that acts on nodeembeddings concatenated with time, which has none of the above inductive biases. The second is a singlelayer GNN that also acts on nodeembeddings concatenated with time. The GNN incorporates the explicit graph structure, but does not incorporate dynamical systems structure. We refer to the models as NaiveMLP and NaiveGNN respectively. In our experiments, Gopher learns using a Graph Isomorphism Network (GIN) layer (Xu et al., 2018) parameterized by a twolayer MLP; we use another twolayer MLP for the projection . NaiveGNN uses the same GIN architecture and projection except that it does not learn a differential equation. Finally, NaiveMLP replaces the GIN layer with a two layer MLP. See Appendix D
for further details on our experiment hyperparameters.
Figure 2 shows the KL divergence betweeen the ground truth and the learned predictions over time for the ring graph. We summarize the KL divergence over in Figure 3
by the geometric mean since the error varies over multiple overs of magnitudes over time
(Finzi et al., 2020). In both figures, we show the 95% confidence intervals over 3 seeds. For both datasets, there is 10x difference in accuracy between the graph structured models and
NaiveMLP, indicating that utilizing the graph structure is greatly beneficial. Though NaiveGNN does not explicitly model the local temporal dynamics of the datasets, it performs nearly identically to our model Gopher in fitting over the training interval . In principle, Gopher has the best chance of extrapolating to the time period not seen during training since Gopher explicitly models the local dynamics. However, Gopher’s poor extrapolation ability suggests that its learned dynamics do not actually reflect the true dynamics. Indeed, in Figure 7 of Appendix C we show that although Gopher can fit the training data well, it is brittle to edge deletions, further indicating Gopher does not learn the true dynamics.Realworld dataset.
We use data released publicly by the New York Times Times (2021) on daily COVID19 cases in New Jersey state to construct a realworld categorical probabilistic forecasting dataset, following the preprocessing script of Chen et al. (2020). We aggregate the cases by county and form a graph with 21 nodes where each node is a county and each edge is a county border. Using the train/test split from Chen et al. (2020)
, we obtain per event log likelihoods with 1standarddeviations of
for NaiveMLP, for NaiveGNN, and for Gopher over 3 seeds. However, these likelihoods are not representative of the model differences since we find a large distribution shift between the train and test distribution shown in Figure 5 of Appendix A. This distribution shift causes the models to perform equally poorly on the test set. In actuality, NaiveMLP completely fails to capture variations in over time, as shown in Figure 4.4 Discussion
Although the inductive biases of Gopher, directly reflect properties of categorical forecasting with local continuoustime dynamics, our experiments find that, surprisingly, explicitly modelling the temporal dynamics does not improve performance. Most of the performance gains of Gopher come from incorporating a graph structure, which can be done with a simple baseline model like NaiveGNN. The failure of Gopher can be attributed to the fact that the learned dynamics in the embedding space do not accurately reflect the ground truth dynamics in probability space.
References
 Bakshy et al. [2012] Eytan Bakshy, Itamar Rosenn, Cameron Marlow, and Lada Adamic. The role of social networks in information diffusion. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 519–528, New York, NY, USA, April 2012. Association for Computing Machinery. ISBN 9781450312295. doi: 10.1145/2187836.2187907.

Chamberlain et al. [2021]
Ben Chamberlain, James Rowbottom, Maria I. Gorinova, Michael Bronstein, Stefan
Webb, and Emanuele Rossi.
GRAND: Graph Neural Diffusion.
In
International Conference on Machine Learning
, pages 1407–1418. PMLR, July 2021.  Chapman and Mesbahi [2011] Airlie Chapman and Mehran Mesbahi. Advection on graphs. IEEE Conference on Decision and Control and European Control Confereence (CDCECC), 50:1461–1466, December 2011.

Chen et al. [2018]
Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.
Neural Ordinary Differential Equations.
In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.  Chen et al. [2020] Ricky T. Q. Chen, Brandon Amos, and Maximilian Nickel. Neural SpatioTemporal Point Processes. In International Conference on Learning Representations, September 2020.
 Daley and VereJones [2003] D. J. Daley and D. VereJones. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. Probability and Its Applications, An Introduction to the Theory of Point Processes. SpringerVerlag, New York, second edition, 2003. ISBN 9780387955414. doi: 10.1007/b97277.
 Deng et al. [2019] Zhiwei Deng, Megha Nawhal, Lili Meng, and Greg Mori. Continuous Graph Flow. arXiv:1908.02436 [cs, stat], September 2019.
 Dupont et al. [2019] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented Neural ODEs. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 Finzi et al. [2020] Marc Finzi, Ke Alexander Wang, and Andrew G. Wilson. Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints. Advances in Neural Information Processing Systems, 33:13880–13889, 2020.
 Gneiting and Katzfuss [2014] Tilmann Gneiting and Matthias Katzfuss. Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1(1):125–151, 2014. doi: 10.1146/annurevstatistics062713085831.
 Grathwohl et al. [2018] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: FreeForm Continuous Dynamics for Scalable Reversible Generative Models. In International Conference on Learning Representations, September 2018.
 Han et al. [2021] Zhen Han, Zifeng Ding, Yunpu Ma, Yujia Gu, and Volker Tresp. Temporal Knowledge Graph Forecasting with Neural ODE. arXiv:2101.05151 [cs], August 2021.
 Huang et al. [2010] Wenzhang Huang, Maoan Han, and Kaiyu Liu. Dynamics of an SIS reactiondiffusion epidemic model for disease transmission. Mathematical Biosciences & Engineering, 7(1):51, 2010. doi: 10.3934/mbe.2010.7.51.
 Hyndman and Athanasopoulos [2018] Robin John Hyndman and George Athanasopoulos. Forecasting: Principles and Practice. OTexts, Australia, 2nd edition, 2018.
 Iakovlev et al. [2020] Valerii Iakovlev, Markus Heinonen, and Harri Lähdesmäki. Learning continuoustime PDEs from sparse data with graph neural networks. In International Conference on Learning Representations, September 2020.
 Linderman and Adams [2014] Scott Linderman and Ryan Adams. Discovering Latent Network Structure in Point Process Data. In International Conference on Machine Learning, pages 1413–1421. PMLR, June 2014.
 Lou et al. [2020] Aaron Lou, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser Nam Lim, and Christopher M De Sa. Neural manifold ordinary differential equations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17548–17558. Curran Associates, Inc., 2020.
 Mathieu and Nickel [2020] Emile Mathieu and Maximilian Nickel. Riemannian Continuous Normalizing Flows. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2503–2515. Curran Associates, Inc., 2020.
 Poli et al. [2021] Michael Poli, Stefano Massaroli, Clayton M. Rabideau, Junyoung Park, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. ContinuousDepth Neural Models for Dynamic Graph Prediction. arXiv:2106.11581 [cs, stat], June 2021.
 Rangapuram et al. [2018] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep State Space Models for Time Series Forecasting. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
 Rasul et al. [2020] Kashif Rasul, AbdulSaboor Sheikh, Ingmar Schuster, Urs M. Bergmann, and Roland Vollgraf. Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows. In International Conference on Learning Representations, September 2020.
 Salinas et al. [2020] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, July 2020. ISSN 01692070. doi: 10.1016/j.ijforecast.2019.07.001.
 Times [2021] The New York Times. Coronavirus (Covid19) Data in the United States, 2021. URL https://github.com/nytimes/covid19data.
 Wang et al. [2021] Rui Wang, Danielle Maddix, Christos Faloutsos, Yuyang Wang, and Rose Yu. Bridging physicsbased and datadriven modeling for learning dynamical systems. In Ali Jadbabaie, John Lygeros, George J. Pappas, Pablo A. Parrilo, Benjamin Recht, Claire J. Tomlin, and Melanie N. Zeilinger, editors, Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pages 385–398. PMLR, 07 – 08 June 2021. URL https://proceedings.mlr.press/v144/wang21a.html.
 Wang et al. [2019] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski. Deep Factors for Forecasting. In International Conference on Machine Learning, pages 6607–6617. PMLR, May 2019.
 Wu et al. [2021] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, January 2021. ISSN 2162237X, 21622388. doi: 10.1109/TNNLS.2020.2978386.
 Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, September 2018.

Yu et al. [2018]
Bing Yu, Haoteng Yin, and Zhanxing Zhu.
Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting.
InProceedings of the 27th International Joint Conference on Artificial Intelligence
, IJCAI’18, pages 3634–3640, Stockholm, Sweden, July 2018. AAAI Press. ISBN 9780999241127.
Appendix A Distribution shift in our realworld dataset
Appendix B Learned forecasts on COVID19 dataset
Appendix C Model robustness
Appendix D Implementation details
We use 2 layer MLPs with 64 hidden units per layer whenever we use a MLP. We use Swish activations to ensure smoothness of our dynamics. We also use an augmented neural ODE (Dupont et al., 2019), using 16 dimensions as augmented dimensions out of the 64 hidden dimensions.
d.1 Model architectures.
Gopher.
We parameterize the dynamics from Equation 2 using one graph isomorophism network layer parameterized by a MLP. We also use a MLP to model the projection .
NaiveGNN.
We use the same architecture as Gopher, namely a GIN layer followed by a projection . However, instead of using the model to parameterize ODE dynamics, we directly input the node embeddings concatenated with time through the GNN.
NaiveMLP.
We replace the GIN layer of NaiveGNN with a MLP, keeping all else the same.
d.2 Training procedure and dataset details.
To maximize hardware parallelism, we parallelize our neural ODE computation across sequence timesteps and across sequences using the timereparameterization trick outlined in Chen et al. (2020)
. For the synthetic datasets, we use the AdamW optimizer with 0.01 learning rate and batch size 64 for 30 epochs. For the COVID19 dataset, we use a
learning rate and batch size 4 for 15 epochs. Here, each batch consists of multiple sequences drawn from the training period . We use for the ring graph, for the geometric graph, and for the New Jersey counties graph. We generate the ring graph dataset by using handset coefficients for the edge weights to allow for counterclockwise transport. We generate the geometric graph dataset by generating a random geometric graph via thenetworkx
python package and drawing a random sample of .