GOPHER: Categorical probabilistic forecasting with graph structure via local continuous-time dynamics

12/18/2021
by   Ke Alexander Wang, et al.
Amazon
Stanford University
8

We consider the problem of probabilistic forecasting over categories with graph structure, where the dynamics at a vertex depends on its local connectivity structure. We present GOPHER, a method that combines the inductive bias of graph neural networks with neural ODEs to capture the intrinsic local continuous-time dynamics of our probabilistic forecasts. We study the benefits of these two inductive biases by comparing against baseline models that help disentangle the benefits of each. We find that capturing the graph structure is crucial for accurate in-domain probabilistic predictions and more sample efficient models. Surprisingly, our experiments demonstrate that the continuous time evolution inductive bias brings little to no benefit despite reflecting the true probability dynamics.

READ FULL TEXT VIEW PDF
12/17/2019

Deep Iterative and Adaptive Learning for Graph Neural Networks

In this paper, we propose an end-to-end graph learning framework, namely...
10/16/2020

Generalizable Machine Learning in Neuroscience using Graph Neural Networks

Although a number of studies have explored deep learning in neuroscience...
01/18/2021

Discrete Graph Structure Learning for Forecasting Multiple Time Series

Time series forecasting is an extensively studied subject in statistics,...
11/08/2020

Learning Continuous System Dynamics from Irregularly-Sampled Partial Observations

Many real-world systems, such as moving planets, can be considered as mu...
08/06/2020

Learning Insulin-Glucose Dynamics in the Wild

We develop a new model of insulin-glucose dynamics for forecasting blood...
10/11/2017

Discrete Event, Continuous Time RNNs

We investigate recurrent neural network architectures for event-sequence...
04/07/2020

Is Graph Structure Necessary for Multi-hop Reasoning?

Recently, many works attempt to model texts as graph structure and intro...

1 Introduction

In categorical probabilistic forecasting, we seek to predict a discrete probability distribution

at some instantaneous time , based on observed time-stamped data (Gneiting and Katzfuss, 2014). Consider the example of forecasting the most likely locations of the next earthquake over a finite set of locations at , given the history of earthquake times and locations. We can view locations as vertices on a graph with edges that represent adjacency. Specifically, the probability of an earthquake at node in the near future is mostly influenced by the probability of earthquakes at nodes within its neighborhood. This type of graphical structure also appears in other problems, including traffic forecasting (Yu et al., 2018), information diffusion in social networks (Bakshy et al., 2012), epidemic diffusion (Wang et al., 2021; Huang et al., 2010), urban conflict patterns (Linderman and Adams, 2014), and is an example of a marked temporal point process (Daley and Vere-Jones, 2003).

In this paper, we consider categorical probabilistic forecasts where there is a graphical structure to inform us of the local dynamics governing

over time. We formalize the intuition that each component of the probability vector

obeys local dynamics using the differential equation

(1)

which we use to inform our model’s inductive bias. Here, governs the local dynamics, denotes the set of neighboring nodes of , and denotes the probability at node . To capture the equivariant local dynamics of our forecast , we propose Gopher, a model that learns a neural ODE (Chen et al., 2018) with graph neural network (GNN) (Wu et al., 2021) dynamics.

Our method Gopher introduces two inductive biases to aid with probabilistic forecasting over graph-structured categories by 1) utilizing graph structure explicitly and 2) introducing temporal evolution through a neural ODE. To disentangle the benefits of these two biases, we introduce two baseline models, ablating each bias. We find that utilizing the known graph structure results is key, and results in 10x improvements in accuracy and sample efficiency. On the other hand, explicitly modelling the temporal dynamics surprisingly results in little benefits.

2 Gopher: Forecasting with temporal dynamics and graph structure

Let be a graph, and let denote the timestamp of an event at node . Given and an irregularly sampled dataset , we want to learn the probability vector of each at any time . We wish to model the dynamics of such that the change in the probability at node depends only on the neighborhood around , as described in Equation 1. However, directly parameterizing from Equation 1 with a neural ODE can violate conservation of probability: .

Instead of explicitly enforcing the sum constraint into our neural ODE, we model the dynamics in a continuous-time embedding space from which we derive the dynamics . Specifically, let , where denotes the embedding space dimension. We use to denote row of at initial time , corresponding to the embedding of node . We then model the dynamics of the continuous-time embeddings via

(2)

where is the learned graph neural network (GNN) dynamics. To map to a probability space while preserving equivariance, we learn a shared projection such that

(3)

Provided that and

are differentiable, which can be done using smooth activation functions, our model then implicitly models the local temporal dynamic of our problem in

Equation 1. Finally, we train our model Gopher by maximizing the log likelihood with respect to the parameters of , , and the initial condition .

Incorporating node attributes.

In some cases may have node attributes for each node that affect the interaction dynamics, such as the geographical coordinates of each node in a spatial graph or the demographics of a user in a social network. Node attributes can be easily incorporated by letting the initial node embeddings be a learned function of the attributes, , and optimizing with respect to the parameters of .

Related works.

Our paper lies at the intersection of probabilistic forecasting, neural ODEs, and graph neural networks (GNNS), and can be seen as the discrete analogue of continuous normalizing flows (Chen et al., 2018; Grathwohl et al., 2018; Chen et al., 2020) on manifolds (Lou et al., 2020; Mathieu and Nickel, 2020). Probabilistic forecasting seeks to predict a full distribution at each time step (Hyndman and Athanasopoulos, 2018; Gneiting and Katzfuss, 2014), with contemporary methods often relying on deep probabilistic models (Salinas et al., 2020; Wang et al., 2019; Rasul et al., 2020; Rangapuram et al., 2018). A direct application of categorical probabilistic forecasts is marked temporal point processes, which learn the rate of an event type at time , summarized by the conditional intensity function (Daley and Vere-Jones, 2003). The inductive bias of a learnable ODE with GNN dynamics has also been explored in the context of other problems, including graph generation (Deng et al., 2019), node classification (Poli et al., 2021; Chamberlain et al., 2021), multi-particle trajectory prediction (Poli et al., 2021)

, learning partial differential equations

(Iakovlev et al., 2020)

, and knowledge graph forecasting

(Han et al., 2021).

3 Results: I Can’t Believe Temporal Dynamics Don’t Matter!

Synthetic datasets.

[width=]figures/advection_ring_advection.pdf [width=]figures/geometric_graph_advection.pdf
Figure 1: (Top) Advection on a cyclic graph set chosen so that the probability mass is transported more strongly in the counter-clockwise direction. (Bottom) Advection on a random geometric graph with randomly chosen edge weights. (Both) Colors are shown in log-scale to make dynamics more visually-apparent. Light-gray coloring corresponds to which is the steady-state probability mass for each node.

We apply our method to model the mark component of a marked temporal point process (TPP) occurring on the nodes of a graph such that is the probability of an event occurring on vertex at time . We create a synthetic dataset where events occur over time on a directed graph , with node probabilities that obey graph advection as an example of local dynamics (Chapman and Mesbahi, 2011). Graph advection conserves the total probability by ensuring . We represent the graph by the weighted adjacency matrix , where for . We sample sequences of events over time from a homogeneous Poisson process with constant temporal intensity and temporal node probability governed by the graph advection equation (Chapman and Mesbahi, 2011)

(4)

Here, denotes the out-degree graph Laplacian, and denotes the diagonal out-degree matrix with .

We create two graphs structures for our synthetic datasets, a ring graph and a random geometric graph, and visualize their advection on the graph over time in Figure 1; see Appendix D for more details on their construction. We also visualize the advection dynamics of each component of for the ring graph in Figure 9 of Appendix D. We use seconds for the ring graph dataset and second for the geometric graph dataset. Since the timestamps are sampled from a Poisson process and not equidistantly spaced over , the continuous time aspect of the problem is clearly evident in the dataset.

Evaluating each inductive bias of Gopher.

[width=0.5]figures/kl-divergence-only.pdf

Figure 2: KL divergence between the true probabilities and the predicted probabilities trained on 1024 sampled sequences. Gray is the training set time interval and red is the extrapolation region beyond the training set.

We evaluate the accuracy and sample-efficiency improvements from incorporating graph structure and modelling temporal dynamics in Gopher. To disentangle the effects of these two inductive biases, we compare our model to two baseline models. The first is a two-layer MLP that acts on node-embeddings concatenated with time, which has none of the above inductive biases. The second is a single-layer GNN that also acts on node-embeddings concatenated with time. The GNN incorporates the explicit graph structure, but does not incorporate dynamical systems structure. We refer to the models as NaiveMLP  and NaiveGNN  respectively. In our experiments, Gopher  learns using a Graph Isomorphism Network (GIN) layer (Xu et al., 2018) parameterized by a two-layer MLP; we use another two-layer MLP for the projection . NaiveGNN  uses the same GIN architecture and projection except that it does not learn a differential equation. Finally, NaiveMLP  replaces the GIN layer with a two layer MLP. See Appendix D

for further details on our experiment hyperparameters.

Figure 2 shows the KL divergence betweeen the ground truth and the learned predictions over time for the ring graph. We summarize the KL divergence over in Figure 3

by the geometric mean since the error varies over multiple overs of magnitudes over time

(Finzi et al., 2020)

. In both figures, we show the 95% confidence intervals over 3 seeds. For both datasets, there is 10x difference in accuracy between the graph structured models and

NaiveMLP, indicating that utilizing the graph structure is greatly beneficial. Though NaiveGNN  does not explicitly model the local temporal dynamics of the datasets, it performs nearly identically to our model Gopher  in fitting over the training interval . In principle, Gopher  has the best chance of extrapolating to the time period not seen during training since Gopher  explicitly models the local dynamics. However, Gopher’s  poor extrapolation ability suggests that its learned dynamics do not actually reflect the true dynamics. Indeed, in Figure 7 of Appendix C we show that although Gopher  can fit the training data well, it is brittle to edge deletions, further indicating Gopher  does not learn the true dynamics.

[width=]figures/sample-complexity-shared.pdf

Figure 3: Sample complexity of each model on the geometric graph dataset (left) and cyclic graph dataset (right). The error is measured in terms of the geometric mean of the KL divergence over the evaluation time period . Prediction of uniform probability corresponds to a geometric mean KL divergence of for the geometric graph and for the cyclic graph.

Real-world dataset.

[width=]figures/covid-vertex-distribution.pdf

Figure 4: Learned over a subset of the counties fit using the COVID-19 dataset preprocessed by Chen et al. (2020). Only the graph-based models are able to capture the variations over time. See Appendix A for the empirical distribution of .

We use data released publicly by the New York Times Times (2021) on daily COVID-19 cases in New Jersey state to construct a real-world categorical probabilistic forecasting dataset, following the preprocessing script of Chen et al. (2020). We aggregate the cases by county and form a graph with 21 nodes where each node is a county and each edge is a county border. Using the train/test split from Chen et al. (2020)

, we obtain per event log likelihoods with 1-standard-deviations of

for NaiveMLP for NaiveGNN, and for Gopher over 3 seeds. However, these likelihoods are not representative of the model differences since we find a large distribution shift between the train and test distribution shown in Figure 5 of Appendix A. This distribution shift causes the models to perform equally poorly on the test set. In actuality, NaiveMLP  completely fails to capture variations in over time, as shown in Figure 4.

4 Discussion

Although the inductive biases of Gopher, directly reflect properties of categorical forecasting with local continuous-time dynamics, our experiments find that, surprisingly, explicitly modelling the temporal dynamics does not improve performance. Most of the performance gains of Gopher  come from incorporating a graph structure, which can be done with a simple baseline model like NaiveGNN. The failure of Gopher  can be attributed to the fact that the learned dynamics in the embedding space do not accurately reflect the ground truth dynamics in probability space.

References

  • Bakshy et al. [2012] Eytan Bakshy, Itamar Rosenn, Cameron Marlow, and Lada Adamic. The role of social networks in information diffusion. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 519–528, New York, NY, USA, April 2012. Association for Computing Machinery. ISBN 978-1-4503-1229-5. doi: 10.1145/2187836.2187907.
  • Chamberlain et al. [2021] Ben Chamberlain, James Rowbottom, Maria I. Gorinova, Michael Bronstein, Stefan Webb, and Emanuele Rossi. GRAND: Graph Neural Diffusion. In

    International Conference on Machine Learning

    , pages 1407–1418. PMLR, July 2021.
  • Chapman and Mesbahi [2011] Airlie Chapman and Mehran Mesbahi. Advection on graphs. IEEE Conference on Decision and Control and European Control Confereence (CDC-ECC), 50:1461–1466, December 2011.
  • Chen et al. [2018] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.

    Neural Ordinary Differential Equations.

    In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Chen et al. [2020] Ricky T. Q. Chen, Brandon Amos, and Maximilian Nickel. Neural Spatio-Temporal Point Processes. In International Conference on Learning Representations, September 2020.
  • Daley and Vere-Jones [2003] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. Probability and Its Applications, An Introduction to the Theory of Point Processes. Springer-Verlag, New York, second edition, 2003. ISBN 978-0-387-95541-4. doi: 10.1007/b97277.
  • Deng et al. [2019] Zhiwei Deng, Megha Nawhal, Lili Meng, and Greg Mori. Continuous Graph Flow. arXiv:1908.02436 [cs, stat], September 2019.
  • Dupont et al. [2019] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented Neural ODEs. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Finzi et al. [2020] Marc Finzi, Ke Alexander Wang, and Andrew G. Wilson. Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints. Advances in Neural Information Processing Systems, 33:13880–13889, 2020.
  • Gneiting and Katzfuss [2014] Tilmann Gneiting and Matthias Katzfuss. Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1(1):125–151, 2014. doi: 10.1146/annurev-statistics-062713-085831.
  • Grathwohl et al. [2018] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models. In International Conference on Learning Representations, September 2018.
  • Han et al. [2021] Zhen Han, Zifeng Ding, Yunpu Ma, Yujia Gu, and Volker Tresp. Temporal Knowledge Graph Forecasting with Neural ODE. arXiv:2101.05151 [cs], August 2021.
  • Huang et al. [2010] Wenzhang Huang, Maoan Han, and Kaiyu Liu. Dynamics of an SIS reaction-diffusion epidemic model for disease transmission. Mathematical Biosciences & Engineering, 7(1):51, 2010. doi: 10.3934/mbe.2010.7.51.
  • Hyndman and Athanasopoulos [2018] Robin John Hyndman and George Athanasopoulos. Forecasting: Principles and Practice. OTexts, Australia, 2nd edition, 2018.
  • Iakovlev et al. [2020] Valerii Iakovlev, Markus Heinonen, and Harri Lähdesmäki. Learning continuous-time PDEs from sparse data with graph neural networks. In International Conference on Learning Representations, September 2020.
  • Linderman and Adams [2014] Scott Linderman and Ryan Adams. Discovering Latent Network Structure in Point Process Data. In International Conference on Machine Learning, pages 1413–1421. PMLR, June 2014.
  • Lou et al. [2020] Aaron Lou, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser Nam Lim, and Christopher M De Sa. Neural manifold ordinary differential equations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17548–17558. Curran Associates, Inc., 2020.
  • Mathieu and Nickel [2020] Emile Mathieu and Maximilian Nickel. Riemannian Continuous Normalizing Flows. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2503–2515. Curran Associates, Inc., 2020.
  • Poli et al. [2021] Michael Poli, Stefano Massaroli, Clayton M. Rabideau, Junyoung Park, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. Continuous-Depth Neural Models for Dynamic Graph Prediction. arXiv:2106.11581 [cs, stat], June 2021.
  • Rangapuram et al. [2018] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep State Space Models for Time Series Forecasting. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Rasul et al. [2020] Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs M. Bergmann, and Roland Vollgraf. Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows. In International Conference on Learning Representations, September 2020.
  • Salinas et al. [2020] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, July 2020. ISSN 0169-2070. doi: 10.1016/j.ijforecast.2019.07.001.
  • Times [2021] The New York Times. Coronavirus (Covid-19) Data in the United States, 2021. URL https://github.com/nytimes/covid-19-data.
  • Wang et al. [2021] Rui Wang, Danielle Maddix, Christos Faloutsos, Yuyang Wang, and Rose Yu. Bridging physics-based and data-driven modeling for learning dynamical systems. In Ali Jadbabaie, John Lygeros, George J. Pappas, Pablo A. Parrilo, Benjamin Recht, Claire J. Tomlin, and Melanie N. Zeilinger, editors, Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pages 385–398. PMLR, 07 – 08 June 2021. URL https://proceedings.mlr.press/v144/wang21a.html.
  • Wang et al. [2019] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski. Deep Factors for Forecasting. In International Conference on Machine Learning, pages 6607–6617. PMLR, May 2019.
  • Wu et al. [2021] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, January 2021. ISSN 2162-237X, 2162-2388. doi: 10.1109/TNNLS.2020.2978386.
  • Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, September 2018.
  • Yu et al. [2018] Bing Yu, Haoteng Yin, and Zhanxing Zhu.

    Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting.

    In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    , IJCAI’18, pages 3634–3640, Stockholm, Sweden, July 2018. AAAI Press.
    ISBN 978-0-9992411-2-7.

Appendix A Distribution shift in our real-world dataset

[width=]figures/distribution-shift.pdf

Figure 5: Distribution shift between the train and test distribution of the New Jersey COVID-19 cases created by Chen et al. (2020) when binned by the 21 counties. For each of the two distributions, the height of each violin plots is normalized by the total count of observations in that split, i.e. size of training set or size of test set. For most vertices, there are fewer COVID cases later in the 7 day interval in the test set than in the training set.

Appendix B Learned forecasts on COVID-19 dataset

[width=]figures/covid-vertex-distribution.pdf

Figure 6: A copy of Figure 4 for easier comparison to the empirical distribution in Figure 5

Appendix C Model robustness

[width=]figures/ring-graph-learned-vertex-distributions.pdf [width=]figures/ring-graph-learned-vertex-distributions-deleted-all.pdf
Figure 7: (Top) The learned and ground truth for the ring graph dataset. (Bottom) The predicted after we remove all edges from the ring graph. Notice that the ground truth of a completely disconnected graph is to have for all . However, all of the models fail completely on this new disconnected graph, suggesting that they do not learn the true dependence of on the graph structure.

Appendix D Implementation details

[width=]figures/gopher.pdf

Figure 8: Architecture of Gopher.

We use 2 layer MLPs with 64 hidden units per layer whenever we use a MLP. We use Swish activations to ensure smoothness of our dynamics. We also use an augmented neural ODE (Dupont et al., 2019), using 16 dimensions as augmented dimensions out of the 64 hidden dimensions.

d.1 Model architectures.

Gopher.

We parameterize the dynamics from Equation 2 using one graph isomorophism network layer parameterized by a MLP. We also use a MLP to model the projection .

NaiveGNN.

We use the same architecture as Gopher,  namely a GIN layer followed by a projection .  However, instead of using the model to parameterize ODE dynamics, we directly input the node embeddings concatenated with time through the GNN.

NaiveMLP.

We replace the GIN layer of NaiveGNN  with a MLP, keeping all else the same.

d.2 Training procedure and dataset details.

To maximize hardware parallelism, we parallelize our neural ODE computation across sequence timesteps and across sequences using the time-reparameterization trick outlined in Chen et al. (2020)

. For the synthetic datasets, we use the AdamW optimizer with 0.01 learning rate and batch size 64 for 30 epochs. For the COVID-19 dataset, we use a

learning rate and batch size 4 for 15 epochs. Here, each batch consists of multiple sequences drawn from the training period . We use for the ring graph, for the geometric graph, and for the New Jersey counties graph. We generate the ring graph dataset by using hand-set coefficients for the edge weights to allow for counter-clockwise transport. We generate the geometric graph dataset by generating a random geometric graph via the networkx python package and drawing a random sample of .

[width=0.5]figures/advection_ring.pdf

Figure 9: The dynamics of each component of on the cyclic graph. Each component corresponds to the probability of a vertex on the graph.