NTS-NOTEARS: Learning Nonparametric Temporal DAGs With Time-Series Data and Prior Knowledge

We propose a score-based DAG structure learning method for time-series data that captures linear, nonlinear, lagged and instantaneous relations among variables while ensuring acyclicity throughout the entire graph. The proposed method extends nonparametric NOTEARS, a recent continuous optimization approach for learning nonparametric instantaneous DAGs. The proposed method is faster than constraint-based methods using nonlinear conditional independence tests. We also promote the use of optimization constraints to incorporate prior knowledge into the structure learning process. A broad set of experiments with simulated data demonstrates that the proposed method discovers better DAG structures than several recent comparison methods. We also evaluate the proposed method on complex real-world data acquired from NHL ice hockey games containing a mixture of continuous and discrete variables. The code is available at https://github.com/xiangyu-sun-789/NTS-NOTEARS/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

06/26/2021

Time-Series Representation Learning via Temporal and Contextual Contrasting

Learning decent representations from unlabeled time-series data with tem...
01/27/2020

Bayesian nonparametric shared multi-sequence time series segmentation

In this paper, we introduce a method for segmenting time series data usi...
02/23/2017

Deep Nonparametric Estimation of Discrete Conditional Distributions via Smoothed Dyadic Partitioning

We present an approach to deep estimation of discrete conditional probab...
03/13/2020

Dynamic transformation of prior knowledge into Bayesian models for data streams

We consider how to effectively use prior knowledge when learning a Bayes...
02/02/2020

DYNOTEARS: Structure Learning from Time-Series Data

In this paper, we revisit the structure learning problem for dynamic Bay...
04/24/2015

A Bayesian approach for structure learning in oscillating regulatory networks

Oscillations lie at the core of many biological processes, from the cell...
05/23/2019

Learning Discrete and Continuous Factors of Data via Alternating Disentanglement

We address the problem of unsupervised disentanglement of discrete and c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The paper addresses the structure learning problem to learn directed acyclic graphs (DAGs) from observational time-series data. While both undirected and directed acyclic graphs provide insights of a system, the directed edges in DAGs can additionally describe the data generation mechanisms of the system by following the direction of arrows Pearl (2009). When the underlying dynamics of a system are unknown, it is essential to identify them using structure learning methods. There are plenty of applications in real-world domains such as biology Sachs et al. (2005), finance Sanford and Moosa (2012), economics Appiah (2018) and sports analytics Wu et al. (2021).

Many traditional score-based or constrained-based structure learning methods require users to have the expertise to choose the current method based on their knowledge about the underlying true data distributions or restricted function classes of data, which creates burdens to newbies and limits the practical usage of the methods in real-world domains when such knowledge is unknown. On the other hand, neural-network-based methods are handy because they do not require users to have such knowledge.

There are two types of DAGs: instantaneous DAGs and temporal DAGs. The paper addresses the structure learning problem in the latter case. In instantaneous DAGs, there are no temporal dependencies among variables. The data samples are considered to be independent and identically distributed from a joint distribution. In temporal DAGs, there are temporal dependencies among variables. We consider time-series data where data samples across timesteps are dependent. We also consider the setting where multiple time series may be dependent within a single timestep. For neural-network-based methods, while acyclicity in the lagged steps can be easily ensured by following the direction of time in data, it is hard to ensure acyclicity within the instantaneous step. To the best of our knowledge, all existing neural-network-based structure learning methods for time-series data either possibly produce cycles in the instantaneous step

Nauta et al. (2019)

or do not estimate instantaneous relations at all

Tank et al. (2021); Khanna and Tan (2020); Marcinkevics and Vogt (2021). However, ignoring instantaneous relations may lead to incorrect estimation of lagged relations Hyvärinen et al. (2010).

Based on a recent algebraic acyclicity characterization due to nonparametric NOTEARS Zheng et al. (2020)

for learning nonparametric instantaneous DAGs with data samples that are independent and identically distributed, we propose a score-based structure learning method for learning nonparametric temporal DAGs with time-series data using 1D convolutional neural networks (CNNs). Similar to the nonparametric NOTEARS that uses multilayer perceptrons (MLPs), although neural networks are generally considered parametric, the estimated DAGs of the proposed method is nonparametric in the sense that they do not assume any particular parametric forms on the underlying true models where the data comes from. The proposed method is easy to use and ensures acyclicity among all timesteps. It also allows prior knowledge to be added to the learning process using optimization constraints.

Contributions

Our contributions are as follows:

  • We propose NTS-NOTEARS, a continuous optimization approach that learns nonparametric temporal DAGs that capture linear, nonlinear, lagged and instantaneous relations among variables in time-series data. It ensures acyclicity in both lagged and instantaneous timesteps and is faster than methods based on nonlinear conditional independence tests.

  • We promote the use of optimization constraints on convolutional layers to incorporate prior knowledge into the learning process and demonstrate with examples.

  • We show the superior performance of the proposed method by comparing it with other state-of-the-art methods using simulated time-series data.

  • To demonstrate the practical use of the proposed method, we apply it with real-world data that contains a mixture of binary, categorical and continuous variables acquired from NHL ice hockey games.

The paper is structured as follows. We talk about preliminary background and related works in Section 2, describe the proposed method in Section 3 and explain how to incorporate prior knowledge into the proposed method in Section 4. In Section 5, we evaluate the proposed method with simulated data and real data. Finally, we conclude in Section 6.

2 Background and Related Works

2.1 Background

A DAG is a pair of where represents vertices (i.e. nodes) and represents directed edges. Each edge points from one vertex to another vertex , written as . DAGs can be used to represent data generation mechanisms Pearl (2009). For instantaneous DAGs that have no time structure, each vertex

corresponds to a random variable

. An edge denotes that the variable is generated depending on the value of variable . For temporal DAGs, each vertex corresponds to a time series at a discrete timestep . An edge for represents that the time series at discrete timestep is generated depending on the value of time series at . While the time sequence may be infinite, we only consider finite timesteps in practice. The edges in the last timestep of a temporal DAG model the relations among time series within a single timestep due to coarse sampling rate over time. The edges in all the other timesteps for model the relations across timesteps. Therefore, we call the last timestep in a temporal DAG the instantaneous step and all the other timesteps for the lags.

The structure learning problem addressed is as follows. Given a sequence of stationary time series of length , , the goal is to learn a temporal DAG that captures the dependencies in . In general, there exist multiple DAGs that encode the same set of dependencies in Peters et al. (2017). Therefore, given a set of observational data , the solution may not be unique. On the other hand, there are also identifiable cases where the solution is unique, by restricting the data generation mechanisms in some particular function forms Peters et al. (2017).

A recent algebraic acyclicity constraint is presented in Zheng et al. (2018) that proposes instantaneous linear NOTEARS to learn instantaneous DAGs in the linear case. The authors show that when the acyclicity constraint is satisfied, that is when

the estimated directed graph is acyclic, where is the adjacency matrix of the estimated graph, and are the trace and matrix exponential of matrix , respectively, and is element-wise product. Later, the instantaneous nonparametric NOTEARS Zheng et al. (2020) for learning instantaneous DAGs in the nonlinear case is proposed using MLPs. To extend adjacency matrix to the nonlinear case, the authors define a surrogate function for

where is the parameters of the MLPs and is the parameters in the first layer of the MLP that predicts a target variable . Let denote the least-squares loss. To learn the DAGs, the following objective function is optimized

where

Augmented Lagrangian can be applied to convert the constrained optimization problem to an unconstrained optimization problem, which is optimized by the L-BFGS-B algorithm Byrd et al. (1995); Zhu et al. (1997):

2.2 Related Works

Besides the advance of structure learning methods for instantaneous DAGs, learning temporal DAGs with multiple timesteps from time-series data is also a popular topic. DYNOTEARS Pamfil et al. (2020) extends the instantaneous linear NOTEARS to the temporal linear case using linear autoregression. VAR-LINGAM Hyvärinen et al. (2010) extends LINGAM Shimizu et al. (2006, 2011), a restricted instantaneous linear model class with additive non-Gaussian noises, to the temporal case. PCMCI+ Runge (2020) and LPCMCI Gerhardus and Runge (2020) are constraint-based methods for time-series data that utilize conditional independence (CI) tests with and without the assumption of causal sufficiency, respectively. Under the assumption of causal sufficiency, PCMCI+ outputs a completed partially directed acyclic graph (CPDAG) with multiple timesteps; LPCMCI outputs a partial ancestral graph (PAG) that takes latent confounders into account. They are linear or nonlinear depending on the CI tests chosen by the users. However, nonlinear CI tests are computationally expensive Zhang et al. (2011); Runge (2018); Zheng et al. (2020); Runge et al. (2019).

There are also several neural-network-based structure learning methods for time-series data. Unlike our method, most of the methods do not estimate edges in the instantaneous step. For example, cMLP and cLSTM Tank et al. (2021) use MLPs and LSTMs, respectively, to estimate temporal DAGs without edges in the instantaneous step. Economy-SRU Khanna and Tan (2020) is an RNN-based method that learns temporal DAGs without instantaneous edges. GVAR Marcinkevics and Vogt (2021) estimates summary graphs using self-explaining neural networks and the estimated summary graphs do not have instantaneous edges, either. TCDF Nauta et al. (2019) uses attention-based CNNs to estimate temporal DAGs. The estimated temporal DAGs contain edges in both lagged and instantaneous steps. Under the assumption of no spurious correlations, the authors claim the method detects latent confounders. However, it may introduce cycles in the instantaneous step whereas our method ensures acyclicity throughout the entire estimated graph.

3 NTS-NOTEARS: Learning Nonparametric Temporal DAGs With Time-Series Data

3.1 Formulation

Figure 1: The architecture of NTS-NOTEARS for one target variable . The convolutional weights w.r.t. the target variable in the instantaneous step are set to 0. In this example, and .

We propose NTS-NOTEARS, a structure learning method that captures linear and nonlinear relations in time-series data, estimates edges in both lagged and instantaneous steps, ensures acyclicity among the entire graph and is computationally much faster than constraint-based methods with nonlinear CI tests. We adapt the acyclicity constraint from the instantaneous nonparametric NOTEARS to time-series data using 1D CNNs.

Let be the number of lags in the estimated graph and be a sequence of stationary time series of length . We train CNNs where the -th CNN is trained to predict the target variable at each timestep given input variables . The first layer of each CNN is a 1D convolution layer with

kernels, stride equal to

and no paddings. The shape of each convolutional kernel is

, where is the number of timesteps in the estimated graph, i.e. lags instantaneous step. Let denote all the parameters of the CNNs where denote the parameters in the convolutional layers and denote the parameters in the fully-connected layers. Let denote the parameters in the convolutional layer of the CNN that predicts the target variable . Let represent the parameters on the -th columns of these convolutional kernels of the CNN that predicts the target variable . For example, denotes the first columns of the convolutional kernels, which represent the earliest lag, and denotes the last columns of the convolutional kernels, which represent the instantaneous step. Let denote the least-squares loss. The training objective function is defined as:

(1)

where

We define a surrogate function that takes the L2-norm of the corresponding parameters to denote how strong the estimated nonlinear dependency on edge is:

(2)

The value of each represents how strong the estimated dependency between nodes and is. It is not an edge coefficient as in the linear case.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 2: (a, b, c): a random true graph. (d, e, f): the estimated graph using the proposed method. In the true graph, there are 3 time-steps (2 lags 1 instantaneous), 5 nodes per timestep and 16 edges in total. The Structural Hamming Distance (SHD) between the true graph and the estimated graph is 2.

Since the model predicts each in the instantaneous step , every estimated edge points to a node in the instantaneous step . Therefore, there are no cycles in the lagged steps anyhow and we only need to ensure acyclicity in the instantaneous step . Let denote the estimated strengths of edges in the instantaneous step of an estimated temporal DAG, according to Zheng et al. (2020) for instantaneous DAGs, we define the following acyclicity constraint to ensure acyclicity in the temporal DAG:

(3)

The augmented Lagrangian can also be applied here to convert the constrained optimization problem to an unconstrained optimization problem, which can be optimized using the L-BFGS-B algorithm Byrd et al. (1995); Zhu et al. (1997):

(4)

Figure 1 shows the architecture of the proposed model for one single target variable. The overall model for the entire dataset contains such architectures and can be trained jointly as one single network. Weight thresholds can be applied to each timestep in the estimated graph to prune off all the insignificant edges with weak dependency strengths and further ensures acyclicity due to machine precision issues in equation (3) Zheng et al. (2018). An example of an estimated graph with dependency strengths on edges is shown in Figure 5.

3.2 Assumptions and Identifiability

Throughout the entire article, we assume there are no latent confounders in the data, since latent confounders may introduce spurious dependencies between nodes. We also assume the underlying data generating process is fixed and stationary over time.

Like most other optimization problems solved using neural networks, the optimization problem (4) is non-convex and solving the problem to its global optimal is near to impossible. Furthermore, although the proposed method is a general method applicable (but not limited) to any identifiable function classes, we do not make any identifiability claims or assume any parametric forms of the true models. Despite non-convexity and no identifiability claims, in Section 5 we show that the proposed method outperforms other methods with simulated identifiable true models and works well with real-world data. For demonstration purposes, a pair of a true graph and an estimated graph using the proposed method is given in Figure 2.

Figure 3:

The means and standard deviations over 10 runs for each setting. Top row:

. Bottom row: . The number of timesteps = 2. Lower SHD is better. NTS-NOTEARS is the proposed method.

4 Optimization Constraints as Prior Knowledge

Adding prior knowledge about the true graph into the learning process increases not only the accuracy but also the speed of learning, since the number of parameters that need to be learned is reduced Shimizu et al. (2011). We promote the use of the L-BFGS-B algorithm Byrd et al. (1995); Zhu et al. (1997) with optimization constraints on the convolutional layers to encode various kinds of prior knowledge into the learning process (e.g. absence or presence of directed edges and specification of dependency strengths between nodes). The L-BFGS-B algorithm is a second-order memory-efficient nonlinear optimization algorithm that allows bound constraints on variables. The algorithm allows users to define which set of variables are free and which are constrained. For each constrained variable, the users provide a lower bound and an upper bound. At every step, the algorithm optimizes its objective function while keeping all the constrained variables within their bounds. By utilizing the bound constraints of the L-BFGS-B algorithm, we integrate prior knowledge into the proposed method. We give the formulation here. The demonstration showing the benefits from having prior knowledge is given in Section 5.2.

Let where denote free parameters without constraints and denote constrained parameters with lower bounds and upper bounds . The objective function (4) becomes

Since the presence and absence of an edge in an estimated DAG are encoded by the parameters in the convolutional layer, the constraints only need to be added in the convolutional layer. For example, to add a prior knowledge that enforces the absence of an edge from in the most recent lag to in the instantaneous step, the following constraint can be applied

where represents the first weights across , which in turn represents the (K-1)-th columns of the convolutional kernels for the target variable . The minimum value of a bound is for the proposed method, because we take the L2-norm of the parameters in equation (2) and the estimated dependency strengths on edges are non-negative. Similarly, to add a prior knowledge that enforces the presence of an edge from in the earliest lag to in the instantaneous step, the following constraint can be applied

where and are positive numbers with . The upper bound is optional. When is not specified, it means there is no upper bound.

The positive constraints and correspond to the parameters of the convolutional layer, not directly to the estimated dependency strength on the target edge. According to the equation (2), the estimated dependency strength on the target edge is equal to the L2-norm of the parameters. In order for the estimated dependency strength of the proposed method to fall in between and , the constraints need to be scaled before they are applied to the L-BFGS-B algorithm. Let denote a positive constraint and be the number of kernels of the convolutional layer of each CNN. Each is scaled in the following way before being applied to the L-BFGS-B algorithm:

We provide simple APIs in our code for users to specify the dependency strengths without knowing the technical details.

(a)
(b)
(c)
(d)
Figure 4: Each matrix is defined in the order of () from left-to-right and top-to-bottom. Each row represents a parent variable and each column represents a child variable. For instance, the yellow top-right cell represents the edge . (a): the true graph. (b): the estimated graph without prior knowledge. SHD . (c): the estimated graph with prior knowledge that encodes the presence of edge . It helps to correctly estimate (in red box) from the true graph. SHD . (d): the estimated graph with prior knowledge that encodes the absence of edge , which helps to correctly estimate both and (in red boxes) from the true graph. SHD .

5 Evaluation

All the experiments are performed on a computer equipped with an Intel Core i7-6850K CPU at 3.60GHz, 32GB memory and an Nvidia GeForce GTX 1080Ti GPU. Unless otherwise stated, throughout the entire section we set , the number of hidden layers , and .

5.1 Simulated Data

To validate the proposed method, we generate simulated data based on random graphs and identifiable nonlinear structural equation models (SEMs) Peters et al. (2017), then compare the proposed method with several recent structure learning methods such as TCDF Nauta et al. (2019), DYNOTEARS Pamfil et al. (2020), VAR-LINGAM Hyvärinen et al. (2010) and PCMCI+ Runge (2020). Methods that do not estimate edges in the instantaneous step are not comparable with the proposed method, therefore we exclude methods such as economy-SRU Khanna and Tan (2020), GVAR Marcinkevics and Vogt (2021), cMLP or cLSTM Tank et al. (2021) from the comparison.

Structural Hamming Distance (SHD) is the metric we use to measure the accuracy of the methods. SHD is defined as the number of edge additions, removals and reversals needed to convert the estimated graph to the true graph (lower is better).

The true DAGs are sampled uniformly given the number of nodes, number of edges and number of lags. Given a true DAG, the data is simulated based on one of the following identifiable SEMs: additive noise models (ANM) Peters et al. (2017), index models (IM) Yuan (2011); Alquier and Biau (2013)

, and Generalized linear models (GLM) with Poisson distribution

Park and Park (2019). Although we do not make any causal claims to the proposed method, the data generation mechanisms are causal. While the data simulated with GLM with Poisson distribution is categorical, the data simulated with the other models is continuous with Gaussian noise.

In Figure 3, we compare the methods with two sequence lengths representing the cases when there are an inadequate or adequate amount of data, respectively. For each method and each sequence length, we perform grid search on the hyper-parameters of the methods using data generated with the index model, and the best sets of hyper-parameters are chosen and their results are reported. For PCMCI+, we measure the performance with both linear and nonlinear CI tests, i.e. partial correlation test (ParCorr) and Gaussian process regression plus distance correlation test (GPDC) Runge et al. (2019), respectively. Nonparametric CI tests, such as CMIknn and CMIsymb Runge et al. (2019), and LPCMCI Gerhardus and Runge (2020) are infeasible to be included in the experiments. Please see appendix A for details. For the proposed method, are used for , respectively. Please see appendix A

for details on the chosen hyperparameters of other methods. We examine each sequence length and each SEM with four different numbers of nodes in the graph

. The proposed method is defeated by PCMCI+ with GPDC CI test in the setting of ANM with . However, PCMCI+ outputs CPDAGs that may have undirected edges, in which case we always evaluate them as correct no matter the true direction of the edges in the true graph. Furthermore, the GPDC CI test requires continuous variables with additive noise, whereas the proposed method does not assume any parametric forms of the models. Except that, the proposed method outperforms every method in all cases. Also, with longer data sequences, all the methods tend to perform better.

5.2 Prior Knowledge With Simulated Data

Adding prior knowledge into the learning process of the proposed method facilitates the recovery of the true DAGs. The parts of the DAG explicitly encoded by the prior knowledge are assured to be estimated as expected by the users. On the other hand, it may also help to recover edges that are not explicitly encoded by the prior knowledge. In Figure 4, we apply the proposed method to a DAG containing 2 timesteps and 7 nodes per timestep. Without prior knowledge, the method recovers most of the true edges except (reversed as ), (reversed as ), (missed, estimated strength ) and (missed, estimated strength ). The SHD between the true graph and the estimated graph is 4. By simply adding a lower bound constraint (no upper bound) that encodes the prior knowledge that there exists an edge , the method recovers the true edge . The SHD is reduced to 3. Solely with one constraint that encodes the prior knowledge of no edge , the method recovers not only the true edge that is directly relevant to the provided prior knowledge but also another true edge that is not directly relevant. With this constraint alone, the SHD is reduced to 2. It shows that providing prior knowledge to the proposed method may help it to recover edges that are not explicitly encoded by the prior knowledge.

Figure 5: The DAG estimated by the proposed method with and the prior knowledge that encodes the absence of outgoing edges coming from goal(t) or goal(t-1). The number on each edge is not an edge coefficient as in the linear case. Instead, it represents how strong the estimated (nonlinear) dependency between the two nodes is.

5.3 Running Time

In Figure 6, we compare the running time of methods that capture nonlinear relations. The neural networks and the GPDC CI test are accelerated by the same GPU. Although PCMCI+ with GPDC CI test outperforms the proposed method in some cases from the SHD comparison, it is more computationally expensive, which makes PCMCI+ with GPDC CI test unscalable to high dimensions. PCMCI+ with nonparametric CMIsymb CI test for discrete variables raises memory error when the number of nodes is , while PCMCI+ with nonparametric CMIknn CI test for continuous variables is too computationally expensive as shown, which make them infeasible to be included in the SHD comparison. The proposed method achieves better scalability comparing to recent constraint-based temporal methods with nonlinear CI tests.

Figure 6: The running time of nonlinear methods measured in minutes with index model and . The heights of bars are on a logarithmic scale. Besides the neural networks, the GPDC CI test is also performed with GPU acceleration. NTS-NOTEARS is the proposed method.

5.4 Real Data

We apply the proposed method to real-world data collected by Sportlogiq from ice hockey games in the 2018-2019 NHL season. The proprietary dataset contains a mixture of continuous, binary and categorical variables. The true distribution of each variable is unspecified to the proposed method. The continuous variables may or may not follow Gaussian distributions. Please see Table

1 and Figure 7 in the appendix for data description and distributions. The proposed method is easy to use for users without knowledge about CI tests, underlying data distributions, restricted function classes, or variable types. Figure 5 is the DAG learned using the proposed method with the same network architecture as in Section 5.1 and with hyperparameters , and . We intentionally choose larger weight thresholds to enhance sparsity on the learned DAG for easy and clear interpretation. Since the play restarts after a goal is scored (i.e. face-off), we incorporate the prior knowledge that encodes the absence of outgoing edges coming from goal(t) or goal(t-1) to any other nodes. The estimated DAG captures many meaningful relationships between variables. An interesting question to ask in ice hockey is “what contributes to a goal?”Sun et al. (2020); Schulte et al. (2017a, b). By identifying the precedences of node goal(t) in the estimated DAG, we can answer the question: the precedent shot, the duration of the shot, the distance between the shot and the net (i.e. xAdjCoord), the manpower situation and the velocity of the puck are important for scoring a goal.

6 Conclusion

In this paper, we propose NTS-NOTEARS, a score-based structure learning method using 1D-CNNs for time-series data, either with or without prior knowledge, to learn nonparametric DAGs containing both lagged and instantaneous edges. The learned DAGs are ensured to be acyclic. With simulated data, we demonstrate the superior accuracy and running speed of the proposed method compared to several state-of-the-art methods. We also apply the proposed method to complex real-world data that contains a mixture of continuous and discrete variables without knowing the true underlying data distribution. A next step could be extending the proposed method to account for latent variables in the data. Besides sports, it will also be interesting to apply the proposed method to diverse real-world domains. Another topic for future work is to incorporate the proposed method into model-based reinforcement learning.

References

  • P. Alquier and G. Biau (2013) Sparse single-index model. J. Mach. Learn. Res. 14 (1), pp. 243–280. External Links: Link Cited by: §5.1.
  • M. O. Appiah (2018) Investigating the multivariate granger causality between energy consumption, economic growth and co2 emissions in ghana. Energy Policy 112, pp. 198–208. Cited by: §1.
  • R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu (1995) A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16 (5), pp. 1190–1208. External Links: Link, Document Cited by: §2.1, §3.1, §4.
  • A. Gerhardus and J. Runge (2020) High-recall causal discovery for autocorrelated time series with latent confounders. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix A, §2.2, §5.1.
  • A. Hyvärinen, K. Zhang, S. Shimizu, and P. O. Hoyer (2010)

    Estimation of a structural vector autoregression model using non-gaussianity

    .
    J. Mach. Learn. Res. 11, pp. 1709–1731. External Links: Link Cited by: §1, §2.2, §5.1.
  • S. Khanna and V. Y. F. Tan (2020) Economy statistical recurrent units for inferring nonlinear granger causality. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §2.2, §5.1.
  • R. Marcinkevics and J. E. Vogt (2021) Interpretable models for granger causality using self-explaining neural networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §1, §2.2, §5.1.
  • M. Nauta, D. Bucur, and C. Seifert (2019) Causal discovery with attention-based convolutional neural networks. Mach. Learn. Knowl. Extr. 1 (1), pp. 312–340. External Links: Link, Document Cited by: §1, §2.2, §5.1.
  • R. Pamfil, N. Sriwattanaworachai, S. Desai, P. Pilgerstorfer, K. Georgatzis, P. Beaumont, and B. Aragam (2020) DYNOTEARS: structure learning from time-series data. In

    The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy]

    , S. Chiappa and R. Calandra (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 108, pp. 1595–1605.
    External Links: Link Cited by: §2.2, §5.1.
  • G. Park and S. Park (2019) High-dimensional poisson structural equation model learning via $\ell_1$-regularized regression. J. Mach. Learn. Res. 20, pp. 95:1–95:41. External Links: Link Cited by: §5.1.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §1, §2.1.
  • J. Peters, D. Janzing, and B. Schölkopf (2017) Elements of causal inference: foundations and learning algorithms. The MIT Press. Cited by: §2.1, §5.1, §5.1.
  • J. Runge, P. Nowack, M. Kretschmer, S. Flaxman, and D. Sejdinovic (2019) Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances 5 (11), pp. eaau4996. Cited by: Appendix A, §2.2, §5.1.
  • J. Runge (2018) Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, A. J. Storkey and F. Pérez-Cruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, pp. 938–947. External Links: Link Cited by: §2.2.
  • J. Runge (2020) Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3-6, 2020, R. P. Adams and V. Gogate (Eds.), Proceedings of Machine Learning Research, Vol. 124, pp. 1388–1397. External Links: Link Cited by: Appendix A, §2.2, §5.1.
  • K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan (2005) Causal protein-signaling networks derived from multiparameter single-cell data. Science 308 (5721), pp. 523–529. Cited by: §1.
  • A. D. Sanford and I. A. Moosa (2012)

    A bayesian network structure for operational risk modelling in structured finance operations

    .
    Journal of the Operational Research Society 63 (4), pp. 431–444. Cited by: §1.
  • O. Schulte, M. Khademi, S. Gholami, Z. Zhao, M. J. Roshtkhari, and P. Desaulniers (2017a) A markov game model for valuing actions, locations, and team performance in ice hockey. Data Min. Knowl. Discov. 31 (6), pp. 1735–1757. External Links: Link, Document Cited by: §5.4.
  • O. Schulte, Z. Zhao, M. Javan, and P. Desaulniers (2017b) Apples-to-apples: clustering and ranking nhl players using location information and scoring impact. In Proceedings of the MIT Sloan Sports Analytics Conference, Cited by: §5.4.
  • S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. J. Kerminen (2006) A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7, pp. 2003–2030. External Links: Link Cited by: §2.2.
  • S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen (2011) DirectLiNGAM: A direct method for learning a linear non-gaussian structural equation model. J. Mach. Learn. Res. 12, pp. 1225–1248. External Links: Link Cited by: §2.2, §4.
  • X. Sun, J. Davis, O. Schulte, and G. Liu (2020) Cracking the black box: distilling deep sports analytics. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 3154–3162. External Links: Link, Document Cited by: §5.4.
  • A. Tank, I. Covert, N. Foti, A. Shojaie, and E. B. Fox (2021) Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: ISSN 1939-3539, Link, Document Cited by: §1, §2.2, §5.1.
  • L. Y. Wu, A. J. Danielson, X. J. Hu, and T. B. Swartz (2021) A contextual analysis of crossing the ball in soccer. Journal of Quantitative Analysis in Sports 17 (1), pp. 57–66. Cited by: §1.
  • M. Yuan (2011) On the identifiability of additive index models. Statistica Sinica, pp. 1901–1911. Cited by: §5.1.
  • K. Zhang, J. Peters, D. Janzing, and B. Schölkopf (2011) Kernel-based conditional independence test and application in causal discovery. In UAI 2011, Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011, F. G. Cozman and A. Pfeffer (Eds.), pp. 804–813. External Links: Link Cited by: §2.2.
  • X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing (2018) DAGs with NO TEARS: continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9492–9503. External Links: Link Cited by: §2.1, §3.1.
  • X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. P. Xing (2020) Learning sparse nonparametric dags. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 3414–3425. External Links: Link Cited by: NTS-NOTEARS: Learning Nonparametric Temporal DAGs With Time-Series Data and Prior Knowledge, §1, §2.1, §2.2, §3.1.
  • C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal (1997) Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23 (4), pp. 550–560. External Links: Link, Document Cited by: §2.1, §3.1, §4.

Appendix A Hyperparameters of Methods

We perform an extensive grid search on the hyperparameters of each method to find the sets of hyperparameters that give the lowest SHD for each method. The chosen set of hyperparameters for the proposed method is given in Section 5. The following are the chosen sets for the other methods we compare with.

  • PCMCI+ with ParCorr

    • number of lags

    • for , respectively

  • PCMCI+ with GPDC

    • number of lags

    • for , respectively

  • TCDF

    • significance = 0.8

    • learning rate = 0.001

    • epochs = 1000

    • levels = 2

    • kernel size = number of lags

    • dilation coefficient = number of lags

  • VAR-LINGAM

    • base model = Direct-LINGAM

    • weight thresholds = 0.1

  • DYNOTEARS

    • weight thresholds = 0.01

We do not include LPCMCI Gerhardus and Runge (2020) in the SHD comparison because it does not assume the absence of latent confounders and outputs a partial ancestral graph (PAG), while the proposed method does make such assumption and there are no latent confounders in the simulated ground truth DAGs. Also, LPCMCI with nonlinear CI test is computationally too expensive to be compared when the number of nodes is large as shown in Figure 6. We include PCMCI+ Runge (2020) in the comparison, which works under the same assumption as the proposed method. For CI tests, we do not consider CMIknn or CMIsymb Runge et al. (2019), which are based on conditional mutual information. Although the two CI tests are nonparametric, they are computationally expensive when the number of nodes is large, which makes them infeasible to be included in the SHD experiments. We provide more details about running time in Section 5.3.

Appendix B Ice Hockey Data

Variables Type Range
time remaining in seconds continuous [0, 3600]
adjusted x coordinate of puck continuous [-100, 100]
adjusted y coordinate of puck continuous [-42.5, 42.5]
score differential categorical (, )
manpower situation categorical {even strength, short handed, power play}
x velocity of puck continuous (, )
y velocity of puck continuous (, )
event duration continuous [0, )
angle between puck and net continuous [, ]
home team taking possession binary {true, false}
shot binary {true, false}
goal binary {true, false}
Table 1: The variables in the ice hockey dataset.
Figure 7: The distributions of two non-Gaussian continuous variables and two discrete variables in the ice hockey dataset.