1 Introduction
The paper addresses the structure learning problem to learn directed acyclic graphs (DAGs) from observational timeseries data. While both undirected and directed acyclic graphs provide insights of a system, the directed edges in DAGs can additionally describe the data generation mechanisms of the system by following the direction of arrows Pearl (2009). When the underlying dynamics of a system are unknown, it is essential to identify them using structure learning methods. There are plenty of applications in realworld domains such as biology Sachs et al. (2005), finance Sanford and Moosa (2012), economics Appiah (2018) and sports analytics Wu et al. (2021).
Many traditional scorebased or constrainedbased structure learning methods require users to have the expertise to choose the current method based on their knowledge about the underlying true data distributions or restricted function classes of data, which creates burdens to newbies and limits the practical usage of the methods in realworld domains when such knowledge is unknown. On the other hand, neuralnetworkbased methods are handy because they do not require users to have such knowledge.
There are two types of DAGs: instantaneous DAGs and temporal DAGs. The paper addresses the structure learning problem in the latter case. In instantaneous DAGs, there are no temporal dependencies among variables. The data samples are considered to be independent and identically distributed from a joint distribution. In temporal DAGs, there are temporal dependencies among variables. We consider timeseries data where data samples across timesteps are dependent. We also consider the setting where multiple time series may be dependent within a single timestep. For neuralnetworkbased methods, while acyclicity in the lagged steps can be easily ensured by following the direction of time in data, it is hard to ensure acyclicity within the instantaneous step. To the best of our knowledge, all existing neuralnetworkbased structure learning methods for timeseries data either possibly produce cycles in the instantaneous step
Nauta et al. (2019)or do not estimate instantaneous relations at all
Tank et al. (2021); Khanna and Tan (2020); Marcinkevics and Vogt (2021). However, ignoring instantaneous relations may lead to incorrect estimation of lagged relations Hyvärinen et al. (2010).Based on a recent algebraic acyclicity characterization due to nonparametric NOTEARS Zheng et al. (2020)
for learning nonparametric instantaneous DAGs with data samples that are independent and identically distributed, we propose a scorebased structure learning method for learning nonparametric temporal DAGs with timeseries data using 1D convolutional neural networks (CNNs). Similar to the nonparametric NOTEARS that uses multilayer perceptrons (MLPs), although neural networks are generally considered parametric, the estimated DAGs of the proposed method is nonparametric in the sense that they do not assume any particular parametric forms on the underlying true models where the data comes from. The proposed method is easy to use and ensures acyclicity among all timesteps. It also allows prior knowledge to be added to the learning process using optimization constraints.
Contributions
Our contributions are as follows:

We propose NTSNOTEARS, a continuous optimization approach that learns nonparametric temporal DAGs that capture linear, nonlinear, lagged and instantaneous relations among variables in timeseries data. It ensures acyclicity in both lagged and instantaneous timesteps and is faster than methods based on nonlinear conditional independence tests.

We promote the use of optimization constraints on convolutional layers to incorporate prior knowledge into the learning process and demonstrate with examples.

We show the superior performance of the proposed method by comparing it with other stateoftheart methods using simulated timeseries data.

To demonstrate the practical use of the proposed method, we apply it with realworld data that contains a mixture of binary, categorical and continuous variables acquired from NHL ice hockey games.
The paper is structured as follows. We talk about preliminary background and related works in Section 2, describe the proposed method in Section 3 and explain how to incorporate prior knowledge into the proposed method in Section 4. In Section 5, we evaluate the proposed method with simulated data and real data. Finally, we conclude in Section 6.
2 Background and Related Works
2.1 Background
A DAG is a pair of where represents vertices (i.e. nodes) and represents directed edges. Each edge points from one vertex to another vertex , written as . DAGs can be used to represent data generation mechanisms Pearl (2009). For instantaneous DAGs that have no time structure, each vertex
corresponds to a random variable
. An edge denotes that the variable is generated depending on the value of variable . For temporal DAGs, each vertex corresponds to a time series at a discrete timestep . An edge for represents that the time series at discrete timestep is generated depending on the value of time series at . While the time sequence may be infinite, we only consider finite timesteps in practice. The edges in the last timestep of a temporal DAG model the relations among time series within a single timestep due to coarse sampling rate over time. The edges in all the other timesteps for model the relations across timesteps. Therefore, we call the last timestep in a temporal DAG the instantaneous step and all the other timesteps for the lags.The structure learning problem addressed is as follows. Given a sequence of stationary time series of length , , the goal is to learn a temporal DAG that captures the dependencies in . In general, there exist multiple DAGs that encode the same set of dependencies in Peters et al. (2017). Therefore, given a set of observational data , the solution may not be unique. On the other hand, there are also identifiable cases where the solution is unique, by restricting the data generation mechanisms in some particular function forms Peters et al. (2017).
A recent algebraic acyclicity constraint is presented in Zheng et al. (2018) that proposes instantaneous linear NOTEARS to learn instantaneous DAGs in the linear case. The authors show that when the acyclicity constraint is satisfied, that is when
the estimated directed graph is acyclic, where is the adjacency matrix of the estimated graph, and are the trace and matrix exponential of matrix , respectively, and is elementwise product. Later, the instantaneous nonparametric NOTEARS Zheng et al. (2020) for learning instantaneous DAGs in the nonlinear case is proposed using MLPs. To extend adjacency matrix to the nonlinear case, the authors define a surrogate function for
where is the parameters of the MLPs and is the parameters in the first layer of the MLP that predicts a target variable . Let denote the leastsquares loss. To learn the DAGs, the following objective function is optimized
where
2.2 Related Works
Besides the advance of structure learning methods for instantaneous DAGs, learning temporal DAGs with multiple timesteps from timeseries data is also a popular topic. DYNOTEARS Pamfil et al. (2020) extends the instantaneous linear NOTEARS to the temporal linear case using linear autoregression. VARLINGAM Hyvärinen et al. (2010) extends LINGAM Shimizu et al. (2006, 2011), a restricted instantaneous linear model class with additive nonGaussian noises, to the temporal case. PCMCI+ Runge (2020) and LPCMCI Gerhardus and Runge (2020) are constraintbased methods for timeseries data that utilize conditional independence (CI) tests with and without the assumption of causal sufficiency, respectively. Under the assumption of causal sufficiency, PCMCI+ outputs a completed partially directed acyclic graph (CPDAG) with multiple timesteps; LPCMCI outputs a partial ancestral graph (PAG) that takes latent confounders into account. They are linear or nonlinear depending on the CI tests chosen by the users. However, nonlinear CI tests are computationally expensive Zhang et al. (2011); Runge (2018); Zheng et al. (2020); Runge et al. (2019).
There are also several neuralnetworkbased structure learning methods for timeseries data. Unlike our method, most of the methods do not estimate edges in the instantaneous step. For example, cMLP and cLSTM Tank et al. (2021) use MLPs and LSTMs, respectively, to estimate temporal DAGs without edges in the instantaneous step. EconomySRU Khanna and Tan (2020) is an RNNbased method that learns temporal DAGs without instantaneous edges. GVAR Marcinkevics and Vogt (2021) estimates summary graphs using selfexplaining neural networks and the estimated summary graphs do not have instantaneous edges, either. TCDF Nauta et al. (2019) uses attentionbased CNNs to estimate temporal DAGs. The estimated temporal DAGs contain edges in both lagged and instantaneous steps. Under the assumption of no spurious correlations, the authors claim the method detects latent confounders. However, it may introduce cycles in the instantaneous step whereas our method ensures acyclicity throughout the entire estimated graph.
3 NTSNOTEARS: Learning Nonparametric Temporal DAGs With TimeSeries Data
3.1 Formulation
We propose NTSNOTEARS, a structure learning method that captures linear and nonlinear relations in timeseries data, estimates edges in both lagged and instantaneous steps, ensures acyclicity among the entire graph and is computationally much faster than constraintbased methods with nonlinear CI tests. We adapt the acyclicity constraint from the instantaneous nonparametric NOTEARS to timeseries data using 1D CNNs.
Let be the number of lags in the estimated graph and be a sequence of stationary time series of length . We train CNNs where the th CNN is trained to predict the target variable at each timestep given input variables . The first layer of each CNN is a 1D convolution layer with
kernels, stride equal to
and no paddings. The shape of each convolutional kernel is
, where is the number of timesteps in the estimated graph, i.e. lags instantaneous step. Let denote all the parameters of the CNNs where denote the parameters in the convolutional layers and denote the parameters in the fullyconnected layers. Let denote the parameters in the convolutional layer of the CNN that predicts the target variable . Let represent the parameters on the th columns of these convolutional kernels of the CNN that predicts the target variable . For example, denotes the first columns of the convolutional kernels, which represent the earliest lag, and denotes the last columns of the convolutional kernels, which represent the instantaneous step. Let denote the leastsquares loss. The training objective function is defined as:(1) 
where
We define a surrogate function that takes the L2norm of the corresponding parameters to denote how strong the estimated nonlinear dependency on edge is:
(2) 
The value of each represents how strong the estimated dependency between nodes and is. It is not an edge coefficient as in the linear case.
Since the model predicts each in the instantaneous step , every estimated edge points to a node in the instantaneous step . Therefore, there are no cycles in the lagged steps anyhow and we only need to ensure acyclicity in the instantaneous step . Let denote the estimated strengths of edges in the instantaneous step of an estimated temporal DAG, according to Zheng et al. (2020) for instantaneous DAGs, we define the following acyclicity constraint to ensure acyclicity in the temporal DAG:
(3) 
The augmented Lagrangian can also be applied here to convert the constrained optimization problem to an unconstrained optimization problem, which can be optimized using the LBFGSB algorithm Byrd et al. (1995); Zhu et al. (1997):
(4) 
Figure 1 shows the architecture of the proposed model for one single target variable. The overall model for the entire dataset contains such architectures and can be trained jointly as one single network. Weight thresholds can be applied to each timestep in the estimated graph to prune off all the insignificant edges with weak dependency strengths and further ensures acyclicity due to machine precision issues in equation (3) Zheng et al. (2018). An example of an estimated graph with dependency strengths on edges is shown in Figure 5.
3.2 Assumptions and Identifiability
Throughout the entire article, we assume there are no latent confounders in the data, since latent confounders may introduce spurious dependencies between nodes. We also assume the underlying data generating process is fixed and stationary over time.
Like most other optimization problems solved using neural networks, the optimization problem (4) is nonconvex and solving the problem to its global optimal is near to impossible. Furthermore, although the proposed method is a general method applicable (but not limited) to any identifiable function classes, we do not make any identifiability claims or assume any parametric forms of the true models. Despite nonconvexity and no identifiability claims, in Section 5 we show that the proposed method outperforms other methods with simulated identifiable true models and works well with realworld data. For demonstration purposes, a pair of a true graph and an estimated graph using the proposed method is given in Figure 2.


The means and standard deviations over 10 runs for each setting. Top row:
. Bottom row: . The number of timesteps = 2. Lower SHD is better. NTSNOTEARS is the proposed method.4 Optimization Constraints as Prior Knowledge
Adding prior knowledge about the true graph into the learning process increases not only the accuracy but also the speed of learning, since the number of parameters that need to be learned is reduced Shimizu et al. (2011). We promote the use of the LBFGSB algorithm Byrd et al. (1995); Zhu et al. (1997) with optimization constraints on the convolutional layers to encode various kinds of prior knowledge into the learning process (e.g. absence or presence of directed edges and specification of dependency strengths between nodes). The LBFGSB algorithm is a secondorder memoryefficient nonlinear optimization algorithm that allows bound constraints on variables. The algorithm allows users to define which set of variables are free and which are constrained. For each constrained variable, the users provide a lower bound and an upper bound. At every step, the algorithm optimizes its objective function while keeping all the constrained variables within their bounds. By utilizing the bound constraints of the LBFGSB algorithm, we integrate prior knowledge into the proposed method. We give the formulation here. The demonstration showing the benefits from having prior knowledge is given in Section 5.2.
Let where denote free parameters without constraints and denote constrained parameters with lower bounds and upper bounds . The objective function (4) becomes
Since the presence and absence of an edge in an estimated DAG are encoded by the parameters in the convolutional layer, the constraints only need to be added in the convolutional layer. For example, to add a prior knowledge that enforces the absence of an edge from in the most recent lag to in the instantaneous step, the following constraint can be applied
where represents the first weights across , which in turn represents the (K1)th columns of the convolutional kernels for the target variable . The minimum value of a bound is for the proposed method, because we take the L2norm of the parameters in equation (2) and the estimated dependency strengths on edges are nonnegative. Similarly, to add a prior knowledge that enforces the presence of an edge from in the earliest lag to in the instantaneous step, the following constraint can be applied
where and are positive numbers with . The upper bound is optional. When is not specified, it means there is no upper bound.
The positive constraints and correspond to the parameters of the convolutional layer, not directly to the estimated dependency strength on the target edge. According to the equation (2), the estimated dependency strength on the target edge is equal to the L2norm of the parameters. In order for the estimated dependency strength of the proposed method to fall in between and , the constraints need to be scaled before they are applied to the LBFGSB algorithm. Let denote a positive constraint and be the number of kernels of the convolutional layer of each CNN. Each is scaled in the following way before being applied to the LBFGSB algorithm:
We provide simple APIs in our code for users to specify the dependency strengths without knowing the technical details.
5 Evaluation
All the experiments are performed on a computer equipped with an Intel Core i76850K CPU at 3.60GHz, 32GB memory and an Nvidia GeForce GTX 1080Ti GPU. Unless otherwise stated, throughout the entire section we set , the number of hidden layers , and .
5.1 Simulated Data
To validate the proposed method, we generate simulated data based on random graphs and identifiable nonlinear structural equation models (SEMs) Peters et al. (2017), then compare the proposed method with several recent structure learning methods such as TCDF Nauta et al. (2019), DYNOTEARS Pamfil et al. (2020), VARLINGAM Hyvärinen et al. (2010) and PCMCI+ Runge (2020). Methods that do not estimate edges in the instantaneous step are not comparable with the proposed method, therefore we exclude methods such as economySRU Khanna and Tan (2020), GVAR Marcinkevics and Vogt (2021), cMLP or cLSTM Tank et al. (2021) from the comparison.
Structural Hamming Distance (SHD) is the metric we use to measure the accuracy of the methods. SHD is defined as the number of edge additions, removals and reversals needed to convert the estimated graph to the true graph (lower is better).
The true DAGs are sampled uniformly given the number of nodes, number of edges and number of lags. Given a true DAG, the data is simulated based on one of the following identifiable SEMs: additive noise models (ANM) Peters et al. (2017), index models (IM) Yuan (2011); Alquier and Biau (2013)
, and Generalized linear models (GLM) with Poisson distribution
Park and Park (2019). Although we do not make any causal claims to the proposed method, the data generation mechanisms are causal. While the data simulated with GLM with Poisson distribution is categorical, the data simulated with the other models is continuous with Gaussian noise.In Figure 3, we compare the methods with two sequence lengths representing the cases when there are an inadequate or adequate amount of data, respectively. For each method and each sequence length, we perform grid search on the hyperparameters of the methods using data generated with the index model, and the best sets of hyperparameters are chosen and their results are reported. For PCMCI+, we measure the performance with both linear and nonlinear CI tests, i.e. partial correlation test (ParCorr) and Gaussian process regression plus distance correlation test (GPDC) Runge et al. (2019), respectively. Nonparametric CI tests, such as CMIknn and CMIsymb Runge et al. (2019), and LPCMCI Gerhardus and Runge (2020) are infeasible to be included in the experiments. Please see appendix A for details. For the proposed method, are used for , respectively. Please see appendix A
for details on the chosen hyperparameters of other methods. We examine each sequence length and each SEM with four different numbers of nodes in the graph
. The proposed method is defeated by PCMCI+ with GPDC CI test in the setting of ANM with . However, PCMCI+ outputs CPDAGs that may have undirected edges, in which case we always evaluate them as correct no matter the true direction of the edges in the true graph. Furthermore, the GPDC CI test requires continuous variables with additive noise, whereas the proposed method does not assume any parametric forms of the models. Except that, the proposed method outperforms every method in all cases. Also, with longer data sequences, all the methods tend to perform better.5.2 Prior Knowledge With Simulated Data
Adding prior knowledge into the learning process of the proposed method facilitates the recovery of the true DAGs. The parts of the DAG explicitly encoded by the prior knowledge are assured to be estimated as expected by the users. On the other hand, it may also help to recover edges that are not explicitly encoded by the prior knowledge. In Figure 4, we apply the proposed method to a DAG containing 2 timesteps and 7 nodes per timestep. Without prior knowledge, the method recovers most of the true edges except (reversed as ), (reversed as ), (missed, estimated strength ) and (missed, estimated strength ). The SHD between the true graph and the estimated graph is 4. By simply adding a lower bound constraint (no upper bound) that encodes the prior knowledge that there exists an edge , the method recovers the true edge . The SHD is reduced to 3. Solely with one constraint that encodes the prior knowledge of no edge , the method recovers not only the true edge that is directly relevant to the provided prior knowledge but also another true edge that is not directly relevant. With this constraint alone, the SHD is reduced to 2. It shows that providing prior knowledge to the proposed method may help it to recover edges that are not explicitly encoded by the prior knowledge.
5.3 Running Time
In Figure 6, we compare the running time of methods that capture nonlinear relations. The neural networks and the GPDC CI test are accelerated by the same GPU. Although PCMCI+ with GPDC CI test outperforms the proposed method in some cases from the SHD comparison, it is more computationally expensive, which makes PCMCI+ with GPDC CI test unscalable to high dimensions. PCMCI+ with nonparametric CMIsymb CI test for discrete variables raises memory error when the number of nodes is , while PCMCI+ with nonparametric CMIknn CI test for continuous variables is too computationally expensive as shown, which make them infeasible to be included in the SHD comparison. The proposed method achieves better scalability comparing to recent constraintbased temporal methods with nonlinear CI tests.
5.4 Real Data
We apply the proposed method to realworld data collected by Sportlogiq from ice hockey games in the 20182019 NHL season. The proprietary dataset contains a mixture of continuous, binary and categorical variables. The true distribution of each variable is unspecified to the proposed method. The continuous variables may or may not follow Gaussian distributions. Please see Table
1 and Figure 7 in the appendix for data description and distributions. The proposed method is easy to use for users without knowledge about CI tests, underlying data distributions, restricted function classes, or variable types. Figure 5 is the DAG learned using the proposed method with the same network architecture as in Section 5.1 and with hyperparameters , and . We intentionally choose larger weight thresholds to enhance sparsity on the learned DAG for easy and clear interpretation. Since the play restarts after a goal is scored (i.e. faceoff), we incorporate the prior knowledge that encodes the absence of outgoing edges coming from goal(t) or goal(t1) to any other nodes. The estimated DAG captures many meaningful relationships between variables. An interesting question to ask in ice hockey is “what contributes to a goal?”Sun et al. (2020); Schulte et al. (2017a, b). By identifying the precedences of node goal(t) in the estimated DAG, we can answer the question: the precedent shot, the duration of the shot, the distance between the shot and the net (i.e. xAdjCoord), the manpower situation and the velocity of the puck are important for scoring a goal.6 Conclusion
In this paper, we propose NTSNOTEARS, a scorebased structure learning method using 1DCNNs for timeseries data, either with or without prior knowledge, to learn nonparametric DAGs containing both lagged and instantaneous edges. The learned DAGs are ensured to be acyclic. With simulated data, we demonstrate the superior accuracy and running speed of the proposed method compared to several stateoftheart methods. We also apply the proposed method to complex realworld data that contains a mixture of continuous and discrete variables without knowing the true underlying data distribution. A next step could be extending the proposed method to account for latent variables in the data. Besides sports, it will also be interesting to apply the proposed method to diverse realworld domains. Another topic for future work is to incorporate the proposed method into modelbased reinforcement learning.
References
 Sparse singleindex model. J. Mach. Learn. Res. 14 (1), pp. 243–280. External Links: Link Cited by: §5.1.
 Investigating the multivariate granger causality between energy consumption, economic growth and co2 emissions in ghana. Energy Policy 112, pp. 198–208. Cited by: §1.
 A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16 (5), pp. 1190–1208. External Links: Link, Document Cited by: §2.1, §3.1, §4.
 Highrecall causal discovery for autocorrelated time series with latent confounders. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix A, §2.2, §5.1.

Estimation of a structural vector autoregression model using nongaussianity
. J. Mach. Learn. Res. 11, pp. 1709–1731. External Links: Link Cited by: §1, §2.2, §5.1.  Economy statistical recurrent units for inferring nonlinear granger causality. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, External Links: Link Cited by: §1, §2.2, §5.1.
 Interpretable models for granger causality using selfexplaining neural networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 37, 2021, External Links: Link Cited by: §1, §2.2, §5.1.
 Causal discovery with attentionbased convolutional neural networks. Mach. Learn. Knowl. Extr. 1 (1), pp. 312–340. External Links: Link, Document Cited by: §1, §2.2, §5.1.

DYNOTEARS: structure learning from timeseries data.
In
The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 2628 August 2020, Online [Palermo, Sicily, Italy]
, S. Chiappa and R. Calandra (Eds.),Proceedings of Machine Learning Research
, Vol. 108, pp. 1595–1605. External Links: Link Cited by: §2.2, §5.1.  Highdimensional poisson structural equation model learning via $\ell_1$regularized regression. J. Mach. Learn. Res. 20, pp. 95:1–95:41. External Links: Link Cited by: §5.1.
 Causality. Cambridge university press. Cited by: §1, §2.1.
 Elements of causal inference: foundations and learning algorithms. The MIT Press. Cited by: §2.1, §5.1, §5.1.
 Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances 5 (11), pp. eaau4996. Cited by: Appendix A, §2.2, §5.1.
 Conditional independence testing based on a nearestneighbor estimator of conditional mutual information. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 911 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, A. J. Storkey and F. PérezCruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, pp. 938–947. External Links: Link Cited by: §2.2.
 Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Proceedings of the ThirtySixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 36, 2020, R. P. Adams and V. Gogate (Eds.), Proceedings of Machine Learning Research, Vol. 124, pp. 1388–1397. External Links: Link Cited by: Appendix A, §2.2, §5.1.
 Causal proteinsignaling networks derived from multiparameter singlecell data. Science 308 (5721), pp. 523–529. Cited by: §1.

A bayesian network structure for operational risk modelling in structured finance operations
. Journal of the Operational Research Society 63 (4), pp. 431–444. Cited by: §1.  A markov game model for valuing actions, locations, and team performance in ice hockey. Data Min. Knowl. Discov. 31 (6), pp. 1735–1757. External Links: Link, Document Cited by: §5.4.
 Applestoapples: clustering and ranking nhl players using location information and scoring impact. In Proceedings of the MIT Sloan Sports Analytics Conference, Cited by: §5.4.
 A linear nongaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7, pp. 2003–2030. External Links: Link Cited by: §2.2.
 DirectLiNGAM: A direct method for learning a linear nongaussian structural equation model. J. Mach. Learn. Res. 12, pp. 1225–1248. External Links: Link Cited by: §2.2, §4.
 Cracking the black box: distilling deep sports analytics. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 2327, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 3154–3162. External Links: Link, Document Cited by: §5.4.
 Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: ISSN 19393539, Link, Document Cited by: §1, §2.2, §5.1.
 A contextual analysis of crossing the ball in soccer. Journal of Quantitative Analysis in Sports 17 (1), pp. 57–66. Cited by: §1.
 On the identifiability of additive index models. Statistica Sinica, pp. 1901–1911. Cited by: §5.1.
 Kernelbased conditional independence test and application in causal discovery. In UAI 2011, Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 1417, 2011, F. G. Cozman and A. Pfeffer (Eds.), pp. 804–813. External Links: Link Cited by: §2.2.
 DAGs with NO TEARS: continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 9492–9503. External Links: Link Cited by: §2.1, §3.1.
 Learning sparse nonparametric dags. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 2628 August 2020, Online [Palermo, Sicily, Italy], S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 3414–3425. External Links: Link Cited by: NTSNOTEARS: Learning Nonparametric Temporal DAGs With TimeSeries Data and Prior Knowledge, §1, §2.1, §2.2, §3.1.
 Algorithm 778: LBFGSB: fortran subroutines for largescale boundconstrained optimization. ACM Trans. Math. Softw. 23 (4), pp. 550–560. External Links: Link, Document Cited by: §2.1, §3.1, §4.
Appendix A Hyperparameters of Methods
We perform an extensive grid search on the hyperparameters of each method to find the sets of hyperparameters that give the lowest SHD for each method. The chosen set of hyperparameters for the proposed method is given in Section 5. The following are the chosen sets for the other methods we compare with.

PCMCI+ with ParCorr


number of lags

for , respectively


PCMCI+ with GPDC


number of lags

for , respectively


TCDF

significance = 0.8

learning rate = 0.001

epochs = 1000

levels = 2

kernel size = number of lags

dilation coefficient = number of lags


VARLINGAM

base model = DirectLINGAM

weight thresholds = 0.1


DYNOTEARS


weight thresholds = 0.01

We do not include LPCMCI Gerhardus and Runge (2020) in the SHD comparison because it does not assume the absence of latent confounders and outputs a partial ancestral graph (PAG), while the proposed method does make such assumption and there are no latent confounders in the simulated ground truth DAGs. Also, LPCMCI with nonlinear CI test is computationally too expensive to be compared when the number of nodes is large as shown in Figure 6. We include PCMCI+ Runge (2020) in the comparison, which works under the same assumption as the proposed method. For CI tests, we do not consider CMIknn or CMIsymb Runge et al. (2019), which are based on conditional mutual information. Although the two CI tests are nonparametric, they are computationally expensive when the number of nodes is large, which makes them infeasible to be included in the SHD experiments. We provide more details about running time in Section 5.3.
Appendix B Ice Hockey Data
Variables  Type  Range 

time remaining in seconds  continuous  [0, 3600] 
adjusted x coordinate of puck  continuous  [100, 100] 
adjusted y coordinate of puck  continuous  [42.5, 42.5] 
score differential  categorical  (, ) 
manpower situation  categorical  {even strength, short handed, power play} 
x velocity of puck  continuous  (, ) 
y velocity of puck  continuous  (, ) 
event duration  continuous  [0, ) 
angle between puck and net  continuous  [, ] 
home team taking possession  binary  {true, false} 
shot  binary  {true, false} 
goal  binary  {true, false} 
Comments
There are no comments yet.