## 1 Introduction

Learning causal interactions from observational multivariate time series is generally impossible. Among many challenges [shojaie2021granger], two of the most important ones are that i) the data acquisition rate may be much slower than the underlying rate of changes; and ii) there may be unmeasured confounders [Clark2019, Reid2019]. First, due to the cost or technological constraints, the data acquisition rate may be much slower than the underlying rate of changes. In such settings, the most commonly used procedure for inferring interactions among time series, Granger causality, may both miss true interactions and identify spurious ones [BREITUNG2002, Silvestrini2008, tank2019]. Second, the available data may only include a small fraction of potentially relevant variables, leading to unmeasured confounders. Naïve connectivity estimators that ignore these confounding effects can produce highly biased results [soudry2014shotgun]. Therefore, reliably distinguishing causal connections between pairs of observed processes from correlations induced by common inputs from unobserved confounders remains a key challenge.

Learning causal interactions between neurons is critical to understanding cognitive functions. Many existing neuroscience data, such as data collected using functional magnetic resonance imaging (fMRI), have relatively low temporal resolutions, and are thus of limited utility for causal discovery

[Lin2014]. This is because many important neuronal processes and interactions happen at finer time scales [Zhou2014AnalysisOS]. New technologies, such as calcium florescent imaging that generate spike train data, make it possible to collect “live” data that are at high temporal resolutions [Prevedel2014]. The spike train data, which are multivariate point processes containing spiking times of a collection of neurons, are increasingly used to learn the latent brain connectivity networks and to glean insight into how neurons respond to external stimuli [Okatan2005]. For example, Boldingeaat6904 collected spike train data on neurons in mouse olfactory bulb region at 30kHz under multiple laser intensity levels to study the odor identification mechanism. Despite progress in recording the activity of massive populations of neurons [Antal2013], simultaneously monitoring a complete network of spiking neurons at high temporal resolutions is still beyond the reach of the current technology. In fact, most experiments only collect data on a small fraction of neurons, leaving many unobserved neurons [trong_rieke_2008, Tchumatchenko2011, Huang2015]. These hidden neurons may potentially interact with the neurons inside the observed set and cannot be ignored. Nevertheless, given its high temporal resolution, spike train data provide an opportunity for causal discovery if we can account for the unmeasured confounders.When unobserved confounders are a concern, causal effects among the observed variables can be learned using causal structural learning approaches, such as the Fast Causal Inference (FCI) algorithm and its variants [Spirtes2000, Clark2019]. However, these algorithms may not identify all causal edges. Specifically, instead of learning the directed acyclic graph (DAG) of causal interactions, FCI learns the maximally ancestral graph (MAG). This graph includes causal interactions between variables that are connected by directed edges, but also bi-directed edges among some other variables, leaving the corresponding causal relationships undetermined. As a result, causality discovery using these algorithms is not always satisfactory. For example, Malinsky2018 recently applied FCI to infer causal network of time series and found a low recall for identifying the true casual relationships. Additionally, despite recent efforts [chen2021causal], causal structure learning remains computationally intensive, because the space of candidate causal graphs grows super-exponentially with the number of network nodes.

The Hawkes process [Hawkes1971]

is a popular model for analyzing multivariate point process data. In this model, the probability of future events for each component can depend on the entire history of events of other components. Under straightforward conditions, the multivariate Hawkes process reveals Granger causal interactions among multivariate point processes

[eichler2017graphical]. Moreover, assuming that all relevant processes are observed in a linear Hawkes process, causal interactions among components can also be inferred [Bacry2016]. The Hawkes process thus provides a flexible and interpretable framework for investigating the latent network of point processes and is widely used in neuroscience applications [Brillinger1988, Johnson1996, krumin_reutsky_shoham_2010, Pernice2011, Reynaud2013, TRUCCOLO2016336, LAMBERT20189].In modern applications, it is common for the number of measured components, e.g., the number of neurons, to be large compared to the observed period, e.g., the duration of neuroscience experiments. The high-dimensional nature of data in such applications poses challenges to learning the connectivity network of a multivariate Hawkes process. To address this challenge, hansen2015 and Shizhe2017 proposed -regularized estimation procedures and wang2020statistical recently developed a high-dimensional inference procedure to characterize the uncertainty of these regularized estimators. However, due to the confounding from unobserved neurons in practice, existing estimation and inference procedures assuming complete observation from all components, may not provide reliable estimates.

Accounting for unobserved confounders in high-dimensional regression has been the subject of recent research. Two such examples are HIVE [bing2020adaptive] and trim regression [cevid2020spectral]

, which facilitate causal discovery using high-dimensional regression with unobserved confounders. However, these methods are designed for linear regression with independent observations and do not apply to the long-history temporal dependency setting of Hawkes processes. Moreover, they rely on specific assumptions on observed and unobsvered causal effects, which are not clear to hold in neuronal network settings.

In this paper, we consider learning causal interactions among high-dimensional point processes with (potentially many) hidden confounders. Considering the generalization of the above two approaches to the setting of Hawkes processes, we show that the assumption required by trim regression is more likely to hold in a stable point process network, especially when the confounders affect many observed nodes. Motivated by this finding, we propose a generalization of the trim regression, termed hp-trim, for causal discovery from high-dimensional point processes in the presence of (potentially many) hidden confounders. We establish a non-asymptotic convergence rate in estimating the network edges using this procedure. Unlike the previous result for independent data [cevid2020spectral], our result considers both the temporal dependence of the Hawkes processes as well as the network sparsity. Using simulated and real data, we also show that hp-trim has superior finite-sample performance compared to the corresponding generalization of HIVE for point processes and/or the naïve approach that ignores the unobserved confounders.

## 2 The Hawkes Processes with Unobserved Components

### 2.1 The Hawkes Process

Let

be a sequence of real-valued random variables, taking values in

, with and almost surely. Here, time is a reference point in time, e.g., the start of an experiment, and is the duration of the experiment. A simple point process on is defined as a family , where denotes the Borel -field of the real line and . The process is essentially a simple counting process with isolated jumps of unit height that occur at . We write as , where denotes an arbitrarily small increment of .Let be a -variate counting process , where, as above, satisfies for with denoting the event times of . Let be the history of prior to time . The intensity process is a -variate -predictable process, defined as

(1) |

Hawkes1971 proposed a class of point process models in which past events can affect the probability of future events. The process is a linear Hawkes process if the intensity function for each unit takes the form

(2) |

where

(3) |

Here, is the background intensity of unit and is the transfer function. In particular, represents the influence from the th event of unit on the intensity of unit at time .

Motivated by neuroscience applications [Linderman2014, MAGRANSDEABRIL2018120], we consider a parametric transfer function of the form

(4) |

with a transition kernel that captures the decay of the dependence on past events. This leads to , where the integrated stochastic process

(5) |

summarizes the entire history of unit of the multivariate Hawkes processes. A commonly used example is the exponential transition kernel, [Bacry2015].

Assuming that the model holds and all relevant processes are observed, it follows from Bacry2015 that the connectivity coefficient represents the strength of the causal dependence of unit ’s intensity on unit ’s past events. A positive implies that past events of unit excite future events of unit and is often considered in the literature [see, e.g., Bacry2015, Etesami2016]. However, we might also wish to allow for negative values to represent inhibitory effects [Shizhe2017, costa2018], which are expected in neuroscience applications [Purves2001].

Denoting and , we can write

(6) |

Furthermore, let and . Then the linear Hawkes process can be written compactly as

(7) |

### 2.2 The Confounded Hawkes Process

Because of technology constraints, neuroscience experiments usually collect data from only a small portion of neurons. As a result, many other neurons that potentially interact with the observed neurons will be unobserved. Consider a network of counting processes, where we only observe the first components. The number of unobserved neurons, , is usually unknown and likely much greater than . Extending (7) to include the unobserved components, we obtain the confounded Hawkes model,

(8) |

in which denotes the integrated processes of the hidden components, and denotes the connectivity coefficients from the unobserved components to unit .

Unless the observed and unobserved processes are independent, the naïve estimator that ignores the unobserved components will produce misleading conclusion about the causal relationship among the observed components. This is illustrated in the simple linear vector autoregressive process of Figure

1. This example includes three continuous random variables generated according to the following set of equations

where are mean zero innovation or error terms. The Granger causal network corresponding to the above process is shown in Figure 1A. Figure 1B shows that if is not observed, the conditional means of the observed variables and , namely,

leads to incorrect Granger causal conclusions—in this case, a spurious autoregressive effect from the past values of . The same phenomenon occurs in Hawkes processes with unobserved components.

Throughout this paper, we assume that the confounded linear Hawkes model in (8) is stationary, meaning that for all units , the spontaneous rates and strengths of transition are constant over the time range [Bremaud1996, Daley2003].

## 3 Estimating Causal Effects in Confounded Hawkes Processes

### 3.1 Extending trim regression to Hawkes Processes

Let be the projection coefficient of onto such that

(9) |

We can write the confounded linear Hawkes model in (8) in the form of the perturbed linear model [cevid2020spectral]:

(10) |

where . By the construction of , is uncorrelated with the observed processes and represents the bias, or the perturbation, due to the confounding from . In general, unless .

The perturbed model in (10) is generally unidentifiable because we can only estimate from the observed data, e.g., by regressing on . The trim regression [cevid2020spectral] is a two-step deconfounding procedure to estimate

for independent and Gaussian-distributed data. The method first applies a simple spectral transformation, called trim transformation (described below), to the observed data. It then estimates

, using penalized regression. When is sufficiently small, the method consistently estimates . Although this condition is generally not valid for Gaussian-distributed data, previous work on Hawkes processes [Shizhe2017] implies that the confounding magnitude cannot be large when the underlying network is stable, particularly when the confounders affect many observed components (see the discussion following Corollary 1 in Section 4). This allows us to generalize the trim regression to learn the network of multivariate Hawkes processes.Assume, without loss of generality, that the first components are observed at times indexed from to . Let be the design matrix of the observed integrated process and be the vector of observed outcomes. Further, let

be the singular value decomposition on

, where , and ; here, is the rank of . Denoting the non-zero diagonal entries of by , the spectral transformation is given by(11) |

Denoting by a diagonal matrix with entries , the first step of hp-trim involves applying the spectral transformation to the observed data to obtain

(12) | ||||

(13) |

The spectral transformation is designed to reduce the magnitude of confounding. In particular, when aligns with the top eigen-vectors of , for an appropriate , e.g., , the magnitude of is small compared with . Here, is a threshold parameter and the trim transformation is a special case of the spectral transformation when . See cevid2020spectral for additional details.

In the second step, we then estimate the network connectivities using the transformed data by solving the following optimization problem

(14) |

which is an instance of lasso regression

[TibsLasso1996] and can be solved separately for each .### 3.2 An alternative approach

HIdden Variable adjustment Estimation (HIVE) [bing2020adaptive] is an alternative method for estimating coefficients of a linear model with independent and Gaussian-distributed data in the presence of latent variables. Adapted to the network of multivariate point processes, HIVE first estimates the latent column space of the unobserved connectivity matrix, , with defined in (8). It then projects the outcome vector, , onto the space orthogonal to the column space of . Assuming that the column space of the observed connectivity matrix, is orthogonal to that of , HIVE consistently estimates using the transformed data. While the orthogonality assumption might be satisfied when the hidden processes are external, such as experimental perturbations in genetic studies [shawn2017], it might be too stringent in a network setting. However, when the orthogonality assumption fails, HIVE may lead to poor edge selection performance, and potentially worse than the naïve method that ignores the hidden processes. HIVE also requires the number of hidden variables to be known. Although methods in selecting the number of hidden variables have been proposed, the resulting theoretical guarantees would only be asymptotic. An over- or under-estimated number can either miss the true edges or generate false ones. Given these limitations, we outline the extension of HIVE for Hawkes processes in Appendix A and refer the interested reader to bing2020adaptive for details.

## 4 Theoretical Properties

In this section we establish the recovery of the network connectivity in the presence of hidden processes. Technical proofs for the results in this section are given in Appendix B.

We start by stating our assumptions. For a square matrix , let and

be its maximum and minimum eigenvalues, respectively.

###### Assumption 1.

Let with entries . There exists a constant such that .

Assumption 1 is necessary for stationarity of a Hawkes process [Shizhe2017]. The constant does not depend on the dimension . For any fixed dimension, Bremaud1996 show that given this assumption the intensity process of the form (6) is stable in distribution and, thus, a stationary process exists. Since our connectivity coefficients of interest are ill-defined without stationarity, this assumption provides the necessary context for our estimation framework.

###### Assumption 2.

There exists and such that

for all .

Assumption 2 requires that the intensity rate is strictly bounded, which prevents degenerate processes for all components of the multivariate Hawkes processes. This assumption has been considered in the previous analysis of Hawkes processes [hansen2015, costa2018, Shizhe2017, wang2020statistical, cai2020latent].

###### Assumption 3.

The transition kernel is bounded and integrable over , for .

###### Assumption 4.

There exists constants and such that

Assumption 3 implies that the integrated process in (5) is bounded. Assumption 4 requires maximum in- and out- intensity flows to be bounded, which provides a sufficient condition for bounding the eigenvalues of the cross-covariance of [wang2020statistical]. A similar assumption is considered by Basu2015 in the context of VAR models. Together, Assumptions 3 and 4 imply that the model parameters are bounded, which is often required in time-series analysis [Safikhani2017JointSB]. Specifically, these assumptions restrict the influence of the hidden processes from being too large.

Define the set of active indices among the observed components, , and and . Let , and and . Our first result provides a fixed sample bound on the error of estimating the connectivity coefficients.

###### Theorem 1.

Compared to the case with independent and Gaussian-distributed data [cevid2020spectral, Theorem 2], we obtain a slower convergence rate because of the complex dependency of the Hawkes processes. Our rate takes into account the network sparsity among the observed components. It also does not depend on the size of unobserved components, , which is critical in neuroscience experiments because is often unknown and potentially very large.

The result in Theorem 1 is different from the corresponding result obtained when all processes are observed [wang2020statistical, Lemma 10]. More specifically, our result includes an extra error term, , which captures the effect of unobserved processes. Next, we show that when is sufficiently small, we obtain a similar rate of convergence as the one obtained when all processes are observed.

###### Corollary 1.

Under the same assumptions in Theorem 1, suppose, in addition, ,

with probability at least , where depending on the model parameters and the transition kernel.

The spectral transformation empirically reduces the magnitude of , especially when the confounding vector, , stays in the sub-space spanned by top right singular vectors of ; however, this is not guaranteed to hold for arbitrary . Corollary 1 specifies a condition on that leads to consistent estimation of , regardless of the empirical performance of the spectral transformation. While the condition does not always hold for arbitrary stochastic process, it is satisfied for a stable network of high-dimensional multivariate Hawkes processes when the confounding is dense. Specifically, by the construction of in (9), Assumption 4 implies that . When the confounding effects are relatively dense—i.e., , meaning that there are large number of interactions from unobserved nodes to the observed ones—we obtain . Therefore, the constraint on is likely satisfied under a high-dimensional network, when . The high-dimensional network setting is common in modern neuroscience experiments where the number of neurons is often large compared to the duration of experiments.

Next we introduce an additional assumption to establish the edge selection consistency. To this end, we consider the thresholded connectivity estimator,

Thresholded estimators are used for variable selections in high-dimensional network estimation [shojaie2012adaptive] as they alleviate the need for restrictive irrepresentability assumptions [vandeGeer2011EJS].

###### Assumption 5.

There exists such that

Assumption 5 is called the - condition [buhlmann2013] and requires sufficient signal strength for the true edges in order to distinguish them from . Let the estimated edge set and the true edge set . The next result shows that the estimated edge set consistently recovers the true edge set.

###### Theorem 2.

Theorem 2 guarantees the recovery of causal interactions among the observed components. As before, the result is valid irrespsective of the number of unobserved components, which is important in neuroscience applications.

## 5 Simulation Studies

We compare our proposed method, hp-trim, with two alternatives, HIVE and the naïve approach that ignores the unobserved nodes. To this end, we compare the methods in terms of their abilities to identify the correct causal interactions among the observed components.

We consider a point process network consisting of nodes with half of the nodes being observed; that is . The observed nodes are connected in blocks of five nodes, and half of the blocks are connected with the unobserved nodes (see Figure 2a). This setting exemplifies neuroscience applications, where the orthogonality assumption of HIVE is violated. As a sensitivity analysis, we also consider a second setting similar to the first, in which we remove the connections of the blocks that are not connected with the unobserved nodes This setting, shown in Figure 3a, satisfies HIVE’s orthogonality assumption.

To generate point process data, we consider and in the setting of Figure 2a, and and in the setting of Figure 3b. The background intensity, , is set to in both settings. The transfer kernel function is chosen to be . These settings satisfy the assumptions of stationary Hawkes processes. In both settings, we set the length of the time series to .

The results in Figure 2b shown that hp-trim offers superior performance for both small and large sample sizes in the first setting. HIVE performs poorly, worse than the naïve approach, because the orthogonality condition is violated. When the orthogonality condition is satisfied (Figure 3b), HIVE shows the best performance. However, this advantage requires knowledge of the correct number of latent features. When the number of latent features is unknown and estimated from data, HIVE’s performance deteriorates, especially with an insufficient sample size. In contrast, hp-trim’s performance with both moderate and large sample sizes is close to the oracle version of HIVE (HIVE-oracle).

## 6 Analysis of Mouse Spike Train Data

We consider the task of learning causal interactions among the observed population of neurons, using the spike train data from Boldingeaat6904. In this experiment, spike times are recorded at 30 kHz on a region of the mice olfactory bulb (OB), while a laser pulse is applied directly on the OB cells of the subject mouse. The laser pulse has been applied at increasing intensities from 0 to 50 (). The laser pulse at each intensity level lasts 10 seconds and is repeated 10 times on the same set of neuron cells of the subject mouse.

The experiment consists of spike train data multiple mice and we consider data from the subject mouse with the most detected neurons () under laser (20 ) and no laser conditions. In particular, we use the spike train data from one laser pulse at each intensity level. Since one laser pulse spans 10 seconds and the spike train data is recorded at 30 kHz, there are 300,000 time points per experimental replicate.

The population of observed neurons is a small subset of all the neurons in mouse’s brain. Therefore, to discover causal interactions among the observed neurons, we apply our estimation procedure, hp-trim, along with HIVE and naïve approaches, separately for each intensity level, and obtain the estimated connectivity coefficients for the observed neurons. For ease of comparison, the tuning parameters for both methods are chosen to have about 30 estimated edges; moreover, for HIVE, is estimated following the procedure in bing2020adaptive, which is based on the maximum decrease in eigenvalue of the covariance matrix of the errors, in (15).

Figure 4 shows the estimated connectivity coefficients specific to each laser condition in a graph representation. In this representation, each node represents a neuron, and a directed edge indicates a non-zero estimated connectivity coefficient. We see different network connectivity structures when laser stimulus is applied, which agrees with the observation by neuroscientists that the OB response is sensitive to the external stimuli [Boldingeaat6904]. Compared to our proposed method, the naïve approach generates a more similar network than HIVE under both laser and no-laser conditions, which is likely an indication that the naïve estimate is incorrect in this application.

As discussed in Section 4, our inference procedure is asymptotically valid. In other words, with large enough sample size, if the other assumptions in Section 4 are satisfied, the estimated edges should represent the true edges. Assessing the validity of the assumptions and selecting the true edges in real data applications is challenging. However, we can assess the sample size requirement and the validity of assumptions by estimating the edges over a subset of neurons as if the other removed neurons are unobserved. If the sample size is sufficient and the other assumptions are satisfied, we should obtain similar connectivities among the observed subset of neurons, even when some neurons are hidden. Figure 5 shows the result of such a stability analysis for the laser condition using hp-trim. Comparing the connectivities in this graph with those in Figure 4 indicates that the estimated edges using the subset of neurons are consistent with those estimated using all neurons. Thus, the assumptions are likely satisfied in this application.

## 7 Discussion

We proposed a causal-estimation procedure with theoretical guarantees for high-dimensional network of multivariate Hawkes processes in the presence of hidden confounders. Our method extends the trim regression [cevid2020spectral] to the setting of point process data. The choice of trim regression as the starting point was motivated by the fact that its assumptions are less stringent than conditions required for the alternative HIVE procedure, especially for a stable point process network with dense confounding effects. Empirically, our procedure shows superior edge-selection performance compared with HIVE and a naïve method that ignores the unobserved nodes.

Our estimates assume a linear Hawkes process with a particular parametric form of the transition function. Thus, the proposed method identifies causal effects only if these modeling assumptions are valid. When the modeling assumptions are violated, the estimated effects may not be causal. In other words, the method is primarily designed to generate causal hypotheses—or facilitate causal discovery—and the results should be interpreted with caution. Extending the proposed approach to model the transition function nonparametrically and learn its form adaptively from data would thus be an important future research direction. In addition, given that non-linear link functions are often used when analyzing spike train data [PANINSKI2007, Pillow2008], it would also be of interest to develop casual-estimation procedure for non-linear Hawkes processes.

## References

## Appendix A Additional Details on HIVE

We introduce additional notations before illustrating the method.

Let , , and . Then, we rewrite (8) simultaneously for all components:

(15) |

where and are connectivity matrix between the observed and unobserved components, respectively. is the vector of spontaneous rate.

To illustrate the confounding induced by the hidden process, we project onto the space spanned by as

(16) |

where is the projection matrix, representing the cross-sectional correlation between and . Then, (15) becomes

(17) |

where

From the above, it is easy to see that the correlations between the observed and unobserved processes determine the strength the confounding. Specifically, unless —i.e., when the observed and unobserved processes are independent, directly regressing on produces biased estimates on . Under the condition that —i.e., the column space of is orthogonal to the column space of , HIVE gets around this issue by finding a projection matrix, , that projects onto its orthogonal space —i.e., . Moreover, because of the orthogonality assumption, . Therefore, when multiplying both sides in (15) by , the unobserved term disappears. Specifically, letting , (15) becomes

(18) |

Consequently, regressing on

produces unbiased estimates on

(using penalized regression with -penalty on under the high-dimensional setting when is allowed to grow with the sample size ). In order to obtain , HIVE first calculates in (17) and then implement heteroPCA algorithm [zhang2019heteroskedastic] to estimate the latent column space of thus to obtain . Then, the method obtains the corresponding orthogonal project as . We refer the interested readers to bing2020adaptive for details about the method.## Appendix B Proof of Main Results

Since our focus is on the estimation error for , we consider the perturbation model in (10) in the following.

Let be the true model parameter and be the optimizer for (14). Recall that the set of active indices, , and and . Because optimization problem (14) can be solved separately for each component process, in the follows we focus on the estimation consistency for one component process. For ease of notation, we drop the subscript ; that is, we use for , for , for , for , for , for and for .

Next, we state two lemmas that will be used in the proof of main results.

###### Lemma 1 (vandegeer1995).

Suppose there exists such that where is the intensity function of Hawkes process defined in (2). Let be a bounded function that is -predictable. Then, for any ,

with probability at least , for some constant .

###### Lemma 2 (wang2020statistical).

Proof of Theorem 1 : While the skeleton of the proof follows from cevid2020spectral, the following two conditions are needed because of the Hawkes process data’s unique dependency structure.

###### Condition 1.

There exist constants such that

where .

Condition 1 is referred as the restrict strong convexity (RSC) [Negahban2012]. Lemma 2 by wang2020statistical has shown Condition 1 holds when under Assumption 1- 4. Since the min eigenvalue of stays the same with our choice of , Condition 1 holds for .

###### Condition 2.

Under the two conditions, we achieve the conclusion as follows.

Because is the optimizer for (14),

Letting and ,

Next, we discuss in two conditions: i) and ii) .

First, when ,

The above implies

which means for .

Taking ,

where the second inequality is by Condition 1 and the last step is by using twice. Therefore, we get

When ,

Combining the two cases, we always have

Thus, taking and dividing both sides by , we achieve the conclusion that

Proof of Corollary 1 : Notice that

with probability at least , where the second inequality is by Lemma 2.

Proof of Theorem 2: Recall and . To establish selection consistency, we need two parts. First, we show that our estimates on the true zero and non-zero coefficients can be separated with high probability; that is, there exists some constant such that for and

Comments

There are no comments yet.