1 Introduction
Understanding causal relationships [24, 36] among different processes is an ubiquitous task in many scientific disciplines as well as engineering (e.g., in the context of climate research [10, 32, 19, 18, 13, 30], econometrics [12, 21], or molecular dynamics [11]). Yet, the common approach to gaining causal knowledge by conducting experiments is often infeasible or unethical, for example in Earth sciences. All that is often given is a set of time series describing these processes with no specific knowledge about the direction and form of their causal relationships available. The challenge, termed causal discovery, is then to reconstruct the underlying graph of causal relationships from time series data [30]. Based on that graph the processes that generated the data can then be modelled in the framework of structural causal models (SCMs)[24] to further understand causal relations, predict the effect of interventions, and for forecasting.
Today’s evergrowing abundance of time series datasets promises many application scenarios for datadriven causal discovery methods, but many challenges emerging from the dynamic nature of such datasets have not yet been met. Further, causal knowledge cannot be gained from data alone and each method comes with its particular set of assumptions [36] about properties of the underlying processes and the observed data. See [30] for an overview of methodological frameworks, challenges, and application scenarios.
A particular and widespread challenge is regimedependence, a common property of nonlinear dynamical systems that can also be described as one form of nonstationary behaviour. Regimedependence means that the causal relationships between the considered processes vary depending on some prevailing background regime that may be modelled as switching between different states. Further, often such regimes have strong persistence, that is, they operate and affect causal relations on much longer time scales than the causal relations among the individual processes. In the climate system, for instance, several cases of such regimedependencies exist. For example, rainfall in India in summer is known to be influenced by the socalled El Niño Southern Oscillation (ENSO), an important mode of variability in the tropical Pacific affecting the largescale atmospheric circulation and thereby weather patterns around the globe [39, 34]. It is, however, generally assumed that ENSO does only marginally affect Indian rainfall in winter [23]. Thus, the causal relationships between ENSO and rainfall over India change dependent on the season that here defines the backgroundregime and operates on a longer time scale (several months) than the causal relations among ENSO and Indian rainfall (several weeks).
1.1 Existing work
Causal discovery has seen a steep rise with a plethora of novel approaches and methods in recent years. Each approach has different underlying assumptions and targets different real world challenges as discussed in [30]
. In general, causal (network) discovery methods can be classified into classical Granger causality approaches
[12, 3], constraintbased causal network learning algorithms [36], scorebased Bayesian network learning methods
[17, 6], structural causal models [26], and statespace reconstruction methods [37, 2].Here we focus on the constraintbased framework which has the advantage that it can flexibly account for nonlinear causal relations and different datatypes (continuous and categorical, univariate and multivariate). PCMCI adapts this framework to the time series case yielding high detection power also in highdimensional and strongly autocorrelated time series settings. However, one of the general assumptions of PCMCI (as well as of other causal discovery algorithms) is stationarity, i.e., that at least the existence or absence of a causal link does not change over the considered time series segment [29]. While known changes in the background signal can be accounted for by restricting the time series to the stationary regimes, PCMCI cannot handle unknown background regimes which constitute a particular case of latent confounding.
Some recent work addresses causal discovery in the presence of nonstationarity. [20] model nonstationarity in the form of (continuous) stochastic trends in a linear autoregressive framework. [41] account for nonstationarity in the more general constraintbased framework. However, both address the case of a (smoothly) varying continuous background variable that continuously changes causal relations among the observed variables. This means that these methods will not output regimedependent causal graphs, but a “summary” graph that accounts for regimes modelled as latent drivers. In [25, 7]
assumed known nonstationary regimes are exploited to estimate causal relations also in the presence of general latent confounders.
Currently few methods exist that address the case of a discrete regime variable leading to distinct causal regimes that may be physically interpreted. For example, in the climate science context, regimedependent autoregressive models (RAM) were introduced already in 1990
[42]. These can yield physically well interpretable results that, however, require wellchosen ancillary variables and a seasonal index which are not learned from data. Thus, RAM requires a priori knowledge of the regimes, which one often aims to learn rather than enforce. Furthermore, the autoregressive framework only permits linear relationships. In the context of discrete state spaces regime dependent causal discovery has been considered in [11]. Another approach that has been proposed to model time dependent Granger (non)causality is based on a Markov Switching VAR ansatz with an economics application in mind [21]. Specifically, the regime assignments are computed by sampling from a Markov chain.
A more general framework to handle discrete regimes is the Markovswitching ansatz of [9], which flexibly models regimedependence utilizing the assumption of a finite number of regimes and a level of persistency in the transitions between different regimes. This ansatz has been successfully realised in combination with many different model assumptions (e.g., see [14]) here we want to explore it for causal networks and combine it with PCMCI [28], a constraintbased time series causal discovery method [36]. We call our method RegimePCMCI.
The remainder of the paper is structured as follows: In section 2 the underlying mathematical problem, concepts, and key assumptions are formalised, and a motivating example is discussed to provide some intuition. Our novel method RegimePCMCI is then presented in section 3. These theoretical and algorithmic parts are complemented by a thorough numerical investigation of the proposed method in various artificial settings in section 4. Finally, in section 5, RegimePCMCI is applied to a realworld dataset from climate science, addressing the changing relationships of ENSO and rainfall over India.
2 Problem setting
Let be a sequence of realvalued
dimensional random variables
where is associated with time. A realisation over the time interval of this stochastic process is denoted and we assume that it is possible to obtain observations of these realisations. We assume that the underlying process is modelled by a regimestationary discretetime structural causal model (SCM)(1) 
Here the measurable functions depend nontrivially on all their arguments, the noise variables are jointly independent and are assumed to be stationary, i.e., for all for some distribution , and the sets define the causal parents of . Here we assume lagged relationships, but this is not a necessity. In contrast to approaches assuming stationarity, both and are allowed to depend on regimes in time as further formalized in Assumption 1 (section 2.2). Then the problem setting considered in this manuscript is of the nature of the following inverse problem
(2) 
with where belong to an appropriate functions space for each and . is the maximum considered time lag. In other words, the aim is to fit a set of unknown parameters on the basis of an observed time series . In the next section we will discuss the particular structure of the parameters we are interested in.
2.1 Causal graphs
Representing causal relations between different processes as graphs (also referred to as networks) is common practice in the context of causal inference and causal discovery [24, 36]. For time series, we use the concept of time series graphs. The nodes in the time series graph associated with the SCM (1) are the individual timedependent variables with at each time . Variables and for a time lag and a given are connected by a lagspecific directed link “” if for a particular . We denote the maximum ground truth time lag of any parent as .
For a more detailed introduction the reader is referred to [28]. In the following we will use graphs and networks interchangeably.
The collection of parent sets for all components at time is denoted . This set of parents is part of the unknown parameters we want to infer. Note that their dimensionality is assumed finite, but not known a priori. The other quantity of interest is the functional form of the causal relations in SCM (1) corresponding to these links which we here restrict to an appropriate function class as modelled in Eq. (2). If we assume linear functions with coefficients , then the inverse problem Eq. (2) simplifies to
(3) 
Thus for a given time series and with and functional the aim is to find the unknown parameters .
2.2 Persistence
As mentioned above, in many application areas nonstationarity may be modelled not in form of abrupt or continuous changes, but via piecewise constant regimes [27, 40, 11]. These regimes will further exhibit a certain persistent behaviour. In order to capture nonstationary systems with these properties we will restrict our inference to regimedependent persistent dynamics.
Assumption 1: Denote the parents and functional dependency of a given variable for a regime as and . We call a regime persistent if the parents and functional dependencies are stationary for an average of consecutive time steps . Further, we assume that there is a finite number of regimes on the whole time domain, i.e., .
Note that persistence enters here via a regime average persistence , which naturally implies a finite number of regimes .
Under Assumption 1 the considered linear inverse problem (3) reduces to finding a set of parameters
and the change points between the regimes given by the regime assigning process
with . For example, component of the regime assigning process can be of the form . Regime is active for all time steps for which .
2.3 Motivating example
Before we introduce our novel regime detecting causal discovery algorithm, we illustrate the underlying challenges of causal discovery in the face of regimedependence by giving a simple example. Consider the case of two background regimes and two time series and and the associated causal graphs as shown in Figure 1,a. Variable linearly influences but the sign changes in time, alternating between a positive (during regime 1) and a negative (during regime 2) influence. Here the two regimes alternate equidistantly. The crosscorrelation of and
over the whole timeperiod is zero because the opposite sign effects cancel each other out in the linear regression. Thus, any linear causal discovery method would fail in detecting the influence of
on when no a priori knowledge on the two background regimes exists. For example, applying a linear version of PCMCI on the whole time sample would give a network of disconnected variables (Figure 1,b).In contrast, if the regimes are known and PCMCI is applied to samples from both regimes separately, the positive and negative links are correctly detected (not shown). To deal with such problems automatically, our algorithm needs to learn both the regimes as well as the regimedependent causal relations.
3 Method
Our approach is designed to alternate between learning the regimes and the causal graphs for each regime in an iterative fashion. In principle, any causal discovery method that yields a causal graph can be used. Here we chose PCMCI [28] as a method that adapts the constraintbased causal discovery framework to the time series case.
3.1 Causal discovery
The constraintbased framework has the advantage that it can flexibly account for nonlinear causal relations and different datatypes (continuous and categorical, univariate and multivariate) since it is based on conditional independence defined as follows. Two variables and are conditionally independent given a (potentially multivariate ) variable , denoted , if
(4) 
where
denotes associated probability density functions.
There exist a large variety of conditional independence tests, see [29, 28] for a discussion. If relationships are assumed linear, as is the case in practical examples of the present work, a partial correlation can be used.
As is explained in detail in [28], PCMCI is based on a variant of the PC algorithm (names after its inventors Peter Spirtes and Clark Glymour [36]) combined with the momentary conditional independence (MCI) test. It consists of two stages: (i) PC condition selection to identify relevant conditions for all time series variables and (ii) the MCI test to test whether with
(5) 
Thus, MCI conditions on both the parents of and the timeshifted parents of . These two stages serve the following purposes: PC is a timelagged causal discovery algorithm based on the PCstable algorithm [8] that removes irrelevant lagged conditions (up to some ) for each variable by iterative conditional independence testing. A liberal significance level in the tests lets PC adaptively converge to typically only few relevant conditions that include the causal parents with high probability, but might also include some false positives. The MCI test then addresses false positive control for the highlyinterdependent time series case, which is why we chose it here. A causal interpretation of the relationships estimated with PCMCI comes from the standard assumptions in the constraintbased framework [36, 29, 28], namely causal sufficiency, the Causal Markov condition, Faithfulness, noncontemporaneous effects, and stationarity within the regimes as further discussed below. As demonstrated in [28], PCMCI has high detection power and controlled false positives also in highdimensional and strongly autocorrelated time series settings.
The main free parameters of PCMCI are the chosen conditional independence test, the maximum time lag , and the significance levels in MCI and in PC. We discuss the selection of these parameters in section 3.6.
PCMCI is applied to sample subsets of the time series pertaining to an estimated regime in an iterative step of our method, initialised with some random regime assignment. Given a significance level , the output of PCMCI is the set of parents for all time series variables for that regime,
(6) 
Based on these parents with associated causal links, causal effects that quantify the strength of a link can be estimated.
3.2 Regime learning
Given an estimated set of parents for each regime, the regime variable for the next iteration is updated assuming a particular nonstationary setting of finite metastable regimes as defined in Assumption 1 (section 2.2). This learning approach is based on ideas first proposed in [14] and later extended to many different models [9].
In the following we focus on the linear setting, the nonlinear extension is discussed in section 3.5. To learn the regime parameters for the inverse problem (3) introduced in section 2,
(7) 
given (the output from PCMCI), we define a cost functional
(8) 
subject to constraints
(9) 
and
(10) 
where is a distance measure such as the squared euclidean distance and is a regime assigning process describing the weight of the individual networks at each time .
The format of relies on the assumption that the system associated with the considered data exhibits metastability in time (see Assumption 1, that translates in the summation over ). Note that the persistence enters the functional in form of a regularization (see Constraint 10). An alternative option is to add a regularisation term that enforces some form of smoothness of (e.g., Tikhonov regularisation [38]).
The free tuning parameter is related to the average regime duration of time steps as follows: an average regime duration of in all regimes is implemented by choosing . Note that in practice, the average regime switching time might not be exactly known. However, we expect in many application areas that prior domain knowledge on reasonable time scales of regime switching is available. The choice of parameters, including choices of value , will be discussed in section 3.6.
3.3 Algorithm
The RegimePCMCI algorithm iterates over two major estimation steps: (Step 1) causal discovery to obtain and fit the coefficients (or a nonlinear function) and (Step 2) regime learning to update the regime variable . In the following, indicates the current iteration. The superscript is added combined with brackets to the variables updated in each loop. The details of the consecutive subroutines are laid out below.
3.3.1 Step 1: Causal discovery and model estimation
The first step is to estimate a set of parents and coefficients with on the basis of a fixed obtained in step 2 of the previous iteration (see lines of Algorithm 1 and section 3.3.2). In the first iteration, the regimes are assigned randomly. and are estimated on the basis of a subset of the time series with
(11) 
for each regime . The regimedependent parents are estimated via PCMCI.
To solve Eq. (3), we assume a functional relationship that relates each variable to its parents (for each regime). Here we assume linear functions implying that the coefficients can be estimated from the following regression model for each fixed :
(12) 
for . In other words for every the following optimisation has to be solved
(13) 
for . Note that the coefficients not indicated as relevant via the parent set are defined to be zero, i.e., for .
3.3.2 Step 2: Regime learning
Step 2 is to determine an optimal regime assigning process given the current estimates for the parents and coefficients (see second bullet in loop of Algorithm 1). For this the following optimisation problem needs to be solved
(14) 
subject to the constraints (9) and (10), and where for each
(15) 
Since the first time steps cannot be predicted, we choose to set those to and to not consider this portion of the time series in the algorithm evaluation.
In order to search for the global minimum, the algorithm is run for a number of different initializations of (annealing). The annealing run with the lowest cost functional objective is chosen as optimal fit. Note that the individual annealing steps are embarrassingly parallelizable.
3.4 Reconstruction of time series
3.5 Nonlinear functions
It is important to mention that the choice of functions in the learning problem (2) should be determined according to the considered applications and on assumptions on the data. Further, the conditional independence test used in PCMCI should cover at least an equally expressive functional dependency class. For example, if Gaussian processes are used to estimate , then the Gaussian Process Distance Correlation (GPDC) test (see [28]) can be used.
Consequently, a nonlinear version of the presented RegimePCMCI would require a different cost functional. The complexity of the assumed model would increase significantly due to the twofold presence of nonlinearity (one through the regimedependence and the other one via nonlinear causal relations). Therefore, we here restricted the focus to linear functions . Addressing nonlinearity in combination with the considered nonstationarity will be explored in subsequent research.
3.6 Parameter selection
RegimePCMCI involves a number of parameters that need to be chosen. They can be separated into parameters of the causal discovery method PCMCI and those of the regime learning part.
The main free parameters of PCMCI are the chosen conditional independence test, the maximum time lag , and the significance levels in MCI and in PC. should be regarded as a hyperparameter and can be chosen based on modelselection criteria such as the Akaike Information Criterion (AIC) [1] or crossvalidation. could be incorporated into this model selection. But since PCMCI is not very sensitive to this parameter [28] (as opposed to, e.g., Granger causality), its choice can be based on lagged correlation functions, see [28] for a discussion. The choice of conditional independence test is a modelling assumption guided by the assumed nonlinearity of the underlying process and also finite sample considerations. Finally, is chosen based on the desired level of false positives.
Determining a suitable choice of the unknown number of regimes
is a difficult task. In particular, it is hard to find the right balance between avoiding to overfit and to choose appropriately complex models to describe a specific dataset and thus the underlying dynamics well. One way to assess this balance heuristically is to employ an information criterion (IC)
[5] which has been derived in the context of regression models and since been adapted to various other model scenarios including graphs [35].An IC is designed to capture the goodness of fit penalised by the number of parameters in order to prefer models with as few parameters as possibles, to avoid overfitting (parsimony). Here the number of parameters is defined as
(17) 
The first term in Eq. (17) relates to the number of parameters required to describe which can be fully determined via the change points. The second term in Eq. (17) counts the number of relevant parents, that is, the nonzero coefficients . Here we use the corrected Akaike Information criterion (AICc) first proposed in [15] to estimate . Note that we use the corrected version of the original AIC [1] to correct for small samples sizes relative to the number of parameters
(18) 
where is the maximum value of the likelihood function for the model one assumes for the residuals (see [22] for a more detailed discussion). The choice of is numerically investigated in section 4.3.
The number of iteration steps should be chosen to ensure that the optimisation process converges. In our experiments we found with exploratory testings that shows convergence after about 1020 iterations for all examples investigated. The number of annealing steps should be chosen to ensure we can span a large number of local solutions to this nonconvex optimisation problem (Eq. (8)). However, computational time will set a limit to a too high parameter. Note, however, that this part is embarrassingly parallelisable.
4 Numerical investigation
In the following we investigate the performance of RegimePCMCI by means of several toy examples. The artificial data is designed to test the methods robustness and accuracy with respect to various potential scenarios that could occur in real applications. At first low dimensional () causal relations are studied as the results can be interpreted more easily. Next, we also consider higher dimensional settings (). The reference time series are generated with the following SCM time series model:
(19)  
with predefined , , and . Note that the reference set of parents is specified by the nonzero coefficients .
4.1 Low dimensional data with two underlying regimes
First we focus on a simple setting of two regimes, i.e. , and a two dimensional underlying process (i.e., ). Our aim is to test the performance of RegimePCMCI for different elemental features that can change between regimes. For brevity, links will be called autolinks or autodependencies for and cross links for . We consider the following scenarios as summarised in Table 1: sign change of coefficient (in auto link and cross variables link), lag change (in cross link), coefficient change (in auto link) and childparent inversion defined via an assortment of linear functions and associated coefficients. In all examples, each variable is also autolinked at lag 1, which is a realistic yet challenging assumption for many algorithms.
4.1.1 Experiment settings
We design five toy models, in network terms, corresponding to different sets of parents defined via the references parameters given in columns to of Table 1. Further, synthetic regime assigning processes are generated for all examples. More specifically, is designed to consist of alternating windows, i.e., regime transitions. The length of these windows is randomly selected to be between and and the constraint (9) imposes . The final length of the time series is capped at to ensure equallylong regime assignment time series.
Then an artificial time series via (19) with is generated. Note that the stochastic process (19) can be exactly reconstructed via the coefficients , their activation and a specific realisation of the innovation term .
The PCMCI parameters are chosen as follows: partial correlation as a conditional independence test, , as recommended in [31], , and masking type ‘y’ (see the documentation of tigramite for the definition of masking types). The number of regimes was set to and the maximum number of regime transitions is , i.e., correct guess on number of regimes and switches (model selection for is investigated in section 4.3). The number of iterations is and the number of annealings is . A summary of the parameters is shown in Table 2. We generate time series realisations for each example.
Example  

arrow direction  
causal effect  
lag  
sign  
sign  
CI test  

ParCorr 
4.1.2 Results
The ability of the proposed method to recover the networks and the regimes on the basis of the artificially designed time series are presented in the following. Figures 26 present results for each case in Table 1, focusing on one of the synthetic datasets. Table 3 shows summary statistics over all runs.
The case sign is discussed in detail. The groundtruth regime evolution and networks are shown in the top part of panels a and b in Figure 2; in the middle part of both panels their RegimePCMCI reconstruction is shown; in the bottom part the difference between reconstructed and true regimes are presented to visually inspect the accuracy. The reconstructed regime assigning process for each regime matches the truth in 99.6% of time steps (97% average value over , see Table 3). The corresponding networks have all and only the correct links (TPR = and FPR = average value over ); their linear causal effect is also well estimated with each link correct up to (averaged error per link is (9%)).
The other four cases are presented in Figures 36. The case causal effect, and to a lesser extent lag change, are hardest to detect. This is because the difference between the individual regimes and a mixed state of the two is not very large and thus the detection is more challenging. This adds to the general challenge of nonconvexity of the functional we are optimising, which we mitigate by the annealing steps as mentioned in section 3.6. A similar challenge is found for some high dimensional runs for which we refer to section 4.4.
Table 3 shows the summary results over the realisations. The estimation errors are presented in terms of the regime assigning process (second column), the network structure (third to sixth column), the causal effects of links (seventh to tenth columns) and the overall reconstructed time series (last column). The second column, , is the average percentage of wrongly estimated time steps per regime (the lower the better, note that this value is the same for , by construction). In terms of networks, the link detection performance is evaluated via the true positive (TPR) and false positive rates (FPR). Further, we compare these with the reference FPR and TPR (superscript ref) if PCMCI is run with the groundtruth regime variable known (but causal structure unknown). The accuracy in links’ causal effects is assessed via , the average difference between the reconstructed linear coefficient and the reference values of the ground truth links. is also expressed as percentage, i.e. each difference is weighted by the absolute value of the ground truth coefficient. The last column, , is the expected prediction error per variable and per time step and is computed as with defined in Eq (8) and with and referring to the number of variables, here two, and the length of the time series respectively. The precise definition of all the above statistics can be found in Appendix A.
Example  

arrow direction  
causal effect  
lag  
sign  
sign 
In summary, Table 3 shows that :

: on average, the regime assigning process is reconstructed correctly in of the time steps for all cases except causal effect. The causal effect and lag examples are the hardest to infer, with causal effect being particularly deficient. In these examples a mixedregime state (e.g., arising from assigning a considerable fraction of wrong time steps to a regime) is still quite close to any of the true regimes. Therefore the algorithm struggles to decide which time steps belong to which regime, since they could fit both to some degree. Yet, there are 7 instances where (one presented in Figure 6) and those, as expected from PCMCI, give very good network fit. We notice that these runs do not correspond to the lowest objective values of the set (i.e. better fit) which shows that runs that end up in mixed states can still fit the data quite well. Also, we notice that the causal effect setup reaches local minima in 16% of the 100 runs, thus in 84% of the runs the algorithm cannot easily find a stable solution which points at a weaker confidence in the output.

TPR: despite some errors in reconstructing the regime assigning process, the TPR is always very close to 1. This can indicate that the true signals, dynamic wise, are strong enough to be detectable.

FPR: Ideally the false positive rate should be upperbounded by . This is also the case if we assume the correct regimes (see column ). However, if the regimes are learned, in most of the examples the FPR value is higher due to errors in learning the regimes. If a wrong regime is learned, then both false positives and false negatives can occur. False negatives, i.e., missing links in the PC step of PCMCI can lead to false positives in the MCI step.

: Errors in parents’ detection (either due to false positives (FPR) or to false negatives (missed links, FNR = 1TPR)) surely impact the estimation of link effects. Since the TPR and FPR are good, except for the causal effects case, we expect to obtain also good results for the linear coefficients. This is indeed the case, as the difference is of order implying a relative error of about .
4.2 Low dimensional data with three underlying regimes
To illustrate how RegimePCMCI deals with more than two regimes, we also considered a toy time series based on three different causal regimes. It is of course possible to consider the case , yet in applications it is often desirable to infer a few prominent and relevant regimes rather than having too many that are not interpretable anymore. In other words, the aim is to avoid overfitting and to increase the information gain by reducing the complexity of the assumed model (parsimony).
4.2.1 Experiment settings
The artificial time series is generated via a regime dependent causal graph that is designed by combining two of the regimes settings presented in section 4.1, namely sign change and arrow inversion (for details see Table 4). The regime assigning reference process is generated by randomly choosing between different persistence lengths of , and timesteps and iterating for times. The algorithm is run with free parameters in Table 5.
example  

sign  
and arrow  
direction 
CI test  

ParCorr 
4.2.2 Results
Figure 7 shows the results. There are only minimal deviations from the true reference values which confirms that the proposed method is capable to deal with . This also holds for the summary results over runs presented in Table 6. Yet, it is important to note that we chose a combination of causal graphs that performed well for , i.e, causal effect changes would also be difficult to detect for .
4.3 Regime parameter selection
We investigate how parameter selection of the number of regimes affects the results by means of the AICc scores defined in (18). We investigate two test scenarios of for a selection of the examples defined in the previous sections 4.1 and 4.2. The PCMCI parameters are as in those sections while , and . The resulting AICc values are displayed in Figure 8. The value is changed adaptively for each to ensure a similar value for the different number of regimes, i.e.,
(20) 
for . The reference value for the number of switches is on average (due to randomisation of ) for both .
We note that the lowest at which the AICc plateaus is the groundtruth one. The plateau itself occurs due to the fact that only the links with nonzero causal effect values are counted towards the number of parameters. Thus a higher number of regimes does not necessarily result in an increase of the total number of parameters. In other words the penalisation is not becoming stronger with higher values of . Concluding, it is clearly visible that no significant improvement is gained by increasing the number of beyond the reference number of regimes. Since the entry point to the plateau reveals the reference number of regimes, it seems possible to face scenarios where the true number of regimes is unknown.
4.4 High dimensional linear network
In this section the algorithm is evaluated on highdimensional datasets, with each dataset consisting of interacting variables. The background regimes are generated with two regular alternating regimes of 300 time steps each, for a total length . The network structures are randomly generated from a family of linear networks defined via the parameters shown in Table 7, where is the number of randomly drawn cross variable links with random coefficients from the third column. Note that each variable is also autolinked at lag 1 with coefficient randomly drawn from the fourth column. The time series are generated with model (19) and for realisations. RegimePCMCI is then run with the settings shown in Table 8.
L  max lag  

10  30  [0.4, 0.4]  [ 0.2, 0.5, 0.9]  3 
CI test  

ParCorr 
The results are shown in Table 9, which is structured like Table 3. RegimePCMCI performs very well even in this challenging setting. Notably, individual runs can perform extremely well, with reaching as low as , and a total of 53 runs below total average of (second row in Table 9). The other 7 runs are responsible for most of the deviation of the average statistics from the reference values (first row).
As in the causal effect case, there is a mismatch between runs with the lowest prediction errors and the lowest error on regimeassigning process , meaning that we cannot use a filtering on
to find the best performing runs. This behaviour can be explained from the tendency of the algorithm to still overfit when too many degrees of freedom are available, as well as from the complexity of distinguishing different causal effects (a challenge already manifested in the
causal effect case).Selection  n. runs  

all  0.85  70  
%  0.70 
4.5 Computational complexity
Table 10 shows some indicators of performance of the method: the fraction of runs that correspond to a (local) minima, the average number of qiterations needed to reach a local minima and the runtime for the whole set of runs (the code run parallel over the annealings and using 4 to 6 CPUs per job).
Most of the examples reach local minima in more than 50% of the runs, while the percentage is very low for causal effect (second column). We note that examples with high percentage of local minima correspond also to quick convergence in terms of iterations steps (third column). They are also associated with better regimes reconstruction (see Table 3, 6,9), confirming that a clear cost functional minimum (as shown from the second and third column) is linked to better detection. Finally, the runtime is quite fast: the low dimensional examples take between and minutes for and 45 minutes for to complete runs. The high dimensional example takes just below hours for runs.
Example  n. local minima/ %  iterations to minima  runtime () 

arrow direction  %  600  
causal effect  %  970  
lag  %  1,130  
sign  %  970  
sign  %  700  
sign and arrow  %  2,670  
high dimensional  10,780 
5 A realworld example: the effect of
El Niño Southern Oscillation on Indian rainfall
We finally test the performance of RegimePCMCI on realworld data, and apply it to address the nonstationary relationship of El Niño Southern Oscillation (ENSO) and allIndia rainfall (AIR) mentioned in the introduction. We are interested in if, for given time series of ENSO and AIR, our method is able to distinguish between the winter and summer months, i.e. the backgroundregimes, and to detect a reported link from ENSO to AIR during summer.
This example can be considered a difficult case since the expected signal from ENSO to AIR is likely small compared to natural variability [39]. Further, climate data is typically very noisy with causal relationships being diluted by other, often unknown processes given a complex coupled climate system [40].
Our input data consist of monthly observations of ENSO and AIR, for the years 1871 to 2016, resulting in two time series consisting of 1740 monthly values each. More precisely, ENSO is represented by the socalled relative Nino3.4 index provided by the National Oceanic and Atmospheric Administration (NOAA) [4]^{1}^{1}1 http://climexp.climexpknmi.surfhosted.nl/getindices.cgi?WMO=NCDCData/ersst_nino3.4a_rel&STATION=NINO3.4_rel&TYPE=i&id=someone@somewhere. Data for AIR anomalies (with the climatology subtracted) are provided by the Indian Institute of Tropical Meteorology (IITM) [16]^{2}^{2}2 http://climexp.climexpknmi.surfhosted.nl/getindices.cgi?WMO=IITMData/ALLIN&STATION=AllIndia_Rainfall&TYPE=p&id=someone@somewhere. We choose the following parameters of RegimePCMCI: For the regime part, we set and , which is equivalent to assuming two seasons per year. For the PCMCI settings, we use a significance level (). Further, we use a maximum timelag of two months, i.e. . The optimisation is run annealing times, to span many local minima, with each annealing allowed for up to iteration steps to converge.
Among the annealing steps, which correspond to different random initial guesses on the regimeassigning process , some clearly performed better in terms of fitting the data. We estimate the average prediction error associated with each annealing, (A.4), and Figure 9,a shows it for all annealings (ranked according to ). A red box highlights the top performing cluster (13 runs).
All of the top 13 annealing find a link from ENSO to AIR during one of their two regimes only (for simplicity hereafter called regime 1). In the following we present results averaged over these annealings and plot links that surpass a strength of .
The causal link from ENSO to AIR in regime 1 has an average standardized linear effect of
, meaning that a one standard deviation increase in ENSO results in a reduction of
standard deviations in AIR (Figure 9,c) . This negative dependence is well documented in the literature [39]. During regime 2, in contrast, ENSO and AIR are, on average, almost independent, with only a very weak link (, not shown) detected from AIR to ENSO. More importantly, our results indicate a clear seasonal dependence. Figure 9,d shows the number of months assigned to each regime (normalised by the number one would expect on the hypothesis of no seasonality, see figure caption). A clear peak in summer months is found for regime 1. More precisely, most of the months between June to September are assigned to regime 1 (). These are the months in which the Indian summer Monsoon is active and for which a robust influence from ENSO has been shown. In contrast, months assigned to regime 2 are predominantly winter months ( of all December to March months). Thus, despite the relatively weak mean causal effect of ENSO on AIR during summer, and the large interannual variability, our algorithm successfully reconstructed this welldocumented relationship given allyear time series of ENSO and AIR.Overall, these results are promising and show the potential of RegimePCMCI to detect regimedependent causal structures in a system as complex as the climate system. On the other hand, it also shows that domain knowledge is required to assure a suitable choice of parameters ( and ) and an interpretation of the results. This is yet a common caveat to many datadriven approaches, which we nevertheless want to stress strongly.
6 Discussion and conclusions
Causal discovery is emerging as an important framework across many disciplines in science and engineering, but each discipline has particular challenges that novel methods need to address [32]. We introduced a novel method, RegimePCMCI, to learn regimedependent causal relations, overcoming one of the key drawbacks of current causal recovery methods. The performance of RegimePCMCI was analysed for many different artificially generated causal scenarios and for varying regimes showing that the method covers a wide range of settings (see Figures 25 and Table 3). The performance of the algorithm is maintained also for highdimensional settings with 10 variables (see Table 9) as well as for more than two regimes (see Figure 7 and Table 6). We found limitations of the method for the case where only the causal effect strength of a link changes between regimes (see Figure 6) which seems to be hard to detect with our optimisation scheme and requires further investigation. Further, the capability of RegimePCMCI was verified by means of a well documented climate example using real data of ENSO and Indian rainfall (see Figure 9). Overall, the proposed method presents itself as a promising approach in the context of nonlinear causal links manifested in regime changes in time.
Note that a causal interpretation of estimated links in our observational causal discovery framework still assumes causal sufficiency, that is, no unobserved common causes. However, estimated noncausality (zero coefficients) do not require this assumption [29] and can be interpreted as an absence of a causal relation already under the weaker Faithfulness assumption [28]. While for PCMCI asymptotic consistency was shown[28], this is a more difficult task for RegimePCMCI and deferred to further research.
There are several interesting aspects that could be explored in the future, building on the present work. These extensions can build on other causal discovery algorithms or extensions of PCMCI in the causal discovery step of our method. For example, as already discussed in section 3.5, the PCMCI algorithm allows for nonlinear causal links [28] and thus a nonlinear extension of the RegimePCMCI is a logical next step. Recent extensions of PCMCI to the case of not only lagged, but also contemporaneous causal relations can also be integrated [33]. Moreover, potentially it is also possible to better capture the causal effectcase and it might be possible to learn a regimedependence of the noise term.
With respect to applications, it would be interesting to utilise the proposed method to study other links in the climate system that are likely regimedependent, but less understood than the presented El NiñoIndian rainfall example.
7 Author’s contribution
E.S., J.dW., M.K. and J.R. designed the research, E.S. mainly performed the research, E.S., J.dW., M.K. and J.R. analyzed the results and wrote the manuscript.
Acknowledgements
E.S. was supported by the Centre for Doctoral Training in Mathematics of Planet Earth, UK EPSRC funded (grant EP/L016613/1). M.K. and J.dW. have been partially funded by the ERC Advanced Grant ACRCC (grant 339390). The research of J.dW. has been partially funded by Deutsche Forschungsgemeinschaft (DFG) (SFB1294/1318763901) and by the Simons CRM ScholarinResidence Program. M.K. has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement (No 841902). The optimisation software by Gurobi™was used for this work. The authors would like to thank Giorgia di Capua for discussions on the climate example.
Data Availability Statement
The data that support the analysis of the first four sections of this study have been synthetically generated by the authors and can be fully reproduced using the equations and parameters described in the article. The data that support the findings of the last section of this study are openly available on the KNMI Climate Explorer at https://climexp.knmi.nl/
(specific urls and original data source in references). RegimePCMCI is part of the opensource Python package
tigramte available at https://github.com/jakobrunge/tigramite.Appendix A Definition of result statistics
a.1 Regime assigning process
a.2 Link detection
TPR