1 Introduction
The field of causal inference has developed massively in recent decades. Most of this literature concerns central quantities such as expectations, but certain causal mechanisms manifest themselves only in rare events and or may simplify in distribution tails. Standard methods of causal inference are illsuited for such situations, and recent work has begun to link causality and extreme value theory. Examples are Gissibl.Kluppelberg:2018, who define recursive maxlinear models on directed acyclic graphs, Kluppelberg21, who propose a scaling technique to determine the causal order of the variables in such graphs, kiriliouk
, who use multivariate generalized Pareto distributions to study probabilities of necessary and sufficient causation as defined in the counterfactual theory of Pearl, and
mhalla2018, who construct a causal inference method for tail quantities relying on Kolmogorov complexity of extreme conditional quantiles. See surveys by
naveau2020 on extreme event attribution and by Engelke2020 on the detection and modeling of sparse patterns in extremes.Our work stems from that of Gnecco2019, who propose an estimator of the causal tail coefficient and an algorithm that, under mild conditions, consistently retrieves a causal order on an underlying graph even in the presence of hidden confounders. Such an order helps to exclude some causal structures, but does not provide evidence for the existence of a specific structure, as in general a given order is causal for several possible graphs; in particular, all orders are causal for the empty graph corresponding to absence of causality. Although it is asymptotically invariant to hidden confounders, this estimator can suffer from confounding in finite samples when inference on the direct relationship between two variables is needed, when these effects are too strong or when the confounders have heavier tails than the two variables.
This paper addresses a central challenge in causal inference: the presence of confounders. It is often assumed that all the relevant variables are observed and can be included in the model, but typically this cannot be ensured. The available variables are often subject to external influences, observed or unobserved, that affect the variables of interest and can make it harder or even impossible to infer a correct causal relationship. Our goals are to mitigate the effect of a known confounder on an extremal causal analysis by treating it as a covariate, and to present a permutation test for direct causality between the two observed variables. Such a model enables causal discovery and inference for a greater variety of situations.
Our work was stimulated by average daily discharge data from 68 gauging stations along the Rhine and Aare catchments in Switzerland, see Figure 1. The data were collected by by the Swiss Federal Office for the Environment (hydrodaten.admin.ch), but were provided by the authors of Engelke2020, with some useful preliminary insights. We focus on the causal relationship between extreme discharges, for which precipitation is an obvious confounder, and use daily precipitation data from meteorological stations, provided by the Swiss Federal Office of Meteorology and Climatology, MeteoSwiss (gate.meteoswiss.ch/idaweb). Unlike in our simulation experiments, we know neither the true tail properties of the discharges and precipitation nor the effect of the confounder. We use precipitation as a covariate and show that it allows correct inference on the direct causal relationships between discharges for the majority of the station pairs, with at least estimated confidence, which was impossible without our proposed approach.
The paper is organised as follows. Section 2 discusses the causal tail coefficient, its interpretation and its properties. Section 3 introduces a new parametric estimator for it based on generalized Pareto modelling of threshold excesses, which allows a known confounder to be used as a covariate. A simulation study in Section 4 underlines the strengths and limitations of the two estimators. Section 5 presents a permutation test intended to detect direct causality between two heavytailed variables, which is also assessed via simulation. Section 6 applies the methodology to the river discharges, and Section LABEL:s:conclu gives a brief discussion.
2 Causal tail coefficient and its estimation
We first give some basic notions needed to describe the setting in which causal relationships between random variables can be recovered.
Definition 2.1.
A linear structural causal model (LSCM) over a set of random variables satisfies
where is a set of nodes representing the corresponding random variables, is the set of parents of , is called the causal weight of node on node , and are jointly independent noise variables. We suppose that the associated graph , in which the directed edge belongs to if and only if , is a directed acyclic graph (DAG).
In a DAG , we say that is an ancestor of in , if there exists a directed path from to . The set of the ancestors of in is denoted by , and we define . In a LSCM over random variables , with associated DAG , we say that causes , if . We call a confounder (or common cause) of and if there exist directed paths from to and from to in that do not include and , respectively. We say that there is no causal link between and if . For any we let denote the sum of the products of the causal weights along the distinct directed paths from vertex to vertex ; we set and if .
Let and be random variables from a LSCM with respective distributions and . The causal (upper) tail coefficient of a random variable on another random variable is defined as (Gnecco2019)
(1) 
if the limit exists. This coefficient lies between zero and one and captures the causal influence of on in their upper tails: if has a linear causal effect on , will be close to unity. The coefficient is asymmetric, as extremes of need not lead to extremes of , and in that case, will be appreciably smaller than . As only depends on the rescaled margins of the variables, it is invariant to monotone increasing marginal transformations.
If both tails are of interest, the causal tail coefficient can be generalized to capture the causal effects in both directions, by considering the symmetric causal tail coefficient of on , i.e.,
if the limit exists, where . As ,
The interpretation and properties of are similar to those of . The symmetric version captures the causal influence of on in both of their tails.
For simplicity we focus on in this paper, though all of our results and methods can be generalized to both tails by considering instead, if the assumptions for the upper tails are also satisfied in the lower tails of the variables considered.
Before stating the theorem that describes how the underlying causal relationships in a set of random variables can be recovered, we define the concept of regular variation.
Definition 2.2.
A positive measurable function is said to be regularly varying with index , written , if for all , . If , then is said to be slowly varying.
Definition 2.3.
The random variable is said to be regularly varying with index , if, for some , as .
Independent regularly varying random variables are said to have comparable upper tails if there exist , and such that, for each , as .
The following theorem describes how the causal relationships underlying a set of random variables can be recovered from their causal tail coefficients.
Theorem 2.4 (Gnecco2019, Gnecco2019).
Let be random variables from a LSCM, with associated directed acyclic graph and suppose that

the coefficients of the linear structural causal relationship are strictly positive for all and , and

the realvalued noise variables are independent and regularly varying with comparable upper tails.
Then the values of and allow one to distinguish between the different possible causal relationships between and summarized in Table 1.
causes  
causes  common cause only  
no causal link 
Under the theorem’s assumptions, the blank entries in Table 1 cannot occur. Theorem 2.4 is generalizable to the variant of the coefficient and possibly negative values if the assumptions are also satisfied in the lower tails of the variables.
Gnecco2019 show that under the setup and assumptions of Theorem 2.4, the causal tail coefficient (1) for any distinct , and with , is
(2) 
If are independent replicates of , with the random variables and from the LSCM, then the nonparametric estimator of is defined to be
(3) 
for some , where denotes the indicator function, denotes the order statistic and
is the empirical cumulative distribution function of
, i.e.,This estimator is the empirical counterpart to (1), as is a quantile of the corresponding empirical distribution. The value of controls the number of data pairs in the upper tail of that contribute to the estimator. Under the assumptions of Theorem 2.4 and a “very mild assumption that is satisfied by most univariate regularly varying distributions of interest”, estimator (3) is consistent as , for a choice of such that and (Gnecco2019).
A strength of the causal tail coefficient approach is its asymptotic robustness to hidden confounders. A frequent assumption in causality is that all the relevant variables are observed, but this is usually moot. Theorem 2.4 holds even when some variables in the underlying LSCM are unobserved. The capacity to deal with confounders both when studying the causal relationship between two variables and when retrieving a causal order is not generally shared by other approaches in causal inference, as argued by Gnecco2019, but this valuable property has the limitation that the unobserved variables must satisfy the regular variation assumption of Theorem 2.4, which cannot be checked in practice and may be unrealistic. In our motivating setting, for example, the tail of the precipitation variable, the confounder, may not behave like the tails of the river discharges. Moreover, it may be difficult to distinguish between causal cases using empirical estimates. In particular, an increase in the strength of the causal effect of a common confounder of and will increase , making it harder to tell whether a high value of indicates that or that . This is illustrated in the simulations in Section 4.
3 Parametric Tail Causality and Confounder Dependence
The causal tail coefficient (1) is useful when trying to understand a causal relationship between two variables, as it allows inference on the underlying extremal causal structure and has attractive asymptotic properties. However, the nonparametric estimator (3) is inappropriate when the underlying assumptions are not met or when a hidden confounder has a strong influence. This problem generally occurs when the confounder is different in nature from the variables of interest, as its tail is then rarely comparable to theirs. In particular the tail of the confounder may be heavier, giving it an even larger unwanted effect on the coefficients. Thus a useful modification would be to allow conditioning on the value of a known confounder in order to remove or at least reduce its effect on the estimated causal tail coefficient . In this section, we propose an approach to this through a peaksoverthreshold method. Another useful modification would be a reliable statistical test for direct causality. We propose and discuss such a test, based on a permutation resampling method, in Section 5.
3.1 Generalized Pareto Causal Tail Coefficient
We first model the tails of the variables using the generalized Pareto distribution (GPD) to model the excesses (IExtr, Chapter 4). This has scale and shape parameters that are typically estimated using maximum likelihood and that can depend on other variables (davison1990). For , and under mild conditions on , for a threshold large enough, the threshold excesses follow approximately a generalized Pareto distribution. That is,
(4) 
with a scale parameter and a shape parameter that determines :

corresponds to lighttailed distributions, and then lies in the maximum domain of attraction of the Gumbel distribution;

corresponds to heavytailed distributions, and then lies in the maximum domain of attraction of the Fréchet distribution; and

corresponds to distributions with bounded upper tails, and then lies in the maximum domain of attraction of the (reverse) Weibull distribution.
Any random variable satisfying the assumptions of Theorem 2.4 satisfies (4), as a regularly varying random variable with index lies in the Fréchet maximum domain of attraction. If the threshold is chosen to be the quantile of for some , then we can write
and using the empirical distribution to estimate and maximum likelihood estimation using the excesses of to obtain and yields an estimator of the distribution function of , i.e.,
Using hybrid estimators for and for an integer yields the parametric GPD causal tail coefficient estimator for ,
(5) 
where . Unlike with the nonparametric estimator (3), the number of data pairs used in (5) may not equal , as it depends on the fit of .
The GPD model can be reparametrized to allow dependence on time or on another variable of interest. More precisely, the model’s parameters can be rewritten in the form , where denotes either , or both, is an inverse link function,
is a vector of parameters and
contains the values of explanatory variables on which the model might depend (IExtr, Chapter 6).We wish to reparametrise the model to reduce or remove the effect on of a potential confounder of and . If appears linearly in the LSCM then under the setup in Section 2 it is straightforward to show that affects the scale parameters of the GPD model that applies to and above high thresholds, and not their shapes.Thus we write
where is the replicate of corresponding to the observations of . This yields, for , the parametric conditional linear generalized Pareto distribution (LGPD) causal tail coefficient estimator for ,
(6) 
Estimation of , and is performed by maximum likelihood. In applications it is preferable to center and rescale the values of
to have unit variance and be centered around zero.
3.2 The positive linear scale issue
Linear modelling of the GPD scale parameter may not yield positive scale estimates for each and . The use of a nonlinear link function to ensure that the scale estimates were positive would not agree with the assumption of extremal linearity of the causal relationships, as the effect of on the scale is also necessarily linear. We now describe two different solutions to this problem, which we compare by simulation in Section 4.
The first solution, postfit correction, replaces in (6) by for an arbitrarily small positive . The second solution, the constrained approach, applies the constraints
(7) 
to the estimates when maximizing the likelihood. When the data have a known distribution, box constraints can replace the linear ones. For example, if , and have distributions, then and . Thus, if , then
(8) 
where the lower and upper bounds are needed for positive and negative , respectively.
4 Simulation Study
Here we perform a simulation study using the Student
, Pareto and lognormal distributions. The first two lie in the Fréchet maximum domain of attraction and are regularly varying with index
. As the Pareto distribution exactly satisfies Definition 2.3, one might expect better behaviour with this than with the Studentdistribution. The lognormal distribution lies in the maximum domain of attraction of the Gumbel distribution and is not regularly varying, but finite samples from it can appear to be heavytailed.
We focus on the behaviour of the causal tail coefficient estimators (3) and (6) between two variables and in their causal configurations, as shown in Figure 2. As we study the estimators of causal effects of both on and of on , we generated simulations only for the four causal cases, A, B, C and D. The LSCM causal weights , and were chosen to equal unity, by default, for each existing edge in all four cases. Hence, in D, is caused by and with equal strength, even though has a second effect on through .
Unless stated otherwise, each estimate is based on a random sample of triples , of which were chosen based on Gnecco2019, who found that the optimal fractional exponent of for choosing seems to lie between and . The factor doubles the number of data pairs used in the estimator, thus decreasing its variability, but does not introduce much bias for such large values of . One thousand independent replicate estimates were calculated for each of the four causal configurations and three distributions.
We present only the highlights of the study; the code and all the results are available from github.com/opasche/ExtremalCausalModelling.
4.1 Variables with comparable tails
Detailed results for variables with comparable tails are presented in Section LABEL:sm:comparableTails of the Supplementary Material. In this case it is essentially always possible to infer the existence and direction of any causality between and , based on the nonparametric or conditional LGPD estimators, (3) or (6), of and alone. When the causal effects of on and , i.e., and , are increased relative to the noise variance and any causal effect of on , both and increase in configuration B, and increases in configurations C and D. This increase is more marked with the nonparametric estimators of and , which are biased upwards in these configurations. When the confounder has a high causal impact, inference based on the nonparametric estimator (3) for direct causal link between and is sometimes impossible, as and their difference is close to zero in configurations B and D.
Use of the conditional LGPD estimator (6) greatly reduces the effect of on the coefficient estimates in configurations B and D. For Pareto and lognormal data, the results are indistinguishable from those without the confounder, both in terms of location and variability, as if the effect of had been entirely removed. For the Student distribution, the estimates are also shifted to around the same values as in the corresponding confounderfree configurations, though their upper tails are marginally heavier. These few greater values remain appreciably lower than without as a covariate. For configurations A and C, unlike for B and D, the estimator is almost unaffected by the addition of as a covariate when it is not a confounder. This is also a useful property, as it could allow tests of whether a specific covariate is a confounder of two variables, based on changes to the estimated coefficient when it is added to the model.
4.2 Confounder with a different tail
One generalisation allows the tail of the distribution of to be heavier or lighter than those of and . A lighter tail does not negatively affect whether the nonparametric and conditional LGPD estimators can infer a direct causal relationship between and , since the tails of and are then dominant. Figure 3 shows the sampling distributions of and for all four causal structures when the tail of is heavier than those of and . The true coefficient values are unknown, as assumption (b) of Theorem 2.4 is not satisfied, though the coefficient for comparable tails, (2), is shown for comparison.
When has a heavier tail than and , the nonparametric estimators and in configuration B and in configuration D are shifted well towards unity. With an even heaviertailed, Student , distribution for (not shown here), the Student results resemble those for the Pareto and lognormal distributions. In all these cases it becomes impossible to infer a direct causal relationship between and , due to the effect of the heavier confounder tail on the nonparametric estimators.
Figure 4 shows the sample distributions of and with postfit correction, for all four causal configurations, when the tail of is heavier than those of and .
Figure 3 shows that in configurations B and D the nonparametric estimator is badly affected by the heavier tail of : all estimates lie very close to unity, rendering inference on the direct causal relationship between and impossible. In the conditional LGPD context, the use of as a covariate solves this problem: the estimates shift towards the coefficient values in the corresponding confounderfree cases, and consistently yield positive values of the difference of estimates for configuration D and differences centred at zero for configuration B; see also Section LABEL:sm:comparableTails of the Supplementary Material. The estimates in configurations A and C, without the confounder causal effect, are barely changed by the addition of as a covariate.
Simulation results for and with the constrained fit are very similar to those for postfit correction for the Pareto and lognormal distributions, but not for the Student distribution. Figure 5 shows the sample distribution of and with the constrained fit, for a heavier confounder tail.
For the Student distribution, the effect of the confounder on the estimator is appreciably less reduced for the constrained fit than for postfit correction, compared to the nonparametric results. As the Student distribution is heavy in both tails, the lower inequality constraint in (7) forces () to have an appreciably smaller slope, explaining this reduced effect. In configurations with a confounder, the absolute values of the constained may be up ten times smaller than those for postfit correction. With both approaches rarely differs greatly from zero for configurations without a confounder.
In the case of the Student distribution, both types of constraint yield very similar estimates; see github.com/opasche/ExtremalCausalModelling.
To summarize, the simulations show that both the nonparametric estimator (3) and the conditional LGPD estimator (6) perform well when the theoretical assumptions are met and the influence of a hidden confounder is not too strong. When this influence grows, it becomes increasingly difficult to confidently infer the causal relationship between the variables using the nonparametric estimator, but the conditional LGPD estimator allows us to detect this relationship by reducing the effect of the confounding.
5 Testing for Direct Causality
5.1 Permutation Test
In situations such as the causal analysis presented in Section 6, the distributions of the and estimators must be estimated to be used for inference. One way to obtain such distributions would be bootstrap resampling, but the extremal nature of the causal tail coefficient would require an unrealistically large sample size for its bootstrap distributions to be trustworthy, as these distributions tend to be too discrete in the extremes.
We therefore propose a permutation test (DavisonBootstrap, Chapter 4) for direct causality between two observed variables, measuring the asymmetry in their direct causal relationship. Suppose we have a sample
from a LSCM and wish to test the null hypothesis of no direct causal relationship between
and , , versus the alternative that causes , . Our proposed procedure is as follows:
Rescale values (, ), where known confounders can be used in the distribution estimator , as for .

For , obtain and by randomly permuting the indices for each pair ().

Compute on and on .

Obtain the Monte Carlo
value, by comparing the value of the test statistic on the original rescaled data with the permutation distribution,
If there are no asymmetric confounding effects on the two variables, then under , whereas under ; see Theorem 2.4. In the first case, the direct causal relationship is symmetric, i.e., is as likely to take extreme values when is extreme as is when is extreme. If so, then permutations such as those performed in step 2. are equally likely, have a common distribution centered around zero, and
will be uniformly distributed. Under the alternative, the direct causal relationship is “asymmetric”, as
is more likely to be extreme when is extreme than conversely; then is more likely to lie in the upper tail of . Thus the distribution ofwill become increasingly skewed towards zero as the causal strength of
on increases.If all asymmetric confounding effects are captured in , and have comparable tails and causal effects behave linearly in the extremes, then the proposed procedure should provide a reliable value for testing direct causality of on .
5.2 Simulations
We used simulation from different data distributions and for different causal configurations involving and a potential confounder to assess our proposed test. We used values of for the causal strength of on , with confounding effects both present and absent. Symmetric () and asymmetric ( and , or and ) confounding effects were considered, and the noise variable were Pareto, Student and lognormal. We generated replicate samples of independent triples for each causal configuration and noise distribution. Three versions of the permutation test were performed for each sample, corresponding to the causal tail coefficient estimators discussed in Sections 2 and 3: nonparametric (3) and conditional LGPD (6) with either postfit correction or constrained fit. Each used permutations and the estimator hyperparameters were set to and .
Figure 6 shows uniform QQplots of for the Pareto and Student distributions, in the case of heavier confounder tail, with symmetric effects. In the absence of confounding the test behaves as expected in both cases, and adding dependence on the independent variable in the modelling through the parametric estimators has no visible effect on the distribution of compared to the nonparametric approach. For the Pareto distribution, the test has power of almost for a direct causal strength of , and it behaves perfectly for higher causal strengths. For the Student distribution, the test reaches a power of for a direct causal strength of , of for causal strength of and of near 1.0 for a causal strength of .
When the confounding effects are added, the test based on the nonparametric estimator fails for the Pareto distribution, as most of the then lie outside the confidence bands, indicating that the distribution of is highly nonuniform. This is corrected when the value of the confounder is taken into account using the parametric approaches, and power , for a direct causal strength of only one twentieth of the confounder’s marginal effects. In the Student case, seems to be close to uniformity in the absence of direct causality (the difference in tail shape is much greater in the Pareto case), but postfit correction increases the power from below to above for a direct causal strength of one fifth of the confounder’s marginal effects. Similar conclusions to those of Section 4.2 about the constrained fit for distributions with both tails heavy apply, as the constrained fit estimator is not significantly better than the nonparametric estimator compared to postfit correction.
Figure 7 shows the uniform QQplots with asymmetric confounding effects for the Pareto distribution with comparable tails.
Unlike in the corresponding symmetric case, the test here fails when using the nonparametric estimator owing to the asymmetry induced by the confounder, but both parametric approaches remove this unwanted effect by enough that nearly has a uniform distribution, with almost perfect power, for a causal strength of one sixteenth and one twentieth of the marginal confounding effects.
6 Application to Swiss rivers
We now illustrate how our method can discover direct causal relationships between the discharge extremes of pairs of river stations. This illustrates our method on a real example for which we know the ‘ground truth’ of extremal causality, but unlike in the simulations of Section 4, we cannot control and do not know the true tail behaviour of the stations discharges and their potential confounders.
6.1 Data sources and additional collection
We use the average daily discharges of the Swiss gauging stations shown in Figure 1, and add daily precipitation data from meteorological stations. Some additional information, such as the stations’ elevation, their catchment surface area and mean elevation, their glaciation percent and their coordinates, was collected from the Federal Office for the Environment’s website. To reduce any seasonal effects due to unobserved confounders, we only consider data during June, July and August, as the more extreme observations happen during this period when mountain rivers are less likely to be frozen.
Figure 8 shows relationships between the estimates, station altitudes and average discharges. Although altitude does not greatly affect the estimates, the shape parameter estimates broadly decrease with increased average river discharge volume.
6.2 Choice of stations and comonotonicity
For the causal analysis, we consider pairs of stations with known direct causal relationships, and pairs with no direct causal relationship. Causal pairs are ordered by the flow of water, with one downstream of the other. The river volumes for the pairs should be as similar as possible, as our exploratory analysis indicated different tail behaviours for rivers with very different average discharges. There should also be enough confluences between the two stations, otherwise one would observe comonotonicity, i.e., almost perfect dependence, between their discharges. If there is comonotonicity between and , then , for all , and it is impossible to know which variable causes which based on the data alone, even if one is certain of direct causality. Confluences between the two stations reduce comonotonicity and make it possible to detect the direction of causality.
As we shall use precipitation as the confounding covariate, the stations must share likely meteorological effects and must lie in regions where precipitation data is available. Based on these criteria, we chose causal station pairs for the analysis: , , , , , , .
The noncausal station pairs were selected to have similar average volume and similar shape parameter estimates. Both pairs with stations separated by long distances and pairs relatively close to each other were considered. The pairs selected are , , , , , , , , , , , , .
The choice of covariate for the causal pairs was the mean daily precipitation among the meteorological stations in the area and the catchment of the two stations. The choice of covariate was less meaningful for the noncausal pairs with large separating distances, as they have different meteorological conditions, so the average daily precipitation over the whole country was used. For the pair , which has the closest stations and local precipitation data available, the daily average in the local catchments was also considered. In the latter case, the pair will be highlighted with an asterisk to avoid confusion.
6.3 Causal Analysis Results
For each station pair, the permutation test for direct causality was performed using the nonparametric (3) and conditional LGPD (6) estimators with postfit correction or constraints, with permutations and estimator hyperparameters and . Table 6.3 shows the values of , the covariate shape estimate and its estimated extremal linear effects for the two stations, the latter estimated without constraints. The number of common observations for the pairs varies from to , and lies between and . With precipitation covariates added, the number of common observations ranges from to , and lies between and .
Comments
There are no comments yet.