Streaming data is used in a wide array of contemporary research fields, propagated by the Internet of Things (IoT) and continuous sensor observation. The data is acquired continuously and usually at a fast pace. Typically one can’t store all of the data ever streamed due to memory constraints and it is infeasible to repeat statistical analyzes on the entire stream as it grows indefinitely over time. To deal with this problematic situation, estimation methods for many different statistical analyzes on such data streams have been proposed (Gama, 2010; Aggarwal and Philip, 2007)
. Some of these techniques use statistical summaries, where only a succinct number of carefully selected elements that have entered into the data stream are stored. One of the most popular analyzes that the literature has focused on is quantile estimation for a univariate variable observed in a data stream (including median estimation)(Buragohain and Suri, 2009). Recently, Gregory (2018) proposed an algorithm to generate an approximation to the bivariate empirical copula function (a popular method of nonparametric dependence modelling) using a succinct statistical summary; this approximation has a guaranteed error bound.
This paper builds on the work in Gregory (2018)
and considers an important use of copulas: computing tail dependence coefficients between random variables. Tail dependence coefficients between random variables quantify their correlation in the tails of their marginals. For example, two random variables may be weakly correlated in the vast majority of their probability space, however in the tails of this space they may be extremely correlated. This behaviour is often seen in financial analysis(Rodriguez, 2007), where sometimes two assets exhibit sharp price increases and decreases at similar times, but tend to have relatively uncorrelated typical daily price movements. The importance of tail dependence in fields such as hydrology (Poulin et al., 2007) and energy (Reboredo, 2011) has also been studied. For the purpose of estimating empirical tail dependence coefficients between streams of data, the type of summary that was proposed in Gregory (2018) is not sufficient, since it returned uniform error over the entire marginal distribution. Empirical tail dependence coefficient approximations require that error can be reduced relatively at the tails, to allow an error constant with the number of elements in the data stream. Therefore the copula summary in Gregory (2018) results in an approximation to the tail dependence coefficient with linearly growing error w.r.t the number of elements in the data stream. To remedy this, relative accuracy quantile summaries from the univariate literature (Cormode et al., 2005) are employed within the copula summary, allowing suitable properties of the modified summaries’ approximation error to be proved. These properties lead to an approximation of the tail dependence coefficient with constant error w.r.t. the number of elements in the data stream.
This article is structured as follows. The next section introduces the empirical copula and how it can be used to construct empirical tail dependence coefficients between random variables observed through data. Next, Sec. 3 describes the challenges associated with computing empirical copula approximations when the data is streamed sequentially, and how the copula summary proposed in Gregory (2018) can be used to provide an approximation to copulas in this streaming regime. Then Sec. 4 introduces how one can adapt the summary to compute accurate approximations to empirical tail dependence coefficients for streaming data. Finally Sec. 5 and 6 provide a theoretical and numerical analysis of the approximation respectively.
2 Empirical copulas and the coefficients of tail dependence
A copula is a dependence model between two or more different random variables. For the remainder of this paper, we will focus on the case where one has only two random variables due to the bivariate form of the work presented in Gregory (2018). This paper did present an example of how the proposed work could be extended to higher dimensions, although the associated evidence was empirical and therefore falls outside the theoretical scope of the current study. More specifically, a bivariate copula function , for
, is the joint distribution function between the random variablesand
where both marginals are uniformly distributed. The bivariate copula function is given by,
is the joint cumulative distribution function (CDF) ofand , and and are the generalized inverse CDFs (quantile functions). They are defined by,
respectively (Charpentier et al., 2007). In most applications one has access to data and simulated from and respectively. In this case, an empirical copula function is typically found to represent the dependence between the two data-sets in . The empirical copula (Deheuvels, 1980) converges to the true copula function in (1), within the limit of . It is defined by,
where is the cardinality of the set , such that correspond to elements in that satisfy , and is the ’th order statistic of . Also here, is the empirical quantile function (Ma et al., 2011) of , for which we will use the approximation,
Finally, is the empirical CDF given by
where is the ’th element of and is the indicator function. For more information on the statistical explanation behind the approximation in (2) and what the terms within it approximate, turn to Gregory (2018).
One of the by-products of a copula is the computation of the tail dependence coefficients (there is an upper and a lower one). These coefficients allow one to study the dependence in the tails of each of the marginals and . For example, and may have low correlation over their entire probability space, however they could have very high correlation when both and take extreme values. This aspect of dependence is important in many applications, for example in financial analysis where it is crucial to realise if two assets have a high relative probability of both crashing at similar times. The lower tail dependence coefficient between and can be computed directly via the copula function,
and so too the upper tail dependence coefficient,
There are many estimators for the tail dependence coefficients, some of which assume a parametric form for the copula . There are also many nonparametric estimators that fall into the scope of this paper considering empirical copulas ; this paper is not concerned with the positives/drawbacks with any particular estimate. For a detailed account of these estimates, see Frahm et al. (2005). Using the empirical copula , one estimate of the empirical lower tail dependence coefficient is given by (Caillault and Guegan, 2005),
and one estimate of the empirical upper tail dependence coefficient is given by,
These are consistent with the tail dependence coefficients in (4) and (5) respectively as , since the empirical copula is also consistent (Deheuvels, 1980). Empirically one cannot take this limit and therefore it suffices to study the following functions,
For the remainder of this study, we will just concentrate on the lower tail dependence coefficient for brevity, however it should be noted that the following framework and theoretical analysis (see Sec. 5) can be readily extended to the case of the upper tail dependence coefficient. For , the function in (6) describes the path of as tends to 1 (Caillault and Guegan, 2005). It has been proposed to evaluate the function in (6) with the minimum value of that the function is decreasing for (Caillault and Guegan, 2005). However for the scope of this paper, which will estimate these functions for an arbitrary fixed value of , this particular selection is not justified further. This paper will instead concentrate on the estimation of the function in (6) when the empirical copula function must be constructed over a data stream. The next section will propose an approximation to the tail dependence functions using an approximation to the empirical copula, formed via a succinct summary of the data stream.
3 Streaming data and the copula summary
Streaming data is the scenario in which say, the bivariate data stream , is added to sequentially over (possibly indefinite) time. In the context of streaming data, it is not possible to store all of the data points in the stream or be able to consistently re-compute the order statistics for the empirical copula function (see previous section). This is typically due to restrictions on runtime and memory/storage. Quantile summaries are a popular way of maintaining an approximation to the empirical quantile function in (3) as an univariate data stream is added to, whilst only storing a succinct number of elements from the stream in space-memory (Greenwald and Khanna, 2001). On this note, define an -approximate quantile summary to be an approximation to the quantile function , that returns a value , for , where . An algorithm to construct such an summary was proposed in Greenwald and Khanna (2001).
The work in Gregory (2018) proposed another summary , made up of -approximate separate quantile summaries. This copula summary maintained an approximation to the bivariate empirical copula function in (2) over the data stream . It was shown that an approximation within can be achieved. Just like the univariate quantile summaries that it is composed from, the copula summary was space-efficient and stored only a succinct number of elements from the data stream. The extension to this summary, proposed in Sec. 4.1, will allow an approximation to tail dependence coefficients between the random variables and to be estimated from the data stream.
4 Estimating the coefficient of lower tail dependence for data streams
The copula summary presented in Gregory (2018) was not suitable to find such coefficients of tail dependence for one main reason: the error of the approximation was uniform over a grid of evaluation points and . Therefore the resolution of the approximation would be as refined on the tails of the two marginals as it would be for the medians of both marginals. This results in an approximation to the tail dependence coefficient (replacing respectively in (6)) that has error growing linearly with . One can see this from the following error bound of the lower tail dependence coefficient function for fixed ,
Simply refining the prescribed error would be insufficient; one would need to sequentially refine as the stream gets longer, tending towards 0. The work presented in this paper is inspired by Cormode et al. (2005), which considered biased quantile estimation and modified -approximate quantile summaries to refine the error sufficiently at the tails, at the expense of error not at the tails. This will, as is apparent from the error analysis later in the paper, guarantee that the error from the approximation of the lower tail dependence coefficient stays fixed as the number of elements in the stream is increased.
4.1 Modifications to the copula summary
This section details the specific modifications to the -approximate quantile summary introduced in Sec. 3 (Cormode et al., 2005), and therefore the copula summary, in order to obtain a suitable approximation to tail dependence coefficients. The proposed algorithm in Cormode et al. (2005) constructed a summary to maintain an approximation, with guaranteed error bounds, to the ‘biased’ quantiles , for 2, , and . An approximation to the biased quantiles should have error relative to the quantile query, such that the approximation to should have an error of rather than the uniform error of from the standard quantile summary. This relative error allows one to refine the quantile approximation within the tails of an univariate distribution, and therefore is suited to the problem considered in this paper. On this note, define a -approximate quantile summary to be an approximation to , for , which returns the value , where
Recall that the copula summary is composed of quantile summaries, . Proved later in the analysis of this method, it will suffice to let the summaries be modified into a -approximate summaries in order to obtain a suitable approximation to the lower tail dependence coefficient. To modify them, the manner in which each summary is maintained and queried (through Sec. 4.2, 4.3 and 4.4) is changed from that explained in Gregory (2018). First, the basic make-up of each quantile summary remains the same from the initial summary proposed back in Greenwald and Khanna (2001). Each summary is made up of tuples, . The values , where , are elements that have been seen in the data stream , for , so far. These values are maintained by the summary as ‘cover’ for a range of quantiles that one may query. I.e. the value will be returned as an approximation to a nearby quantile. The values and control the range of quantiles that the value is returned as an approximation to. They do this by governing the minimum, , and maximum, rank that the value takes in the original data stream. We define , and then,
One also knows the length of the data stream at any one time via . In order to guarantee that the -approximate quantile summary maintains an approximation which satisfies (9) these minimum and maximum ranks must satisfy (Cormode et al., 2005),
In the copula summary, each of the subsummaries , , corresponds to an element inside of (therefore has the cardinality ). Whilst the summary contains elements (and information about their ranks) from the first component of the data stream, i.e. , the summaries , , contain elements (and information about their ranks) from the second component of the data stream, i.e. . On this note, let a tuple in be denoted by where , for , and let a tuple in be denoted by where , for (therefore has the cardinality ). The parameters within these summaries are changed carefully over time as new elements are added to the data stream via the operations defined in the following three sections.
4.2 Inserting an element into the copula summary
When the element enters the data stream, it should be inserted into the copula summary. This is done by inserting the tuple into . If , then we insert the tuple at the start of , and let . Conversely if , then we insert the tuple at the end of and let . If , then insert the tuple in between and and let . Now for the second component of the new element, let be a new quantile summary. This summary gets inserted between and in the copula summary if , between and if or at the end of the copula summary if . Finally, increase by 1.
4.3 Combining tuples in the copula summary
Combining the tuples in the summaries is occasionally required to remove unnecessary tuples from the summaries, whilst maintaining the elements required for the approximation to be of the desired accuracy. Providing that , sequentially for each element in we find the index satisfying
Once this value is found, the tuples can be combined into the new tuple . We use the condition on in (11) in order to guarantee (10) is satisfied. In addition to combining those tuples, we merge the tuples into a new tuple . See Sec. 5.1 for the implementation of this merging and the bound of the approximation error. Finally combine unnecesary tuples (e.g. for ) inside of this merged summary , in the manner described earlier in this section. Insert this new summary in the place of in the copula summary, such that the copula summary now is .
4.4 Querying the copula summary
This section now describes how to query the copula summary (maintained over time using the operations in Sec. 4.2 and 4.3) for an approximation to . We will denote this approximation by , as opposed to the approximation from the copula summary proposed in Gregory (2018) composed of -approximate quantile summaries, . First, we compute the approximation to the empirical quantile function using the -approximate quantile summary ; denote this approximation by . Let be equal to the value of that satisfies , and find the total number of elements in the stream that have entered into the first subsummaries ,
Suppose that the indices of the elements to have entered into the first subsummaries form the set ; note this is an approximation to the set introduced in Sec. 2.
Next, let be a merged summary composed of all the subsummaries (again for the implementation details of this merge, see Sec. 5.1). This summary can be queried for an approximation to ; denote this approximation by . Finally, let be a merged summary of the subsummaries . Then define the approximation to the empirical CDF
to be an inverse query of the summary . The implementation details of this query, and a guarantee on it’s error with respect to the empirical CDF, is provided later in Sec. 5.2. In total, the copula summary approximation , to the empirical copula function is given by,
5 Analysis of the modified approximation
This section provides a theoretical analysis of the approximation in (13) to the empirical copula function, and the resulting approximation to the empirical lower tail dependence coefficient. First, it is important to clarify error bounds for merged - approximate quantile summaries, and an inverse query of a - approximate quantile summary. Recall from (10) that the summary is a -approximate quantile summary if two neighbouring elements and in satisfy,
The next two sections cover two preliminary bounds, before Sec. 5.3 outlines the error bound of the lower tail dependence coefficient approximation using the modified copula summary, proposed in this paper.
5.1 Merging -approximate quantile summaries
We recall from Greenwald (2004) that one can merge -approximate quantile summaries (length ) and (length ) to obtain the quantile summary , containing the elements , which is also -approximate itself. It does this via the following method. Suppose , for , is an element from which exists in . If it exists, let be the largest element in that is less than or equal to . Also, if it exists, let be the smallest element in that is greater than . Then set,
It will now be proved that if and are -approximate then is a -approximate summary as well. It is therefore necessary to show that
For the case where and are from the same summary, let them equal and (say in the summary ). If there exists both the elements (largest element in that is less than or equal to ) and (smallest element in that is greater than ) then we know that and are consecutive elements in . Thus,
If doesn’t exist, then
as and if doesn’t exist, then
For the case where and come from different summaries, say from labelled by , and from labelled by . Then w.l.o.g. let be the smallest element in greater than , and be the largest element in less than or equal to . Then,
We now have the condition for a -approximate summary in (14) for all cases of membership to and for the elements and . Therefore any -approximate summaries merged together will also be -approximate too.
5.2 Inversely querying -approximate quantile summaries
In this section, we would like to bound the approximation , for and , to the empirical CDF using a -approximate quantile summary of the data stream . This is a simple extension to the proof in Lall (2015) for inversely querying a -approximate quantile summary. Firstly, let , meaning , where and . Let , then we know that using the quantile summary we keep an approximation, , to the ’th order statistic of the data stream; this approximation satisfies . Note that as the summary is -approximate, we have . Also recall from Sec. 4.1 that we can only access the minimum and maximum values that can take, and not actually itself. To find an approximation to we can simply search all values in , for , for the that satisfies (with ) and take as as the approximation to . If , of course take . Now we know that , and therefore . Due to the triangle inequality we have that and finally that
5.3 Bounding the error of the lower tail dependence coefficient approximation
Now that we have covered some necessary bounds, we can derive the guaranteed error bound of the modified copula summary and therefore the lower tail dependence coefficient approximation. Recall that the -approximate empirical copula approximation is given by,
and therefore the approximation to the lower tail dependence coefficient (for a fixed ) is given by,
Let denote the -approximate copula summary approximation, given in (19), to the bivariate empirical copula , then one can bound the error of this approximation by,
Therefore the approximation to the lower tail dependence function in (20) can be bounded by,
The approximation error is therefore constant with increasing .
To prove this bound, we shall follow the steps of the proof in Gregory (2018) with some modifications.
We shall split the error
into three contributing parts via the triangle inequality and prove each individually,