Streaming data is found in many applications where data is acquired continuously. This characteristic, in addition to any space-memory constraints of the user, make such data a challenge for analyses. As data is acquired the analyser of the data must utilise it before the next piece of data is acquired and the entire stream cannot be stored. Therefore, given a particular statistical quantity of the data, a summary of the data with respect to this quantity must be maintained throughout time. This summary is typically much smaller in size than the entire stream. The idea of this summary is to allow an approximation of the desired statistical quantity to be made at any time with only a single pass of the data.
Estimating the quantiles of a data stream is a popular example of such a statistical quantity (Buragohain and Suri, 2009). A host of studies (Arandjelović et al., 2015; Greenwald and Khanna, 2001; Munro and Paterson, 1978; Manku et al., 1998) propose methods to construct succinct summaries of univariate data that can be queried at any time to obtain approximate quantiles within a guaranteed error bound (e.g. -approximate quantile summaries). However, data is rarely univariate. Copula functions (empirical) are a natural way to model the dependencies between multiple streams of data. This paper adapts the aforementioned Greenwald and Khanna algorithm (Greenwald and Khanna, 2001) to construct an alternative bivariate data summary, returning queries to the empirical copula function with guaranteed error bounds. Whilst the paper doesn’t directly extend the summary to higher dimensions, one can construct models of dependence for such multidimensional data using sets of pair-wise copulas (Aas et al., 2009; Mazo et al., 2015). Therefore, approximations to such a copula can be found by using the -accurate bivariate copula functions considered here.
This work is related to other studies that also consider the construction of summaries for multidimensional data. These summaries have been used to query multidimensional ranks and ranges (Hershberger et al., 2004; Suri et al., 2006; Yiu et al., 2006). Querying multidimensional ranges, such as a rectangle of points on the plane, is analogous to finding empirical copulas, only considering the actual data points on the plane rather than the marginal quantiles. This is where our motivation differs to that of Suri et al. (2006) and Hershberger et al. (2004). On this note, another closely related piece of literature to the scope of this paper is that of Xiao (2017) which considers the online computation of pair-wise nonparametric correlations. However, this doesn’t provide any theoretical error bounds on the summarized statistical approximations.
Due to the vast range of industries that use copulas to model dependent data, this application of copula models to streaming data is an important contribution to the data science community. The paper is structured as follows. A background on empirical copulas is given in the next section. In Sec.3.1, an algorithm to construct the summary used to obtain approximations of empirical copula functions is presented. This is followed by a theoretical and numerical assessment of the approximation from the algorithm in Secs. 4 and 6 respectively. Section 5 gives an example of how higher dimensional copulas framed as sets of bivariate copulas can be approximated using the -approximate copulas presented in this paper. A discussion then concludes the paper.
Copulas represent a joint probability distribution of a multidimensional random variable, and therefore can capture the dependence structure between components. The joint distribution is such that the marginal probability distributions of each component are uniform. Suppose we have two random variablesand
, with marginal cumulative distribution functions (CDF)and respectively222For now we will only consider bivariate copulas. Then later in the paper, higher dimensional copulas will be considered.. Then the copula function is defined by
where and is the joint CDF of and . Here and are the inverse marginal CDFs (quantile functions). In the case where there does not exist unique values and that satisfy and , generalized inverse CDFs are used, where these are defined by
respectively (Charpentier et al., 2007). There exist families of analytical copulas such as the Gaussian copula and Archimedean copulas, which can be fit to data streams where and . Typically, this involves estimating the parameters within the copula using the data. For example, the Gaussian copula between and is given by
is the joint CDF corresponding to the Gaussian distribution, is the inverse marginal Gaussian CDF of and is the inverse marginal Gaussian CDF of . Here is the covariance matrix between and
and can be estimated directly from the data. The mean and variance for the marginal Gaussian CDF’s can also be estimated from the data stream. This would be a suitable copula model to use if one knew the dependence structure between and to be Gaussian.
2.1 Empirical copulas
For many data sets one wishes to compute an empirical copula where the dependence structure is unknown in advance. This empirical copula is based on concordant and discordant ranks of data points and therefore is linked to Kendall Tau correlation. Suppose is the indicator function, taking the value of 1 if , and 0 if . Also let be the order statistics (ranked data) of the data stream , such that . An empirical copula (Deheuvels, 1979) of the bivariate data stream is given by
This copula weakly converges (with the number of samples ) to the true underlying dependence structure between the two components of the data stream (Deheuvels, 1979). There are a variety of different approximations to the quantile function , for (Ma et al., 2011). A commonly used one shown in (3) is obtained by the piecewise constant function of the order statistics, , for . The ceiling function in (3) is used to construct this piecewise function by noting that for . To simplify the analysis later on in the paper, let , and let . The dependence of on (taking the empirical inverse CDF of into account) allows (2) to be expressed as
where is the ’th element in the set . As with the inverse CDF, one can approximate the CDF empirically via
Then one can state
This form of (2) frames the expression in terms of empirical CDFs and inverse CDFs. The problem that this paper considers is updating (5) when elements are continuously added to the stream. The next section considers bivariate copulas in this streaming data scenario, and proposes a methodology to approximate bivariate empirical copulas for such data.
3 Bivariate copulas for streaming data
In the streaming data scenario, one does not wish to store the entire stream of data. Therefore, the estimation of copulas in this setting has to operate in an online manner. In the case of parametric copulas, a suitable approach would be to iteratively update the parameters within the copula model. There are some cases when this estimation would be exactly equivalent to that of computing them over the entire stream at once. For example, there are several parameters to estimate within the Gaussian copula model considered earlier. These are the mean and (population) standard deviation of each of the marginals, as well as the covariance betweenand . The means can easily be updated as a new element is added to the stream by
for . One can also follow similar updates for the standard deviation and covariance. For many other parametric copulas, Kendall’s Tau is used to estimate the parameters within the copula model. An online computation of Kendall’s Tau is available in Xiao (2017). Therefore, online estimation of the parameters within parametric copulas could follow. On the other hand, in the case of empirical copulas one cannot feasibly store the entire data stream to compute the order statistics of and . Therefore the methodology presented in the following sections can be implemented to iteratively maintain an approximation to the bivariate empirical copula, , over the data stream.
3.1 Bivariate empirical copulas for streaming data
An approximation to the bivariate empirical copula can be maintained over the course of the data stream by carefully updating a particular data structure, typically referred to as a statistical summary. The data structure proposed in this section is similar to both those used in Suri et al. (2006) and Hershberger et al. (2004) for the estimation of multidimensional ranges in data streams. As adopted in the former study, the data structure proposed in this work for a copula summary stores multiple versions of the quantile summary that was used in Greenwald and Khanna (2001): lists of certain values seen in a data stream , where each value ‘covers’ the empirical quantiles in (3) within a different range (e.g. ). The size of these quantile ranges is dependent on the approximation error that the user prescribes. On this note, define an -approximate quantile summary as one that can be queried for the -empirical quantile , and return a value , where . The next paragraph will describe how to construct the quantile summary , and then the paragraph that follows will discuss how another summary that approximates bivariate empirical copulas can be formed from multiple versions of the quantile summary.
3.1.1 Quantile summary
The quantile summary is composed of tuples , for . The values , where , are a selection of data points that have been seen in the data stream so far. The parameters and in all tuples within the summary are required to infer the range of empirical quantiles that each element in the summary, , covers. On this note, let and be the rank of the element in that corresponds to the minimum and maximum empirical quantiles covered by the summary value respectively. The parameters and infer these ranks via the governing equations,
with . The values of and are minimum and maximum bounds on the rank that the element took in the original stream. This means that the upper bound on the number of elements in the original stream between and is . The Greenwald and Khanna algorithm updates the quantile summary in a manner that guarantees that
at all times. Due to this guarantee, it follows that a query of the rank of an element in the original stream, where , can be answered to within an tolerance (Greenwald and Khanna, 2001).
3.1.2 Copula summary
Now given a bivariate data stream the structure of the proposed copula summary, formed using multiple versions of the quantile summaries explained above, is now described. It starts by maintaining an -approximate quantile summary, , for the first components of the elements in the bivariate data stream . Suppose this summary is elements long. The summary is composed from the following tuples: , for . To accompany each of the elements in this summary are different -approximate quantile (sub)summaries of length , for . Here, is the first component of a data point seen in the stream so far. As aforementioned, the parameters and enforce the range of quantiles that each element covers in the stream . Finally, each is a quantile summary for the second component of a selection of the data points seen in the stream so far. These points will not in general correspond to points with the first component (i.e. the coupling between the two components of each point is lost), however it is permissible for the motivation of this paper. Each subsummary is formed of tuples , for , where . Once again, the parameters and work in the same way as and in enforcing ranges of quantiles. In total then, there are different -approximate quantile summaries stored. This data structure resembles a grid of the joint ranks of the data, and is analogous to the grid of quantiles used in Xiao (2017) to this end. The collection of summaries will henceforth be referred to as the copula summary. The following subsections will describe how this copula summary can be updated as further elements join the bivariate data stream and is used to answer empirical copula function queries to a particular error tolerance.
3.2 Updating the copula summary
Two operations (insert and combine) are used to maintain the standard -approximate quantile summaries in Greenwald and Khanna (2001) when new elements are added to the data stream, whilst guaranteeing (6). These can be modified to update the copula summary.
When an element gets added to the bivariate data stream , a tuple gets added to the quantile summary . Here the subsummary also gets added to the copula summary. For more details on this operation see A.1.1.
After a particular number of elements get added to the copula summary (using the insert operation described in the previous section), it is necessary to combine and merge tuples within the summary. This means that the copula summary will be a succinct summary, and not storing every element in the data stream. In general, successive tuples will be merged into a single tuple if the range of quantiles they jointly cover, in either the first component summary or second component subsummaries , is , from (6). This operation therefore makes sure that and , for , are -approximate marginal quantile summaries for the first and second components of the data stream respectively. For more details on this operation see A.1.2.
3.3 Querying the copula summary
The copula summary can be updated after new elements are added to the bivariate data stream using the operations described in the previous section. Now the following section explains how this summary can be queried at any time to return an approximation to the empirical copula function. The section sequentially describes approximations to the different components of the empirical copula in (5). First recall that is an -approximate quantile summary for the first component of the bivariate data stream. This means that one can query the summary for the -quantile of and have an approximation returned, where (Greenwald and Khanna, 2001). For full details on how to implement such a query see A.3. Denote this query, an approximation to , by .
Recall that , for , are the elements in the summary . The value is an approximation to , defined in Sec. 2.1. Next we take advantage of the fact that multiple subsummaries , for , can be merged into one -approximate quantile summary by using the methodology in Greenwald and Khanna (2004) and described in A.2. In the present work this allows one to approximate the -quantile of by querying the -approximate quantile summary . Denote this query by . Finally, one can also find an approximation to the empirical CDF that appears in the empirical copula approximation in (5) via an ‘inverse’ summary query described in Lall (2015) and A.4. Denote this inverse query on the merged -approximate summary by . Combining all of the different queries described above together, we have
as the copula summary query and the approximation to the empirical copula . This query is described in more detail in A.5. The next section provides a theoretical analysis of the error of this approximation.
4 Error and efficiency analysis
This section provides a theoretical analysis on the error and efficiency of the approximation . The bound on the error of this approximation away from (2) is now stated and proved in the following theorems.
Theorem 1 (Error bound).
Let be the empirical copula function of the bivariate stream of data evaluated at . Also suppose that is as it is defined in (9), then
The error can therefore be framed as taking a sum of the errors from steps (3) and (4) in Sec. 3.2.2 (A and B) in addition to those from steps (1) in Sec. 3.2.2 (C). Each of these contributing errors are now bounded.
Theorem 2 (Error bound on (A)).
Suppose , then
This is the guaranteed error bound for inversely querying an -approximate summary, from Lall (2015). ∎
Theorem 3 (Error bound on (B)).
Suppose , then
Let . Then suppose the element returned by querying the -approximate summary for the -quantile is . Therefore . Now define
Recall from Sec. 3.3 that is simply the count of all samples in less than or equal to . For , let this count be denoted by . For , let this count be denoted by . As , we have also. Therefore
Theorem 4 (Error bound on (C)).
Suppose , then
Recall the definition of from the previous proof. Let . Recall from Sec. 2.1 that are the elements that have corresponding values with ranks less than or equal to in the original stream. We assume without loss of generality that if then , and vice-versa if . Define to be the count of all elements in that are less than or equal to , which is equivalent to . Then by the fact that is an -approximate quantile summary of , the count of all elements in that are less than or equal to , which is equivalent to , is within the interval . Therefore,
The main benefit of this algorithm is that in streams of bivariate data acquired continuously one can compute the approximation to the empirical copula function, bounded in the theorems above, by maintaining a succinct summary of the data. It does this by storing a separate quantile summary , for each elements in a single quantile summary , all of which are -approximate. From Greenwald and Khanna (2001), the length of an -approximate summary constructed using the insert and combine operations discussed in Sec. 3.2.1 and 3.2.2 is at the worst-case . Therefore for the algorithm considered in this paper, the worst-case number of tuples stored in the copula summary at any one time is . This is the same complexity as the queries of very similar data structures in Suri et al. (2006) and Hershberger et al. (2004) obtained for multidimensional range counting. It is worth noting, as seen in Greenwald and Khanna (2001), the space-memory of a single quantile summary is much better than this worst-case in practice. In many cases, such as when one implements the combine operation after an element is added to a single quantile summary rather than after every steps, the space-memory used is independent of .
The -approximate quantile summaries utilised in this paper are uniformly accurate across all quantiles . It is possible to adjust the condition in (6) to allow for certain quantile approximations to be more accurate than others using an -approximate quantile summary (Cormode et al., 2005). Commonly the high quantiles , , , …, are of interest. These have been referred to as biased quantile approximations. One could extend this methodology to the copula summaries presented in this paper by adjusting the insert and combine operations in Sec. 3.2.1 and 3.2.2 acting on and . This is of particular relevance to the field of copulas, as one is often interested in computing the tail dependence (coefficient) between two random variables (Schmidt and Stadtmüller, 2006). Based on the analysis above, it is apparent that a direct extension of this algorithm to a higher dimension , where such a data structure uses space-memory, would be infeasible. This is noted in Hershberger et al. (2004)
for a very similar data structure. However, the next section gives an explanation of how one may model the dependence structure of high dimensional data streams by utilising these bivariate copula summaries.
5 Higher dimensional copulas
So far this paper has only discussed bivariate copulas for two streams of data. However, this section now gives a brief example of how the approximations from bivariate copula summaries can be used to construct approximations to higher dimensional copulas. It is well known that higher dimensional copulas , for , can be framed as decompositions containing sets of (conditional) bivariate pair-copulas (Aas et al., 2009; Mazo et al., 2015; Bedford and Cooke, 2002) (e.g. pair-copula construction). This corresponds to each of the components being a node in a fully connected dependence graph. For high dimensions, there are many different decompositions for the copula , and therefore often vines are a useful tool. Given a copula modelling 5 random variables, there are 240 possible decompositions. For more information on these, turn to Aas et al. (2009). Whilst these decompositions provide complexity in deriving conditional copula densities, they offer a very adaptable framework for constructing higher dimensional copulas. Denote , where , to be the (joint) distribution function for the random variables , and similarly to be the conditional distribution function of . A possible decomposition (known as the -vine) of is
Here it is common practice to make the simplifying assumption that the conditional copulas are constant over . Let , then the conditional is given by (for ),
In the same way let , and . Therefore each conditional copula density can be framed as a recursion of the expressions above as elements are removed from , until and is an unconditional bivariate copula. For example, using the framework above a possible decomposition of the copula is given by
Now suppose we have a data stream for the random variables . In the case where the bivariate copulas used in the decomposition above are empirical copulas, the unconditional bivariate copulas, e.g. , can be simply computed via (5). The conditional pair copulas, e.g. , are required to be computed using ‘pseudo-observations’ (Nagler et al., 2017). These are given by (where ),
and vice-versa for where . These integrals can be computed numerically (e.g. trapezoidal rule). From these pseudo-observations, one can compute the ‘pseudo-data’
using the empirical inverse marginal CDFs in (3). Finally the pseudo-data can be used to construct the (conditional) empirical copula via (5). All conditional empirical copulas are then recursively computed to obtain all components of the decomposition in (12); on this note let a decomposition of the higher dimensional empirical copula be given by