An Online Algorithm for Nonparametric Correlations

12/05/2017 ∙ by Wei Xiao, et al. ∙ SAS 0

Nonparametric correlations such as Spearman's rank correlation and Kendall's tau correlation are widely applied in scientific and engineering fields. This paper investigates the problem of computing nonparametric correlations on the fly for streaming data. Standard batch algorithms are generally too slow to handle real-world big data applications. They also require too much memory because all the data need to be stored in the memory before processing. This paper proposes a novel online algorithm for computing nonparametric correlations. The algorithm has O(1) time complexity and O(1) memory cost and is quite suitable for edge devices, where only limited memory and processing power are available. You can seek a balance between speed and accuracy by changing the number of cutpoints specified in the algorithm. The online algorithm can compute the nonparametric correlations 10 to 1,000 times faster than the corresponding batch algorithm, and it can compute them based either on all past observations or on fixed-size sliding windows.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robust statistics and related methods are widely applied in a variety of fields [1, 2, 3, 4]

. Nonparametric correlations such as Spearman’s rank (SR) correlation and Kendall’s tau (KT) correlation are commonly used robust statistics. They are often used as a replacement of the classic Pearson correlation to measure the relationship between two random variables when the data contain outliers or come from heavy-tailed distributions. Applications include estimating the correlation structure of financial returns

[5], comparing diets in fish [6], and studying the relationship between summer temperature and latewood density in trees [7].

Nonparametric correlations have the following beneficial properties that standard Pearson correlation does not possess. First, nonparametric correlations can work on incomplete data (where only ordinal information of the data is available). Second, SR and KT equal 1 when is a monotonically increasing function of . Third, SR and KT are more robust against outliers or heavy-tailed errors. The latter two properties are demonstrated in Figure 1. Previous works have shown that the influence function of Pearson correlation is unbounded whereas the influence functions of SR and KT are both bounded [8, 9]

. This fact proves that Pearson correlation lacks robustness. Furthermore, even though the Pearson correlation is the most efficient (in teams of asymptotic variance) for a normal distribution, the efficiency of ST and KT are both above 70% for all values of the population correlation coefficient

[8].

(a) is a monotonically increasing function of
(b) Data with outliers
Figure 1: Comparison of Pearson correlation, Spearman’s rank correlation, and Kendall’s tau correlation on simulated data.

One drawback of SR and KT compared with Pearson correlation is that they require more computational time. The computation of SR and KT requires sorting (finding rank) of the and sequences, which is a very time-consuming step when the sample size is large. The minimum time complexities for batch algorithms for SR and KT are [10], whereas the time complexity for batch algorithm of Pearson correlation is .

In practice you sometimes would want to analyze correlation between variables in dynamic environments, where the data are streaming in. These environments include network monitoring, sensor networks, and financial analysis. A good algorithm should make it easy to incorporate new data and process the input sequence in a serial fashion. Such algorithms are called online algorithms in this paper. Online algorithms have interesting applications in various fields [11, 12, 13, 14]. A standard online algorithm exists to compute Pearson correlation by using the idea of sufficient statistics [13]. The time complexity of this algorithm is , and its memory cost is also . However, because Pearson correlation is not robust against outliers, it is not the desirable method for some applications, for example, suppose you collect data from a huge sensor network of a complex system and you want to analyze the correlation in order to detect highly correlated sensor pairs. Outliers in sensor readings might occur because of noise, different temperature conditions, or failures of sensors or communication. Pearson correlation would not be robust enough to handle such an analysis.

This type of analysis demonstrates the need for an online algorithm for nonparametric correlations (such as SR and KT). However, the way the Pearson correlation is computed cannot be directly carried over to SR and KT. It cannot be carried over for SR because new data can change the ranks of all historical observations that were used to compute the correlation. For KT on the other hand, new data need to be compared with all historical data in the computation of the correlation. In order to compute SR and KT with streaming data exactly, it is necessary to keep all previous history in memory, which is impossible because the data streams can be unbounded in length.

This paper proposes an efficient online algorithm for SR and KT. The time complexity of this algorithm is , and its memory cost is also . Although the algorithm only approximately computes SR and KT, this paper shows through extensive simulation studies and real applications that the approximation is good enough for most cases. To the limit of the authors’ knowledge, the algorithm developed in this work is the first online algorithm for nonparametric correlation.

2 Online Algorithms for Nonparametric Correlations

Let denote the streaming inputs of two time series and . At time , the Pearson correlation (), Spearman’s rank correlation (), and Kendall’s tau correlation () computed based on all previous observations are defined as:

where is the rank of , is the rank of , is the number of concordant pairs, is the number of discordant pairs, is the number of ties only in , and is the number of ties only in .

The main idea in designing the online algorithms for nonparametric correlations is to coarsen the bivariate distribution of (,

). The coarsened joint distribution can represented by using a small matrix. Assume that both

and

are continuous random variables. You provide

and distinct cutpoints in ascending order for and respectively, where both and are nonnegative integers. The cutpoints for are denoted as ,…,, where . Similarly, the cutpoints for are represented as ,…,. Two default cutpoints are added for : and , where and . Cutpoints discretize into ranges, where . The same is done for , and cutpoints discretize into ranges, where . The count matrix is then constructed, where stores the number of observations that falls into the range . An example of an matrix is shown in Figure 2, where three cutpoints are chosen for and four cutpoints are chosen for . Using the count matrix has two advantages: first, instead of entire series (which maybe unbounded in length) being stored,the information is stored in a matrix of fixed size. Second, when are discretized and stored in , they are naturally sorted, and fast algorithms exist to quickly compute and from . This paper proves that the time complexity of the algorithms is for both Spearman’s rank correlation and Kendall’s tau correlation. Because both and are fixed integers, the algorithms for both Spearman’s rank correlation and Kendall’s tau have time complexity and memory cost. This makes the implementation of these algorithms quite attractive on edge devices, where only limited memory and processing power are available. In practice cutpoints for and

need to be chosen. One good choice for cutpoints are the equally spaced quantiles of the random variable. For example, to choose 9 cutpoints for

, we can use the sample quantiles of

that correspond to the probabilities

.

Figure 2: Example of count matrix with three cutpoints of and four cutpoints of .

The preceding discussion assumes that both and are continuous. When and are discrete or ordinal, cutpoints can be selected so that each pair of consecutive cutpoints contains only one level of the random variable. When both and are discrete or ordinal no information is lost by using to approximate the bivariate distribution of (, ), and the algorithm’s result is the exact nonparametric correlation between and .

The general online algorithm for SR and KT is given in Algorithm 1. To expedite the computation, not only is matrix tracked, but also its row sum , its column sum , and its total sum . The algorithm is also designed to compute nonparametric correlation and return the result every new observations. The is a user-specified parameter, where . When the observation index mod is not equal to 0, only , , , and need to be updated (Step 3–6), and the nonparametric correlation does not need to be computed (Step 8) in the iteration. When mod is equal to 0, it is necessary both to update , , and , and to compute the nonparametric correlation. Unlike the step that computes the nonparametric correlation, the updating steps can be done very efficiently with time complexity .

1 Input: (streaming observations), ,…, (cutpoints for ), ,…, (cutpoints for ), . for t=1 to  do
2       Reveal the sample . Compare with ,…, and find the row index of that corresponds to the observation via a binary search. Compare with ,…, and find the column index of that corresponds to the observation via a binary search. , , , if  mod  then
3             Compute the nonparametric correlation based on , , , , and save the result to .
4      
return
Algorithm 1 General online algorithm for nonparametric correlation

Step 8 in Algorithm 1 is described in detail for SR and KT respectively in Algorithms 2 and 3, where the nonparametric correlations are quickly computed based on matrix , , and . It is easy to verify that the time complexity of both algorithms is in linear proportion to the number of cells in matrix .

1 Input: , , , , , and # iteratively compute the rank that corresponds to each row of for k=1 to  do
2       if  then
3            
4      else
5            
6      
7# iteratively compute the rank that corresponds to each column of . . for k=1 to  do
8       if  then
9            
10      else
11            
12      
; ; return
Algorithm 2 Compute Spearman’s rank correlation based on matrix
1 Input: , , , , , and . Initialize as an by zero matrix. Compute (the number of concordant pairs): for i=2 to  do
2       for j=2 to  do
3             if j=2 then
4                  
5            else
6                  
7            
8      
9for i=2 to  do
10      
Compute (the number of ties only in x): Compute (the number of ties only in y): Compute (the number of ties in both x and y): Compute (the number of discordant pairs): return
Algorithm 3 Compute Kendall’s tau correlation based on matrix

The parameters control a tradeoff between the accuracy and efficiency of the online algorithms. The more cutpoints that are chosen for and (larger , ), the more accurate the approximation of the bivariate distribution of with and a more accurate result is generally achieved. However, increasing and also decreases the speed of the online algorithm. Based on the extensive numerical studies of the next section, a rule-of-thumb choice of is for SR and for KT.

In practice, the proposed online algorithms for nonparametric correlations usually work well for the following reasons: First, when increase to infinity, the result of online algorithms converges to the true value. Second, for a reasonably large , each cell of matrix represents only a very local area of the distribution. Positive errors and negative errors can cancel each other out when summed together.

Algorithms  2 and 3 compute SR and KT over all past data. Frequently, you are interested in computing the statistics only over the recent past. Specifically, you might like to compute SR and KT with a sliding window of a fixed size. Algorithm 1 can be easily modified to deal with such cases. The only change that is needed is to add some steps after step 6; in the added steps , is updated by first finding the row index and the column index that correspond to the observation and then decreasing the corresponding cell of by 1. Here represents the size of the sliding windows. The details are left to the readers.

3 Simulation Studies

In the following simulation studies, equal numbers of cutpoints are always chosen for both and , and all cutpoints are chosen as equally spaced quantiles of a standard normal distribution. We implement the batch and online nonparametric correlation algorithms in python. Users can download the package from https://github.com/wxiao0421/onlineNPCORR.git.

3.1 Simulation Study with Nonparametric Correlations Computed over All Past Observations

This simulation evaluates the online algorithms for SR and KT (computed over all past observations) by comparing them with the corresponding batch algorithms. The , are generated from an independently and indentically distributed (iid) normal distribution . Let , where , are iid random variables, which are independent of , . It is easy to verify that both and are iid with a Pearson correlation coefficient of .

The result for SR is shown in Figure 3. All numbers are averaged over 10 replications. All results in subplots (a) and (b) are based on . Subplot (a) compares the run times of the batch algorithm and the online algorithms with different numbers of cutpoints. It is clear that for the online algorithms, increasing the number of cutpoints decreases the speed of the algorithms. Furthermore, as the number of observations () increases, the differences between the run times for the batch algorithm and the online algorithm also increase dramatically. The SR algorithm with 20 cutpoints (online SR (20)) takes less than 10 seconds to run a case, whereas the batch algorithm takes more than 1,000 seconds. When increases to , the batch algorithm becomes too slow to handle such cases (time complexity ), whereas it is easy to both numerically and theoretically prove that the run time of the online algorithm is proportional to . Subplot (b) compares the L1 error of the estimated Spearman’s rank correlation from the online algorithm with different numbers of cutpoints (computed at ). The L1 error does not seem to increase with , and it generally decreases with the number of cutpoints. For all cases, the L1 error is kept below 0.004. Subplot (c) compares the run times of the online algorithm (with 50 cutpoints) for and . The increase in speed is 30-fold for for all . Last, the batch algorithm is implemented very efficiently in C (using the Python Scipy package) whereas the online algorithm is Purely python code. So if the online algorithm is also implemented in C, you would likely see another 10 to 100-fold speed increase for the online algorithm.

(a) Run times with different numbers of cutpoints ()
(b) errors ()
(c) Run times with different (online SR (50))
Figure 3: Comparison of online algorithm and batch algorithm for Spearman’s rank correlation, where “online SR (K)” represents the online algorithm with 50 cutpoints for both and variables. All cutpoints of and are chosen as equally spaced quantiles of a standard normal distribution.

The result for KT is shown in Figure 4. Subplot (a) compares the run times of the batch algorithm and the online algorithm with different numbers of cutpoints. The observed pattern is similar to that of SR, where the run times increase with the number of cutpoints in the online algorithm. Furthermore, as the number of observations () increases, the differences between the run times of the batch algorithm and online algorithm also increase dramatically. The KT online algorithm with 100 cutpoints (online KT (100)) takes approximately 10 seconds to run a case, whereas the batch algorithm takes more than 1,000 seconds. When increases to , the batch algorithm becomes too slow to handle such cases, whereas the online algorithm can still finish the computation in a very short period of time. Subplot (b) compares the L1 error of the estimated Kendall’s tau correlation of the online algorithm with different numbers of cutpoints (computed at ). The L1 error does not seem to change much with , and it decreases with the number of cutpoints. For all cases where the number of cutpoints is larger than 50, the L1 error is below 0.01. Subplot (c) compares the run times of the online algorithm under and (with 50 cutpoints). You see increases of approximately 60 to 70 times for and all .

(a) Run times with different numbers of cutpoints ()
(b) errors ()
(c) Run times with different (online KT (50))
Figure 4: Comparison of online algorithm and batch algorithm for Kendall’s tau correlation, where “online KT (K)” represents online algorithm with 50 cutpoints for both and variables. All cutpoints of and are chosen as equally spaced quantiles of a standard normal distribution.

3.2 Simulation Study with Nonparametric Correlations Computed Based on Sliding Windows

This simulation study compares the batch and online algorithms for SR and KT based on sliding windows. Generate , from an iid normal distribution . Let , where , are iid random variables, which are independent with , . Then and for SR, and and for KT. We choose , where for SR and for KT.

(a) Spearman’s rank correlation ()
(b) Kendall’s tau correlation ()
Figure 5: Comparison of online and batch algorithms for Spearman’s rank correlation and Kendall’s tau correlation based on sliding windows where “online SR (K)” and “online KT (K)” represents the online algorithm for SR and KT with 50 cutpoints (for both and variables). All cutpoints of and are chosen as equally spaced quantiles of standard normal distribution.

The results are shown in Figure 5. The SR online algorithm generates a very accurate estimate of Spearman’s rank correlation even when the number of cutpoints is small (). The KT online algorithm seems to generate a very accurate estimate of the Kendall’s tau correlation when the absolute value of the correlation is small, but it seems to generate a more biased estimate when the absolute value of the correlation is large. This is because when , are highly correlated, the pairs are likely to be concentrated on the diagonal of matrix . This concentration leads to a poor approximation of the bivariate distribution of with matrix , which leads to biased estimates of Kendall’s tau correlation. For the KT online algorithm we suggest keeping the number of cutpoints above 100 in order to achieve a more accurate result.

4 Application to Sensor Data Generated in Industrial plant

This section uses the proposed online algorithms to compute nonparametric correlations based on sensor data that were generated in industrial plant from 2015 Prognostics and Health Management Society Competition [15]. The data contains sensor readings of 50 plants. For each plant, it provides sensor readings of four standard sensors S1-S4, and four control sensors R1-R4. We use the sensor readings of the first component in the first plant to do our experiment, where we compute nonparametric correlations based on sliding windows with window size 35,040 (corresponds to a one-year window).

First, we compute the nonparametric correlation between and . contains 10 unique values and we choose 9 cutpoints so that each unique value has its own cell in matrix . has 121 unique values, and we experiment on two methods to choose cutpoints for . In the first method, we choose 120 cutpoints for so that each unique value of has its own cell in matrix . In the second method, the cutpoints are chosen by first computing sample quantiles of at probabilities , and keeping only the unique values. This leads to choosing 19 cutpoints for . The result is shown in Figure 6. We refer the result of the online algorithm with 19 cutpoints for as online SR (approximate) and online KT (approximate), respectively. Because the returned nonparametric correlations will only approximately equal the true values. We see a 20-50 fold speed up for SR and 5-20 fold speed up for KT.

(a) Spearman’s rank correlation
(b) Kendall’s tau correlation
Figure 6: Computation of nonparametric correlations between and with fixed size sliding windows.

Then, we compute the nonparametric correlation between and . We choose 7 cutpoints for and 11 cutpoints for so that each unique value of and has its own cell in matrix . The result is shown in Figure 7. The online algorithm returns the same result as that of batch algorithm with a 20-40 fold speed up.

(a) Spearman’s rank correlation
(b) Kendall’s tau correlation
Figure 7: Computation of nonparametric correlations between and with fixed size sliding windows.

5 Conclusion

This paper proposes a novel online algorithm for the computation of Spearman’s rank correlation and Kendall’s tau correlation. The algorithm has time complexity and memory cost , and it is quite suitable for edge devices, where only limited memory and processing power are available. By changing the number of cutpoints specified in the algorithm, users can seek a balance between speed and accuracy. The new online algorithm is very fast and can easily compute the correlations 10 to 1,000 times faster than the corresponding batch algorithm (the number varies over the settings of the problem). The online algorithm can compute nonparametric correlations based either on all past observations or on fixed-size sliding windows.

References

  • [1] P. J. Huber, Robust Statistics. Springer, 2011.
  • [2] P. J. Rousseeuw and A. M. Leroy,

    Robust Regression and Outlier Detection

    , vol. 589.
    John wiley & sons, 2005.
  • [3] A. Zaman, P. J. Rousseeuw, and M. Orhan, “Econometric applications of high-breakdown robust regression techniques,” Economics Letters, vol. 71, no. 1, pp. 1–8, 2001.
  • [4] W. Xiao, H. H. Zhang, and W. Lu, “Robust regression for optimal individualized treatment rules,” arXiv preprint arXiv:1604.03648, 2016.
  • [5] O. Grothe et al., “Estimating correlation and covariance matrices by weighting of market similarity,” tech. rep., arXiv. org, 2010.
  • [6] E. S. Fritz, “Total diet comparison in fishes by spearman rank correlation coefficients,” Copeia, pp. 210–214, 1974.
  • [7] T. Franceschini, J.-D. Bontemps, V. Perez, and J.-M. Leban, “Divergence in latewood density response of norway spruce to temperature is not resolved by enlarged sets of climatic predictors and their non-linearities,” Agricultural and Forest Meteorology, vol. 180, pp. 132–141, 2013.
  • [8] C. Croux and C. Dehon, “Influence functions of the spearman and kendall correlation measures,” Statistical methods & applications, vol. 19, no. 4, pp. 497–515, 2010.
  • [9] S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring, “Robust estimation and outlier detection with correlation coefficients,” Biometrika, vol. 62, no. 3, pp. 531–545, 1975.
  • [10] W. R. Knight, “A computer method for calculating kendall’s tau with ungrouped data,” Journal of the American Statistical Association, vol. 61, no. 314, pp. 436–439, 1966.
  • [11] K. Crammer and Y. Singer, “Ultraconservative online algorithms for multiclass problems,”

    Journal of Machine Learning Research

    , vol. 3, no. Jan, pp. 951–991, 2003.
  • [12] N. C. Oza, “Online bagging and boosting,” in Systems, Man and Cybernetics, 2005 IEEE international conference on, vol. 3, pp. 2340–2345, IEEE, 2005.
  • [13] J. Gama, Knowledge Discovery from Data Streams. CRC Press, 2010.
  • [14] W. Xiao, X. Huang, J. Silva, S. Emrani, and A. Chaudhuri, “Online robust principal component analysis with change point detection,” arXiv preprint arXiv:1702.05698, 2017.
  • [15] J. Rosca, Z. Song, N. Willard, and N. Eklund PHM15 Challenge Competition and Data Set: Fault Prognostics, NASA Ames Prognostics Data Repository (http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field, CA, 2015.