PhiK
Phi_K correlation analyzer library
view repo
A prescription is presented for a new and practical correlation coefficient, ϕ_K, based on several refinements to Pearson's hypothesis test of independence of two variables. The combined features of ϕ_K form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables. Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution. These are useful features when studying the correlation between variables with mixed types. Particular emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships in a contingency table, in particular in case of low statistics samples and significant dependencies. Three practical applications are discussed. The presented algorithms are easy to use and available through a public Python library.
READ FULL TEXT VIEW PDFPhi_K correlation analyzer library
The calculation of correlation coefficients between paired data variables is a standard tool of analysis for every data analyst. Pearson’s correlation coefficient [1] is a de facto standard in most fields, but by construction only works for interval variables. While many coefficients of association exist, each with different strengths, we have not been able to identify a correlation coefficient^{1}^{1}1The convention adopted here is that a correlation coefficient is bound, e.g. in the range or , and that a coefficient of association is not.
with Pearson-like characteristics and a sound statistical interpretation that works for interval, ordinal and categorical variable types alike.
This paper describes a novel correlation coefficient, , with properties that – taken together – form an advantage over existing methods. Broadly, it covers three related topics typically encountered in data analysis:
Calculation of the correlation coefficient, , for each variable-pair of interest.
The correlation follows a uniform treatment for interval, ordinal and categorical variables. This is particularly useful in modern-day analysis when studying the dependencies between a set of variables with mixed types, where some variables are categorical. The values for levels of correlation are bound in the range , with for no association and for complete association. By construction, the interpretation is similar to Pearson’s correlation coefficient, and is equivalent in case of a bi-variate normal input distribution. Unlike Pearson, which describes the average linear dependency between two variables, also captures non-linear relations. Finally, is extendable to more than two variables.
Evaluation of the statistical significance of each correlation.
The correlation is derived from Pearson’s contingency test [2], i.e. the hypothesis test of independence between two (or more) variables in a contingency table, henceforth called factorization assumption. In a contingency table each row is the category of one variable and each column the category of a second variable. Each cell describes the number of records occurring in both categories at the same time. The asymptotic approximation commonly advertised to evaluate the statistical significance of the hypothesis test, e.g. by statistics libraries such as R [3] and scipy [4]
, makes particular assumptions on the number of degrees of freedom and the shape of the
distribution. This approach is unusable for sparse data samples, which may happen for two variables with a strong correlation and for low- to medium-statistics data samples, leading to incorrect -values. (Examples follow in Section 5.) Presented here is a robust and practical statistical prescription for the significance evaluation of the level of variable association, based on an adjustment of the distribution when using the [5].Insights in the correlation of each variable-pair, by studying outliers and their significances.
To help interpret any relationship found, we provide a method for the detection of significant excesses or deficits of records with respect to the expected values in a contingency table, so-called outliers, using a statistically independent evaluation for expected frequency of records. We evaluate the significance of each outlier frequency, putting particular emphasis on the statistical uncertainty on the expected number of records and on the scenario of low statistics data samples.
The methods presented in this work can be applied to many analysis problems. Insights in variable dependencies serve as useful input to all forms of model building, be it classification or regression based, such as the identification of customer groups, outlier detection for predictive maintenance or fraud analytics, and decision making engines. More general, they can be used to find correlations across (big) data sets, and correlations over time (in correlograms). Three use-cases are discussed, the study of numbers of insurance claims, survey responses, and clustering compatibility.
This document is organized as follows. A brief overview of existing correlation coefficients is provided in Section 2. Section 3 describes the contingency test, which serves as input for Section 4, detailing the derivation of the correlation coefficient . The statistical significance evaluation of the contingency test is discussed in Section 5. In Section 6 we zoom in on the interpretation of the dependency between a specific pair of variables, where the significance evaluation of outlier frequencies in a contingency table is presented. Three practical applications of this can be found in Section 7. Section 8 describes the implementation of the presented algorithms in publicly available Python code, before concluding in Section 9.
A correlation coefficient quantifies the level of mutual, statistical dependence between two variables. Multiple types of correlation coefficients exist in probability theory, each with its own definition and features. Some focus on linear relationships where others are sensitive to any dependency, some are robust against outliers, etc. Typically their values range from
to or to , where means no statistical association, means the strongest possible association, and means the strongest negative relation. In general, different correlation coefficients are used to describe dependencies between interval, ordinal, and categorical variables.This section briefly discusses existing correlations coefficients and other measures of variable association. This is done separately for interval, ordinal, and categorical variables. In addition, several related concepts used throughout this work are presented.
An interval variable, sometimes called continuous or real-valued variable, has well-defined intervals between the values of the variable. Examples are distance or temperature measurements. The Pearson correlation coefficient is a de facto standard to quantify the level of association between two interval variables. For a sample of size with variables and
, it is defined as the covariance of the two variables divided by the product of their standard deviations:
(1) |
where and are the sample means. Notably, is symmetric in and , and .
Extending this to a set of input variables, Pearson’s correlation matrix , containing the values of all variable pairs, is obtained from the covariance matrix as:
(2) |
where are the indices of a variable pair.
The Pearson correlation coefficient measures the strength and direction of the linear relationship between two interval variables; a well-known limitation is therefore that non-linear dependencies are not (well) captured. In addition,
is known to be to sensitive to outlier records. Pearson’s correlation coefficient, like many statistics formulas, requires interval variables as input, which can be unbinned or binned. It cannot be evaluated for categorical variables, and ordinal variables can only be used when ranked (see below).
A direct relationship exists between
and a bi-variate normal distribution:
(3) | ||||
where (
) is the width of the probability distribution in
(), and the correlation parameter signifies the linear tilt between and . We use this relation in Section 4 to derive the correlation coefficient .Another measure is the global correlation coefficient [6], which is a number between zero and one obtained from the covariance matrix that gives the highest possible correlation between variable and the linear combination of all other variables:
(4) |
An ordinal variable has two or more categories with a clear ordering of these categories. For example, take the variable “level of education” with six categories: no education, elementary school graduate, high school graduate, college and university graduate, PhD. A rank correlation measures the statistical relationship between two variables that can be ordered. The rank of a variable is its index in the ordered sequence of values. For ordinal variables a numbering is assigned to the categories, e.g. 0, 1, 2, 3. Note the equidistant spacing between the categorical values.
Examples of rank correlation coefficients are Spearman’s [7], Kendall’s [8], Goodman-Krustall’s [9, 10, 11, 12], and the polychoric correlation [13]. The definition of Spearman’s is simply Eqn. 1, using the ranks of and as inputs, essentially treating the ranks as interval variables. This makes Spearman’s very robust against outliers. Noteworthy, Goodman-Krustall’s is dependent on the order of the two input variables, resulting in an asymmetric correlation matrix.
Although ranking is regular practice, the assumption of equidistant intervals – often made implicitly – can sometimes be difficult to justify. Adding the category of “MBA” to the above example increases the distance between “PhD” and “no education”, where one could argue that this distance should be independent of the number of educational categories.
A categorical variable, sometimes called a nominal or class variable, has two or more categories which have no intrinsic ordering. An example is the variable gender, with two categories: male and female. Multiple measures of association exist that quantify the mutual dependence between two (or more) categorical variables, including Pearson’s contingency test [2], the -test statistic [5], mutual information [14], Fisher’s exact test [15, 16], Barnard’s test [17, 18]. For an overview see Ref. [19]
. These measures determine how similar the joint distribution
is to the product of the factorized marginal distributions . Each measure of association consists of a sum of contributions, one from each cell of the contingency table, and therefore does not depend on the intrinsic ordering of the cells.Though typically limited to categorical variables, these test statistics can also be applied to interval and ordinal type variables. However, their values are not bound in the range , and can become large. Moreover, their interpretation is often complex, as their values not only depend on the level of association, but also on the number of categories or intervals and the number of records.
Most comparable to this work is Cramér’s [20], a correlation coefficient meant for two categorical variables, denoted as , based on Pearson’s test statistic, and with values between (no association) and (complete association):
(5) |
where () is the number of rows (columns) in a contingency table. Notably, with a relatively small number of records, comparable with the number of cells, statistical fluctuations can result in large values of without strong evidence of a meaningful correlation. (An example of this follows in Fig. 4a.)
Cramér’s can also be used for ordinal and binned interval variables. Fig. 1 shows for a binned bi-variate normal input distribution with correlation parameter . Compared to Pearson’s , shows relatively low values for most values of , and only shoots up to one for values of close to one. Moreover, the value found for is dependent on the binning chosen per variable, as also seen in the figure. This effect make difficult to interpret, and essentially unsuitable for interval variables.
One more alternative is the contingency coefficient , which suffers from the disadvantage that its maximum value depends on the number of categories and , and does not reach a maximum of one. The recommendation [21] is not to use to compare correlations in tables with variables that have different numbers of categories (i.e. when ).
The contingency test, also called the test of variable independence, determines if a significant relationship exists between two (or more) categorical variables. Though usually performed on two categorical variables, the test can equally be applied to ordinal and binned interval variables, and can be extended to an arbitrary number of variables. Specifically, the contingency test indicates how well the joint data distribution of variables and is described by the product of its factorized marginal distributions .
Throughout this paper we employ two contingency tests, where each compares the observed frequency of each category for one variable with the expectation across the categories of the second variable:
Pearson’s test:
(6) |
which is used to define the correlation coefficient in Section 4. Pearson’s test is the standard test for variable independence.
The -test, sometimes called log-likelihood ratio test:
(7) |
which is used to evaluate the significance of the contingency test in Section 5. The sum is taken over all non-empty cells.
In both formulas, () is the observed (expected) frequency of records for row and column of the contingency table. The stronger the dependency between and , the less well modeled is their distribution by the factorized distribution , and the larger each test statistic value.
Under the factorization assumption, the expected frequencies can be obtained in two ways: statistically dependent and independent.
The default method of frequency estimation for row
and column includes , so is statistically dependent on the observed frequency of its bin.The expected value of the two nominal variables is calculated as:
(8) |
where () is the () bin of the row-projected (column-projected) marginal probability mass function (p.m.f.) and is the number of records. The statistical dependency between and arises as the expectation for cell includes the cell’s observation in both the sum over columns and rows, and as part of . The formula can be easily extended to an arbitrary number of variables.
The second method of estimation of excludes , i.e. is statistically independent of the observed frequency of records for row and column . This estimate, known in high energy physics as the ABCD formula [22], is given by:
(9) |
where by construction is not part of , which allows for an objective comparison between observed and expected frequencies per bin. This formula can also be extended to more variables, except that the denominator of Eqn. 9, which is different for each pair of indices, can easily become zero for low statistics samples.
Note that , , and are sums of frequencies, each obeying Poisson statistics, and are statistically independent. Consequently, the statistical uncertainty on is evaluated with straight-forward error propagation [23] as:
(10) |
For an observed frequency of records, , except when , in which case we set . By doing so, when or is zero, and thus , this approach results in a non-zero error on . The statistical uncertainty on the expected frequency, , is only zero when both and are zero.
The expectation from Eqn. 9 is built with fewer statistics than Eqn. 8 and thus is slightly less accurate. Another difference is that the ABCD formula is not a true product of two (or more) factorized marginal distributions, i.e. the relative predictions for one row are not identical to those for another row, as is the case for dependent frequency estimates.
The correlation coefficient is obtained by inverting the contingency test statistic through the steps outlined below. Although the procedure can be extended to more variables, the method is described with two variables for simplicity.
We define the bi-variate normal distribution of Eqn. 3 with correlation parameter and unit widths, centered around the origin, and in the range for both variables. Using uniform binning for the two interval variables, with rows and columns, results in a corresponding bi-variate p.m.f.. With records, the observed frequencies, , are set equal to the probability per bin multiplied by . The expected frequencies , are set to the predictions from the bi-variate normal distribution with , with records and the same binning. We then evaluate the value of Eqn. 6.
Let us define this function explicitly. First, we perform the integral of the bi-variate normal distribution over the area of bin
(11) |
leading to the sum over bins:
(12) |
This value explicitly ignores statistical fluctuations in observed frequencies, and is a function of the numbers of rows and columns, , and the value of .
To account for statistical noise, we introduce a sample-specific pedestal related to a simple estimate of the effective number of degrees of freedom of the bi-variate sample, :
(13) |
with number of rows and columns , and where is the number of empty bins of the dependent frequency estimates of the sample. The pedestal is defined as:
(14) |
The noise pedestal is configurable through parameter , and by default . See Section 4.4 for the impact of the noise pedestal on and Section 5 for a discussion on the effective number of degrees of freedom.
The maximum possible value [20] of the contingency test is:
(15) |
which depends only the number of records , rows , and columns , and is reached when there is a one-on-one dependency between the two variables. Specifically note that is independent of the shape of distribution^{2}^{2}2Note that the -test does not have this useful feature, making the -test unsuitable for the calculation of . .
We scale Eqn. 12 to ensure it equals for and for .
(16) |
This function is symmetric in , and increases monotonically from to as goes from zero to one.
We can now perform the necessary steps to obtain the correlation coefficient :
In case of unbinned interval variables, apply a binning to each one. A reasonable binning is generally use-case specific. As a default setting we take uniform bins per variable.
Fill the contingency table for a chosen variable pair, which contains records, has rows and columns.
Interpret the value as coming from a bi-variate normal distribution without statistical fluctuations, using Eqn. 16.
If , set to zero.
Else, with fixed , , , invert the function, e.g. using Brent’s method [24], and numerically solve for in the range .
The solution for defines the correlation coefficient .
The procedure can be extended to more variables by using a multi-variate Gaussian instead of a bi-variate one.
In summary, we interpret the value found in data as coming from a bi-variate normal distribution with a fixed amount of statistical noise and with correlation parameter . Non-linear relations are captured by through the test of variable independence. The correlation reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution, with uniformly binned interval variables. Unlike Cramér’s , the value of is stable against the number of bins chosen per interval variable, making it unambiguous to interpret. (In Fig. 1, overlaying the values evaluated with (a)symmetric binning gives a line indistinguishable from Pearson’s .) Like Cramér’s , is affected by statistical fluctuations, which is relevant when the number of records is comparable with the number of cells (or lower); however, unlike Cramér’s , has a correction for the statistical noise. Note that is independent of the order of the two input variables, and that the procedure can be extended to more than two variables^{3}^{3}3For more than two variables, follow the same procedure and assume a common correlation for each variable pair of the multivariate normal input distribution..
All coefficients presented in Section 2 are computationally inexpensive to evaluate. The calculation of is computationally expensive because of the integrals of correlated bi-variate normal distributions evaluated in Eqn. 16, but is well-doable on any modern laptop, typically taking only a fraction of a second per calculation.
A comparison with alternative correlation coefficients based on benchmark samples is given in Fig. 2. By construction, the interpretation of is similar to that of Pearson’s correlation coefficient, in particular for the bi-variate normal input distributions and the linear shapes, shown in the left and middle columns. Unlike Pearson, however, also captures non-linear relations as shown in the right column. Moreover, can be determined for categorical, ordinal, and interval variables alike. Note that Cramér gives relatively low values for all samples.
When studying the dependencies of a set of variables with a mixture of types, one can now calculate the correlation matrix for all variable pairs, filled with values, which is a useful overview to have for a data analyst.
For illustration purposes a synthetic data set with car insurance data has been created. The data set consists of 2000 records. Each record contains 5 (correlated) variables of mixed variable types, see Table 1. These data are used throughout the paper to provide insights in the practical application of the methods introduced in this work. The correlation matrix measured on the car insurance data set is shown in Fig. 3.
Besides the variable-pair information available from the correlation matrix in Fig. 3a, it is also interesting to evaluate per variable the global correlation coefficient, , of Eqn. 4. Strictly speaking, is only defined for interval variables, as it requires a covariance matrix
. Here, we set the variances of all variable types to one (anyhow undefined for categorical variables
^{4}^{4}4Interval variable can always be re-scaled to have unit variance.) and use . Example global correlations measured in the car insurance data are shown in Fig. 3b. They give a tenable estimate of how well each variable can be modeled in terms of all other variables, irrespective of variable type.The calculation of contains a correction for statistical fluctuations: for any value below the sample-specific noise threshold of Eqn. 14, indicating that no meaningful correlation can be determined, is set to by construction.
The impact of the noise correction is seen in Fig. 4a, showing the absolute value of Pearson’s , Cramér’s , and measured for 1000 synthetic data sets with only 500 records each, simulated from a bi-variate normal distribution with no correlation, and each binned in a 10x10 grid. Without absolute function applied, the distribution of Pearson’s values would be centered around zero, as expected. The calculation of Cramér’s results in a seemingly significant bump at . This cannot be interpreted as a meaningful correlation, but results from the statistical noise contributing to each sample’s value.
For , only when does the calculation of kick into action. The noise threshold is set such that about 50% of the simulated samples gets assigned . The remaining samples result in a wide distribution of values^{5}^{5}5Without noise correction, the distribution shows a similar peak as Cramér’s , at value 0.5..
Fig. 4b shows as a function of true correlation, where is obtained from the median value of 1000 synthetic data sets with 500 data points each. The median gives the most representative, single synthetic data sample. In the calculation of three configurations for the noise pedestal of Eqn. 14 are tested: no pedestal, and . No pedestal gives values that overshoot the true correlation significantly at low values. Configuration undershoots: the calculation of turns on too late. Configuration follows the true correlation most closely. The residual differences disappear for larger data samples, and we deem this acceptable for this level of statistics (with on average only 5 records per bin).
The sample-specific noise threshold depends mostly on the number of filled cells, and stabilizes for larger sample sizes. Consequently, its impact is rather limited for large samples with a meaningful, non-zero correlation, typically having . For small sample sizes, as is also obvious from Fig. 4b, any correlation coefficient value should first be held up against the significance of the hypothesis test of variable independence – the topic of Section 5 – before further interpretation.
In practice, when exploring a data set for variable dependencies, the studies of correlations and their significances are equally relevant: a large correlation may be statistically insignificant, and vice versa a small correlation may be very significant.
Both Pearson’s test and the -test asymptotically approach the distribution [2]. For samples of a reasonable size (Cochran’s rule on what defines “reasonable size” follows below), the default approach to obtain the -value for the hypothesis test of variable independence is to integrate the probability density function over all values^{6}^{6}6The integral runs up to infinity, even though the contingency test has a maximum test statistic value. In practice the difference is negligible. equal to or greater than the observed test statistic value :
(17) |
with the p.d.f. of the distribution:
(18) |
where , is the gamma function, and is set to the number of degrees of freedom . The solution of this integral is expressed as the regularized gamma function. This approach holds for samples of a reasonable size, and when using the test statistic or -test.
For the independence test of variables, the number of degrees of freedom is normally presented [25] as the difference between the number of bins and model parameters
(19) | |||||
where is the number of categories of variable . Explained using Eqn. 8, each dimension requires parameters to model its p.m.f., which is normalized to one, and the p.m.f. product is scaled to the total number of events, which requires one more parameter. For just two variables this reduces to:
(20) |
In practice Eqn. 19 does not hold for many data sets, in particular for distributions with unevenly filled or unfilled bins. For example, in the case of two (binned) interval variables with a strong dependency. The effective number of degrees of freedom, , is often smaller than the advocated value, , and can even take on floating point values, because the number of available bins is effectively reduced.
The asymptotic approximation, Eqns. 17-18, breaks down for sparse data sets, for example for two (interval) variables with a strong correlation, and for low-statistics data sets. The literature on evaluating the quality of this approximation is extensive; for an overview see Refs. [26, 27]. Cochran’s rule of thumb is that at least of the expected cell frequencies is counts or more, and that no expected cell frequency is less than count. For a 2x2 contingency table, Cochran recommends [28, 29] that the test should be used only if the expected frequency in each cell is at least counts.
How to properly evaluate the -value if the test statistic does not follow the distribution and hence Eqn. 17 cannot be be safely applied. A reasonable approach is to evaluate Eqn. 17 directly with Monte Carlo data sets, sampled randomly from the distribution of expected frequencies. However, this approach quickly becomes cumbersome for -values smaller than , i.e. once more than 1000 simulations are needed for a decent -value estimate, and practically impossible when needing at least a million simulations. Given that variable dependencies can be very significant, we prefer a common approach that works for both strong and weak dependencies and both low- and high-statistics samples.
In this section we propose another option: a hybrid approach where a limited number of Monte Carlo simulations is used to fit an analytical, empirical description of the distribution. Specifically, we describe two corrections to Eqn. 17:
A procedure to evaluate the effective number of degrees of freedom for a contingency test;
A correction to Eqn. 18 for low statistics samples, when using the -test statistic.
We conclude the section with a prescription to evaluate the statistical significance of the hypothesis test of variable independence, and a brief overview of sampling methods to help evaluate the -value.
To obtain the effective number of degrees of freedom of any sample, we use the property of Eqn. 18 that, for a test statistic distribution obeying , to good approximation the average value of equals .
The effective number of degrees of freedom for any sample is obtained as follows:
For the two variables under study, the dependent frequency estimates form the factorized distribution most accurately describing the observed data. Using Monte Carlo sampling techniques, this distribution is used to randomly generate independent synthetic data sets with the same number of records as in the observed data set.
Optionally, sampling with fixed row and/or column totals may be chosen. A short discussion of sampling methods is held in Section 5.5.
For each synthetic data set, evaluate the -test statistic using the statistically dependent frequency estimates, as detailed in Section 3.
The effective number of degrees of freedom, , is taken as the average value of the -test distribution of all generated Monte Carlo samples.
Fig. 5 shows a “smiley” data set of two interval variables, consisting of two blobs and a wide parabola, which are binned into a 20x20 histogram, for which we can generate an arbitrary number of records.
The bottom two curves in Fig. 6 show obtained for this sample, as a function of the number of records in the data set, , and evaluated using the -test and test statistic. Using Eqn. 20, the advocated number of degrees of freedom of this sample equals . For both test statistics this number is only reached for very large sample sizes (). and drops significantly for smaller values of , where the drop is slightly steeper for the -test statistic. The top two curves show the same data set on top of a uniform background of 1 record per cell, ensuring that each is always filled, again evaluated using the -test or test statistic. Now the -test overshoots, and the test statistic happens to level out at the expected value.
To understand the behavior of under- and overshooting, realize that relates directly to the distribution of dependent frequency estimates. By construction, the dependent frequency estimates of Eqn. 8 make non-zero predictions for each bin in the distribution, as long as the input data set contains at least one record per row and column. Under the assumption of variable independence, each bin in the distribution is expected to be filled.
First consider the bottom two curves of Fig. 6. For an uneven input distribution, for example two strongly correlated interval variables, one may expect many bins with low frequency estimates. A data set sampled randomly from a distribution with very low frequency estimates, such as the data set in Fig. 5, is likely to contain empty bins. On average, high-statistics bins contribute to the -test or test statistic, but the low-statistics bins do not obey this regime. As an example, let us focus on the empty bins. By construction their contribution to the -test is zero. The contribution to the test statistic is non-zero: , where the sum runs over all empty bins. It is clear however, when , that this sum is relatively small and contributes only marginally. Taken over many randomly sampled data sets, this effect reduces the average value of the -test or test statistic distribution to lower values, and likewise decreases compared with .
For the top two curves, by construction for each bin, bringing them closer to the nominal regime and increasing the -test and test statistics. For a discussion of the contribution of low-statistics contingency table cells to the test statistic, see Ref. [30].
In summary, depending on the shape and statistics of the input data set, the effective number of degrees of freedom of a contingency table can differ from the advocated value of (Eqn. 19). To be certain of the effective value to use, this is best derived as the average value of the test statistic distribution, which is obtained with Monte Carlo simulations of the expected frequency distribution.
Given a large enough data sample, and given the hypothesis that the observed frequencies result from a random sampling from the distribution of expected frequencies, the -test statistic can be approximated^{7}^{7}7The approximation is obtained with a second-order Taylor expansion of the logarithm around 1. by Pearson’s . In this scenario both the -test and value are described by the distribution of Eqn. 18, with the same number of degrees of freedom, and applying any one test leads to the same conclusions.
For low statistics samples – to be more specific, samples with many bins of low expected and observed frequencies – the distributions of and start to differ, and both distributions diverge from the nominal distribution. This can be seen in Fig. 7, which uses the smiley data set of Fig. 5 as input. The simulated distribution of test statistics is wider than the -distribution in case of the Pearson -test statistic (Fig. 7a) and narrower than the -distribution in case of the -test statistic (Fig. 7b). This results in -value distributions with elevated frequencies around zero and one for the Pearson -test statistic (Fig. 7c) and lower frequencies near zero and one for the -test statistic (Fig. 7d). Note that the effective number of degrees of freedom is much lower than the theoretical value; using in the -value calculation results in an uneven distributions peaked towards one.
This section addresses the question whether the test statistic distribution for the contingency test can be modeled for all sample sizes, knowing that Eqn. 18 cannot be safely used for low statistics data sets. In particular we are interested in assessing the -values of large test statistic values, coming from possibly strong variable dependencies. To evaluate these correctly, it is important to properly model the high-end tail of the test statistic distribution.
We observe empirically that for low-statistics samples the
-test statistic distribution converges towards a Gaussian distribution
, with mean and width . For high-statistics samples the distribution is modeled by , with degrees of freedom. Experimentally we find that, for any sample size, the -test statistic distribution can be well described by the combined probability density function :(21) |
where the parameters of and are fixed as above, and is a floating fraction parameter between .
Below we use as the modified p.d.f. to model the -test statistic distribution for any data set.
Fig. 8 shows the results of binned log-likelihood fits of to two -test statistic distributions, each with 10k entries generated with the procedure of Section 5.1, using the smiley data set with 20x20 bins with: a) and b) records for the simulated data sets. Clearly, these distributions are not well modeled using or alone. The fit of can separate the two component p.d.f.’s given that the RMS-value of is and the width of the Gaussian is fixed to . For , the distribution is dominated by the Gaussian, and for by the theoretical distribution. Note that , when present, contributes to the core of the distribution while dominates in the tails.
Fig. 9 uses a similar setup, with 20x20 or 50x50 bins, where the fit fraction is shown as a function of a) the number of records per simulated data set, , and b) the average number of records per cell, . The fraction rises as a function of sample size, such that turns into for large enough data sets. With 20x20 bins, for a fraction of () the approximately sample size equals (), and the average number of entries per cell equals (). Note that the fraction reaches well before reaches the advocated value of in Fig. 6.
In summary, to assess the -value for the hypothesis test of variable independence, in this work we choose to work with the -test statistic, and not Pearsons’s , for two reasons:
We manage to describe the -test statistic distribution most successfully for any sample size.
The statistical significance of the hypothesis test of any variable independence is obtained with the following procedure:
Calculate the average number of entries per cell, . If , set , else samples.
Follow the procedure of Section 5.1 to generate synthetic data sets based on the dependent frequency estimates of the input data set. For each synthetic data set evaluate its -test value. Take the average of the -test distribution to obtain .
If , to obtain fit the probability density function to the -test distribution, with fixed. Else, skip the fit and set .
With this fraction, use Eqn. 17 with as modified distribution to obtain the -value for the hypothesis test, using the -test value from data as input.
The -value is converted to a normal -score:
(22) |
where
is the quantile (inverse of the cumulative distribution) of the standard Gaussian,
e.g. is the significance in 1-sided Gaussian standard deviations. For example, the threshold -value of (95% confidence level) corresponds to .When the -value is too small to evaluate Eqn. 22 numerically, at , anyhow a very strong variable dependency, is estimated using Chernoff’s bound [31] to ensure a finite value. Let , Chernoff states when :
(23) |
where we safely ignore the contribution from the narrow Gaussian in . This is converted to with the approximation (valid for large ):
(24) |
The significance procedure is illustrated in Fig. 10, which shows the -value distributions of the two -test distributions of Fig. 8, with and records per sample. The two -value distributions in each figure have been calculated in two ways.
Using the original distribution of Eqn. 17, with the effective number of degrees of freedom, . This results in the blue distributions.
Fitting each test statistic distribution with of Eqn. 21, and using that to calculate the -values, resulting in the red distributions.
The blue distributions drop around zero and one, in particular for the low statistics sample (). This is because the -test distribution is more narrow than the distribution, as shown in Fig. 7. The red -value distributions, evaluated with , are uniform, as desired in both setups.
Let us apply the statistical procedure to a low-statistics data sample. A smiley data set with 100 entries, in a histogram with 20x20 bins, has correlation value and test statistic value . The calculation is done in three consecutively more refined ways:
The asymptotic approximation: using and the asymptotic distribution gives: ;
Effective number of degrees of freedom: using and the asymptotic distribution results in: ;
Modified distribution: with , the modified distribution , and fit fraction one finds: .
In this example, between the three approaches the -value increases with more than 8 units! Typically, using the effective number of degrees of freedom gives the largest correction to , and the modified distribution only gives a small correction on top of that.
The choice of 2000 synthetic data sets for the fit of is a compromise between accuracy and speed. With this number, typically varies at the level of , and is calculated in just a fraction of a second.
Based on our findings, for any sample size we recommend the -value to be calculated with the modified distribution , using degrees of freedom. If not, the -value may over-cover for strong variable dependencies and at low-statistics, resulting in a -value that is too small, possibly by multiple units. This is important to know, as it can lead to rather incorrect conclusions regarding the studied variable dependency.
In practice a correlation value may be small but its statistical significance can still be large, and vice versa. For this reason, when exploring a data set, the levels of correlation and significance should always be studied together.
Based on the statistically dependent frequency estimates, three sampling approaches are offered to generate synthetic data sets for testing the hypothesis of no variable association:
Multinomial sampling: with only the total number of records fixed. The hypothesis of no association is independent of the row and column variables.
Product-multinomial sampling: with the row or column totals fixed in the sampling. The hypothesis of no association is also called homogeneity of proportions. This approach is commonly used in cohort and case-control studies.
Hypergeometric sampling: both the row or column totals are fixed in the sampling. This approach is also known as Fisher’s exact test. We use Patefield’s algorithm [32] to generate the samples.
There is an ongoing debate about sampling design for tests of variable independence. Although in practice most people are not too worried about the sampling approach, at least not in the high-statistics regime, because asymptotically the different approaches lead to the same result. The default approach used in this paper is multinomial sampling. For a discussion and further references see Ref. [33].
After the evaluation of and its significance, the specific relationship between two variables is typically inspected. To facilitate the interpretation of any dependency found, the significance of observed excesses or deficits of records with respect to expected values in the contingency table is discussed here.
The statistical significance for each cell in the table is obtained from an hypothesis test between a background-only and signal-plus-background hypothesis for a Poisson process. Such hypothesis tests, i.e. for the presence of new sources of (Poisson) counts on top of known “background” processes, are frequently performed in many branches of science, for example gamma ray astronomy and high energy physics, and have been discussed extensively in the literature [34].
We employ a measure of statistical significance commonly used in both fields, one that accounts for the mean background rate having a non-negligible uncertainty. The background estimate and its uncertainty have been derived from an auxiliary or side-band measurement, typically assumed to be a Poisson counting setup, as in the case of the ABCD estimate of Section 3.2. Here we use as background estimate the statistically independent frequency estimate (and related uncertainty) of Eqn. 9 (10).
The hybrid Bayesian-Frequentist method from Linneman [35] is used to evaluate the probability of the hypothesis test (-value). Per cell, Linneman’s probability calculation requires the observed count , the expected count , and the uncertainty on the expectation :
(25) |
where is the incomplete Beta function, and .
We apply four corrections on top of this calculation:
The incomplete Beta function returns no number for , when by construction the -value should be .
The incomplete Beta function is undefined when
, in which case we simply revert to the standard Poisson distribution.
The incomplete Beta function always returns when , irrespective of and . The scenarios and are captured by the previous two fixes. In all other cases we set before evaluating Eqn. 25. In particular, this procedure prevents (minus) infinite significances for low statistics cells where uncertainty-wise these are not expected^{8}^{8}8When , or is zero in Eqn. 9, so Eqn. 10 typically gives . For example, for , , and (), correction three to results in . Varying between and gives a maximum absolute shift in of (). To do outlier detection, for this procedure we deem this level of systematic error acceptable..
As we combine an integer-valued measurement (namely the observed frequency) with a continuous expectation frequency and uncertainty, resulting in a continuous (combined) test statistic, we correct to Lancaster’s mid- value [36], which is the null probability of more extreme results plus only half the probability of the observed result^{9}^{9}9The standard -value definition is: .:
(26) |
with the integrated-over number of cell counts. This -value is then translated into the -value using Eqn. 22. When observing the expected frequency by construction Lancaster’s mid- value (-value) is close to (), even at low statistics. Likewise, for background-only samples the Lancaster’s mid- correction centers the distribution around zero.
Fig. 12a shows the distribution from 1000 randomly generated samples of two variables obeying a uniform probability mass distribution, i.e. the samples have no variable dependency. Each sample contains only 200 records collected in a contingency table, so on average each cell contains records. As can be seen from the Gaussian curve, even for such low statistics samples the distribution is fairly consistent with a normal distribution, albeit slightly shifted towards negative values. Fig. 12b shows a similar distribution, built from samples with on average records per contingency table cell. Clearly, with more statistics the distribution converges to the normal distribution relatively quickly^{10}^{10}10With an average of less than records per bin, the distribution gets more distorted, and breaks up into individual peaks of , , , etc. observed entries per cell. The distribution peaks at negative values, corresponding to no observations, and the tail at negative gets truncated. Relevant here: the mean of the distribution remains close to zero, its width is (slightly less than) one, and the positive tail is similar to that of a normal distribution..
To filter out significant excesses or deficits of records over expected values in the contingency table, one simply demands to be greater than a specified value, e.g. 5 standard deviations. For two variables with a dependency, note that excesses and deficits always show up together, since the frequency estimates of Section 3.2 smooth the input distribution.
Two example contingency tables are shown in Fig. 13, one for a combination of categorical variables, and one for the combination of an interval and ordinal variable, both based on the synthetic car insurance data of Table 1. Per cell each figure shows the evaluated -value. For example, black-colored cars occur significantly more in suburbs and significantly less down-town, and XXL-sized cars have significantly higher mileage.
In practice these turn out to be valuable plots to help interpret correlations, in particular between categorical variables. In essence, for a data sample with a dependency, the contingency table cells with large values show the variable dependency.
Given a set of mixed-type variables and using the methods described in this work, one can:
Find variable pairs that have (un)expected correlations;
Evaluate the statistical significance of each correlation;
Interpret the dependency between each pair of variables.
The methods presented in this work can be applied to many analysis problems, and in particular they are useful for model building purposes. Three interesting applications using the methods presented in this paper are briefly discussed below.
One interesting application is the modeling of numbers of expected insurance claims, e.g. car damage claims as a function of car type, type of residential area, mileage, age of driver, etc. – a set of variables with a mixture of types.
The aggregate loss incurred by an insurer is the total amount paid out in claims over a fixed time period: , where is an individual claim amount, known as the severity, and is the total number of claims paid out in the time period. Traditionally it is assumed that the individual claim amounts are mutually independent, and that does not depend on the values of the claims. The total expected severity is then expressed as a product of the expected number of claims times the average claim amount:
, where each term can be estimated separately. When a vector of variables
is available at the individual claim level, this information is incorporated through two independent generalized linear models (GLMs): one for the claim frequency , and the other for the severity . See Ref. [37] for more information.Here we focus on the GLM for modeling the frequency of insurance claims. Suppose that claims data are available for different classes of policy holders, and that class has claims. Assume the claim frequency for each class is Poisson distributed, , where is the expectation for . Let be the vector of variables for class at claim level, with the baseline convention that . One writes:
(27) |
where is a vector of regression constants^{11}^{11}11Sometimes the ratio of claims to no claims per class of policy holders is modeled instead.. In GLM terminology is the link function. When the frequency GLM uses a logarithmic link function, Eqn. 27 simplifies to:
(28) |
yielding a simple rating structure which ensures that . The logarithmic function reflects the common practice that each variable alters the baseline claim rate by a multiplicative factor.
In initial GLM risk models, no relations are typically assumed between the input variables, and each variable category
(or interval bin) is one-hot-encoded,
, and assigned one model parameter. The number of regression parameters per variable equals its number of categories or interval bins. Note that it is common practice to merge low-statistics categories until they contain sufficient records.Take the example variables of residential area and car type, each with multiple categories. Three classes of policy holders could be: “city, small car”, “city, SUV”, and “countryside, SUV”, where the first two share the regression parameter , and the last two the regression parameter . The predicted, factorized number of claims for class “city, SUV” simply reads: , where is the nominal number of claims shared between all classes, and .
In a refinement modeling step, to improve the factorized estimates, cross-terms between categories of variable pairs can be added to the linear sum in the power of Eqn. 28. However, there is an exponentially large number of cross-terms to choose from. Practical modeling questions are: which are the most relevant terms to add? And can they be picked in an effective way that limits their number?
To help answer these, realize that the shape of Eqn. 28 and the assumption of variable independence are identical to the factorization assumption of Eqn. 8. A practical approach can then be:
Using the values and their significances, select the variable pairs with the strongest correlations.
The most relevant model cross-terms for each variable pair , having the largest impact in the model’s likelihood, can be identified by studying the outliers in the correlation plots of Section 6.
Cross-terms can also be included in a manner that limits the number of extra regression parameters. For example, for a given variable pair , introduce one cross-term parameter that affects only the contingency table cells with a value greater than a predefined value (and one for those smaller). To model those outlier cells, use Eqn. 10: the cross term for each selected cell should scale with the uncertainty on the statistically independent estimate for that cell, .
When interpreting questionnaires one is often interested in finding all “unexpected” correlations between ordinal or categorical answers given to a set of survey questions (the definition of what constitutes an unexpected correlation is typically survey specific). The methods presented in this paper can help to do so:
By selecting question-pairs that have an interesting (“unexpected”) correlation and significance on the one hand;
And selecting those with relatively high values in the contingency tables of their respective answers on the other hand.
This allows one to compile a list with all answer-pairs significantly deviating from the norm of no correlation, of which the unexpected pairs are a subset.
When looking for groups of similar data records, a typical approach is to run multiple unsupervised clustering algorithms to cluster the data, and study the results. In trying to understand the compatibility in clusters created by the various algorithms, the methods presented in this work come in useful.
For each data record, store the cluster-ID assigned by each clustering algorithm. Using this information, one can now:
Calculate the correlation matrix between the various clustering algorithms;
For two specific algorithms, study where the two sets of predicted clusters overlap and deviate.
The correlation analyzer code is publicly available as a Python library through the PyPi server, and from GitHub at https://github.com/KaveIO/PhiK. Install it with the command:
pip install phik
The web-page https://phik.readthedocs.io contains a description of the source code, a tutorial on how to set up an analysis, and working examples of how to use and run the code.
We have presented a new correlation coefficient, , based on the contingency test, with Pearson-like behavior and the practical feature that it applies to all variable types alike. Compared to Cramér’s , the calculation of is stable against the binning per interval variable, making it easy to interpret, and contains a noise correction against statistical fluctuations.
The asymptotic approximation breaks down for sparse and low-statistics data sets. To evaluate the statistical significance of the hypothesis test of variable independence, a hybrid approach is proposed where, using the -test statistic, a number of Monte Carlo simulations is used to determine the effective number of degrees of freedom and to fit an analytical, empirical description of the distribution. We have evaluated the statistical significance of outlier frequencies with respect to the factorization assumption, which is a helpful technique for interpreting any dependency found, e.g. between categorical variables.
Three practical use-cases are discussed, studying the numbers of insurance claims, survey responses, and clustering compatibility, but plenty of other applications exist. The methods described are easy to apply through a Python analysis library that is publicly available.
We thank our colleagues of KPMG’s Advanced Analytics & Big Data team for fruitful discussions and in particular Ralph de Wit for carefully reading the manuscript. This work is supported by KPMG Advisory N.V.
Comments
There are no comments yet.