A new Gini correlation between quantitative and qualitative variables

09/26/2018 ∙ by Xin Dang, et al. ∙ The University of Mississippi 0

We propose a new Gini correlation to measure dependence between a categorical and numerical variables. Analogous to Pearson R^2 in ANOVA model, the Gini correlation is interpreted as the ratio of the between-group variation and the total variation, but it characterizes independence (zero Gini correlation mutually implies independence). Closely related to the distance correlation, the Gini correlation is of simple formulation by considering the nature of categorical variable. As a result, the proposed Gini correlation has a lower computational cost than the distance correlation and is more straightforward to perform inference. Simulation and real applications are conducted to demonstrate the advantages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Measuring strength of association or dependence between two variables or two sets of variables is of vital importance in many research fields. Various correlation notions have been developed and studied [15, 20]. The widely-used Pearson product correlation measures the linear relationship. Rank based or coupla based correlations such as Spearman’s [33] and Kendall’s [16] explore monotonic relationships. Gini correlation [26, 28] is based on the covariance of one variable and rank of the other. A symmetric version of Gini correlation is proposed by Sang, Dang and Sang (2016) [24]. Other robust correlation measures are surveyed in [6, 31] and explored in detail in [32]. Distance correlation proposed by Székely and Riozzo (2009) [37]

characterizes dependence for multivariate data. Those correlations, however, are only defined for numerical and/or ordinal variables. They can not be applied to a categorical variable.

If both variables are nominal, Cramér’s [3] and Tschuprow’s [41] based on test statistic can be used to measure their association. Theoretically based on information theory, mutual information is popular due to its easy computation for two discrete variables. However, mutual information correlation [23, 11] loses the computational attractiveness for measuring dependence between categorical and numerical variables, especially when the numerical variable is in high dimension.

For this case, two approaches are typically used for defining association measures. The first one treats the continuous numerical variable

as the response variable and the categorical variable

as the predictor. Pearson

of the analysis of variance (ANOVA) or

of MANOVA is then the measure of correlation between them. The second approach considers being the response and as the explanatory variable(s). A Psuedo- of the logistic or other generalized regression model serves a measure of correlation [42]. If and are independent, those correlation parameters are zero. However, the converse is not true in general. Those correlations do not characterize independence. In this paper, we propose a so-called Gini distance correlation (denoted as ) for measuring dependence between categorical and numerical variables.

The contributions of this paper are as follows.

  • A new dependence measure between categorical and numerical variables. The proposed Gini correlation characterizes independence: zero correlation mutually implies independence. It also has a nice interpretation as the ratio of between Gini variation and the total variation.

  • Limiting distributions of sample Gini correlation obtained under independence and dependence cases.

  • Extension of the distance correlation for dependence measure between categorical and numerical variables.

  • Comparison of Gini correlation and distance correlation. Comparing with the distance correlation, Gini correlation has a simpler form, leading a simple computation and easy inference.

The remainder of the paper is organized as follows. Section 2 begins with a motivation of the proposed correlation by considering a dependence measure between one-dimensional numerical variable and a categorical variable. The connection to Gini mean difference leads a natural generalization and nice interpretation. The properties of the generalized Gini correlation are studied in Section 2.3. The relationship of distance correlation is treated in Section 2.4 and three examples are given in Section 2.5. Section 3 is devoted to inferences of Gini correlation. The asymptotical behavior of sample Gini correlation is explored. In Section 4, we conduct experimental studies by simulation and real data applications to demonstrate advantages of the Gini correlation over the distance correlation. We conclude and discuss future works in Section 5.

2 Categorical Gini Correlation

2.1 Motivation

We consider to measure association between a numerical variable in and a categorical variable . Suppose that takes values . Assume the categorical distribution of is and the conditional distribution of given is

. Then the joint distribution of

is and the marginal distribution of is

When the conditional distribution of given is the same as the marginal distribution of , and is independent. In that case, we say there is no correlation between them. However, when they are dependent, i.e for some , we would like to measure this dependence. Intuitively, the larger the difference between the marginal distribution and conditional distribution is, the stronger association should be. With that consideration, a natural correlation measure shall be proportional to

(1)

the expectation of the integrated squared difference between conditional and marginal distribution functions, if is finite.

Clearly, the corresponding correlation is non-negative, just like Pearson type of correlations. It, however, has an advantage that the correlation is zero if and only if and are independent, while for Pearson type of correlation, zero does not mutually imply independence.

Next, we need to find the standardization term so that the corresponding correlation has a range of , a desired property for a dependence measure [21]. In other words, under some condition of , we want to obtain among all and , which can be formulated to solve the following optimization problem.

(2)

Note that for any . Since

is a cumulative distribution function, we have

The equality holds if and only if is a single point mass distribution. In that case, is a discrete distribution with at most distinct values almost surely. Assuming that , we propose the correlation between and as

(3)

From the discussion above, we have the following immediate results.

  1. .

  2. if and only if and are independent.

  3. if and only if is a single point mass distribution.

Assumption implies that is not a point mass distribution and hence is non-degenerate. Assumption means , which we will see in the next subsection. Further, can be written as

(4)

This formulation provides a Gini mean difference representation of the proposed correlation.

2.2 Gini distance representation

Gini mean difference (GMD) was introduced as an alternative measure of variability to the usual standard deviation (

[12], [5], [43]). Let and

be independent random variables from a distribution

with finite first moment in

. The GMD of is

(5)

the expected distance between two independent random variables. Dorfman (1979) [7] proved that for non-negative random variables,

(6)

The proof can be easily extended to any random variable with . Note that (6

) also holds for discrete random variables. Hence, we can write the correlation of (

4) as

(7)

where is the Gini mean difference (GMD) of and is the GMD of . We call it the Gini correlation and denote as or .

The representation of (7) allows another inspiring interpretation. , the weighted average of Gini mean differences, is a measure of within-group variation and is the corresponding between group variation. The proposed correlation is the ratio of the between-group Gini variation and the total Gini variation, analogue to the Pearson correlation in ANOVA (Analysis of Variance). The squared Pearson correlation is defined to be the ratio of between variance and the total variance. Denote , and as the mean and variance of and , respectively. The variance of can be partitioned to the within variation and the between variation as below,

And Pearson correlation, denoted as , is

Let , , be independent pair variables independently from , and , respectively. It is easy to derive that

(8)

where and . Then the between Gini variation, denoted as the Gini distance covariance between and , is

(9)

and the Gini distance correlation between and is

(10)

The total Gini variation is partitioned to the within and the between Gini variation. The proposed Gini correlation is the ratio of the between and the total variation. Frick et al. (2006) [10] consider another decomposition of the Gini variation, which is represented by four components, i.e, within Gini variation, between Gini variation among group means and two effects of overlapping among groups. Although the extra terms provide some insights of the extent of group intertwining, their decomposition is complicated. Not only our representation of the total Gini variation is simple and easy to interpret, but also it is natural to extend to the multivariate case.

2.3 Generalized Gini Correlations

There are two multivariate generalizations for the Gini mean difference. One is the Gini covariance matrix proposed by Dang, Sang and Weatherall (2016) [4]. Along this line, one may extend the Gini correlation based on an analog of Wilk’s lamda or Hotelling-Lawley trace in MANOVA. That leaves for future work. Here we explore another generalization defined in [17]. That is, the Gini mean difference of a distribution in is

or even more generally for some ,

(11)

where is the Euclidean norm of . With this generalized multivariate Gini mean difference (11), we can define the Gini correlation in (4) as follows.

Definition 2.1

For a non-degenerate random vector

in and a categorical variable , if for , the Gini correlation of and is defined as

(12)

where and are the generalized Gini differences of distribution and , respectively.

Remark 2.1

Note that a small provides a weak assumption of on distributions, which allows applications of the Gini correlation to heavy-tailed distributions.

Remark 2.2

The requirement of is for desired properties of the Gini correlation.

The next theorem states the properties of the proposed Gini correlation.

Theorem 2.1

For a categorical variable and a continuous random vector in with for , has following properties.

  1. .

  2. if and only if and Y are independent.

  3. if and only if is a single point mass distribution for .

  4. for any orthonormal matrix , nonzero constant and vector .

Properties 3 and 4 immediately follow from the definition. First of all, so we have . It is obvious that if and only if for each , which mutually implies that is a singleton distribution. Orthogonal invariance of the Property 4 is a result from the Euclidean distance used in Gini correlation. It remains invariant under under rotation, translation and homogeneous scale change. The remaining part of the proof has two steps. In Step 1, let and are independent pairs from and , respectively. We can write

(13)

where This is because

In Step 2, one recognizes that is the energy distance between and defined in [38]. Applying the Proposition 2 of [38], for , we have

(14)

where and

are the characteristic functions of

and , respectively, and is a constant only depending on and , i.e.,

Results of (13) and (14) show that for all , we have and hence with equality to zero if and only if and are identically distributed for all .

Remark 2.3

is the energy distance of and , which is the weighted distance of characteristic functions of and . For , is also the distance of the distribution function and multiplying a constant. However, such a relationship does not hold for .

Remark 2.4

The Gini covariance of and is the weighted average of energy distance between and . It is also a linear combination of energy distances between and for . That is, .

Particularly for , the between variation , is simplified to be

which is proportional to , the energy distance used in [38, 39]. Székely and Rizzo [34] considered a special case of the energy distance of and proposed a test for the equality of two distributions and , which is also studied in [1]. The test is equivalent to test . The test of is also used for the -sample problem. In that case, it is equivalent to the test of DISCO (DIStance COmponent) analysis in [22]. The test statistic in DISCO takes the ratio of the between and the within group Gini variations for the -sample problem. Testing is equivalent to their one-way DISCO analysis. What we contribute in the dependence test is that our test is able to provide power analysis for a particular alternative which is specified as where

. Also, we can have a test which controls Type-II error rather than Type-I error.

2.4 Connection to Distance Correlation

The proposed Gini correlation is closely related to but different from the distance correlation studied by Székely, Rizzo and Bakirov (2007) [36], Székely and Rizzo (2009) [37]

. Their distance correlation considers correlation between two sets of continuous random variables. Later the distance covariance and distance correlation are extended from Eucliean space to general metric spaces by Lyons (2013)

[19]. Based on that idea, we define the discrete metric

where is the indicator function. Equipped with this set difference metric on the support of and Euclidean distance on the support of , the corresponding distance covariance and distance correlation for numerical and categorical variables are as follows.

(15)
(16)
Remark 2.5

As expected, , where are i.i.d.

The proofs of this identity in Remark 2.5 along with (16) are given in Appendix. Comparing (15) with (13) and (14), it is easy to make the following conclusions.

Remark 2.6

.

Remark 2.7

. They are equal if and only if and are independent with both being zero.

Remark 2.8

When , .

Remark 2.9

For , and .

Remark 2.10

For the case of , is studied in [8] and

(17)

Comparison of Remark 2.10 and (1) explains the difference of our Gini approach and distance correlation approach in the one dimensional case. The distance covariance of and is based on squared difference of the joint distribution and the product of the marginal distributions , while the Gini one is based on the squared difference between the conditional distribution and the marginal distribution . Our Gini dependence measure considers the categorical nature of and has a simpler formulation than the distance correlation, leading a simpler inference and computation.

Before we discuss their computation and inference, let first demonstrate the Gini correlation and distance correlation in several examples for .

2.5 Examples

Three examples for , and are provided. Denote as .

h (a) (b)

Figure 1: (a) Correlation coefficients vs

in the mixture exponential distribution with

and ; (b) Correlation coefficients vs in the mixture exponential distribution with .

Example 1. Let and . We have

As we see, the formula of is complicated for the 2-component exponential mixture distribution. The correlations are given as follows.

Figure 1 demonstrates Gini correlation, distance correlation and squared Pearson correlation in the exponential mixtures. The cases of or in (a) and in (b) have zero Gini, zero distance and zero Pearson correlation coefficients, corresponding to the case of independence of and . The value of the Gini correlation is between the squared Pearson correlation and distance correlation.

(a) (b)
Figure 2: (a) Correlation coefficient vs

in the mixture normal distribution with

; (b) Correlation coefficient vs with .

Example 2. Let , and . We have

where and are the density and cumulative functions of the standard normal distribution, respectively. But it is too complicate to derive formula of when is from a mixture of two normal distributions. In this case, we are only able to derive Gini correlation and the squared Pearson correlation as follows.

For a mixture of two normal distributions with a same standard deviation but different means, independence of and is equivalent to either in (a) or in (b) for both correlations, which is demonstrated in Figure 2. For dependence cases, the squared Pearson correlation is larger than the Gini correlation.

(a) (b)
Figure 3: (a) Correlation coefficient vs in the mixture normal distribution with for different ; (b) Correlation coefficient vs in the mixture normal distribution with for different .

Example 3. Let , and . Again, it is too complicate to derive the formula of in this example. Since two distributions have a same mean, is always 0 and hence it completely fails to measure the difference of two distributions when . For the Gini correlation, we have

Then

Figure 3 plots Gini correlation changes with for normal mixture under different ratios of standard deviations in (a) and (b) plots the changes of Gini correlation with ratio of standard deviations of normal mixture under different . In the cases of and in (a) and the case of the ratio to be 1 in (b), the Gini correlation is 0, corresponding to the independence of and .

3 Inference

3.1 Estimation

Suppose a sample data for available. The sample counterparts can be easily computed. Let be the index set of sample points with , then

is estimated by the sample proportion of that category, that is,

where is the number of elements in . With a given , a point estimator of is given as follows.

(18)

Clearly, and are U-statistics of size 2. Applying the U-statistic theorem [13, 27], we are able to establish asymptotic properties of and . The limiting distribution of the sample Gini correlation is obtained, depending on whether is degenerate. We have the following theorems.

Theorem 3.1

If and for all , then almost surely

Proof: By the SLLN, converges to

with probability 1. Also by the almost sure behavior of

-statistics [29], and converge with probability 1 to and , respectively. Let be the function , which is continuous for . Therefore, the strong consistency of the sample Gini correlation follows by the fact that .

Theorem 3.2

Suppose that , for all and . We have

where is the asymptotic variance given in the proof.

Proof: Let be and its sample version be . Let and . We first provide the limiting distribution of , then by Slutsky’s theorem, and are obtained since and are consistent estimators for and , respectively.

Let . With the U-statistic theorem, we have