Some notes on Goodman's marginal-free correspondence analysis

02/03/2022
by   Vartan Choulakian, et al.
0

In his seminal paper Goodman (1996) introduced marginal-free correspondence analysis; where his principal aim was to reconcile Pearson correlation measure with Yule's association measure in the analysis of contingency tables. We show that marginal-free correspondence analysis is a particular case of correspondence analysis with prespecified weights studied in the beginning of the 1980s by Benzécri and his students. Furthermore, we show that it is also a particular first-order approximation of logratio analysis with uniform weights. Key words: Marginal-free correspondence analysis; logratio analysis; interactions; scale invariance; taxicab singular value decomposition.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

11/08/2021

Extension of Correspondence Analysis to multiway data-sets through High Order SVD: a geometric framework

This paper presents an extension of Correspondence Analysis (CA) to tens...
07/25/2021

A Comparison of Latent Semantic Analysis and Correspondence Analysis for Text Mining

Both latent semantic analysis (LSA) and correspondence analysis (CA) use...
01/22/2022

Physical geometry of channel degradation

We outline a geometrical correspondence between capacity and effective f...
09/26/2019

The Stroke Correspondence Problem, Revisited

We revisit the stroke correspondence problem [13,14]. We optimize this a...
02/26/2020

Correspondence Analysis between the Location and the Leading Causes of Death in the United States

Correspondence Analysis analyzes two-way or multi-way tables withe each ...
11/01/2020

Comments on "correspondence analysis makes you blind"

Collins' (2002) statement "correspondence analysis makes you blind" foll...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Correspondence analysis (CA) and logratio analysis (LRA) are two popular methods for the analysis and visualization of a contingency table (two-way frequency counts data having rows and columns) or a compositional data set ( individuals, also named samples, of compositional parts). The reference book on CA is Benzécri (1973); Beh and Lombardo (2014) present a panoramic review of CA and its variants.

LRA includes two independently well developed methods: RC association models for the analysis of contingency tables by Goodman (1979, 1981a, 1981b, 1991, 1996) and compositional data analysis (CoDA) by Aitchison (1986). CA and LRA are based on three different principles: CA on Benzécri’s distribututional equivalence principle, RC association models on Yule’s scale invariance principle, and CoDA on Aitchison’s subcompositional coherence principle.

From a statistical point of view there is a fundamental difference between the structures of a two-way contingency table and a compositional data set for and ; while from a mathematical point of view the form of the resulting statistical equations arising from different departure assumptions may be identical in Goodman’s RC association models and Aitchison’s CoDA.

Goodman (1996, equation (46)) in his seminal paper introduced marginal-free correspondence analysis (mfCA); where his principal aim was to reconcile Pearson correlation measure with Yule’s association measure in the analysis of contingency tables. In this paper, we show that mfCA is a particular case of CA with prespecified weights, which has been studied in the beginning of 1980s under the direction of Benzécri. In Benzécri’s edited journal Les Cahiers de l’Analyse des Données, the following papers appeared [Madre (1980), Cholakian (1980, 1984), Benzécri (1983a, 1983b), Benzécri et al. (1980) and Moussaoui (1987)]. Furthermore, we show that mfCA is also a particular first-order approximation of LRA analysis with uniform weights.

This paper is organized as follows: Section 2 presents three different basic ways of representing the concept of interaction in a contingency table; section 3 discusses the the important consequences of Yule’s scale invariance association index; section 4 presents Goodman’s marginial-free CA; section 5 discusses an example; section 6 presents the R code to do the computations; finally we conclude in section 7.

2 Preliminaries on analysis of contingency tables

Let of size

be the associated correspondence matrix (probability table) of a contingency table

N. We define as usual ,

the vector

the vector , and the diagonal matrix having diagonal elements and similarly We suppose that and are positive definite metric matrices of size and , respectively; this means that the diagonal elements of and are strictly positive.

2.1 Independence of the row and column categories

a) The row categories are independent of the column categories,

(1)

where is the residual matrix of with respect to the independence model

Remark 1: The contingency table can also be represented (coded) as an indicator matrix of size by where if individual does not have level of the row variable, if individual has level of the row variable; if individual does not have level of the column variable, if individual has level of the column variable. Note that and is the covariance between the -th column of and the -th column of .

b) The independence assumption can also be interpreted in another way as

(2)

this is the column and row homogeneity models. Benzécri (1973, p.31) named the conditional probability vector ( for and fixed) the profile of the th column; and the element the density function of the probability measure with respect to the product measure . The element is named Pearson ratio in Goodman (1996) and Beh and Lombardo (2014, p.123).

c) A third way to represent the indepence assumption and the row and column homogeneity models is via the (, weighted loglinear formulation, equation (3), assuming and defining

(3)

where and ; and satisfying are a priori fixed probability weights. Two popular weights are marginal (, and uniform (, This is implicit in equation 7 in Goodman (1996) or equation 2.2.6 in Goodman (1991); and explicit in Egozcue et al. (2015).

Equation (3) is equivalent to the logratios

which Goodman (1979, equation 2.2) names it ”null association” model.

Equation (3) is also equivalent to

from which we deduce that : under the independence assumption the marginal row probability vector (

is proportional to the vector of weighted geometric means (

and a similar property is true also for the columns; see for instance Egozcue et al. (2015).

2.2 Interaction factorization

Suppose the independence-homogeneity-null association models are not true, then each of the three equivalent model formulations (1,2,3) can be generalized to explain the nonindependence-nonhomogeneity-association, named interaction, among the rows and the columns by adding bilinear terms, where . We designate any one of the interaction indices (1,2,3) by

Benzécri (1973, Vol.1, p. 31-32) emphasized the importance of row and column weights or metrics in multidimensional data analysis; this is the reason in the french data analysis circles any study starts with a triplet , where X represents the data set, is the metric defined on the rows and the metric defined on the columns. We follow the same procedure where:

a) In covariance analysis, and

b) In CA, and

c) In LRA, and with

We factorize the interactions in (1,2,3) by singular value decomposition (SVD) or taxicab SVD (TSVD) as

(4)

In the SVD case the parameters satisfy the conditions: for

In the TSVD case the parameters satisfy the conditions: for

A description of TSVD can be found, among others, in Choulakian (2006, 2016).

Remark 2

a) In the case , the bilinear decomposition (4) is also named interbattery analysis first proposed by Tucker (1958); later on, Tenenhaus and Augendre (1996) reintroduced it within correspondence analysis circles, where they showed that the Tucker decomposition by SVD produced on some correpodence tables more interesting structure, more interpretable, than CA.

b) In the case

, the CA decomposition has many interpretations. Essentially, for data analysis purposes Benzécri (1973) interpreted it as weighted principal components analysis of row and column profiles. Another useful interpretation of CA, comparable to Tucker interbattery analysis, is Hotelling(1936)’s canonical correlation analysis, see Lancaster (1958) and Goodman (1991, 1996).

3 Yule’s principle of scale invariance

We start by quoting Goodman (1996, section 10) to really understand Yule’s principle of scale invariance: ”Pearson’s approach to the analysis of cross-classified data was based primarily on the bivariate normal. He assumed that the row and column classifications arise from underlying continuous random variables having a bivariate normal distribution, so that the sample contingency table comes from a discretized bivariate normal; and he then was concerned with the estimation of the correlation coefficient for the underlying bivariate normal. On the other hand, Yule felt that, for many kinds of contingency tables, it was not desirable in scientific work to introduce assumptions about an underlying bivariate normal in the analysis of these tables; and for such tables, he used, to a great extent, coefficients based on the odds-ratios (for example, Yule’s Q and Y), coefficients that did not require any assumptions about underlying distributions. The Pearson approach and the Yule approach appear to be wholly different, but a kind of reconciliation of the two perspectives was obtained in Goodman (1981a)”. An elementary exposition of these ideas with examples can also be found in Mosteller (1968).

In the notation of our paper, Goodman’s reconciliation is based on defining the a priori weights in the association index (3), where by its decomposition into bilinear terms, mwLRA will correspond to Pearson’s approach, while uwLRA to Yule’s approach. Because log-odds

(5)

To have a clear picture of LRA with general a priori prescribed weights (, we first study the properties of the association index that distinguishes it from interaction indices (2,3).

3.1 Scale invariance of an interaction index

We are concerned with the property of scale dependence or independence of the three interaction indices (1,2,3). We note that in (1,2,3), depends on To emphasize this dependence, we express an interaction index by where: in the case of the association index is defined in (3), in the case of the nonhomogeneity index is defined in (2), and in the case of the nonindependence index is defined in (1). Following Yule (1912), we state the following

Definition 1: An interaction index is scale invariant if for scales and .

It is important to note that Yule’s principle of scale invariance concerns a function of four interaction terms, see equation (5); while in Definition 1 the invariance concerns each interaction term.

It is evident that the interaction indices (1 and 2) are not scale invariant: because they are marginal-dependent.

Concerning the association index (3) we have

Lemma 1: The association index (3) is scale invariant.

Proof: Let then

(6)

Lemma 2: To a first-order approximation,

Proof: The average value of the density function with respect to the product measure is 1; so the values are distributed around 1. By Taylor series expansion of in the neighborhood of , we have to a first-order Putting and in (6), and by using

which is the required result.

Remark 3: Lemma 2 provides a first order approximation to mwTLRA and uwTLRA, where we see that both first-order approximations are marginal-dependent but in different ways.

a) In the case and in Lemma 2, which implies that CA (or TCA) is a first-order approximation of mwLRA (or mwTLRA), a result stated in Cuadras et al. (2006).

b) In the case and in Lemma 2, which implies that the bilinear expansion of the right side by TSVD (or SVD) is a first-order approximation of uwTLRA (or uwLRA).

In this subsection, we discussed the approximation of LRA to CA related methods. Greenacre (2009) posed the reciprocal question: when CA related methods converge to LRA? And he stated two results; which we provide a proof in the following subsection.

3.2 Box-Cox transformation

Theoretically CA and LRA have been presented in a unified mathematical framework via Box-Cox transformation by Goodman (1996), where the bilinear terms have been estimated by SVD. Goodman’s framework was further considered, among others, by Cuadras et al. (2006), Greenacre (2009, 2010), and Cuadras and Cuadras (2015 ).

Consider the triplet (X, Q, D), where with represents the data set, and with Let be a nonnegative real number. Following Goodman (1996, equations (3,4,5)), we define the interaction index,

(7)

Using the well-known result based on Hopital’s rule, lim (7) converges to

(8)

We consider two cases of (7, 8):

a) which is the interaction term of mwLRA, and equivalent to Result 2 in Greenacre (2010).

b)   which is the interaction term of uwLRA; this is similar to Result 1 in Greenacre (2010).

Equation (7) can also be applied differently, where:

In (7) we replace by by we see that lim similarly lim Then we get

In particular lim.

4 CA with prescribed weights and Goodman’s mfCA

CA with prescribed weights is done in two steps in the following way: We observe a probability table of size by . Let of size by be an unknown probability table with known marginals and The two steps are:

Step1: We construct Q which is in a sense ”nearest to . Two general criteria are: based on (3) and min based on (2).

Step 2: We apply CA to the constructed probability Q

which represents CA of with prespecified weights (. Cholakian (1980) presents an example, where both criteria have been applied and similar results have been obtained.

In the particular case, where we get Goodman’s mfCA, see Goodman (1996, equation (46)). is related to via the strictly positive scales ( that keeps the association between the -th row and the -th column unchanged. The famous iterative proportional fitting algorithm (IPF) is used to construct Q. That is, the constructed probability table ( has uniform marginals and So in Step 2, CA representation is

(9)

which represents a first-order approximation to both uwLRA and mwLRA by Remark 3. Furthermore, by Remark 2 we see that mfCA can be interpreted both as Tucker and Hotelling decompositions.

5 Examples

We present the analysis of two datasets for comparative purposes.

5.1 Example 1

This dataset is contrived and taken from Goodman (1991, Table 10(1), that we reproduce below

According to Goodman, LRA has one principal dimension, while CA has 2 principal dimensions. Here we compare the dispersion results of the 4 methods: CA, TCA, mfCA and mfTCA.

In CA: and

In mfCA: and

In TCA: and

In mfTCA: and

5.2 Example 2

We consider the rodent data set of size 28 by 9 found in TaxicabCA in R package. This is an abundance data set of 9 kinds of rats in 28 cities in California. It can be considered both a contingency table and a compositional data set. Choulakian (2017) analyzed it by comparing the CA and TCA maps; furthermore Choulakian (2021) showed that it has quasi-2-blocks structure. Here we compare the dispersion results of the first 2 principal dimensions in the 4 methods: CA, TCA, mfCA and mfTCA:

In CA: and

In mfCA: and

In TCA and

In mfTCA: and

The curious reader can apply the R code below to campare the 4 maps: CA, mfCA, TCA and mfTCA.

6 R code

#

# install packages

install.packages(c(”ipfr”, ”ca”, ”TaxicabCA”))

#

library(TaxicabCA)

dataMatrix = as.matrix(rodent)

nRow - nrow(dataMatrix)

nCol - ncol(dataMatrix)

ssize - sum(dataMatrix)

#

#Computation of Q matrix of rodent

library(ipfr)

mtx - dataMatrix

row_targets - rep(ssize/nRow, nRow)

column_targets - rep(ssize/nCol, nCol)

QMatrix - ipu_matrix(mtx, row_targets, column_targets)

rownames(QMatrix) - paste(””,1:nRow,sep=””)

colnames(QMatrix) - paste(”C”,1:nCol,sep=””)

#

#CA map of rodent dataset

library(ca)

plot(ca(dataMatrix))

#mfCA map of rodent

plot(ca(QMatrix))

#

# TCA map of rodent

tca.Data - tca(dataMatrix, nAxes=2,algorithm = ”exhaustive”)

plot(

tca.Data,

axes = c(1, 2),

labels.rc = c(1, 1),

col.rc = c(”blue”, ”red”),

pch.rc = c(5, 5, 0.3, 0.3),

mass.rc = c(F, F),

cex.rc = c(0.6, 0.6),

jitter = c(F, T)

)

#mfTCA map of rodent dimensions 1-2

tca.Data - tca(QMatrix, nAxes=2,algorithm = ”exhaustive”)

plot(

tca.Data,

axes = c(1, 2),

labels.rc = c(1, 1),

col.rc = c(”blue”, ”red”),

pch.rc = c(5, 5, 0.3, 0.3),

mass.rc = c(F, T),

cex.rc = c(0.6, 0.6),

jitter = c(F, T)

)

#mfTCA map of rodent dimensions 2-3

tca.Data - tca(QMatrix, nAxes=2,algorithm = ”exhaustive”)

plot(

tca.Data,

axes = c(2, 3),

labels.rc = c(1, 1),

col.rc = c(”blue”, ”red”),

pch.rc = c(5, 5, 0.3, 0.3),

mass.rc = c(F, T),

cex.rc = c(0.6, 0.6),

jitter = c(F, T)

)

7 Conclusion

In his seminal paper Goodman (1996) introduced marginal-free correspondence analysis; where his principal aim was to reconcile Pearson correlation measure with Yule’s association measure in the analysis of contingency tables. We showed that marginal-free correspondence analysis is a particular case of correspondence analysis with prespecified weights studied in the beginning of the 1980s by Benzécri and his students. Furthermore, we showed that it is also a particular first-order approximation of logratio analysis with uniform weights.

Acknowledgements.

Choulakian’s research has been supported by NSERC of Canada.

References

Aitchison J (1986) The Statistical Analysis of Compositional Data. London: Chapman and Hall.

Beh E and Lombardo R (2014) Correspondence Analysis: Theory, Practice and New Strategies. N.Y: Wiley.

Benzécri JP (1973) L’Analyse des Données: Vol. 2: L’Analyse des Correspondances. Paris: Dunod.

Benzécri JP (1983a) Ajustement d’un tableau à des marges sous l’hypothèse d’absence d’interaction ternaire. Les Cahiers de l’Analyse des Données, 8(2), 227-233

Benzécri JP (1983b) Sur une généralisation du problème de l’ajustement d’une mesure à des marges. Les cahiers de l’analyse des données, 8(3), 359-370

Benzécri JP, Bourgarit C, Madre JL (980) Problème: ajustement d’un tableau à ses marges d’après la formule de reconstitution. Les Cahiers de l’Analyse des Données, 5(l), 163-172

Cholakian V (1980) Un exemple d’application de diverses méthodes d’ajustement d’un tableau à des marges imposées. Les Cahiers de l’Analyse des Données, 5(2), 173-176

Cholakian V (1984) Méthodes et critères pour l’ajustement d’un tableau à des marges imposées. Les Cahiers de l’Analyse des Données, 9(l), pp. 113-117

Choulakian V (2006) Taxicab correspondence analysis. Psychometrika, 71, 333-345.

Choulakian V (2016) Matrix factorizations based on induced norms. Statistics, Optimization and Information Computing, 4, 1-14.

Choulakian V (2017) Taxicab correspondence analysis of sparse contingency tables. Italian Journal of Applied Statistics, 29 (2-3), 153-179.

Choulakian V (2021) Quantification of intrinsic quality of a principal dimension in correspondence analysis and taxicab correspondence analysis. Available on arXiv:2108.10685.

Cuadras CM, Cuadras D, Greenacre M (2006) A comparison of different methods for representing categorical data. Communications in Statistics-Simul. and Comp, 35(2), 447-459.

Cuadras CM, Cuadras D (2015 ) A unified approach for the multivariate analysis of contingency tables.

Open Journal of Statistics, 5, 223-232

Egozcue JJ, Pawlowsky-Glahn V, Templ M, Hron K (2015) Independence in contingency tables using simplicial geometry. Communications in Statistics - Theory and Methods, 44:18, 3978-3996

Goodman LA (1979) Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association, 74,537-55

Goodman LA (1981a) Association models and the bivariate normal for contingency tables with ordered categories. Biometrika, 68, 347-355

Goodman LA (1981b) Association models and canonical correlation in the analysis of cross-classifications having ordered categories. Journal of the American Statistical Association, 76, 320-334

Goodman, LA (1991) Measures, models, and graphical displays in the analysis of cross-classified data. Journal of the American Statistical Association, 86 (4), 1085-1111

Goodman LA (1996) A single general method for the analysis of cross-classified data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also some methods of correspondence analysis and association analysis. Journal of the American Statistical Association, 91, 408-428

Greenacre M (2009) Power transformations in correspondence analysis. Computational Statistics & Data Analysis, 53(8), 3107-3116

Greenacre M (2010) Log-ratio analysis is a limiting case of correspondence analysis. Mathematical Geosciences, 42, 129-134

Madre JL (1980) Méthodes d’ajustement d’un tableau à des marges. Les Cahiers de l’Analyse des Données, 5 (1), 87-99

Mosteller F (1968) Association and estimation in contingency tables, Journal of the American Statistical Association, 63, 1-28

Moussaoui AE (1987) Sur la reconstruction approchée d’un tableau de correspondance à partir du tableau cumulé par blocs suivant suivant deux partitions des ensembles I et J. Les Cahiers de l’Analyse des Données, 12(3), 365-370

Yule, G.U. (1912). On the methods of measuring association between two attributes. JRSS, 75, 579-642