Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are well-known multivariate data reduction techniques. These two methods are analyses of a correlation or variance–covariance matrix of a multivariate quantitative dataset. CA is an analysis of CD.
Also, CA is a statistical visualization method for picturing the associations between the levels of a two-way contingency table. To illustrate CA, one might consider the contingency table presented in Table1, which shows data well known as Fisher’s data for eye and hair color of people in Caithness, Scotland. The CA of these data yield the graphical display presented in Figure 1. Figure 1 shows the correspondence between eye and hair color.
PCA is the eigenvalue decomposition of a covariance matrix. CCA is the singular value decomposition (SVD) based on the matrix representing correlation with modification of the average and the inhomogeneousness. CA can be regarded as SVD of the covariance of CD with modification of the average and inhomogeneousness like that of CCA. And the Gini-index is another way of addressing the variance of CD. Okada[11, 10] reported that defining the Gini-index using one-hot encoding on appropriately rotated space yields reasonable values for the covariance. This research describes the rotated Gini-index with modifying inhomogeneousness is equivalent to CA.
Picca et al.  introduced nonlinear kernel extension of CA. Niitsuma et al.  introduced nonlinear kernel extension of the rotated Gini-index. Equivalence between CA and the modified Gini-index can define more general nonlinear kernel extension of CA. Calling this nonlinear kernel extension kernel Correspondence Analysis (KCA), we note that gives well-known analysis for CD and natural language processing by specializing kernels. For example, our KCA can give G-test, skip-gram with negative-sampling (SGNS)[8, 6], and GloVe as a special case.
We apply KCA to natural language processing.
Especially we specifically examine vector representations of words.
Mikolov et al. introduced dense vector representations of words referred to as
The vector representations give meaning subtraction or addition operations of the vectors.
For example, the analogy “King is to queen as man is to woman.”
is encoded in the vector space by the following vector equation:
Sometimes CA also has a similar property.
For example, we can see the vector equation:
in Figure 1.
Levy et al.  showed that skip-gram with negative-sampling (SGNS), which is one of
word2vec, can be regarded as analysis of the two-way contingency table in which the row represents a word, and the column represents context.
 showed that SGNS is the matrix factorization of pointwise mutual information (PMI) of the contingency table.
Actually, PMI can be regarded as a log-scale representation of a contingency table.
We formulate this mechanism within KCA by providing the appropriate kernels.
2 Correspondence Analysis
Actually, CA is a generalized singular value decomposition (GSVD) 111 http://forrest.psych.unc.edu/research/vista-frames/pdf/chap11.pdf
based on a contingency table for two categorical variables such that
where is an contingency table, with entries giving the frequency with which row categorical variables occur together with column categorical variable . In addition, is the categorical variable representing the row side of the contingency table . represents the categorical variable representing the column side. denotes the vector of row marginals, is the vector of column marginals and stands for the total number of observations. where .
denotes an identity matrix.stands for the diagonal matrix for which diagonal entries are the components of vector . is GSVD of so that . Usually CA use the following matrix for the visualize instead of using and .
For example, let us consider the contingency table presented in Table 1 . The table presents the joint population distribution of the categorical variable for eye color and the categorical variable for hair color . For these data, , , and . CA using Fisher data is associated with the following GSVD.
3 Gini’s definition of Variance and its Extension
A contingency table can be constructed using the matrix product of two indicator matrices of
is the indicator matrix of . is one-hot encoding of . represent the observations. is the value of of -th observation. is an indicator matrix of . is one-hot encoding of . is the value of of -th observation.
For example, the one-hot encodings and indicator matrix of in Fisher’s data are
can be constructed similarly. The contingency table of Fisher’s data is
Based on each observation, we can define the Gini-index as
Gini-index is the variance of a categorical variable. For example, using Fisher’s data is
A simple extension of this definition to the covariance by replacing to does not give reasonable values for the covariance . Okada[11, 10] shown using rotated one-hot encoding provides reasonable values for the covariance of CD:
where is a rotation matrix to maximize the covariance. We can rewrite Eq. (3) using , which is used in CA.
Here we use the following relation
One can solve the maximization problem by differentiating the following Lagrangian.
Therein, is a Lagrange multiplier. The conditions for this Lagrangian to be stationary with respect to are This result demonstrates that must be a symmetric matrix. We can give a solution to this maximization problem using the following SVD.
It is readily apparent that satisfies all necessary conditions. One can conclude that is the solution of the problem (3).
Recalling that CA is the GSVD of , then the GSVD can be computed using SVD of
as shown below.
Substituting this relation to Eq. (3) produces the following.
The following relations show that the solution of this optimization problem is . And we can say CA is equivalent to the problem (21).
Based on these relation, we introduce scaled one-hot encoding as the following.
It is noteworthy that
Substituting this relation into problem (21) yields the following.
It is readily apparent that this problem defines the rotated Gini-index using scaled one-hot encoding. This generalized Gini-index is equivalent to CA.
4 Non-linear extension
This section presents an attempt to generalize CA as possible to generalize while keeping the relation between Gini-index and CA. Let us introduce nonlinear mapping and operators on nonlinear mapped spaces: . Rewriting the problem (29) using this nonlinear extension yields the following.
Consider expansion of this expression using below rules:
We consider the case of how to evaluate an expression including the operators , , and , which are not defined. Subsequently, we move these operators outside parentheses. To move the operators outside parentheses, some rules (31)–(36) are useful. In these rules, rewriting the left-hand side to the right-hand side moves the operators outside parentheses. Applying these rewriting rules to the problem (30) of yields the following.
When , and introducing kernel matrices and to be
Specifying the operators and kernel matrices provides various well known analyses for CD and natural language processing. Table 2 presents the relation between the specification and known methods. Linear CA is a simple CA without the nonlinear extension. Levy et al.  showed GloVe is matrix factorization of sifted PMI. Based on that discussion, one can regard GloVe as that of the KCA, as shown in Table 2. When some element of a contingency table is zero: then . The G-test for contingency tables can avoid this problem. Let us call SVD of when the G-test in this research.
5 Kernels for Natural language processing
In this section, we specifically examine vector representations of words. We consider the following contingency table
where is the number of times that word appears in context . Using this contingency table, we can formulate SGNS as one specialized KCA, as shown in Table 2. Linear CA and G-test also provide vector representations of words using the contingency table.
KCA can introduce tunable weights for stop words (SW) using the following kernel as
is the vector for which the weight is added when -th world is SW. We designate this mechanism as a stop word kernel.
When we have datasets about similarity scores between two words, the scores can be encoded into the vector representations of words using our kernel mechanism. For example,
We can encode such scores using defined operators and in Eq. (30) as presented below.
are tunable parameters. When and , the problem (30) is
where is the Hadamard product. We designate this mechanism as the word similarity kernel.
This section explains experimental evaluations of vector representations of words based on our proposed system. To calculate the contingency table , we use the program code222https://bitbucket.org/omerlevy/hyperwords used in . For the SW kernel, we use stop words set in NLTK. For WS, we use the score of the MEN dataset reported by Bruni et al. First, 20% data of Text8 corpus333 http://mattmahoney.net/dc/text8.zip are used as training text data. The stop words set and word similarity data are too small compared to the entire Text8 corpus. Consequently, the stop word kernel and word similarity kernel have a small effect when using the whole Text8 corpus.
The 100 dimensional word vectors presentations are evaluated by comparing ranking of their cosine similarities and the ranking in the test dataset. The rankings are compared using Spearman’s. We also show the SGNS result for comparison. Table 3 presents evaluation results. Linear CA provides vector representations of two types: and . We present the results of these two representations. We also show G-test results. The results for Linear CA and G-test with SW kernel and WS kernel are shown as “with SW” or “with WS” in the table.
|Linear CA with SW||0.198||0.088||0.212|
|Linear CA with SW||0.222||0.087||0.421|
|Linear CA with WS||0.190||0.084||0.201|
|Linear CA with WS||0.213||0.074||0.407|
|G-test with SW||0.178||0.078||0.179|
|G-test with SW||0.186||0.065||0.369|
|G-test with WS||0.139||0.074||0.171|
|G-test with WS||0.170||0.055||0.365|
Although Linear CA and G-test have no tunable parameters, these results are comparable to SGNS. In most cases, the SW kernel enhances accuracy. The WS kernel enhances accuracy slightly in some cases.
We demonstrate that Linear Correspondence Analysis (CA) is equivalent to defining the Gini-index with rotated and scaled one-hot encoding. Moreover, we attempted to generalize CA as possible to be able to generalize while maintaining the relation between the Gini-index and CA. Kernel Correspondence Analysis (KCA) is introduced based on the nonlinear generalization of CA. KCA gives various known analyses for categorical data and natural language processing by specializing kernels. For example, KCA can give G-test, skip-gram with negative-sampling (SGNS), and GloVe as a special case. We introduce two kernels for natural language processing based on KCA. The proposed mechanism is evaluated by application of the problem of vector representations of words. Although Linear CA and G-test have no tunable parameter, these results are comparable to those obtained for SGNS. Additionally, we show that kernels with tunable parameters can enhance accuracy.
-  S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, 2009.
-  E. Bruni, G. Boleda, M. Baroni, and N. K. Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
-  L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Trans. Inf. Syst., 20(1):116–131, Jan. 2002.
-  R. A. Fisher. The precision of discriminant functions. Annals of Eugenics (London), 10:422–429, 1940.
-  C. Gini. Variability and mutability, contribution to the study of statistical distributions and relations. studi economico-giuridici della r. universita de cagliari (1912). reviewed in: Light, r.j., margolin, b.h.: An analysis of variance for categorical data. J. American Statistical Association, 66:534–544, 1971.
-  O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2177–2185, 2014.
-  O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. TACL, 3:211–225, 2015.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
H. Niitsuma and T. Okada.
Kernel PCA for categorical data.
IEICE technical report. Artificial intelligence and knowledge-based processing, 103(305):13–17, sep 2003.
-  H. Niitsuma and T. Okada. Covariance and PCA for categorical variables. In Advances in Knowledge Discovery and Data Mining, 9th Pacific-Asia Conference, PAKDD 2005, Hanoi, Vietnam, May 18-20, 2005, Proceedings, pages 523–528, 2005.
-  T. Okada. A note on covariances for categorical data. In K. Leung, L. Chan, and H. Meng, editors, Intelligent Data Engineering and Automated Learning - IDEAL 2000, 2000.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
-  D. Picca, B. Curdy, and F. Bavaud. Non-linear correspondence analysis in text retrieval: A kernel view. In Proceedings of JADT, 2006.
-  K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 337–346, New York, NY, USA, 2011. ACM.