1 Introduction
Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are wellknown multivariate data reduction techniques. These two methods are analyses of a correlation or variance–covariance matrix of a multivariate quantitative dataset. CA is an analysis of CD.
Also, CA is a statistical visualization method for picturing the associations between the levels of a twoway contingency table. To illustrate CA, one might consider the contingency table presented in Table
1, which shows data well known as Fisher’s data[4] for eye and hair color of people in Caithness, Scotland. The CA of these data yield the graphical display presented in Figure 1. Figure 1 shows the correspondence between eye and hair color.PCA is the eigenvalue decomposition of a covariance matrix. CCA is the singular value decomposition (SVD) based on the matrix representing correlation with modification of the average and the inhomogeneousness. CA can be regarded as SVD of the covariance of CD with modification of the average and inhomogeneousness like that of CCA. And the Giniindex is another way of addressing the variance of CD. Okada
[11, 10] reported that defining the Giniindex using onehot encoding on appropriately rotated space yields reasonable values for the covariance. This research describes the rotated Giniindex with modifying inhomogeneousness is equivalent to CA.Picca et al. [13] introduced nonlinear kernel extension of CA. Niitsuma et al. [9] introduced nonlinear kernel extension of the rotated Giniindex. Equivalence between CA and the modified Giniindex can define more general nonlinear kernel extension of CA. Calling this nonlinear kernel extension kernel Correspondence Analysis (KCA), we note that gives wellknown analysis for CD and natural language processing by specializing kernels. For example, our KCA can give Gtest, skipgram with negativesampling (SGNS)[8, 6], and GloVe[12] as a special case.
We apply KCA to natural language processing.
Especially we specifically examine vector representations of words.
Mikolov et al.[8] introduced dense vector representations of words referred to as word2vec
.
The vector representations give meaning subtraction or addition operations of the vectors.
For example, the analogy “King is to queen as man is to woman.”
is encoded in the vector space by the following vector equation:
.
Sometimes CA also has a similar property.
For example, we can see the vector equation:
in Figure 1.
Levy et al. [6] showed that skipgram with negativesampling (SGNS), which is one of word2vec
, can be regarded as analysis of the twoway contingency table in which the row represents a word, and the column represents context.
[6] showed that SGNS is the matrix factorization of pointwise mutual information (PMI) of the contingency table.
Actually, PMI can be regarded as a logscale representation of a contingency table.
We formulate this mechanism within KCA by providing the appropriate kernels.
2 Correspondence Analysis
Actually, CA is a generalized singular value decomposition (GSVD) ^{1}^{1}1 http://forrest.psych.unc.edu/research/vistaframes/pdf/chap11.pdf
based on a contingency table for two categorical variables such that
(1)  
(2)  
(3) 
where is an contingency table, with entries giving the frequency with which row categorical variables occur together with column categorical variable . In addition, is the categorical variable representing the row side of the contingency table . represents the categorical variable representing the column side. denotes the vector of row marginals, is the vector of column marginals and stands for the total number of observations. where .
denotes an identity matrix.
stands for the diagonal matrix for which diagonal entries are the components of vector . is GSVD of so that . Usually CA use the following matrix for the visualize instead of using and .(4)  
(5) 
For example, let us consider the contingency table presented in Table 1 . The table presents the joint population distribution of the categorical variable for eye color and the categorical variable for hair color . For these data, , , and . CA using Fisher data is associated with the following GSVD.
(6) 
3 Gini’s definition of Variance and its Extension
A contingency table can be constructed using the matrix product of two indicator matrices of
(7) 
where
(8) 
is the indicator matrix of . is onehot encoding of . represent the observations. is the value of of th observation. is an indicator matrix of . is onehot encoding of . is the value of of th observation.
For example, the onehot encodings and indicator matrix of in Fisher’s data are
(9) 
can be constructed similarly. The contingency table of Fisher’s data is
(10) 
Based on each observation, we can define the Giniindex as
(11) 
where
(12) 
Giniindex is the variance of a categorical variable[5]. For example, using Fisher’s data is
A simple extension of this definition to the covariance by replacing to does not give reasonable values for the covariance [11]. Okada[11, 10] shown using rotated onehot encoding provides reasonable values for the covariance of CD:
subject to  (13) 
where is a rotation matrix to maximize the covariance. We can rewrite Eq. (3) using , which is used in CA.
subject to  (14) 
Here we use the following relation
(15) 
One can solve the maximization problem by differentiating the following Lagrangian.
(16) 
Therein, is a Lagrange multiplier. The conditions for this Lagrangian to be stationary with respect to are This result demonstrates that must be a symmetric matrix. We can give a solution to this maximization problem using the following SVD.
(17) 
It is readily apparent that satisfies all necessary conditions. One can conclude that is the solution of the problem (3).
Recalling that CA is the GSVD of , then the GSVD can be computed using SVD of
(18)  
(19) 
as shown below.
(20) 
Substituting this relation to Eq. (3) produces the following.
subject to  (21) 
The following relations show that the solution of this optimization problem is . And we can say CA is equivalent to the problem (21).
(22)  
(23)  
(24)  
(25)  
(26) 
Based on these relation, we introduce scaled onehot encoding as the following.
(27) 
It is noteworthy that
(28) 
Substituting this relation into problem (21) yields the following.
subject to  (29) 
It is readily apparent that this problem defines the rotated Giniindex using scaled onehot encoding. This generalized Giniindex is equivalent to CA.
4 Nonlinear extension
This section presents an attempt to generalize CA as possible to generalize while keeping the relation between Giniindex and CA. Let us introduce nonlinear mapping and operators on nonlinear mapped spaces: . Rewriting the problem (29) using this nonlinear extension yields the following.
subject to  (30) 
Consider expansion of this expression using below rules:
(31)  
(32)  
(33)  
(34)  
(35)  
(36)  
(37)  
(38) 
We consider the case of how to evaluate an expression including the operators , , and , which are not defined. Subsequently, we move these operators outside parentheses. To move the operators outside parentheses, some rules (31)–(36) are useful. In these rules, rewriting the lefthand side to the righthand side moves the operators outside parentheses. Applying these rewriting rules to the problem (30) of yields the following.
subject to  (39) 
where
(40) 
When , and introducing kernel matrices and to be
(41) 
gives
(42)  
subject to  (43) 
Specifying the operators and kernel matrices provides various well known analyses for CD and natural language processing. Table 2 presents the relation between the specification and known methods. Linear CA is a simple CA without the nonlinear extension. Levy et al. [7] showed GloVe[12] is matrix factorization of sifted PMI. Based on that discussion, one can regard GloVe[12] as that of the KCA, as shown in Table 2. When some element of a contingency table is zero: then . The Gtest for contingency tables can avoid this problem. Let us call SVD of when the Gtest in this research.
5 Kernels for Natural language processing
In this section, we specifically examine vector representations of words. We consider the following contingency table
(44) 
where is the number of times that word appears in context . Using this contingency table, we can formulate SGNS as one specialized KCA, as shown in Table 2. Linear CA and Gtest also provide vector representations of words using the contingency table.
KCA can introduce tunable weights for stop words (SW) using the following kernel as
(45) 
where
(46) 
is the vector for which the weight is added when th world is SW. We designate this mechanism as a stop word kernel.
When we have datasets about similarity scores between two words, the scores can be encoded into the vector representations of words using our kernel mechanism. For example[3],
We can encode such scores using defined operators and in Eq. (30) as presented below.
(47) 
where
(48) 
are tunable parameters. When and , the problem (30) is
subject to  
(49) 
where is the Hadamard product. We designate this mechanism as the word similarity kernel.
6 Experiments
This section explains experimental evaluations of vector representations of words based on our proposed system. To calculate the contingency table , we use the program code^{2}^{2}2https://bitbucket.org/omerlevy/hyperwords used in [7]. For the SW kernel, we use stop words set in NLTK[1]. For WS, we use the score of the MEN dataset[2] reported by Bruni et al. First, 20% data of Text8 corpus^{3}^{3}3 http://mattmahoney.net/dc/text8.zip are used as training text data. The stop words set and word similarity data are too small compared to the entire Text8 corpus. Consequently, the stop word kernel and word similarity kernel have a small effect when using the whole Text8 corpus.
For evaluation, we use WordSim3533 dataset[3], the MEN dataset[2] of Bruni et al. and the Mechanical Turk dataset[14] reported by Radinsky et al.
The 100 dimensional word vectors presentations are evaluated by comparing ranking of their cosine similarities and the ranking in the test dataset. The rankings are compared using Spearman’s
. We also show the SGNS result for comparison. Table 3 presents evaluation results. Linear CA provides vector representations of two types: and . We present the results of these two representations. We also show Gtest results. The results for Linear CA and Gtest with SW kernel and WS kernel are shown as “with SW” or “with WS” in the table.Name  WordSim3533[3]  Bruni[2]  Randinsky[14] 

SGNS[7]  0.192  0.061  0.314 
Linear CA  0.189  0.081  0.201 
Linear CA  0.213  0.072  0.407 
Linear CA with SW  0.198  0.088  0.212 
Linear CA with SW  0.222  0.087  0.421 
Linear CA with WS  0.190  0.084  0.201 
Linear CA with WS  0.213  0.074  0.407 
Gtest  0.139  0.074  0.156 
Gtest  0.169  0.051  0.365 
Gtest with SW  0.178  0.078  0.179 
Gtest with SW  0.186  0.065  0.369 
Gtest with WS  0.139  0.074  0.171 
Gtest with WS  0.170  0.055  0.365 
Although Linear CA and Gtest have no tunable parameters, these results are comparable to SGNS. In most cases, the SW kernel enhances accuracy. The WS kernel enhances accuracy slightly in some cases.
7 Conclusion
We demonstrate that Linear Correspondence Analysis (CA) is equivalent to defining the Giniindex with rotated and scaled onehot encoding. Moreover, we attempted to generalize CA as possible to be able to generalize while maintaining the relation between the Giniindex and CA. Kernel Correspondence Analysis (KCA) is introduced based on the nonlinear generalization of CA. KCA gives various known analyses for categorical data and natural language processing by specializing kernels. For example, KCA can give Gtest, skipgram with negativesampling (SGNS), and GloVe as a special case. We introduce two kernels for natural language processing based on KCA. The proposed mechanism is evaluated by application of the problem of vector representations of words. Although Linear CA and Gtest have no tunable parameter, these results are comparable to those obtained for SGNS. Additionally, we show that kernels with tunable parameters can enhance accuracy.
References
 [1] S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, 2009.
 [2] E. Bruni, G. Boleda, M. Baroni, and N. K. Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
 [3] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Trans. Inf. Syst., 20(1):116–131, Jan. 2002.
 [4] R. A. Fisher. The precision of discriminant functions. Annals of Eugenics (London), 10:422–429, 1940.
 [5] C. Gini. Variability and mutability, contribution to the study of statistical distributions and relations. studi economicogiuridici della r. universita de cagliari (1912). reviewed in: Light, r.j., margolin, b.h.: An analysis of variance for categorical data. J. American Statistical Association, 66:534–544, 1971.
 [6] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 2177–2185, 2014.
 [7] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. TACL, 3:211–225, 2015.
 [8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.

[9]
H. Niitsuma and T. Okada.
Kernel PCA for categorical data.
IEICE technical report. Artificial intelligence and knowledgebased processing
, 103(305):13–17, sep 2003.  [10] H. Niitsuma and T. Okada. Covariance and PCA for categorical variables. In Advances in Knowledge Discovery and Data Mining, 9th PacificAsia Conference, PAKDD 2005, Hanoi, Vietnam, May 1820, 2005, Proceedings, pages 523–528, 2005.
 [11] T. Okada. A note on covariances for categorical data. In K. Leung, L. Chan, and H. Meng, editors, Intelligent Data Engineering and Automated Learning  IDEAL 2000, 2000.
 [12] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
 [13] D. Picca, B. Curdy, and F. Bavaud. Nonlinear correspondence analysis in text retrieval: A kernel view. In Proceedings of JADT, 2006.
 [14] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 337–346, New York, NY, USA, 2011. ACM.
Comments
There are no comments yet.