Word2Vec is a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing

We show Correspondence Analysis (CA) is equivalent to defining Gini-index with appropriate scaled one-hot encoding. Using this relation, we introduce non-linear kernel extension of CA. The extended CA gives well-known analysis for categorical data (CD) and natural language processing by specializing kernels. For example, our formulation can give G-test, skip-gram with negative-sampling (SGNS), and GloVe as a special case. We introduce two kernels for natural language processing based on our formulation. First is a stop word(SW) kernel. Second is word similarity(WS) kernel. The SW kernel is the system introducing appropriate weights for SW. The WS kernel enables to use WS test data as training data for vector space representations of words. We show these kernels enhances accuracy when training data is not sufficiently large.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/18/2020

An Analysis on the Learning Rules of the Skip-Gram Model

To improve the generalization of the representations for natural languag...
08/10/2015

Learning Structural Kernels for Natural Language Processing

Structural kernels are a flexible learning paradigm that has been widely...
12/04/2019

Natural Alpha Embeddings

Learning an embedding for a large collection of items is a popular appro...
11/08/2021

The Conjugate Post Correspondence Problem

We introduce a modification to the Post Correspondence Problem where (in...
12/19/2017

Any-gram Kernels for Sentence Classification: A Sentiment Analysis Case Study

Any-gram kernels are a flexible and efficient way to employ bag-of-n-gra...
01/11/2018

Stochastic Learning of Nonstationary Kernels for Natural Language Modeling

Natural language processing often involves computations with semantic or...
07/21/2019

Word Sense Disambiguation using Diffusion Kernel PCA

One of the major problems in natural language processing (NLP) is the wo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are well-known multivariate data reduction techniques. These two methods are analyses of a correlation or variance–covariance matrix of a multivariate quantitative dataset. CA is an analysis of CD.

Also, CA is a statistical visualization method for picturing the associations between the levels of a two-way contingency table. To illustrate CA, one might consider the contingency table presented in Table

1, which shows data well known as Fisher’s data[4] for eye and hair color of people in Caithness, Scotland. The CA of these data yield the graphical display presented in Figure 1. Figure 1 shows the correspondence between eye and hair color.

PCA is the eigenvalue decomposition of a covariance matrix. CCA is the singular value decomposition (SVD) based on the matrix representing correlation with modification of the average and the inhomogeneousness. CA can be regarded as SVD of the covariance of CD with modification of the average and inhomogeneousness like that of CCA. And the Gini-index is another way of addressing the variance of CD. Okada

[11, 10] reported that defining the Gini-index using one-hot encoding on appropriately rotated space yields reasonable values for the covariance. This research describes the rotated Gini-index with modifying inhomogeneousness is equivalent to CA.

Picca et al. [13] introduced nonlinear kernel extension of CA. Niitsuma et al. [9] introduced nonlinear kernel extension of the rotated Gini-index. Equivalence between CA and the modified Gini-index can define more general nonlinear kernel extension of CA. Calling this nonlinear kernel extension kernel Correspondence Analysis (KCA), we note that gives well-known analysis for CD and natural language processing by specializing kernels. For example, our KCA can give G-test, skip-gram with negative-sampling (SGNS)[8, 6], and GloVe[12] as a special case.

We apply KCA to natural language processing. Especially we specifically examine vector representations of words. Mikolov et al.[8] introduced dense vector representations of words referred to as word2vec. The vector representations give meaning subtraction or addition operations of the vectors. For example, the analogy “King is to queen as man is to woman.” is encoded in the vector space by the following vector equation: . Sometimes CA also has a similar property. For example, we can see the vector equation: in Figure 1.

Levy et al. [6] showed that skip-gram with negative-sampling (SGNS), which is one of word2vec, can be regarded as analysis of the two-way contingency table in which the row represents a word, and the column represents context. [6] showed that SGNS is the matrix factorization of pointwise mutual information (PMI) of the contingency table. Actually, PMI can be regarded as a log-scale representation of a contingency table. We formulate this mechanism within KCA by providing the appropriate kernels.

2 Correspondence Analysis

fair red medium dark black blue 326 38 241 110 3 light 688 116 584 188 4 medium 343 84 909 412 26 dark 98 48 403 681 85

Table 1: Fisher’s data.
Figure 1: Visualizing Fisher’s data.

Actually, CA is a generalized singular value decomposition (GSVD) 111 http://forrest.psych.unc.edu/research/vista-frames/pdf/chap11.pdf

based on a contingency table for two categorical variables such that

(1)
(2)
(3)

where is an contingency table, with entries giving the frequency with which row categorical variables occur together with column categorical variable . In addition, is the categorical variable representing the row side of the contingency table . represents the categorical variable representing the column side. denotes the vector of row marginals, is the vector of column marginals and stands for the total number of observations. where .

denotes an identity matrix.

stands for the diagonal matrix for which diagonal entries are the components of vector . is GSVD of so that . Usually CA use the following matrix for the visualize instead of using and .

(4)
(5)

For example, let us consider the contingency table presented in Table 1 . The table presents the joint population distribution of the categorical variable for eye color and the categorical variable for hair color . For these data, , , and . CA using Fisher data is associated with the following GSVD.

(6)

3 Gini’s definition of Variance and its Extension

A contingency table can be constructed using the matrix product of two indicator matrices of

(7)

where

(8)

is the indicator matrix of . is one-hot encoding of . represent the observations. is the value of of -th observation. is an indicator matrix of . is one-hot encoding of . is the value of of -th observation.

For example, the one-hot encodings and indicator matrix of in Fisher’s data are

(9)

can be constructed similarly. The contingency table of Fisher’s data is

(10)

Based on each observation, we can define the Gini-index as

(11)

where

(12)

Gini-index is the variance of a categorical variable[5]. For example, using Fisher’s data is

A simple extension of this definition to the covariance by replacing to does not give reasonable values for the covariance [11]. Okada[11, 10] shown using rotated one-hot encoding provides reasonable values for the covariance of CD:

subject to (13)

where is a rotation matrix to maximize the covariance. We can rewrite Eq. (3) using , which is used in CA.

subject to (14)

Here we use the following relation

(15)

One can solve the maximization problem by differentiating the following Lagrangian.

(16)

Therein, is a Lagrange multiplier. The conditions for this Lagrangian to be stationary with respect to are This result demonstrates that must be a symmetric matrix. We can give a solution to this maximization problem using the following SVD.

(17)

It is readily apparent that satisfies all necessary conditions. One can conclude that is the solution of the problem (3).

Recalling that CA is the GSVD of , then the GSVD     can be computed using SVD  of

(18)
(19)

as shown below.

(20)

Substituting this relation to Eq. (3) produces the following.

subject to (21)

The following relations show that the solution of this optimization problem is . And we can say CA is equivalent to the problem (21).

(22)
(23)
(24)
(25)
(26)

Based on these relation, we introduce scaled one-hot encoding as the following.

(27)

It is noteworthy that

(28)

Substituting this relation into problem (21) yields the following.

subject to (29)

It is readily apparent that this problem defines the rotated Gini-index using scaled one-hot encoding. This generalized Gini-index is equivalent to CA.

4 Non-linear extension

This section presents an attempt to generalize CA as possible to generalize while keeping the relation between Gini-index and CA. Let us introduce nonlinear mapping and operators on nonlinear mapped spaces: . Rewriting the problem (29) using this nonlinear extension yields the following.

subject to (30)

Consider expansion of this expression using below rules:

(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)

We consider the case of how to evaluate an expression including the operators , , and , which are not defined. Subsequently, we move these operators outside parentheses. To move the operators outside parentheses, some rules (31)–(36) are useful. In these rules, rewriting the left-hand side to the right-hand side moves the operators outside parentheses. Applying these rewriting rules to the problem (30) of yields the following.

subject to (39)

where

(40)

When , and introducing kernel matrices and to be

(41)

gives

(42)
subject to (43)

Specifying the operators and kernel matrices provides various well known analyses for CD and natural language processing. Table 2 presents the relation between the specification and known methods. Linear CA is a simple CA without the nonlinear extension. Levy et al. [7] showed GloVe[12] is matrix factorization of sifted PMI. Based on that discussion, one can regard GloVe[12] as that of the KCA, as shown in Table 2. When some element of a contingency table is zero: then . The G-test for contingency tables can avoid this problem. Let us call SVD of when the G-test in this research.

Name
Linear CA
Gini-index [11, 10]
G-test
SGNS[8, 6]
GloVe[12]
kernel PCA for CD[9]
Table 2: Relations between known methods and kernel specializations.

5 Kernels for Natural language processing

In this section, we specifically examine vector representations of words. We consider the following contingency table

(44)

where is the number of times that word appears in context . Using this contingency table, we can formulate SGNS as one specialized KCA, as shown in Table 2. Linear CA and G-test also provide vector representations of words using the contingency table.

KCA can introduce tunable weights for stop words (SW) using the following kernel as

(45)

where

(46)

is the vector for which the weight is added when -th world is SW. We designate this mechanism as a stop word kernel.

When we have datasets about similarity scores between two words, the scores can be encoded into the vector representations of words using our kernel mechanism. For example[3],

We can encode such scores using defined operators and in Eq. (30) as presented below.

(47)

where

(48)

are tunable parameters. When and , the problem (30) is

subject to
(49)

where is the Hadamard product. We designate this mechanism as the word similarity kernel.

6 Experiments

This section explains experimental evaluations of vector representations of words based on our proposed system. To calculate the contingency table , we use the program code222https://bitbucket.org/omerlevy/hyperwords used in [7]. For the SW kernel, we use stop words set in NLTK[1]. For WS, we use the score of the MEN dataset[2] reported by Bruni et al. First, 20% data of Text8 corpus333 http://mattmahoney.net/dc/text8.zip are used as training text data. The stop words set and word similarity data are too small compared to the entire Text8 corpus. Consequently, the stop word kernel and word similarity kernel have a small effect when using the whole Text8 corpus.

For evaluation, we use WordSim3533 dataset[3], the MEN dataset[2] of Bruni et al. and the Mechanical Turk dataset[14] reported by Radinsky et al.

The 100 dimensional word vectors presentations are evaluated by comparing ranking of their cosine similarities and the ranking in the test dataset. The rankings are compared using Spearman’s

. We also show the SGNS result for comparison. Table 3 presents evaluation results. Linear CA provides vector representations of two types: and . We present the results of these two representations. We also show G-test results. The results for Linear CA and G-test with SW kernel and WS kernel are shown as “with SW” or “with WS” in the table.

Name WordSim3533[3] Bruni[2] Randinsky[14]
SGNS[7] 0.192 0.061 0.314
Linear CA 0.189 0.081 0.201
Linear CA 0.213 0.072 0.407
Linear CA with SW 0.198 0.088 0.212
Linear CA with SW 0.222 0.087 0.421
Linear CA with WS 0.190 0.084 0.201
Linear CA with WS 0.213 0.074 0.407
G-test 0.139 0.074 0.156
G-test 0.169 0.051 0.365
G-test with SW 0.178 0.078 0.179
G-test with SW 0.186 0.065 0.369
G-test with WS 0.139 0.074 0.171
G-test with WS 0.170 0.055 0.365
Table 3: Comparing word similarity.

Although Linear CA and G-test have no tunable parameters, these results are comparable to SGNS. In most cases, the SW kernel enhances accuracy. The WS kernel enhances accuracy slightly in some cases.

7 Conclusion

We demonstrate that Linear Correspondence Analysis (CA) is equivalent to defining the Gini-index with rotated and scaled one-hot encoding. Moreover, we attempted to generalize CA as possible to be able to generalize while maintaining the relation between the Gini-index and CA. Kernel Correspondence Analysis (KCA) is introduced based on the nonlinear generalization of CA. KCA gives various known analyses for categorical data and natural language processing by specializing kernels. For example, KCA can give G-test, skip-gram with negative-sampling (SGNS), and GloVe as a special case. We introduce two kernels for natural language processing based on KCA. The proposed mechanism is evaluated by application of the problem of vector representations of words. Although Linear CA and G-test have no tunable parameter, these results are comparable to those obtained for SGNS. Additionally, we show that kernels with tunable parameters can enhance accuracy.

References

  • [1] S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, 2009.
  • [2] E. Bruni, G. Boleda, M. Baroni, and N. K. Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
  • [3] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Trans. Inf. Syst., 20(1):116–131, Jan. 2002.
  • [4] R. A. Fisher. The precision of discriminant functions. Annals of Eugenics (London), 10:422–429, 1940.
  • [5] C. Gini. Variability and mutability, contribution to the study of statistical distributions and relations. studi economico-giuridici della r. universita de cagliari (1912). reviewed in: Light, r.j., margolin, b.h.: An analysis of variance for categorical data. J. American Statistical Association, 66:534–544, 1971.
  • [6] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2177–2185, 2014.
  • [7] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. TACL, 3:211–225, 2015.
  • [8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  • [9] H. Niitsuma and T. Okada. Kernel PCA for categorical data.

    IEICE technical report. Artificial intelligence and knowledge-based processing

    , 103(305):13–17, sep 2003.
  • [10] H. Niitsuma and T. Okada. Covariance and PCA for categorical variables. In Advances in Knowledge Discovery and Data Mining, 9th Pacific-Asia Conference, PAKDD 2005, Hanoi, Vietnam, May 18-20, 2005, Proceedings, pages 523–528, 2005.
  • [11] T. Okada. A note on covariances for categorical data. In K. Leung, L. Chan, and H. Meng, editors, Intelligent Data Engineering and Automated Learning - IDEAL 2000, 2000.
  • [12] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [13] D. Picca, B. Curdy, and F. Bavaud. Non-linear correspondence analysis in text retrieval: A kernel view. In Proceedings of JADT, 2006.
  • [14] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 337–346, New York, NY, USA, 2011. ACM.