# An Optimality Proof for the PairDiff operator for Representing Relations between Words

Representing the semantic relations that exist between two given words (or entities) is an important first step in a wide-range of NLP applications such as analogical reasoning, knowledge base completion and relational information retrieval. A simple, yet surprisingly accurate method for representing a relation between two words is to compute the vector offset () between the corresponding word embeddings. Despite its empirical success, it remains unclear whether is the best operator for obtaining a relational representation from word embeddings. In this paper, we conduct a theoretical analysis of the operator. In particular, we show that for word embeddings where cross-dimensional correlations are zero, is the only bilinear operator that can minimise the ℓ_2 loss between analogous word-pairs. We experimentally show that for word embedding created using a broad range of methods, the cross-dimensional correlations in word embeddings are approximately zero, demonstrating the general applicability of our theoretical result. Moreover, we empirically verify the implications of the proven theoretical result in a series of experiments where we repeatedly discover as the best bilinear operator for representing semantic relations between words in several benchmark datasets.

There are no comments yet.

## Authors

• 5 publications
• 17 publications
• 46 publications
09/19/2017

### Why PairDiff works? -- A Mathematical Analysis of Bilinear Relational Compositional Operators for Analogy Detection

Representing the semantic relations that exist between two given words (...
02/02/2019

### Understanding Composition of Word Embeddings via Tensor Decomposition

Word embedding is a powerful tool in natural language processing. In thi...
09/04/2017

### Compositional Approaches for Representing Relations Between Words: A Comparative Study

Identifying the relations that exist between words (or entities) is impo...
01/09/2020

### Multiplex Word Embeddings for Selectional Preference Acquisition

Conventional word embeddings represent words with fixed vectors, which a...
08/22/2019

### ViCo: Word Embeddings from Visual Co-occurrences

We propose to learn word embeddings from visual co-occurrences. Two word...
01/14/2016

### Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Word embeddings are ubiquitous in NLP and information retrieval, but it'...
07/26/2017

### Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants

This paper deals with using word embedding models to trace the temporal ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Different types of semantic relations exist between words such as HYPERNYMY between ostrich and bird, or ANTONYMY between hot and cold. If we consider entities111We interchangeably use the terms word and entity to represent both unigrams as well as a multi-word expressions including named entities., we can observe even a richer diversity of relations such as FOUNDER-OF between Bill Gates and Microsoft, or CAPITAL-OF between Tokyo and Japan

. Identifying the relations between words and entities is important for various Natural Language Processing (NLP) tasks such as automatic knowledge base completion

(Socher et al., 2013), analogical reasoning (Turney and Littman, 2005; Bollegala et al., 2009) and relational information retrieval (Duc et al., 2010). For example, to solve a word analogy problem of the form a is to b as c is to ?, the relationship between the two words in the pair must be correctly identified in order to find candidates that have similar relations with . For example, given the query Bill Gates is to Microsoft as Steve Jobs is to ?, a relational search engine must retrieve Apple Inc. because the FOUNDER-OF relation exists between the first and the second entity pairs.

Two main approaches for creating relation embeddings can be identified in the literature. In the first approach, from given corpora or knowledge bases, word and relation embeddings are jointly learnt such that some objective is optimised (Guo et al., 2016; Yang et al., 2015; Nickel et al., 2016; Bordes et al., 2013; Rocktäschel et al., 2016; Minervini et al., 2017; Trouillon et al., 2016). In this approach, word and relation embeddings are considered to be independent parameters that must be learnt by the embedding method. For example, TransE (Bordes et al., 2013) learns the word and relation embeddings such that we can accurately predict relations (links) in a given knowledge base using the learnt word and relation embeddings. Because relations are learnt independently from the words, we refer to methods that are based on this approach as independent relational embedding methods.

A second approach for creating relational embeddings is to apply some operator on two word embeddings to compose the embedding for the relation that exits between those two words, if any. In contrast to the first approach, we do not have to learn relational embeddings and hence this can be considered as an unsupervised setting, where the compositional operator is predefined. A popular operator for composing a relational embedding from two word embeddings is PairDiff, which is the vector difference (offset) of the word embeddings Mikolov et al. (2013b); Levy and Goldberg (2014); Vylomova et al. (2016); Bollegala et al. (2015b); Blacoe and Lapata (2012). Specifically, given two words and represented by their word embeddings respectively and , the relation between and is given by under the PairDiff operator. mikolov2013linguistic showed that PairDiff can accurately solve analogy equations such as , where we have used the top arrows to denote the embeddings of the corresponding words. Bollegala:IJCAI:2015 showed that PairDiff can be used as a proxy for learning better word embeddings and Vylomova:ACL:2016 conducted an extensive empirical comparison of PairDiff using a dataset containing 16 different relation types. Besides PairDiff, concatenation (Hakami and Bollegala, 2017; Yin and Schütze, 2016), circular correlation and convolution (Nickel et al., 2016) have been used in prior work for representing the relations between words. Because the relation embedding is composed using word embeddings instead of learning as a separate parameter, we refer to methods that are based on this approach as compositional relational embedding methods. Note that in this approach it is implicitly assumed that there exist only a single relation between two words.

In this paper, we focus on the operators that are used in compositional relational embedding methods. If we assume that the words and relations are represented by vectors embedded in some common space, then the operator we are seeking must be able to produce a vector representing the relation between two words, given their word embeddings as the only input. Although there have been different proposals for computing relational embeddings from word embeddings, it remains unclear as to what is the best operator for this task. The space of operators that can be used to compose relational embeddings is open and vast. A space of particular interest from a computational point-of-view is the bilinear operators that can be parametrised using tensors and matrices. Specifically, we consider operators that consider pairwise interactions between two word embeddings (second-order terms) and contributions from individual word embeddings towards their relational embedding (first-order terms). The optimality of a relational compositional operator can be evaluated, for example, using the expected relational distance/similarity such as

between analogous (positive) vs. nonanalogous (negative) word-pairs.

If we assume that word embeddings are standardised, uncorrelated and word-pairs are i.i.d, then we prove in §3 that bilinear relational compositional operators are independent of bilinear pairwise interactions between the two input word embeddings. Moreover, under regularised settings (§3.1), the bilinear operator further simplifies to a linear combination of the input embeddings, and the expected loss over positive and negative instances becomes zero. In §4.1, we empirically validate the uncorrelation assumption for different pre-trained word embeddings such as the Continuous Bag-of-Words Model (CBOW) (Mikolov et al., 2013a), Skip-Gram with negative sampling (SG) (Mikolov et al., 2013a), Global Vectors (GloVe) (Pennington et al., 2014), word embeddings created using Latent Semantic Analysis (LSA) (Deerwester et al., 1990), Sparse Coding (HSC) (Faruqui et al., 2015; Yogatama et al., 2015), and Latent Dirichlet Allocation (LDA) (Blei et al., 2003a). This empirical evidence implies that our theoretical analysis is applicable to relational representations composed from a wide-range of word embedding learning methods. Moreover, our experimental results show that a bilinear operator reaches its optimal performance in two different word-analogy benchmark datasets, when it satisfies the requirements of the PairDiff operator. We hope that our theoretical analysis will expand the understanding of relational embedding methods, and inspire future research on accurate relational embedding methods using word embeddings as the input.

## 2 Related Work

As already mentioned in §1, methods for representing a relation between two words can be broadly categorised into two groups depending on whether the relational embeddings are learnt independently of the word embeddings, or they are composed from the word embeddings, in which case the relational embeddings fully depend on the input word embeddings. Next, we briefly overview the different methods that fall under each category. For a detailed survey of relation embedding methods see Nickel et al. (2015).

Given a knowledge base where an entity is linked to an entity by a relation , the TransE model (Bordes et al., 2013) scores the tuple by the or norm of the vector . Nickel et al. (2011) proposed RESCAL, which uses as the scoring function, where is a matrix embedding of the relation . Similar to RESCAL, Neural Tensor Network (Socher et al., 2013)

also models a relation by a matrix. However, compared to vector embeddings of relations, matrix embeddings increase the number of parameters to be estimated, resulting in an increase in computational time/space and likely to overfit. To overcome these limitations, DistMult

(Yang et al., 2015) models relations by vectors and use elementwise multilinear dot product . Unfortunately, DistMult cannot capture directionality of a relation. Complex Embeddings (Trouillon et al., 2016) overcome this limitation of DistMult by using complex embeddings and defining the score to be the real part of , where denotes the complex conjugate of .

The observation made by Mikolov et al. (2013b) that the relation between two words can be represented by the difference between their word embeddings sparked a renewed interest in methods that compose relational embeddings using word embeddings. Word analogy datasets such as Google dataset (Mikolov et al., 2013b), SemEval 2012 Task2 dataset (Jurgens et al., 2012), BATS (Drozd et al., 2016) etc. have established as benchmarks for evaluating word embedding learning methods.

Different methods have been proposed to measure the similarity between the relations that exist between two given word pairs such as CosMult, CosAdd and PairDiff (Levy and Goldberg, 2014; Bollegala et al., 2015a). Vylomova et al. (2016) studied as to what extent the vectors generated using simple PairDiff encode different relation types. Under supervised classification settings, they conclude that PairDiff can cover a wide range of semantic relation types. Holographic embeddings proposed by Nickel et al. (2016) use circular convolution to mix the embeddings of two words to create an embedding for the relation that exist between those words. It can be showed that circular correlation is indeed an elementwise product in the Fourier space and is mathematically equivalent to complex embeddings (Hayashi and Shinbo, 2017).

Although PairDiff operator has been widely used in prior work for computing relation embeddings from word embeddings, to the best of our knowledge, no theoretical analysis has been conducted so far explaining why and under what conditions PairDiff is optimal, which is the focus of this paper.

## 3 Bilinear Relation Representations

Let us consider the problem of representing the semantic relation between two given words and . We assume that and are already represented in some -dimensional space respectively by their word embeddings . The relation between two words can be represented using different linear algebraic structures. Two popular alternatives are vectors (Nickel et al., 2016; Bordes et al., 2013; Minervini et al., 2017; Trouillon et al., 2016) and matrices (Socher et al., 2013; Bollegala et al., 2015b). Vector representations are preferred over matrix representations because of the smaller number of parameters to be learnt (Nickel et al., 2015).

Let us assume that the relation is represented by a vector in some -dimensional space. Therefore, we can write as a function that takes two vectors (corresponding to the embeddings of the two words) as the input and returns a single vector (representing the relation between the two words) as given in (1).

 \myfunc→r\Rd×\Rd\Rδ (1)

Having both words and relations represented in the same dimensional space is useful for performing linear algebraic operations using those representations in that space. For example, in TransE (Bordes et al., 2013), the strength of a relation that exists between two words and is computed as the norm of the vector using the word and relation embeddings. Such direct comparisons between word and relation embeddings would not be possible if words and relations were not embedded in the same vector space. If , we can first project word embeddings to a lower -dimensional space using some dimensionality reduction method such as SVD, whereas if we can learn higher -dimensional overcomplete word representations Faruqui et al. (2015) from the original -dimensional word embeddings. Therefore, we will limit our theoretical analysis to the case for ease of description.

Different functions can be used as that satisfy the domain and range requirements specified by (1). If we limit ourselves to bilinear functions, the most general functional form is given by (2).

 →r(→h,→t)=→h\T\matA–––––––→t+\matP→h+\matQ→t (2)

Here, is a 3-way tensor in which each slice is a real matrix. Let us denote the -th slice of by and its element by . The first term in (2) corresponds to the pairwise interactions between and . are the nonsingular222If the projection matrix is nonsingular, then the inverse projection exists, which preserves the dimensionality of the embedding space. projection matrices involving first-order contributions respectively of and towards .

Let us consider the problem of learning the simplest bilinear functional form according to (2) from a given dataset of analogous word-pairs . Specifically, we would like to learn the parameters , and such that some distance (loss) between analogous word-pairs is minimised. As a concrete example of a distance function, let us consider the popularly used Euclidean distance333For

normalised vectors, their Euclidean distance is a monotonously decreasing function of their cosine similarity.

( loss) for two word pairs given by (3).

 J((h,t),(h′,t′))=\norm→r(→h,→t)−→r(→h′,→t′)22 (3)

If we were provided only analogous word-pairs (i.e. positive examples), then this task could be trivially achieved by setting all parameters to zero. However, such a trivial solution would not generalise to unseen test data. Therefore, in addition to we would require a set of non-analogous word-pairs as negative examples. Such negative examples are often generated in prior work by randomly corrupting positive relational tuples (Nickel et al., 2016; Bordes et al., 2013; Trouillon et al., 2016) or by training an adversarial generator (Minervini et al., 2017).

The total loss over both positive and negative training data can be written as follows:

 J= ∑((h,t),(h′,t′))∈\cD+\norm→r(→h,→t)−→r(→h′,→t′)22 −∑((h,t),(h′,t′))∈\cD−\norm→r(→h,→t)−→r(→h′,→t′)22 (4)

Assuming that the training word-pairs are randomly sampled from and according to two distributions respectively and , we can compute the total expected loss, , as follows:

 \Epp[J]= \Epp+[\norm→r(→h,→t)−→r(→h′,→t′)22]− \Epp−[\norm→r(→h,→t)−→r(→h′,→t′)22] (5)

We make the following assumptions to further analyse the properties of relational embeddings.

Uncorrelation:

The correlation between any two distinct dimensions of a word embedding is zero. One might think that the uncorrelation of word embedding dimensions to be a strong assumption, but we later show its validity empirically in §4.1 for a wide range of word embeddings.

Standerdisation:

Word embeddings are standerdised to zero mean and unit variance. This is a linear transformation in the word embedding space and does not affect the topology of the embedding space. In particular, translating word embeddings such that they have a zero mean has shown to improve performance in similarity tasks

Mu et al. (2017).

Relational Independence

Word pairs in the training data are assumed to be i.i.d. For example, whether a particular semantic relation exists between and , is assumed to be independent of any other relation that exists between and in a different pair.

For relation representations given by (2), section 3 holds: Consider the bilinear relational embedding defined by (2) computed using uncorrelated word embeddings. If the word embeddings are standerdised, then the expected loss given by (3) over a relationally independent set of word pairs is independent of .

###### Proof.

Let us consider the bilinear term in (2), because and dimensions of word embeddings are uncorrelated by the assumption (i.e. ), from the definition of correlation we have,

 corr(ui,uj) =\Ep[uiuj]−\Ep[ui]\Ep[uj]=0 (6) \Ep[uiuj] =\Ep[ui]\Ep[uj]. (7)

Moreover, from the standerdisation assumption we have, . From (7) it follows that:

 \Ep[uiuj]=0 (8)

for dimensions.

We will next show that (3) is independent of . For this purpose, let us consider the term first and write the -th dimension of using , and as follows:

 ∑i,j(A(k)ijhitj)+∑nPknhn+∑nQkntn (9)

Plugging (9) in (3) and computing the loss over all positive training instances we get,

 \Epp+[∑k(∑i,j(A(k)ij(hitj−h′it′j))+ ∑nPkn(hn−h′n)+∑nQkn(tn−t′n))2] (10)

Terms that involve only elements in take the form:

 ∑i,j ∑l,m\Epp+[A(k)ijA(k)lm(hitj−h′it′j)(hltm−h′lt′m)] = ∑i,j∑l,mA(k)ijA(k)lm(\Epp+[hitjhltm]−\Epp+[hitjh′lt′m]− \Epp+[h′it′jhltm]+\Epp+[h′it′jh′lt′m]) (11)

In cases where and , each of the four expectations in (11) contains the product of different dimensionalities, which is zero from (8). For case we have,

 A(k)ij2(\Epp+[h2it2i]−2\Epp+[hitih′it′i]+\Epp+[h′i2t′i2]) (12)

From the relational independence we have . Moreover, because the word embeddings are assumed to be standerdised to unit variance we have and . Therefore, (12) evaluates to zero and none of the terms arising purely from will remain in the expected loss over positive examples.

Next, lets consider the terms in the expansion of (3) given by,

 2∑i,j∑nA(k)ijPkn(hitj−h′it′j)(hn−h′n). (13)

Taking the expectation of (13) w.r.t. we get,

 2∑i,j∑nA(k)ijPkn(\Epp+[hitjhn]−\Epp+[hitjh′n]− \Epp+[h′it′jhn]+\Epp+[h′it′jh′n]). (14)

Likewise, from the uncorrelation assumption and relational independence it follows that all the expectations in (3) are zero. A similar argument can be used to show that terms that involve disappear from (3). Therefore, does not play any part in the expected loss over positive examples. Similarly, we can show that is independent of the expected loss over negative examples. Therefore, from (3) we see that the expected loss over the entire training dataset is independent of .

### 3.1 Regularised ℓ2 loss

As a special case, if we attempt to minimise the expected loss under some regularisation on such as the Frobenius norm regularisation, then this can be achieved by sending to zero tensor because according to section 3 (2) is independent from .

With , the relation between and can be simplified to:

 →r(→h,→t)=\matP→h+\matQ→t (15)

Then the expected loss over the positive instances is given by (3.1).

 \Epp+[\norm\matP(→h−→h′)+\matQ(→t−→t′)22]= \Epp+[(→h−→h′)\T\matP\T\matP(→h−→h′)]+\Epp+[(→h−→h′)\T\matP\T\matQ(→t−→t′)]+ \Epp+[(→t−→t′)\T\matQ\T\matP(→h−→h′)]+\Epp+[(→t−→t′)\T\matQ\T\matQ(→t−→t′)] (16)

The second expectation term in RHS of (3.1) can be computed as follows:

 \Epp+[(→h−→h′)\T\matP\T\matQ(→t−→t′)] =∑i,j(\matP\T\matQ)ij\Epp+[(hi−h′i)(tj−t′j)] =∑i,j(\matP\T\matQ)ij(\Epp+[hitj]−\Epp+[hit′j]−\Epp+[h′itj]+\Epp+[h′it′j]) (17)

When , each of the four expectations in the RHS of (19) are zero from the uncorrelation assumption. When , each term will be equal to one from the standeridisation assumption (unit variance) and cancel each other out. A similar argument can be used to show that the third expectation term in the RHS of (3.1) vanishes.

Now lets consider the first expectation term in the RHS of (3.1), which can be computed as follows:

 \Epp+[(→h−→h′)\T\matP\T\matP(→h−→h′)] =∑i,j(\matP\T\matP)ij\Epp+[(hi−h′i)(hj−h′j)] =∑i,j(\matP\T\matP)ij(\Epp+[hihj]−\Epp+[hih′j] −\Epp+[h′ihj]+\Epp+[h′ih′j]) (18)

When , it follows from the uncorrelation assumption that each of the four expectation terms in the RHS of (18) will be zero. For case we have,

 ∑i,j(\matP\T\matP)ii(\Epp+[h2i]−2\Epp+[hih′i]+\Epp+[h′i2]) =2∑i,j(\matP\T\matP)ii (19)

Note that from the relational independence between and we have . From the standerdidation (zero mean) assumption this term is zero. On the other hand from the standerdidation (unit variance) assumption, which gives the result in (19).

Similarly, the fourth expectation term in the RHS of (3.1) evaluates to , which shows that (3.1) evaluates to . Note that this is independent of the positive instances and will be equal to the expected loss over negative instances, which gives for the relational embedding given by (15).

It is interesting to note that PairDiff is a special case of (15), where and . In the general case where word embeddings are nonstanderdised to unit variance, we can set to be the diagonal matrix where , where is the variance of the -th dimension of the word embedding space, to enforce standerdisation. Considering that are parameters of the relational embedding, this is analogous to batch normalisation Ioffe and Szegedy (2015), where the appropriate parameters for the normalisation are learnt during training.

## 4 Experimental Results

### 4.1 Corss-dimensional Correlations

A key assumption in our theoretical analysis is the uncorrelations between different dimensions in word embeddings. Here, we empirically verify the uncorrelation assumption for different input word embeddings. For this purpose, we create SG, CBOW and GloVe embeddings from the ukWaC corpus. We use a context window of 5 tokens and select words that occur at least 6 times in the corpus. We use the publicly available implementations for those methods by the original authors and set the parameters to the recommended values in (Levy et al., 2015) to create

-dimensional word embeddings. As a representative of counting-based word embeddings, we create a word co-occurrence matrix weighted by the positive pointwise mutual information (PPMI) and apply singular value decomposition (SVD) to obtain

-dimensional embeddings, which we refer to as the Latent Semantic Analysis (LSA) embeddings.

We use Latent Dirichlet Allocation (LDA) (Blei et al., 2003b) to create a topic model, and represent each word by its distribution over the set of topics. Ideally, each topic will capture some semantic category and the topic distribution provides a semantic representation for a word. We use gensim to extract topics from a 2017 January dump of English Wikipedia. In contrast to the above-mentioned word embeddings, which are dense and flat structured, we used Hierarchical Sparse Coding (HSC) (Yogatama et al., 2015) to produce sparse and hierarchical word embeddings.

Given a word embedding matrix , where each row correspond to the -dimensional embedding of a word in a vocabulary containing words, we compute a correlation matrix , where the element, , denotes the Pearson correlation coefficient between the -th and -th dimensions in the word embeddings over the words. By construction and the histograms of the cross-dimensional correlations () are shown in Figure 1 for

dimensional word embeddings obtained from the six methods described above. The mean of the absolute pairwise correlations for each embedding type and the standard deviation (sd) are indicated in the figure.

From Figure 1, irrespective of the word embedding learning method used, we see that cross-dimensional correlations are distributed in a narrow range with an almost zero mean. This result empirically validates the uncorrelation assumption we used in our theoretical analysis. Moreover, this result indicates that section 3 can be applied to a wide-range of existing word embeddings.

### 4.2 Learning Relation Representations

Our theoretical analysis in §3 claims that the performance of the bilinear relational embedding is independent of the tensor operator . To empirically verify this claim, we conduct the following experiment. For this purpose, we use the BATS dataset (Gladkova et al., 2016) that contains of 40 semantic and syntactic relation types, and generate positive examples by pairing word-pairs that have the same relation types. Approximately each relation type has 1,225 word-pairs, which enables us to generate a total of 48k positive training instances (analogous word-pairs) of the form . For each pair related by a relation , we randomly select pairs with a different relation type , according to the distance between the two pairs to create negative (nonanalogous) instances.88810 negative instances are generated from each word-pair in our experiments. We collectively refer both positive and negative training instances as the training dataset.

Using the dimensional word embeddings from CBOW, SG, GloVe, LSA, LDA, and HSC methods created in §4.1, we learn relational embeddings according to (2) by minimising the loss, (3). To avoid overfitting, we perform regularisation on , and are regularised to diagonal matrices and , for . We initialise all parameters by uniformly sampling from and use AdaGrad (Duchi et al., 2011) with initial learning rate set to 0.01.

Figure 2 shows the Frobenius norm of the tensor (on the left vertical axis) and the values of and (on the right vertical axis) for the six word embeddings. In all cases, we see that as the training progresses, goes to zero as predicted by section 3 under regularisation. Moreover, we see that approximately is reached for some in all cases, which implies that , which is the PairDiff operator. Among the six input word embeddings compared in Figure 1, HSC has the highest mean correlation (), which implies that its dimensions are correlated more than in the other word embeddings. This is to be expected by design because a hierarchical structure is imposed on the dimensions of the word embedding during training. However, HSC embeddings also satisfy the and requirements, as expected by the PairDiff. This result shows that the claim of section 3 is empirically true even when the uncorrelation assumption is mildly violated.

### 4.3 Generalisation Performance on Analogy Detection

So far we have seen that the bilinear relational representation given by (2) does indeed converge to the form predicted by our theoretical analysis for different types of word embeddings. However, it remains unclear whether the parameters learnt from the training instances generated from the BATS dataset accurately generalise to other benchmark datasets for analogy detection. To emphasise, our focus here is not to outperform relational representation methods proposed in previous works, but rather to empirically show that the learnt operator converges to the popular PairDiff for the analogy detection task.

To measure the generalisation capability of the learnt relational embeddings from BATS, we measure their performance on two other benchmark datasets: SAT Turney and Bigham (2003) and SemEval 2012-Task2. Note that we do not retrain , and in (2) on SAT nor SemEval, but simply to use their values learnt from BATS because the purpose here to evaluate the generalisation of the learnt operator.

In SAT analogical questions, given a stem word-pair with five candidate word-pairs , the task is to select the word-pair that is relationally similar to the the stem word-pair. The relational similarity between two word-pairs and is computed by the cosine similarity between the corresponding relational embeddings and . The candidate word-pair that has the highest relational similarity with the stem word-pair is selected as the correct answer to a word analogy question. The reported accuracy is the ratio of the correctly answered questions to the total number of questions. On the other hand, SemEval dataset has 79 semantic relations, with each relation having ca. 41 word-pairs and four prototypical examples. The task is to assign a score for each word pair which is the average of the relational similarity between the given word-pair and prototypical word-pairs in a relation. Maximum difference scaling (MaxDiff) is used as the evaluation measure in this task.

Figure 3 shows the performance of the relational embeddings composed from 50-dimensional CBOW embeddings.101010Similar trends were observed for all six word embedding types but not shown here due to space limitations. The level of performance reported by PairDiff on SAT and SemEval datasets are respectively and , and are shown by horizontal dashed lines. From Figure 3

, we see that the training loss gradually decreases with the number of training epochs and the performance of the relational embeddings on SAT and SemEval datasets reach that of the

PairDiff operator. This result indicates that the relational embeddings learnt not only converge to PairDiff operator on training data but also generalise to unseen relation types in SAT and SemEval test datasets.

## 5 Conclusion

We showed that, if the word embeddings are standardised and uncorrelated, then the expected distance between analogous and non-analogous word-pairs is independent of bilinear terms, and the relation embedding further simplifies to the popular PairDiff

operator under regularised settings. Moreover, we provided empirical evidence showing the uncorrelation in word embedding dimensions, where their cross-dimensional correlations are narrowly distributed around a mean close to zero. An interesting future research direction of this work is to extend the theoretical analysis to nonlinear relation composition operators, such as for nonlinear neural networks.

## References

• Blacoe and Lapata (2012) William Blacoe and Mirella Lapata. 2012. A comparison of vector-based representations for semantic composition. In Proc. of EMNLP. pages 546–556.
• Blei et al. (2003a) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003a. Latent dirichlet allocation.

Journal of Machine Learning Research

3:993–1022.
• Blei et al. (2003b) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003b. Latent dirichlet allocation. Machine Learning research 3:993–1022.
• Bollegala et al. (2015a) Danushka Bollegala, Takanori Maehara, and Ken ichi Kawarabayashi. 2015a. Embedding semantic relationas into word representations. In Proc. of IJCAI. pages 1222 – 1228.
• Bollegala et al. (2015b) Danushka Bollegala, Takanori Maehara, Yuichi Yoshida, and Ken ichi Kawarabayashi. 2015b. Learning word representations from relational graphs. In Proc. of AAAI. pages 2146 – 2152.
• Bollegala et al. (2009) Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2009. A relational model of semantic similarity between words using automatically extracted lexical pattern clusters from the web. In Proc. of EMNLP. pages 803–812.
• Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhenko. 2013. Translating embeddings for modeling multi-relational data. In Proc. of NIPS. pages 2787–2795.
• Deerwester et al. (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. American Society for Information Science 41(6):391–407.
• Drozd et al. (2016) Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. 2016. Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In Proc. of COLING. pages 3519–3530.
• Duc et al. (2010) Nguyen Tuan Duc, Danushka Bollegala, and Mitsuru Ishizuka. 2010. Using relational similarity between word pairs for latent relational search on the web. In Proc. of WI. pages 196–199.
• Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159.
• Faruqui et al. (2015) Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015. Sparse overcomplete word vector representations. In Proc. of ACL. pages 1491–1500.
• Gladkova et al. (2016) Anna Gladkova, Aleksandr Drozd, and Satoshi Matsuoka. 2016. Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proc. of SRW@ HLT-NAACL. pages 8–15.
• Guo et al. (2016) Shu Guo, Quan Wang, Lihong Wang, Bin Wang, and Li Guo. 2016.

Jointly embedding knowledge graphs and logical rules.

In Proc. of EMNLP. pages 192–202.
• Hakami and Bollegala (2017) Huda Hakami and Danushka Bollegala. 2017. Compositional approaches for representing relations between words: A comparative study. Knowledge-Based Systems 136:172–182.
• Hayashi and Shinbo (2017) Katsuhiko Hayashi and Masashi Shinbo. 2017. On the equivalence of holograpic and complex embeddings for link prediction. In Proc. of ACL. pages 554–559.
• Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. of Machine Learning Research. pages 448–456.
• Jurgens et al. (2012) David A. Jurgens, Saif Mohammad, Peter D. Turney, and Keith J. Holyoak. 2012. Semeval-2012 task 2: Measuring degrees of relational similarity. In Proc. of *SEM. pages 356 – 364.
• Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Linguistic regularities in sparse and explicit word representations. In Proc. of CoNLL. pages 171–180.
• Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of Association for Computational Linguistics 3:211–225.
• Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, and Jeffrey Dean. 2013a. Efficient estimation of word representation in vector space. In Proc. of ICLR.
• Mikolov et al. (2013b) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proc. of hlt-Naacl. pages 746–751.
• Minervini et al. (2017) Pasquale Minervini, Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. 2017. Adversarial sets for regularising neural link predictors. In Proc of UAI.
• Mu et al. (2017) J. Mu, S. Bhat, and P. Viswanath. 2017. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. ArXiv e-prints .
• Nickel et al. (2015) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2015. A review of relational machine learning for knowledge graphs. In Proc. of the IEEE. volume 104, pages 11–33.
• Nickel et al. (2016) Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. 2016. Holographic embeddings of knowledge graphs. In Proc. of AAAI.
• Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In Proc. of ICML. pages 809–816.
• Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proc. of EMNLP. volume 14, pages 1532–1543.
• Rocktäschel et al. (2016) Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, and Phil Blunsom. 2016. Reasoning about entailment with neural attention. In Proc. of ICLR.
• Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Proc. of NIPS. pages 926–934.
• Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In Proc. of ICML.
• Turney and Bigham (2003) Peter D Turney and Jeffrey Bigham. 2003. Combining independent modules to solve multiple-choice synonym and analogy. In Proc. of RANLP.
• Turney and Littman (2005) Peter D Turney and Michael L Littman. 2005. Corpus-based learning of analogies and semantic relations. Machine Learning 60(1):251–278.
• Vylomova et al. (2016) Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2016. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relational learning. In Proc. of ACL. pages 1671–1682.
• Yang et al. (2015) Bishan Yang, Wen tau Yih, Xiadong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In Proc. of ICLR.
• Yin and Schütze (2016) Wenpeng Yin and Hinrich Schütze. 2016. Learning meta-embeddings by using ensembles of embedding sets. In Proc. of ACL. pages 1351–1360.
• Yogatama et al. (2015) Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A. Smith. 2015. Learning word representations with hierarchical sparse coding. In Proc. of ICML. pages 87 – 96.