Reproducing and learning new algebraic operations on word embeddings using genetic programming

02/18/2017 ∙ by Roberto Santana, et al. ∙ UPV/EHU 0

Word-vector representations associate a high dimensional real-vector to every word from a corpus. Recently, neural-network based methods have been proposed for learning this representation from large corpora. This type of word-to-vector embedding is able to keep, in the learned vector space, some of the syntactic and semantic relationships present in the original word corpus. This, in turn, serves to address different types of language classification tasks by doing algebraic operations defined on the vectors. The general practice is to assume that the semantic relationships between the words can be inferred by the application of a-priori specified algebraic operations. Our general goal in this paper is to show that it is possible to learn methods for word composition in semantic spaces. Instead of expressing the compositional method as an algebraic operation, we will encode it as a program, which can be linear, nonlinear, or involve more intricate expressions. More remarkably, this program will be evolved from a set of initial random programs by means of genetic programming (GP). We show that our method is able to reproduce the same behavior as human-designed algebraic operators. Using a word analogy task as benchmark, we also show that GP-generated programs are able to obtain accuracy values above those produced by the commonly used human-designed rule for algebraic manipulation of word vectors. Finally, we show the robustness of our approach by executing the evolved programs on the word2vec GoogleNews vectors, learned over 3 billion running words, and assessing their accuracy in the same word analogy task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In semantic vector word spaces, each word of a given corpus is represented by a vector of real values. One reason that makes this type of representation relevant is that several natural language processing (NLP) tasks can be efficiently implemented on it. In particular, machine learning methods that use this representation have been proposed for named entity recognition

[30], question answering [11], machine translation [17], etc.

Another convenient feature of vector word spaces is that the word vectors are able to capture attributional similarities [31] between words. This means that words that appear in similar contexts in the corpus will be close in their vector representation.

From a machine learning point of view, a crucial question is how meaning can be extracted from the relationships between the vectors. Recent works show that vector word representations obtained using neural networks can capture linguistic or relational regularities between pair of words. For instance, these regularities can be manifested as constant vector offsets between pairs of words sharing a particular relationship [18, 16]. Let us use to represent the vector representation of the word W, then this offset property can be illustrated as . In another example, in the vector space constructed by Mikolov et al, the algebraic operation will produce a real-value vector whose closest word in the vector word space is “queen”. More notably, other semantic relationships such as gender-inflections, geographical relationships, etc. can be recovered using algebraic operations between vectors.

It has been suggested that the linguistic regularities that vector representations produced by neural networks exhibit are not a consequence of the embedding process itself, but are well preserved by it however [15]. This seems confirmed by the fact that for other types of vector representations linear algebraic operations can also produce meaningful results [15, 24].

While it is evident that simple vector algebraic operations such as those aforementioned, capture some semantics encoded in then vector space, it is not clear whether other types of operations could support more precise semantic relationships or unearth more complex or subtle relationships hidden in the semantic spaces. A possible answer to this question could come from exploring in an efficient way, the space of possible transformations in the vector space so as to find new ways to construct meaning out of word vectors. In this context, genetic programming (GP) [13] arises as a natural candidate.

GP is a search method that explores the space of programs looking for the one that maximizes a given evaluation criterion. Programs can be represented in a variety of ways but a common choice is tree-based representation. Mathematical expressions can be easily represented using a tree in which the nodes have associated mathematical operators, and every terminal node has an associated operand. Trees are evaluated in a recursive way. The output of the GP tree is contrasted with the desired target value for the input variables, and from this comparison the quality of the tree program is assessed and a “fitness” value is assigned to it. A characteristic feature of GP as a search method is that it is evolutionary, i.e., a set of programs (population) is progressively modified (evolved) by the application of random modifications (mutations) and swapping (crossover) of partial trees in the population.

This paper proposes the use of GP to find a sequence of word vector operations that captures a semantic relationship implicitly encoded in a set of training examples. This constitutes an automatic way to unveil the algebraic operations that express or support a given semantic relationship. We frame the general question of finding a suitable transformation of word vectors on the more specific word analogy task [16, 24]. This task consists of answering a question such as: “a is to b as c is to ?”. A correct answer is the exact word that would fit the analogy. Given the vector representations of the three known words, the problem to be solved by GP is to produce a vector whose closest word in the corpus is the one that correctly answers the question.

Using this particular problem, we address the following research questions: For embedding representations, can meaningful vector algebraic operations be learned from training examples? If so, is GP a feasible approach to do it? How does GP score with regard to the linear algebraic relationship commonly exploited on vector representations? Are GP evolved programs transferable across linguistic tasks, vector representations and corpora?

The remainder of the paper is structured as follows: In the next section we introduce a general background to vector-based representation of words. Section 3 gives a brief introduction to GP. Section 4 reviews related work. In Section 5, the benchmark of the word analogy task dealt with in the paper is described. Section 6 introduces the approach for automatically learning compositional methods using GP. Experiments to evaluate the accuracy of the evolved programs and their transferability across corpora are presented in Section 7. Section 8 presents the conclusions of our paper and discusses future work.

2 Vector-based representations of words

In this section we briefly review some of the foundations on which is our work built. We discuss semantic spaces and the approach that creates word embeddings using shallow neural networks.

2.1 Semantic spaces

In semantic spaces, words are given an associated representation and a number of semantic properties can be inferred from the relationships between these word representations. In this paper we will assume that words are represented as vectors of real numbers, all the vectors with the same dimension. We will alternatively use the terms “word vectors” or “embeddings” to refer to the mapping between words and vectors.

To organize our analysis, we consider two key issues in semantic spaces: i) The possible compositional relationships between the word vectors. ii) The methods used to learn the representations.

Compositional models are conceived to capture the semantic of a multi-word construction from the semantics of its constituents. The underlying idea is that the vector representations of two or more words could be transformed to obtain the representation of the multi-word construction they form. Algebraic combination of word vectors are of interest in the context of compositional semantics and also relevant for a variety of machine learning tasks in NLP.

To formalize the analysis of methods for word composition, Mitchell and Lapata [20] define as the composition of vectors and . represents how the pair of words represented by the vectors stand in some syntactic relation , given some background knowledge .

Compositional methods then propose multiple ways of defining the function . For instance, it is usually assumed that is a linear function of the Cartesian product of and , simply defined as

(1)

where and are matrices which determine the contributions made by and to .

In [20], additive models such as the one represented by Eq. (1) and also multiplicative models are discussed. Other compositional models that consider contextual information (neighboring words) have been also proposed [12].

Additive and multiplicative compositional models are limited because, among other reasons, they are based on commutative operators that do not attribute any role to the order of the constituents in multi-word constructions. The repertoire of available operations is also constrained if we compare it to the vast range of possible vector manipulations that could be encoded by a more general “program”.

The second issue relevant for semantic spaces is how are they created. Vector spaces can be learned from computing statistical measures of correlations between words or learned by capturing word-context relationships while scanning the sentences of a given corpus. Neural networks have been applied for implementing the latter approach [5, 32].

In this paper we have used vectors learned by the application of shallow neural networks as proposed in [16]. In the following section we briefly review this approach.

2.2 Learning word embeddings using neural networks

SUM
Figure 1: Continuous Bag-of-Words (CBOW) model as proposed in [16].

In [16], two neural-network based models have been proposed to learn embeddings: Skip-gram and Continuous Bags of words (CBOW) models. Skip-gram learns to predict the surrounding words of a given word in a sentence. CBOW learns to predict, given the surrounding words, the word most likely to be in the center. We focus on the CBOW model.

CBOW is a feed-forward neural net language model [1] with a number of added changes. The most important difference is that the hidden layer has been removed. The rationale behind this modification was to explore simpler models. They can not represent the non-linear interactions that neural networks with hidden layers can, but they are much more efficient for learning from millions of words. The CBOW network also uses a Huffman binary tree for more efficient representation of the word vocabulary and a hierarchical softmax scheme.

Figure 1 shows a schematic representation of the CBOW architecture [16]. Learning is done by scanning the corpus and considering, for each target word , a window comprising words from to where is the window size. In the results reported in [16], the best results were obtained using

. The model was trained using stochastic gradient descent and backpropagation.

2.3 Generation of the embeddings

To generate the embeddings we work with in this paper, we have used the text8.zip corpus111Available from http://mattmahoney.net/dc/text8.zip. This corpus has been extracted from the English Wikipedia222Details on the procedure to extract the data are available from https://cs.fit.edu/%7Emmahoney/compression/textdata.html. It comprises words.

We use the original word2vec implementation333http://code.google.com/p/word2vec of Mikolov et al [18, 16] to train the CBOW network from the corpus and generate embeddings. The parameters used by the word2vec program to generate the embedding are described in Table 1.

Parameter vector size Window negative hs
Value 200 8 25 0
Parameter sample threads binary iter
value 1e-4 6 1 15
Table 1: Parameters used by word2vec to train the CBOW model.

The CBOW is only generated once, regarding to the GP implementation, the most important parameter is the vector size. A larger vector size may allow a more accurate representation of the words. However, the vector size also influences the computational cost of the algebraic operations between the words that are applied intensively while GP searches for an optimal way to compose the words.

To evaluate the scalability and robustness of the programs evolved by GP, we also used a much larger embedding. The word2vec word vector model444Available from https://github.com/mmihaltz/word2vec-GoogleNews-vectors comprises 3 million 300-dimension English word vectors and was trained with the Google News corpus (3 billion running words).

3 Genetic programming

Genetic programming [13, 25] is a domain-independent method for the automatic creation of programs that solve a given problem. Each GP program can be seen as a candidate solution to the problem. The process to find the optimal solution is posed as a search in the space of possible programs. The search is organized using a traditional evolutionary optimization approach in which sets (populations) of programs are evolved and transformed by the application of the so-called mutation and crossover operators.

GP algorithm

Generate GP individuals randomly and evaluate them using the fitness function.

Select a population from according to a selection method

Create a population applying genetic crossover to individuals in

with probability

Apply mutation to individuals in with probability

Evaluate the individuals in

A stop criterion is met

Algorithm 3 shows the pseudocode of a very general GP algorithm. Issues in the application of GP are the choice of the program representation, the algebraic operators used by the program, and the objective or fitness function to evaluate the programs. We will discuss these issues in more detail in Section 6. However, in order to build some intuition on the particular way in which GP is used in this paper, we present a simple example of the representation.

Let us consider that the three words in the question “a is to b as c is to ?” are transformed to their vector representations, which will be the three arguments of a program. They are transformed as: , , . Then, the linear algebraic rule to compute the answer to the questions, i.e., , could be represented as , where indicates addition, and , subtraction. Figure 2 shows four GP programs that produce the same rule. The representation shown in Figure 2 is called a tree-based GP representation and is the one used in this paper. The tree representation is a convenient way to recursively organize the evaluation of a particular composition of the word vectors. Depending on the set of available operators (those defined in the non-terminal nodes of the trees) a richer space of possible word vector compositions could be represented. What the GP algorithm does is to bias the search toward those programs that maximize the given fitness function.

Figure 2: Four programs evolved by the GP algorithm. All implement the linear algebraic rule .

4 Related work

Levy and Goldberg [15] investigate the question of how to recover the relational similarities in word embeddings. They show that the linear algebraic proposed by Mikolov et al. [16] to solve analogy recovery is equivalent to searching for a word that maximizes a linear combination of three word similarities. Furthermore, they propose an alternative way to compute the distance between the vector generated and the set of words from the corpus. While our research is related to the work presented in [15], our focus is on the operations involved in the generation of the candidate word vector, not on the way the match between the generated vector and the word vectors in the corpus is assessed.

Pennington et al [24] introduce global log-bilinear regression models as an alternative to shallow neural-networks to produce word embeddings. They show their model is able to produce a word vector space with meaningful substructure. The algebraic rule they use to solve the word analogy task is the same as that originally introduced in [16]. Although they applied the distance measure previously presented by Levy and Goldberg [15], they report that this distance did not produce better results than the original one. The work presented in [24] is relevant for our research since it confirms that the usability of the word vector algebraic rule extends over vector representations obtained using a variety of model types and algorithms.

Socher et al [27, 28]

propose a method for compositional representation of words that learns a binary parse tree for each input phrase or sentence. The leaves of the tree contain vector representation of words. The tree serves as the basis for the application of a deep recursive autoencoder. Although this representation uses a tree structure to combine the word vectors, it is completely different to a GP approach. Furthermore, trees are independently inferred for each single sentence or multi-word phrase.

Grefenstette et al [9] propose associating different levels of meaning for words with different types of representations. For example, verbs or other relational words would be represented by matrices while nouns as vectors. Algebraic operations involving matrices and vectors are used to produce sentence vectors. In principle, GP approaches could cater for joint use of vector and matrix representation by means of strongly typed GP [21] or other GP variants that guarantee type constraint enforcement. However, it makes more sense to exhaust the potential of homogeneous word representations before recurring to GP based on more complex word representations.

In [3]

, three word vector representations and three compositional methods (addition of word vectors, multiplication of vectors, and the aforementioned deep recursive autoencoder approach) are combined to evaluate their applicability to estimate phrase similarity and paraphrase detection (i.e., determining whether two sequences have the same meaning). The reported results show that diverse combinations of representations and compositions produce the best results for different problems. Authors state “the sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.” This fact highlights the importance of finding an appropriate compositional method.

As a summary of this brief review of related work on compositional methods, we point out that although several papers emphasize the important role of these methods for solving an array of semantic tasks, we did not find any previous report of the automatic learning of the word compositions.

It is important to notice that from the point of view of machine learning problems, the word analogy task is not a classification problem. This is so even if the quality of a solution can be given in terms of accuracy, as the fraction of correctly answered questions. Neither it is a classical regression problem since each single input and output feature is represented using a vector of high-dimensional variables. In this context GP has been less investigated than for classical classification and regression problems. However, GP has been applied to a miscellany of tasks in information retrieval [6, 7, 22, 29]. In particular, Oren [22] combines GP with vector-based representation of documents for information retrieval. Other problems that involve text classification have also been addressed with GP. Two related areas where GP has been applied are document ranking [29] and term-weighting learning [6, 7]. We did not find any previous report on the combination of genetic programming and word embeddings.

5 Problem benchmark: word analogy task

The word analogy task consists of answering a question such as: “a is to b as c is to ?” A correct answer is the exact word that would fit the analogy. Table 2 shows several exemplar questions. We used the benchmark proposed by Mikolov et al. [18] in which questions are separated into groups. In Table 2, Group refers to the group from which the example was taken.

Group word 1 word 2 word 3 Answer
4 boy girl sons daughters
5 amazing amazingly slow slowly
6 honest dishonest known unknown
7 bad worse old older
8 bad worst good best
9 code coding walk walking
11 dancing danced feeding fed
12 banana bananas car cars
13 decrease decreases say says
Table 2: Examples of questions in the word analogy task.

Table 3 shows the description of the word analogy task benchmark. In this table, is the number of questions in the original benchmark and is the number of question after removing those words that do not appear in the shortened corpus we used in our experiments. Since the corpus we use is relatively small, for of the groups of questions (“capital-world”, “currency”, “city-in-state”, “nationality-adjective”) we did not find one or more of the four words for each of the questions. Therefore, these four groups of questions were excluded from our analysis.

Group Name
4 family (gender inflections) 506 305
5 gram1-adjective-to-adverb 992 755
6 gram2-opposite 812 305
7 gram3-comparative 1332 1259
8 gram4-superlative 1122 505
9 gram5-present-participle 1056 991
11 gram7-past-tense 1560 1331
12 gram8-plural (nouns) 1332 991
13 gram9-plural-verbs 870 649
Table 3: Description of the word analogy task benchmark, where is the number of questions in the original database and is the number of question after removing those words that do not appear in shortened corpus.

6 Description of the GP approach

The automatic learning of the composition of words is possible in the specific problem we use, and the GP task, given the vector representations of three words that define a question, is to produce a vector whose closest word in the corpus is one that correctly answers the question. We will mainly use the CBOW model learned using word2vec to determine which is word vector of the model encoding a given word, or to find which is the word in the model whose encoding vector is the closest to a target word vector. The pseudocode of the GP algorithm we used is shown in Algorithm 3. It is a straightforward implementation of tree-based genetic programming.

The selection method used is truncation selection. After sorting the individuals according to their fitness, the best solutions are kept for crossover and mutation. Uniform mutation randomly selects a point in the tree individual and replaces it by a random subtree. One-point crossover is used, it randomly selects two subtrees in the individuals and then exchanges them. The probability of mutation and crossover was .

The choice of genetic operators has been made as simple as possible to enhance the readability of the algorithm. While more sophisticated GP methods exist, our focus here is the proof of concept of automatic generation of the compositions, and, for that purpose, the choice of the operators was appropriate. On the other hand, we conducted a set of preliminary experiments with other mutation and selection operators555Those included in the DEAP library used to implement the algorithms and did not appreciate significant changes in the results when the set of all groups of analogy questions were considered. Some operators can produce more accurate programs for some particular group, but then they are outperformed by other methods in other groups.

In the experiments, the population size used was and the stop criterion is a maximum number () of generations. Our GP implementation was written in Python. It is based on the EA software DEAP666http://deap.readthedocs.io/en/master/api/tools.html [8] and the gensim package777https://radimrehurek.com/gensim/, a Python-based implementation of NLP algorithms [26]. gensim includes methods for interrogating the model generated by word2vect. Our code is openly available 888https://github.com/rsantana-isg/GP_word2vec.

6.1 GP operators

The set of operators used by the programs is shown in Table 4. All operators are defined on vectors of the same dimension. There are two classes of operators: binary and unary. The , and operators have the following meaning: vector addition, vector subtraction, and vector component-wise multiplication, respectively; while corresponds to protected division (usual division except that a divide by zero for any of the vector components returns zero). We discarded the possibility of including fixed (vector) random constants as terminals of the programs since they may depend on the size of the vector and our aim was to produce programs scalable to any vector dimension. We set a constraint to the depth of the trees to reduce the complexity of the programs.

B. w v U. w U. w
add neg Roll
sub diff Rint
mul abs Half
saveDiv cos Norm
sin Log1p
Table 4: Set of operators and terminals used by the tree-based GP algorithm. Binary and unary operators are represented as B. and U., respectively.

6.2 Fitness function

A critical component of a GP implementation is the definition of the fitness function. We implement the fitness evaluation as follows: At the time of evaluating a candidate program, it is applied to a training set of questions. For each question, the word vectors of the first three words are first obtained from the CBOX models. The program is then evaluated using as arguments these three word vectors, and the program’s output word vector is used to compute the quality of the program for the question.

Let us consider the program and the first question in Table 2 as an example. First, we obtained the words vectors , , and . Then, from the execution of the GP program, we obtain a word vector , . The vector is then presented to the CBOX model which outputs the closest word in the model. If this word coincides with the answer to the question, a correctly answered questions counter is increased. The final fitness value is the proportion of questions in the training set that were correctly answered.

The fitness function serves as a direct assessment of the program quality because we can directly test whether the program produces vectors whose semantics is the one encoded by the question. However, it has an important drawback. The computational cost of repeatedly interrogating the model to determine the closest word to a given vector is very high, and it would increase with the size of the vocabulary if larger corpora were used. To diminish this cost we introduced three changes to the GP scheme.

  1. Restricted vocabulary size for interrogation: The word2vec implementation allows the restriction of the search for the most likely word given a vector to the most frequent words in the vocabulary. Out of the total number of words () in the vocabulary, we set . This reduces the computational time of the fitness function.

  2. Partial evaluation: Each program is trained on a fraction of the questions from the training set. We set this fraction to be of the size of the training set. When evaluating a program, first a subset of the questions from the training set is randomly selected, and the accuracy of the program is measured in this subset. This means that different programs are evaluated on distinct subsets of questions.

  3. Early halt: While sequentially evaluating the questions in the (random subset of the) training set, the program does not complete the evaluation of all questions and halts if: 1) A output is generated for any of the questions. 2) If after at least ten questions have been “answered” the proportion of correctly answered questions is at some point below . In this case it is clearly a poorly performing program.

All the previous enhancements considerably increase the efficiency of the algorithm. While partial evaluation adds some variability in the fitness output of the programs, good programs are in general good across subsets of questions and poor programs can not specialize in niches of questions since the subset selection in the training set is made randomly.

7 Experiments

The main objective of the experiments is to determine the quality of the programs generated by GP. We will compare their results for the word analogy task with those obtained by the application of the linear algebraic rule , which is the one commonly used for the composition of words for this problem. In addition, we will evaluate the transferability of the best programs by applying them to a vector space comprising vector, roughly times the size of the vector space we used to learn the programs.

For each fitness function and each group of questions of those described in Table 3, independent runs of the algorithms were executed. In total, executions were conducted. Each group of questions was split into a training and test set with same number of questions. The questions in the test set were not used at any time of the evolution.

7.1 Numerical results

We evaluate the performance of the GP algorithms by looking at the accuracy of the best GP programs found. The accuracy, for each group of questions, is the proportion of questions correctly answered by a GP program. For each of the runs we keep all the solutions in the last selected population ( solutions by run). Among the programs, the one that has the highest accuracy in the training set is selected. Then we compute the accuracy of this program also in the test set. Using the programs, the maximum and mean accuracy are calculated in the training and test sets. Table 5 shows these values for the groups of questions. The table also shows the accuracy produced by the algebraic rule. It can be seen that the best GP evolved programs outperform the algebraic rule on all the groups of questions, although the difference in the results is more noticeable for some groups of questions (e.g., group ). The mean accuracy of the programs on the test set is also higher than that achieved using the algebraic rule for of the groups of questions. Notice, that since our selection of the best programs was based on the accuracy for the training set, there might be programs with a higher accuracy on the test set. We did identify some of these programs. Interestingly, for some groups of questions (e.g., group 11) the maximum and mean accuracy in the training set is smaller than in the test set.

Group Training set Test set rule
max. mean max. mean
4 84.21 82.74 83.66 82.77 77.70
5 24.67 21.62 26.46 21.31 16.16
6 49.34 41.86 50.98 41.11 24.92
7 67.09 66.00 61.27 58.62 60.44
8 46.83 44.96 41.11 38.72 40.40
9 48.08 45.09 44.56 39.89 36.83
11 45.71 41.97 46.70 44.47 37.94
12 76.16 73.13 72.38 69.41 66.50
13 49.69 40.42 38.77 32.38 34.21
Table 5: Results of the GP algorithm in terms of the maximum and mean accuracy of the best program. Training and test sets of questions are used to evaluate the accuracy of the best program in each of the runs. The last column corresponds to the algebraic rule .

In a second phase of the experiments, the best program generated in the last generation of each execution of the GP algorithm was selected based on the sum of the training and test set accuracy values for the corresponding group of questions. We evaluated this set of programs using the word2vec GoogleNews vectors. These vectors have a larger dimension ( versus in the text8 vector space) and comprise around million words. As a consequence, these vectors contain all words for the original groups of questions introduced in [16]. We must remember that the reason why we did not use four of the original groups of questions was that the text8 vector space did not include the vector representations for all constituent words of each question in these groups.

Using the word2vec GoogleNews vectors, we can test the evolved programs in all the data sets. The same set of operations encoded in the programs are applied, but this time using the new vector representation. The output vector is then submitted to the model that determines whether the closest word in the space of word2vec GoogleNews vectors is the right answer to the question.

The results of this evaluation are shown in Table  6, where each row corresponds to one of the original groups of questions. Each column shows the best accuracy produced by the best program among the generated with function for the group of questions represented in column . The last column shows the accuracy results of the algebraic rule. In each row, all the programs that produce results better than the one in the last column are shown in bold.

Notice that we can evaluate the GP program in all groups of questions independently of the group used to learn them. Since all the questions have the same structure, we can apply the programs to them.

Group 4 5 6 7 8 9 11 12 13 rule
1 77.80 63.41 51.43 81.72 81.49 76.23 81.41 78.60 81.49 81.49
2 14.68 10.87 2.08 16.18 15.95 14.68 16.07 16.07 15.95 15.95
3 65.94 41.16 33.62 72.75 72.75 69.06 68.98 70.48 74.17 72.75
4 73.27 62.18 48.71 77.62 77.62 73.66 77.23 77.62 77.62 77.62
5 27.75 18.77 17.66 28.86 28.86 21.39 29.87 28.86 28.86 28.86
6 34.16 27.87 28.61 32.68 32.43 34.28 33.54 32.31 32.43 32.43
7 88.66 63.19 71.83 89.56 89.56 86.33 89.93 89.26 89.56 89.56
8 65.21 48.26 33.27 67.53 67.53 46.65 69.05 68.06 67.53 67.53
9 70.90 60.00 61.90 73.18 73.18 61.99 71.09 72.70 74.41 73.18
10 85.17 83.79 71.78 85.42 85.42 82.23 85.36 85.36 85.42 85.42
11 62.92 43.30 48.69 68.44 66.20 64.27 64.46 66.13 66.20 66.20
12 76.93 74.76 74.68 78.36 78.36 75.13 78.21 78.14 78.36 78.36
13 55.12 21.75 29.69 65.71 63.87 56.85 62.83 62.83 63.87 63.87
Table 6: Results of the best programs produced by the GP algorithm on all groups of questions when the set of GoogleNews vectors are used to execute the GP programs. Last column corresponds to the algebraic rule .
Group 4 5 6 7 8 9 11 12 13
1 23 52 76 104 129 153 185 235 261
2 9 52 76 104 129 153 185 235 261
3 15 58 76 120 129 153 198 235 247
4 23 58 76 120 129 178 185 235 261
5 23 52 76 120 129 178 208 235 261
6 23 58 76 111 129 178 196 224 261
7 23 58 76 120 129 178 208 235 261
8 23 52 76 120 129 178 208 235 261
9 23 58 76 104 129 153 195 235 247
10 23 58 76 120 129 178 208 224 261
11 23 58 76 104 129 178 208 235 261
12 23 37 76 120 129 154 208 235 261
13 23 58 76 104 129 153 195 235 261
Table 7: Indices of the best programs produced by the GP algorithm on all groups of questions.

There are a number of remarkable facts in the results shown in Table 6:

  1. Some of the programs improved the accuracy for groups of questions that were not in the original reduced benchmark of groups. This is the case for the group of questions .

  2. The best program for the group of questions is not, in general, a program evolved to answer this group of questions.

  3. There are programs evolved for some groups of questions that are good at answering questions for all groups. For example, this happens with programs learned using the group of questions .

7.2 Evaluating answers and evolved programs

One important issue is the interpretability of the evolved programs and how are they related with the algebraic rule. Out of the programs tested, were equivalent to the algebraic rule. Four of these programs are shown in Figure 2. It can be seen how the same rule is implemented in distinct ways using only the operators , , and . These results show that, as an algorithm to create word compositions, GP can automatically learn compositional methods designed by humans.

We also analyzed those GP programs that outperformed the algebraic rule. An exemplar of this type of programs is shown in Figure 3. It was the best program found for the group of questions . Its accuracy using the word2vec GoogleNews vectors was , above the accuracy of the algebraic rule for the same group of questions.

The tree shown in Figure 3 is a slight modification of the algebraic rule. Instead of adding to the rule, this programs adds and this change allows it to increase the accuracy for the group of questions. A trend observed in other evolved programs was that they contained building blocks from the algebraic rule. As in the case of the programs shown in Figure 2, these structural features were not specifically induced, they were acquired as part of the evolutionary process. Other programs that produced high accuracy values are shown in figures 48. When analyzing the behavior of these programs, tables 6 and 7 should be consulted.

Figure 3: Program number , according to the indices in Table 7. It was learned from the group of questions and produced the best accuracy, among the programs selected, for group of questions . Its accuracy using the word2vec GoogleNews vectors was , above the accuracy of the algebraic rule for the same group of questions. See tables 6 and 7 for details of the program behavior.
Figure 4: Program , learned from the group of questions . See tables 6 and 7 for details of the program behavior.
Figure 5: Program , learned from the group of questions .
Figure 6: Program , learned from the group of questions .
Figure 7: Program , learned from the group of questions .
Figure 8: Program , learned from the group of questions .

7.3 Discussion

We go back to the research questions posed at the beginning of this work and try to answer them based on the results of our experiments.

  • For embedding representations, can meaningful vector algebraic operations be learned from training examples? Yes, they can be learned within a relatively small computational time.

  • If so, is GP a feasible approach to do it? Yes, GP is a natural solution for this type of problem and even straightforward implementations can deal with the problem.

  • How do GP programs score with regard to the algebraic rule commonly applied on vector representations? GP programs can learn the same rule designed by humans and, therefore, can reach the same results. They can also outperform these results but, at least for the class of word vector representation and the basic tree-based GP approach implemented, the improvements are moderate.

  • Are GP evolved programs transferable across linguistic tasks, vector representations and corpora? Definitively. The high transferability of the programs across groups of questions may be supported by the general underlying commonality between analogies that these group of questions represent. However, it is remarkable how the programs can be transferred to a vector space where both the dimension of the vectors, and the number of vectors increase dramatically. In this respect, transferability opens an additional opportunity for efficiency gain. Programs can be learned using small vector spaces, and then validated or refined on more computationally costly large vector spaces.

8 Conclusions and future work

While semantic spaces and word vector representations are able to capture some of the semantic relationships between the words, compositional methods are necessary to extend their use to multi-word constructions. In this paper we have proposed representing compositional vector operations as simple programs that can be automatically learned from data. We have shown that, using GP, it is possible to encode a set of vector operations as a program, that the programs can be evolved to achieve higher accuracy than the human rules conceived to manipulate the words, and that the programs are valid for datasets other than those from which they have been learned, i.e., they are transferable programs. Furthermore, our results indicate that it is possible to learn programs using vector vocabularies of small to moderate sizes and then test them in bigger domains where the evaluation of a program is more costly.

8.1 Future work

As lines for future work we consider the following:

8.1.1 Use alternative methods for the word vector generation

While GP approaches can explore a vast range of possible word compositions, the usefulness of more intricate programs is, to some extent, constrained by the nature of the relationships that the vectors can encode. For example, if the methods used to construct the embeddings do not allow non-linear relationships between the vectors, then the improvements of the GP programs over plain linear algebraic compositional operators will be marginal. Therefore, it would be important to test the automatic generation of word compositions with GP on word vectors generated using diverse methods.

8.1.2 Evolve functions for the similarity metric

Since it has been shown that the type of similarity metric can critically influence the accuracy results [15], it makes sense to learn this function as well. One difficulty is that the output of this function will be a numerical value and not a vector like the other operators used in the current GP representation. In addition, evaluating an alternative similarity metric implies using the candidate metric to compute distances to all vectors in the vector space, a process that can be very costly computationally.

8.1.3 Combining different word representations

Turian et al [30] have shown that combining different word representations can improve accuracy for supervised classification tasks in NLP [30]. We envision the evolution of programs which are able to combine different word vector representations.

8.1.4 Using more sophisticated GP approaches

From the point of view of research in genetic programming, word embeddings open an interesting research line. More research is needed to identify which, among more sophisticated GP approaches, are the most appropriate for their application to semantic spaces. Among possible lines of research are the following:

  • Alternative GP representations: In addition to trees, other GP representations such as grammars [23, 4] and Cartesian GP [19] could be considered.

  • More complex descriptions of the compositional operators: One open question is to what extent can more complex functions better exploit the underlying semantic relationships between the word vectors. This could be investigated by adding other algebraic operators to the set of GP functions, including ternary operators. Another possibility is representing the composition of vectors with ensembles of GP programs [2].

  • Reusing problem information: Approaches able to identify and transfer building blocks [10] between word vectors or corpora of varying dimensions arise as potential candidates.

  • Behavioral program synthesis: One direction in which the evolution of the programs could be improved is by analyzing and assessing the quality of the intermediate vectors produced in the evaluation of the programs. In general, algorithms that advocate a more efficient use of the information displayed by the behavior of the GP programs [14] could lead to better solutions and reveal additional insights in learning compositional methods.

Acknowledgments

This work has received support through through the IT-609-13 program (Basque Government), TIN2016-78365-R (Spanish Ministry of Economy, Industry and Competitiveness) and Brazilian CNPq Program Science Without Borders No.: 400125/2014-5.

References

  • [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
  • [2] U. Bhowan, M. Johnston, M. Zhang, and X. Yao. Evolving diverse ensembles using genetic programming for classification with unbalanced data.

    IEEE Transactions on Evolutionary Computation

    , 17(3):368–386, 2013.
  • [3] W. Blacoe and M. Lapata. A comparison of vector-based representations for semantic composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 546–556. Association for Computational Linguistics, 2012.
  • [4] P. A. Bosman and E. D. De Jong. Learning probabilistic tree grammars for genetic programming. In International Conference on Parallel Problem Solving from Nature, pages 192–201. Springer, 2004.
  • [5] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
  • [6] R. Cummins and C. O’Riordan. An analysis of the solution space for genetically programmed term-weighting schemes in information retrieval. In P. S. P. M. D. Bell, editor, 17th Artificial Intelligence and Cognitive Science Conference (AICS 2006), Queen’s University, Belfast, 2006.
  • [7] H. J. Escalante, M. A. García-Limón, A. Morales-Reyes, M. Graff, M. Montes-y Gómez, E. F. Morales, and J. Martínez-Carranza. Term-weighting learning via genetic programming for text classification. Knowledge-Based Systems, 83:176–189, 2015.
  • [8] F.-A. Fortin, D. Rainville, M.-A. G. Gardner, M. Parizeau, C. Gagné, et al.

    DEAP: Evolutionary algorithms made easy.

    The Journal of Machine Learning Research, 13(1):2171–2175, 2012.
  • [9] E. Grefenstette and M. Sadrzadeh. Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1394–1404. Association for Computational Linguistics, 2011.
  • [10] M. Iqbal, W. Browne, and M. Zhang. Reusing building blocks of extracted knowledge to solve complex, large-scale Boolean problems. Evolutionary Computation, IEEE Transactions on, 18(4):465–480, Aug 2014.
  • [11] M. Iyyer, J. L. Boyd-Graber, L. M. B. Claudino, R. Socher, and H. Daumé III. A neural network for factoid question answering over paragraphs. In Empirical Methods in Natural Language Processing (EMNLP), pages 633–644, 2014.
  • [12] W. Kintsch. Predication. Cognitive science, 25(2):173–202, 2001.
  • [13] J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge, MA, 1992.
  • [14] K. Krawiec, J. Swan, and U.-M. O’Reilly. Behavioral program synthesis: Insights and prospects. In Genetic Programming Theory and Practice XIII, pages 169–183. Springer, 2016.
  • [15] O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Language Learning, pages 171–180, 2014.
  • [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
  • [17] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013.
  • [18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [19] J. F. Miller and P. Thomson. Cartesian genetic programming. In European Conference on Genetic Programming, pages 121–132. Springer, 2000.
  • [20] J. Mitchell and M. Lapata. Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429, 2010.
  • [21] D. J. Montana. Strongly typed genetic programming. Evolutionary computation, 3(2):199–230, 1995.
  • [22] N. Oren. Improving the effectiveness of information retrieval with genetic programming. Master’s thesis, Faculty of Science of the University of Witwatersrand, Johannesburg, 2002.
  • [23] M. O’Neil and C. Ryan. Grammatical evolution. In Grammatical Evolution, pages 33–47. Springer, 2003.
  • [24] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), volume 14, pages 1532–1543, 2014.
  • [25] R. Poli, W. B. Langdon, N. F. McPhee, and J. R. Koza. A field guide to genetic programming. Lulu.com, 2008.
  • [26] R. Řehůřek and P. Sojka. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.
  • [27] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 2011 Conference Advances in Neural Information Processing Systems 24, NIPS, volume 24, pages 801–809, 2011.
  • [28] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the conference on empirical methods in natural language processing, pages 151–161. Association for Computational Linguistics, 2011.
  • [29] A. Trotman. Learning to rank. Information Retrieval, 8(3):359–381, 2005.
  • [30] J. Turian, L. Ratinov, and Y. Bengio.

    Word representations: a simple and general method for semi-supervised learning.

    In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics, 2010.
  • [31] P. D. Turney. Similarity of semantic relations. Computational Linguistics, 32(3):379–416, 2006.
  • [32] A. Zhila, W.-t. Yih, C. Meek, G. Zweig, and T. Mikolov. Combining heterogeneous models for measuring relational similarity. In Proceedings of the 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 1000–1009, 2013.