1 Introduction
In semantic vector word spaces, each word of a given corpus is represented by a vector of real values. One reason that makes this type of representation relevant is that several natural language processing (NLP) tasks can be efficiently implemented on it. In particular, machine learning methods that use this representation have been proposed for named entity recognition
[30], question answering [11], machine translation [17], etc.Another convenient feature of vector word spaces is that the word vectors are able to capture attributional similarities [31] between words. This means that words that appear in similar contexts in the corpus will be close in their vector representation.
From a machine learning point of view, a crucial question is how meaning can be extracted from the relationships between the vectors. Recent works show that vector word representations obtained using neural networks can capture linguistic or relational regularities between pair of words. For instance, these regularities can be manifested as constant vector offsets between pairs of words sharing a particular relationship [18, 16]. Let us use to represent the vector representation of the word W, then this offset property can be illustrated as . In another example, in the vector space constructed by Mikolov et al, the algebraic operation will produce a realvalue vector whose closest word in the vector word space is “queen”. More notably, other semantic relationships such as genderinflections, geographical relationships, etc. can be recovered using algebraic operations between vectors.
It has been suggested that the linguistic regularities that vector representations produced by neural networks exhibit are not a consequence of the embedding process itself, but are well preserved by it however [15]. This seems confirmed by the fact that for other types of vector representations linear algebraic operations can also produce meaningful results [15, 24].
While it is evident that simple vector algebraic operations such as those aforementioned, capture some semantics encoded in then vector space, it is not clear whether other types of operations could support more precise semantic relationships or unearth more complex or subtle relationships hidden in the semantic spaces. A possible answer to this question could come from exploring in an efficient way, the space of possible transformations in the vector space so as to find new ways to construct meaning out of word vectors. In this context, genetic programming (GP) [13] arises as a natural candidate.
GP is a search method that explores the space of programs looking for the one that maximizes a given evaluation criterion. Programs can be represented in a variety of ways but a common choice is treebased representation. Mathematical expressions can be easily represented using a tree in which the nodes have associated mathematical operators, and every terminal node has an associated operand. Trees are evaluated in a recursive way. The output of the GP tree is contrasted with the desired target value for the input variables, and from this comparison the quality of the tree program is assessed and a “fitness” value is assigned to it. A characteristic feature of GP as a search method is that it is evolutionary, i.e., a set of programs (population) is progressively modified (evolved) by the application of random modifications (mutations) and swapping (crossover) of partial trees in the population.
This paper proposes the use of GP to find a sequence of word vector operations that captures a semantic relationship implicitly encoded in a set of training examples. This constitutes an automatic way to unveil the algebraic operations that express or support a given semantic relationship. We frame the general question of finding a suitable transformation of word vectors on the more specific word analogy task [16, 24]. This task consists of answering a question such as: “a is to b as c is to ?”. A correct answer is the exact word that would fit the analogy. Given the vector representations of the three known words, the problem to be solved by GP is to produce a vector whose closest word in the corpus is the one that correctly answers the question.
Using this particular problem, we address the following research questions: For embedding representations, can meaningful vector algebraic operations be learned from training examples? If so, is GP a feasible approach to do it? How does GP score with regard to the linear algebraic relationship commonly exploited on vector representations? Are GP evolved programs transferable across linguistic tasks, vector representations and corpora?
The remainder of the paper is structured as follows: In the next section we introduce a general background to vectorbased representation of words. Section 3 gives a brief introduction to GP. Section 4 reviews related work. In Section 5, the benchmark of the word analogy task dealt with in the paper is described. Section 6 introduces the approach for automatically learning compositional methods using GP. Experiments to evaluate the accuracy of the evolved programs and their transferability across corpora are presented in Section 7. Section 8 presents the conclusions of our paper and discusses future work.
2 Vectorbased representations of words
In this section we briefly review some of the foundations on which is our work built. We discuss semantic spaces and the approach that creates word embeddings using shallow neural networks.
2.1 Semantic spaces
In semantic spaces, words are given an associated representation and a number of semantic properties can be inferred from the relationships between these word representations. In this paper we will assume that words are represented as vectors of real numbers, all the vectors with the same dimension. We will alternatively use the terms “word vectors” or “embeddings” to refer to the mapping between words and vectors.
To organize our analysis, we consider two key issues in semantic spaces: i) The possible compositional relationships between the word vectors. ii) The methods used to learn the representations.
Compositional models are conceived to capture the semantic of a multiword construction from the semantics of its constituents. The underlying idea is that the vector representations of two or more words could be transformed to obtain the representation of the multiword construction they form. Algebraic combination of word vectors are of interest in the context of compositional semantics and also relevant for a variety of machine learning tasks in NLP.
To formalize the analysis of methods for word composition, Mitchell and Lapata [20] define as the composition of vectors and . represents how the pair of words represented by the vectors stand in some syntactic relation , given some background knowledge .
Compositional methods then propose multiple ways of defining the function . For instance, it is usually assumed that is a linear function of the Cartesian product of and , simply defined as
(1) 
where and are matrices which determine the contributions made by and to .
In [20], additive models such as the one represented by Eq. (1) and also multiplicative models are discussed. Other compositional models that consider contextual information (neighboring words) have been also proposed [12].
Additive and multiplicative compositional models are limited because, among other reasons, they are based on commutative operators that do not attribute any role to the order of the constituents in multiword constructions. The repertoire of available operations is also constrained if we compare it to the vast range of possible vector manipulations that could be encoded by a more general “program”.
The second issue relevant for semantic spaces is how are they created. Vector spaces can be learned from computing statistical measures of correlations between words or learned by capturing wordcontext relationships while scanning the sentences of a given corpus. Neural networks have been applied for implementing the latter approach [5, 32].
In this paper we have used vectors learned by the application of shallow neural networks as proposed in [16]. In the following section we briefly review this approach.
2.2 Learning word embeddings using neural networks
In [16], two neuralnetwork based models have been proposed to learn embeddings: Skipgram and Continuous Bags of words (CBOW) models. Skipgram learns to predict the surrounding words of a given word in a sentence. CBOW learns to predict, given the surrounding words, the word most likely to be in the center. We focus on the CBOW model.
CBOW is a feedforward neural net language model [1] with a number of added changes. The most important difference is that the hidden layer has been removed. The rationale behind this modification was to explore simpler models. They can not represent the nonlinear interactions that neural networks with hidden layers can, but they are much more efficient for learning from millions of words. The CBOW network also uses a Huffman binary tree for more efficient representation of the word vocabulary and a hierarchical softmax scheme.
Figure 1 shows a schematic representation of the CBOW architecture [16]. Learning is done by scanning the corpus and considering, for each target word , a window comprising words from to where is the window size. In the results reported in [16], the best results were obtained using
. The model was trained using stochastic gradient descent and backpropagation.
2.3 Generation of the embeddings
To generate the embeddings we work with in this paper, we have used the text8.zip corpus^{1}^{1}1Available from http://mattmahoney.net/dc/text8.zip. This corpus has been extracted from the English Wikipedia^{2}^{2}2Details on the procedure to extract the data are available from https://cs.fit.edu/%7Emmahoney/compression/textdata.html. It comprises words.
We use the original word2vec implementation^{3}^{3}3http://code.google.com/p/word2vec of Mikolov et al [18, 16] to train the CBOW network from the corpus and generate embeddings. The parameters used by the word2vec program to generate the embedding are described in Table 1.
Parameter  vector size  Window  negative  hs 

Value  200  8  25  0 
Parameter  sample  threads  binary  iter 
value  1e4  6  1  15 
The CBOW is only generated once, regarding to the GP implementation, the most important parameter is the vector size. A larger vector size may allow a more accurate representation of the words. However, the vector size also influences the computational cost of the algebraic operations between the words that are applied intensively while GP searches for an optimal way to compose the words.
To evaluate the scalability and robustness of the programs evolved by GP, we also used a much larger embedding. The word2vec word vector model^{4}^{4}4Available from https://github.com/mmihaltz/word2vecGoogleNewsvectors comprises 3 million 300dimension English word vectors and was trained with the Google News corpus (3 billion running words).
3 Genetic programming
Genetic programming [13, 25] is a domainindependent method for the automatic creation of programs that solve a given problem. Each GP program can be seen as a candidate solution to the problem. The process to find the optimal solution is posed as a search in the space of possible programs. The search is organized using a traditional evolutionary optimization approach in which sets (populations) of programs are evolved and transformed by the application of the socalled mutation and crossover operators.
GP algorithm
Generate GP individuals randomly and evaluate them using the fitness function.
Select a population from according to a selection method
Apply mutation to individuals in with probability
Evaluate the individuals in
A stop criterion is met
Algorithm 3 shows the pseudocode of a very general GP algorithm. Issues in the application of GP are the choice of the program representation, the algebraic operators used by the program, and the objective or fitness function to evaluate the programs. We will discuss these issues in more detail in Section 6. However, in order to build some intuition on the particular way in which GP is used in this paper, we present a simple example of the representation.
Let us consider that the three words in the question “a is to b as c is to ?” are transformed to their vector representations, which will be the three arguments of a program. They are transformed as: , , . Then, the linear algebraic rule to compute the answer to the questions, i.e., , could be represented as , where indicates addition, and , subtraction. Figure 2 shows four GP programs that produce the same rule. The representation shown in Figure 2 is called a treebased GP representation and is the one used in this paper. The tree representation is a convenient way to recursively organize the evaluation of a particular composition of the word vectors. Depending on the set of available operators (those defined in the nonterminal nodes of the trees) a richer space of possible word vector compositions could be represented. What the GP algorithm does is to bias the search toward those programs that maximize the given fitness function.
4 Related work
Levy and Goldberg [15] investigate the question of how to recover the relational similarities in word embeddings. They show that the linear algebraic proposed by Mikolov et al. [16] to solve analogy recovery is equivalent to searching for a word that maximizes a linear combination of three word similarities. Furthermore, they propose an alternative way to compute the distance between the vector generated and the set of words from the corpus. While our research is related to the work presented in [15], our focus is on the operations involved in the generation of the candidate word vector, not on the way the match between the generated vector and the word vectors in the corpus is assessed.
Pennington et al [24] introduce global logbilinear regression models as an alternative to shallow neuralnetworks to produce word embeddings. They show their model is able to produce a word vector space with meaningful substructure. The algebraic rule they use to solve the word analogy task is the same as that originally introduced in [16]. Although they applied the distance measure previously presented by Levy and Goldberg [15], they report that this distance did not produce better results than the original one. The work presented in [24] is relevant for our research since it confirms that the usability of the word vector algebraic rule extends over vector representations obtained using a variety of model types and algorithms.
propose a method for compositional representation of words that learns a binary parse tree for each input phrase or sentence. The leaves of the tree contain vector representation of words. The tree serves as the basis for the application of a deep recursive autoencoder. Although this representation uses a tree structure to combine the word vectors, it is completely different to a GP approach. Furthermore, trees are independently inferred for each single sentence or multiword phrase.
Grefenstette et al [9] propose associating different levels of meaning for words with different types of representations. For example, verbs or other relational words would be represented by matrices while nouns as vectors. Algebraic operations involving matrices and vectors are used to produce sentence vectors. In principle, GP approaches could cater for joint use of vector and matrix representation by means of strongly typed GP [21] or other GP variants that guarantee type constraint enforcement. However, it makes more sense to exhaust the potential of homogeneous word representations before recurring to GP based on more complex word representations.
In [3]
, three word vector representations and three compositional methods (addition of word vectors, multiplication of vectors, and the aforementioned deep recursive autoencoder approach) are combined to evaluate their applicability to estimate phrase similarity and paraphrase detection (i.e., determining whether two sequences have the same meaning). The reported results show that diverse combinations of representations and compositions produce the best results for different problems. Authors state “the sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.” This fact highlights the importance of finding an appropriate compositional method.
As a summary of this brief review of related work on compositional methods, we point out that although several papers emphasize the important role of these methods for solving an array of semantic tasks, we did not find any previous report of the automatic learning of the word compositions.
It is important to notice that from the point of view of machine learning problems, the word analogy task is not a classification problem. This is so even if the quality of a solution can be given in terms of accuracy, as the fraction of correctly answered questions. Neither it is a classical regression problem since each single input and output feature is represented using a vector of highdimensional variables. In this context GP has been less investigated than for classical classification and regression problems. However, GP has been applied to a miscellany of tasks in information retrieval [6, 7, 22, 29]. In particular, Oren [22] combines GP with vectorbased representation of documents for information retrieval. Other problems that involve text classification have also been addressed with GP. Two related areas where GP has been applied are document ranking [29] and termweighting learning [6, 7]. We did not find any previous report on the combination of genetic programming and word embeddings.
5 Problem benchmark: word analogy task
The word analogy task consists of answering a question such as: “a is to b as c is to ?” A correct answer is the exact word that would fit the analogy. Table 2 shows several exemplar questions. We used the benchmark proposed by Mikolov et al. [18] in which questions are separated into groups. In Table 2, Group refers to the group from which the example was taken.
Group  word 1  word 2  word 3  Answer 

4  boy  girl  sons  daughters 
5  amazing  amazingly  slow  slowly 
6  honest  dishonest  known  unknown 
7  bad  worse  old  older 
8  bad  worst  good  best 
9  code  coding  walk  walking 
11  dancing  danced  feeding  fed 
12  banana  bananas  car  cars 
13  decrease  decreases  say  says 
Table 3 shows the description of the word analogy task benchmark. In this table, is the number of questions in the original benchmark and is the number of question after removing those words that do not appear in the shortened corpus we used in our experiments. Since the corpus we use is relatively small, for of the groups of questions (“capitalworld”, “currency”, “cityinstate”, “nationalityadjective”) we did not find one or more of the four words for each of the questions. Therefore, these four groups of questions were excluded from our analysis.
Group  Name  

4  family (gender inflections)  506  305 
5  gram1adjectivetoadverb  992  755 
6  gram2opposite  812  305 
7  gram3comparative  1332  1259 
8  gram4superlative  1122  505 
9  gram5presentparticiple  1056  991 
11  gram7pasttense  1560  1331 
12  gram8plural (nouns)  1332  991 
13  gram9pluralverbs  870  649 
6 Description of the GP approach
The automatic learning of the composition of words is possible in the specific problem we use, and the GP task, given the vector representations of three words that define a question, is to produce a vector whose closest word in the corpus is one that correctly answers the question. We will mainly use the CBOW model learned using word2vec to determine which is word vector of the model encoding a given word, or to find which is the word in the model whose encoding vector is the closest to a target word vector. The pseudocode of the GP algorithm we used is shown in Algorithm 3. It is a straightforward implementation of treebased genetic programming.
The selection method used is truncation selection. After sorting the individuals according to their fitness, the best solutions are kept for crossover and mutation. Uniform mutation randomly selects a point in the tree individual and replaces it by a random subtree. Onepoint crossover is used, it randomly selects two subtrees in the individuals and then exchanges them. The probability of mutation and crossover was .
The choice of genetic operators has been made as simple as possible to enhance the readability of the algorithm. While more sophisticated GP methods exist, our focus here is the proof of concept of automatic generation of the compositions, and, for that purpose, the choice of the operators was appropriate. On the other hand, we conducted a set of preliminary experiments with other mutation and selection operators^{5}^{5}5Those included in the DEAP library used to implement the algorithms and did not appreciate significant changes in the results when the set of all groups of analogy questions were considered. Some operators can produce more accurate programs for some particular group, but then they are outperformed by other methods in other groups.
In the experiments, the population size used was and the stop criterion is a maximum number () of generations. Our GP implementation was written in Python. It is based on the EA software DEAP^{6}^{6}6http://deap.readthedocs.io/en/master/api/tools.html [8] and the gensim package^{7}^{7}7https://radimrehurek.com/gensim/, a Pythonbased implementation of NLP algorithms [26]. gensim includes methods for interrogating the model generated by word2vect. Our code is openly available ^{8}^{8}8https://github.com/rsantanaisg/GP_word2vec.
6.1 GP operators
The set of operators used by the programs is shown in Table 4. All operators are defined on vectors of the same dimension. There are two classes of operators: binary and unary. The , and operators have the following meaning: vector addition, vector subtraction, and vector componentwise multiplication, respectively; while corresponds to protected division (usual division except that a divide by zero for any of the vector components returns zero). We discarded the possibility of including fixed (vector) random constants as terminals of the programs since they may depend on the size of the vector and our aim was to produce programs scalable to any vector dimension. We set a constraint to the depth of the trees to reduce the complexity of the programs.
B.  w v  U.  w  U.  w 

add  neg  Roll  
sub  diff  Rint  
mul  abs  Half  
saveDiv  cos  Norm  
sin  Log1p 
6.2 Fitness function
A critical component of a GP implementation is the definition of the fitness function. We implement the fitness evaluation as follows: At the time of evaluating a candidate program, it is applied to a training set of questions. For each question, the word vectors of the first three words are first obtained from the CBOX models. The program is then evaluated using as arguments these three word vectors, and the program’s output word vector is used to compute the quality of the program for the question.
Let us consider the program and the first question in Table 2 as an example. First, we obtained the words vectors , , and . Then, from the execution of the GP program, we obtain a word vector , . The vector is then presented to the CBOX model which outputs the closest word in the model. If this word coincides with the answer to the question, a correctly answered questions counter is increased. The final fitness value is the proportion of questions in the training set that were correctly answered.
The fitness function serves as a direct assessment of the program quality because we can directly test whether the program produces vectors whose semantics is the one encoded by the question. However, it has an important drawback. The computational cost of repeatedly interrogating the model to determine the closest word to a given vector is very high, and it would increase with the size of the vocabulary if larger corpora were used. To diminish this cost we introduced three changes to the GP scheme.

Restricted vocabulary size for interrogation: The word2vec implementation allows the restriction of the search for the most likely word given a vector to the most frequent words in the vocabulary. Out of the total number of words () in the vocabulary, we set . This reduces the computational time of the fitness function.

Partial evaluation: Each program is trained on a fraction of the questions from the training set. We set this fraction to be of the size of the training set. When evaluating a program, first a subset of the questions from the training set is randomly selected, and the accuracy of the program is measured in this subset. This means that different programs are evaluated on distinct subsets of questions.

Early halt: While sequentially evaluating the questions in the (random subset of the) training set, the program does not complete the evaluation of all questions and halts if: 1) A output is generated for any of the questions. 2) If after at least ten questions have been “answered” the proportion of correctly answered questions is at some point below . In this case it is clearly a poorly performing program.
All the previous enhancements considerably increase the efficiency of the algorithm. While partial evaluation adds some variability in the fitness output of the programs, good programs are in general good across subsets of questions and poor programs can not specialize in niches of questions since the subset selection in the training set is made randomly.
7 Experiments
The main objective of the experiments is to determine the quality of the programs generated by GP. We will compare their results for the word analogy task with those obtained by the application of the linear algebraic rule , which is the one commonly used for the composition of words for this problem. In addition, we will evaluate the transferability of the best programs by applying them to a vector space comprising vector, roughly times the size of the vector space we used to learn the programs.
For each fitness function and each group of questions of those described in Table 3, independent runs of the algorithms were executed. In total, executions were conducted. Each group of questions was split into a training and test set with same number of questions. The questions in the test set were not used at any time of the evolution.
7.1 Numerical results
We evaluate the performance of the GP algorithms by looking at the accuracy of the best GP programs found. The accuracy, for each group of questions, is the proportion of questions correctly answered by a GP program. For each of the runs we keep all the solutions in the last selected population ( solutions by run). Among the programs, the one that has the highest accuracy in the training set is selected. Then we compute the accuracy of this program also in the test set. Using the programs, the maximum and mean accuracy are calculated in the training and test sets. Table 5 shows these values for the groups of questions. The table also shows the accuracy produced by the algebraic rule. It can be seen that the best GP evolved programs outperform the algebraic rule on all the groups of questions, although the difference in the results is more noticeable for some groups of questions (e.g., group ). The mean accuracy of the programs on the test set is also higher than that achieved using the algebraic rule for of the groups of questions. Notice, that since our selection of the best programs was based on the accuracy for the training set, there might be programs with a higher accuracy on the test set. We did identify some of these programs. Interestingly, for some groups of questions (e.g., group 11) the maximum and mean accuracy in the training set is smaller than in the test set.
Group  Training set  Test set  rule  

max.  mean  max.  mean  
4  84.21  82.74  83.66  82.77  77.70 
5  24.67  21.62  26.46  21.31  16.16 
6  49.34  41.86  50.98  41.11  24.92 
7  67.09  66.00  61.27  58.62  60.44 
8  46.83  44.96  41.11  38.72  40.40 
9  48.08  45.09  44.56  39.89  36.83 
11  45.71  41.97  46.70  44.47  37.94 
12  76.16  73.13  72.38  69.41  66.50 
13  49.69  40.42  38.77  32.38  34.21 
In a second phase of the experiments, the best program generated in the last generation of each execution of the GP algorithm was selected based on the sum of the training and test set accuracy values for the corresponding group of questions. We evaluated this set of programs using the word2vec GoogleNews vectors. These vectors have a larger dimension ( versus in the text8 vector space) and comprise around million words. As a consequence, these vectors contain all words for the original groups of questions introduced in [16]. We must remember that the reason why we did not use four of the original groups of questions was that the text8 vector space did not include the vector representations for all constituent words of each question in these groups.
Using the word2vec GoogleNews vectors, we can test the evolved programs in all the data sets. The same set of operations encoded in the programs are applied, but this time using the new vector representation. The output vector is then submitted to the model that determines whether the closest word in the space of word2vec GoogleNews vectors is the right answer to the question.
The results of this evaluation are shown in Table 6, where each row corresponds to one of the original groups of questions. Each column shows the best accuracy produced by the best program among the generated with function for the group of questions represented in column . The last column shows the accuracy results of the algebraic rule. In each row, all the programs that produce results better than the one in the last column are shown in bold.
Notice that we can evaluate the GP program in all groups of questions independently of the group used to learn them. Since all the questions have the same structure, we can apply the programs to them.
Group  4  5  6  7  8  9  11  12  13  rule 

1  77.80  63.41  51.43  81.72  81.49  76.23  81.41  78.60  81.49  81.49 
2  14.68  10.87  2.08  16.18  15.95  14.68  16.07  16.07  15.95  15.95 
3  65.94  41.16  33.62  72.75  72.75  69.06  68.98  70.48  74.17  72.75 
4  73.27  62.18  48.71  77.62  77.62  73.66  77.23  77.62  77.62  77.62 
5  27.75  18.77  17.66  28.86  28.86  21.39  29.87  28.86  28.86  28.86 
6  34.16  27.87  28.61  32.68  32.43  34.28  33.54  32.31  32.43  32.43 
7  88.66  63.19  71.83  89.56  89.56  86.33  89.93  89.26  89.56  89.56 
8  65.21  48.26  33.27  67.53  67.53  46.65  69.05  68.06  67.53  67.53 
9  70.90  60.00  61.90  73.18  73.18  61.99  71.09  72.70  74.41  73.18 
10  85.17  83.79  71.78  85.42  85.42  82.23  85.36  85.36  85.42  85.42 
11  62.92  43.30  48.69  68.44  66.20  64.27  64.46  66.13  66.20  66.20 
12  76.93  74.76  74.68  78.36  78.36  75.13  78.21  78.14  78.36  78.36 
13  55.12  21.75  29.69  65.71  63.87  56.85  62.83  62.83  63.87  63.87 
Group  4  5  6  7  8  9  11  12  13 

1  23  52  76  104  129  153  185  235  261 
2  9  52  76  104  129  153  185  235  261 
3  15  58  76  120  129  153  198  235  247 
4  23  58  76  120  129  178  185  235  261 
5  23  52  76  120  129  178  208  235  261 
6  23  58  76  111  129  178  196  224  261 
7  23  58  76  120  129  178  208  235  261 
8  23  52  76  120  129  178  208  235  261 
9  23  58  76  104  129  153  195  235  247 
10  23  58  76  120  129  178  208  224  261 
11  23  58  76  104  129  178  208  235  261 
12  23  37  76  120  129  154  208  235  261 
13  23  58  76  104  129  153  195  235  261 
There are a number of remarkable facts in the results shown in Table 6:

Some of the programs improved the accuracy for groups of questions that were not in the original reduced benchmark of groups. This is the case for the group of questions .

The best program for the group of questions is not, in general, a program evolved to answer this group of questions.

There are programs evolved for some groups of questions that are good at answering questions for all groups. For example, this happens with programs learned using the group of questions .
7.2 Evaluating answers and evolved programs
One important issue is the interpretability of the evolved programs and how are they related with the algebraic rule. Out of the programs tested, were equivalent to the algebraic rule. Four of these programs are shown in Figure 2. It can be seen how the same rule is implemented in distinct ways using only the operators , , and . These results show that, as an algorithm to create word compositions, GP can automatically learn compositional methods designed by humans.
We also analyzed those GP programs that outperformed the algebraic rule. An exemplar of this type of programs is shown in Figure 3. It was the best program found for the group of questions . Its accuracy using the word2vec GoogleNews vectors was , above the accuracy of the algebraic rule for the same group of questions.
The tree shown in Figure 3 is a slight modification of the algebraic rule. Instead of adding to the rule, this programs adds and this change allows it to increase the accuracy for the group of questions. A trend observed in other evolved programs was that they contained building blocks from the algebraic rule. As in the case of the programs shown in Figure 2, these structural features were not specifically induced, they were acquired as part of the evolutionary process. Other programs that produced high accuracy values are shown in figures 4 8. When analyzing the behavior of these programs, tables 6 and 7 should be consulted.
7.3 Discussion
We go back to the research questions posed at the beginning of this work and try to answer them based on the results of our experiments.

For embedding representations, can meaningful vector algebraic operations be learned from training examples? Yes, they can be learned within a relatively small computational time.

If so, is GP a feasible approach to do it? Yes, GP is a natural solution for this type of problem and even straightforward implementations can deal with the problem.

How do GP programs score with regard to the algebraic rule commonly applied on vector representations? GP programs can learn the same rule designed by humans and, therefore, can reach the same results. They can also outperform these results but, at least for the class of word vector representation and the basic treebased GP approach implemented, the improvements are moderate.

Are GP evolved programs transferable across linguistic tasks, vector representations and corpora? Definitively. The high transferability of the programs across groups of questions may be supported by the general underlying commonality between analogies that these group of questions represent. However, it is remarkable how the programs can be transferred to a vector space where both the dimension of the vectors, and the number of vectors increase dramatically. In this respect, transferability opens an additional opportunity for efficiency gain. Programs can be learned using small vector spaces, and then validated or refined on more computationally costly large vector spaces.
8 Conclusions and future work
While semantic spaces and word vector representations are able to capture some of the semantic relationships between the words, compositional methods are necessary to extend their use to multiword constructions. In this paper we have proposed representing compositional vector operations as simple programs that can be automatically learned from data. We have shown that, using GP, it is possible to encode a set of vector operations as a program, that the programs can be evolved to achieve higher accuracy than the human rules conceived to manipulate the words, and that the programs are valid for datasets other than those from which they have been learned, i.e., they are transferable programs. Furthermore, our results indicate that it is possible to learn programs using vector vocabularies of small to moderate sizes and then test them in bigger domains where the evaluation of a program is more costly.
8.1 Future work
As lines for future work we consider the following:
8.1.1 Use alternative methods for the word vector generation
While GP approaches can explore a vast range of possible word compositions, the usefulness of more intricate programs is, to some extent, constrained by the nature of the relationships that the vectors can encode. For example, if the methods used to construct the embeddings do not allow nonlinear relationships between the vectors, then the improvements of the GP programs over plain linear algebraic compositional operators will be marginal. Therefore, it would be important to test the automatic generation of word compositions with GP on word vectors generated using diverse methods.
8.1.2 Evolve functions for the similarity metric
Since it has been shown that the type of similarity metric can critically influence the accuracy results [15], it makes sense to learn this function as well. One difficulty is that the output of this function will be a numerical value and not a vector like the other operators used in the current GP representation. In addition, evaluating an alternative similarity metric implies using the candidate metric to compute distances to all vectors in the vector space, a process that can be very costly computationally.
8.1.3 Combining different word representations
8.1.4 Using more sophisticated GP approaches
From the point of view of research in genetic programming, word embeddings open an interesting research line. More research is needed to identify which, among more sophisticated GP approaches, are the most appropriate for their application to semantic spaces. Among possible lines of research are the following:

More complex descriptions of the compositional operators: One open question is to what extent can more complex functions better exploit the underlying semantic relationships between the word vectors. This could be investigated by adding other algebraic operators to the set of GP functions, including ternary operators. Another possibility is representing the composition of vectors with ensembles of GP programs [2].

Reusing problem information: Approaches able to identify and transfer building blocks [10] between word vectors or corpora of varying dimensions arise as potential candidates.

Behavioral program synthesis: One direction in which the evolution of the programs could be improved is by analyzing and assessing the quality of the intermediate vectors produced in the evaluation of the programs. In general, algorithms that advocate a more efficient use of the information displayed by the behavior of the GP programs [14] could lead to better solutions and reveal additional insights in learning compositional methods.
Acknowledgments
This work has received support through through the IT60913 program (Basque Government), TIN201678365R (Spanish Ministry of Economy, Industry and Competitiveness) and Brazilian CNPq Program Science Without Borders No.: 400125/20145.
References
 [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.

[2]
U. Bhowan, M. Johnston, M. Zhang, and X. Yao.
Evolving diverse ensembles using genetic programming for
classification with unbalanced data.
IEEE Transactions on Evolutionary Computation
, 17(3):368–386, 2013.  [3] W. Blacoe and M. Lapata. A comparison of vectorbased representations for semantic composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 546–556. Association for Computational Linguistics, 2012.
 [4] P. A. Bosman and E. D. De Jong. Learning probabilistic tree grammars for genetic programming. In International Conference on Parallel Problem Solving from Nature, pages 192–201. Springer, 2004.
 [5] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
 [6] R. Cummins and C. O’Riordan. An analysis of the solution space for genetically programmed termweighting schemes in information retrieval. In P. S. P. M. D. Bell, editor, 17th Artificial Intelligence and Cognitive Science Conference (AICS 2006), Queen’s University, Belfast, 2006.
 [7] H. J. Escalante, M. A. GarcíaLimón, A. MoralesReyes, M. Graff, M. Montesy Gómez, E. F. Morales, and J. MartínezCarranza. Termweighting learning via genetic programming for text classification. KnowledgeBased Systems, 83:176–189, 2015.

[8]
F.A. Fortin, D. Rainville, M.A. G. Gardner, M. Parizeau, C. Gagné, et al.
DEAP: Evolutionary algorithms made easy.
The Journal of Machine Learning Research, 13(1):2171–2175, 2012.  [9] E. Grefenstette and M. Sadrzadeh. Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1394–1404. Association for Computational Linguistics, 2011.
 [10] M. Iqbal, W. Browne, and M. Zhang. Reusing building blocks of extracted knowledge to solve complex, largescale Boolean problems. Evolutionary Computation, IEEE Transactions on, 18(4):465–480, Aug 2014.
 [11] M. Iyyer, J. L. BoydGraber, L. M. B. Claudino, R. Socher, and H. Daumé III. A neural network for factoid question answering over paragraphs. In Empirical Methods in Natural Language Processing (EMNLP), pages 633–644, 2014.
 [12] W. Kintsch. Predication. Cognitive science, 25(2):173–202, 2001.
 [13] J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge, MA, 1992.
 [14] K. Krawiec, J. Swan, and U.M. O’Reilly. Behavioral program synthesis: Insights and prospects. In Genetic Programming Theory and Practice XIII, pages 169–183. Springer, 2016.
 [15] O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Language Learning, pages 171–180, 2014.
 [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
 [17] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013.
 [18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [19] J. F. Miller and P. Thomson. Cartesian genetic programming. In European Conference on Genetic Programming, pages 121–132. Springer, 2000.
 [20] J. Mitchell and M. Lapata. Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429, 2010.
 [21] D. J. Montana. Strongly typed genetic programming. Evolutionary computation, 3(2):199–230, 1995.
 [22] N. Oren. Improving the effectiveness of information retrieval with genetic programming. Master’s thesis, Faculty of Science of the University of Witwatersrand, Johannesburg, 2002.
 [23] M. O’Neil and C. Ryan. Grammatical evolution. In Grammatical Evolution, pages 33–47. Springer, 2003.
 [24] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), volume 14, pages 1532–1543, 2014.
 [25] R. Poli, W. B. Langdon, N. F. McPhee, and J. R. Koza. A field guide to genetic programming. Lulu.com, 2008.
 [26] R. Řehůřek and P. Sojka. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.
 [27] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 2011 Conference Advances in Neural Information Processing Systems 24, NIPS, volume 24, pages 801–809, 2011.
 [28] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semisupervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the conference on empirical methods in natural language processing, pages 151–161. Association for Computational Linguistics, 2011.
 [29] A. Trotman. Learning to rank. Information Retrieval, 8(3):359–381, 2005.

[30]
J. Turian, L. Ratinov, and Y. Bengio.
Word representations: a simple and general method for semisupervised learning.
In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics, 2010.  [31] P. D. Turney. Similarity of semantic relations. Computational Linguistics, 32(3):379–416, 2006.
 [32] A. Zhila, W.t. Yih, C. Meek, G. Zweig, and T. Mikolov. Combining heterogeneous models for measuring relational similarity. In Proceedings of the 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT, pages 1000–1009, 2013.
Comments
There are no comments yet.