Paraphrasing verbal metonymy through computational methods

09/18/2017 ∙ by Alberto Morón Hernández, et al. ∙ 0

Verbal metonymy has received relatively scarce attention in the field of computational linguistics despite the fact that a model to accurately paraphrase metonymy has applications both in academia and the technology sector. The method described in this paper makes use of data from the British National Corpus in order to create word vectors, find instances of verbal metonymy and generate potential paraphrases. Two different ways of creating word vectors are evaluated in this study: Continuous bag of words and Skip-grams. Skip-grams are found to outperform the Continuous bag of words approach. Furthermore, the Skip-gram model is found to operate with better-than-chance accuracy and there is a strong positive relationship (phi coefficient = 0.61) between the model's classification and human judgement of the ranked paraphrases. This study lends credence to the viability of modelling verbal metonymy through computational methods based on distributional semantics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 35

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Acknowledgements

“Prefiero caminar con una duda que con un mal axioma.”
—Javier Krahe

Thousands of words and hundreds of lines of code do not write themselves, so there are naturally many people who have my gratitude:

My parents, first and foremost, for the unconditional support they have given me during my time away from home. My family, both in Spain and the UK, for always being there for me.

Lecturers at Manchester who have guided me during the past three years, including but not limited to: Dr. Andrew Koontz-Garboden, Dr. Laurel MacKenzie, Dr. Eva Schultze-Berndt and Dr. Wendell Kimper.

This dissertation would never have been possible without my time at UMass Amherst. Prof. Brian Dillon, Alan Zaffeti, everyone in LING492B and Jamie, Grusha, Amy & Ben have my gratitude for welcoming me to UMass and making it such a memorable experience.

Finally – Raúl, DJ, Serguei & Pedro. Gracias. La próxima, en Cazorla.

Disclaimers

This dissertation is supported in part by an Amazon Web Services Educate grant (#PC1R88EPEV238VD). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the view of Amazon.com, Inc.

Data cited herein has been extracted from the British National Corpus, managed by Oxford University Computing Services on behalf of the BNC Consortium. All rights in the texts cited are reserved.

List of acronyms

AWS Amazon Web Services
BNC British National Corpus
CBOW Continuous bag-of-words
LSTM Long Short-Term Memory
NLTK Natural Language Toolkit (Python module)
VSM Vector Space Model

Abstract

Verbal metonymy has received relatively scarce attention in the field of computational linguistics despite the fact that a model to accurately paraphrase metonymy has applications both in academia and the technology sector. The method described in this paper makes use of data from the British National Corpus in order to create word vectors, find instances of verbal metonymy and generate potential paraphrases. Two different ways of creating word vectors are evaluated in this study: Continuous bag of words and Skip-grams. Skip-grams are found to outperform the Continuous bag of words approach. Furthermore, the Skip-gram model is found to operate with better-than-chance accuracy and there is a strong positive relationship (phi coefficient = 0.61) between the model’s classification and human judgement of the ranked paraphrases. This study lends credence to the viability of modelling verbal metonymy through computational methods based on distributional semantics.

2.1 Previous accounts of verbal metonymy

The aim of the study is to develop a model which paraphrases verbal metonymy. Consider the following sentences:

  1. [label=(0)]

  2. The cook finished eating the meal.

  3. The cook finished the meal.

In sentence (1) the aspectual verb ‘finish’ combines with a verb phrase meaning the event of eating a meal, thus (1) refers to the termination of an event. Contrast this with (2), where ‘finish’ instead combines with a noun phrase referring to a specific meal. The resulting sentence concerns the termination of an unspecified event involving ‘the meal’. Interestingly, the structure of (2), [NP [V [ NP ]]], does not include an event whose termination the sentence could be referring to. Arguments that could pair with the aspectual verb ‘finish’ are restricted to those with temporal or eventive meanings. This restriction is not directly satisfied by ‘the meal’, yet human judges are able to make sense of (2). Katsika et al. suggest that the fact that sentences like (2) make sense despite this conflict means that “a temporal/eventive argument is supplied [to aspectual verbs] at some point during the interpretation of the sentence” (2012: 59). Jackendoff describes logical metonymy as an instance of “enriched composition” (1997: 49), and Utt et al. (2013) succinctly define it as consisting of an event-selecting verb combining with an entity-denoting noun. Making sense of sentences like (2) entails the recovery of a covert event (e.g. eating, making, cooking).

My interest in focusing on verbs stems partly from the fact that other aspects of language have received more attention in past computational studies of semantics. Existing computational accounts of metonymy in the literature explore other instances of metonymy, such as those which use toponyms or proper names in general (Markert and Nissim 2006). Psycholinguistic studies conducted on the interpretation of metonymic language include McElree et al. (2001) and Traxler et al. (2002). The latter tested combinations of metonymic and non-metonymic verbs with both entity- and event-denoting nouns (e.g. The cook [finished / saw]V [the meal / the fight]NP). The study found that sentences featuring a metonymic verb and an entity-denoting object (‘The cook finished the meal’ – the coercion combination) involved higher processing costs. The abundance of psycholinguistic studies of verbal metonymy compared to the relative scarcity of papers from a computational or distributional perspective encouraged me to pursue my research question. The frequency with which metonymy happens in natural language and the ease with which humans can interpret it through context and our knowledge of the world also contribute towards making metonymy an interesting phenomenon to model computationally. Despite general metonymy not generally relying on type clashes as much as verbal metonymy does, there naturally exists a relation between the two. Some of the earliest attempts at generating a computational understanding of general metonymy include Lakoff & Johnson’s 1980 paper and Verspoor’s 1997 study, which searched for possible metonymies computationally yet carried out a paraphrasing task manually. Verspoor’s work is also relevant here since she used a previous version of the British National Corpus. One of the earliest attempts at fully automating the process is Utiyama, Masaki & Isahara’s 2000 paper on Japanese metonymy. Shutova et al.’s 2012 paper on using techniques from distributional semantics to compute likely candidates for the meanings of metaphors has been a major influence in getting me to think about possible obstacles and improvements in regards to the ranking algorithm. A 2003 paper by Lapata & Lascarides has been a guiding influence in the creation of my model. For instance, their finding that it is possible to “discover interpretations for metonymic constructions without presupposing the existence of qualia-structures” has led to my model consisting of a statistical learner algorithm and a shallow syntactic parser as opposed to a more contrived solution (2003: 41). When considering the question of which verbs to target in order to search for instances of verbal metonymy in the BNC, Utt et al. (2013) have provided an invaluable starting point. Utt et al. ask: “What is a metonymic verb?” and “Are all metonymic verbs alike?” (2013: 31). They develop empirical answers to these questions by introducing a measure of ‘eventhood’ which captures the extent to which “verbs expect objects that are events rather than entities” (2013: 31). Utt et al. provide both a useful list of metonymic verbs as well as one of non-metonymic verbs. The list builds upon the datasets provided by two previous psycholinguistic studies: Traxler et al. (2002) and Katsika et al. (2012). The existence of this empirical list is useful since it allows me to bypass the ongoing debate regarding whether individual verbs lend themselves to metonymy. This debate has been approached both by theorists (Pustejovsky 1991), psycholinguists (McElree et al. 2001) and computational linguists (Lapata, Keller & Scheepers 2003. I return to these studies and their relevance in helping me pick relevant verbs in Chapter 3.

2.2 Foundations of computational linguistics

Though it has been echoed many times when introducing the subject of distributional semantics, J.R. Firth’s pithy quip that “You shall know a word by the company it keeps” (1957: 11) remains the best way to describe the field in the fewest number of words. The core idea that meaning must be analysed with context and collocations in mind was put forward by Firth as early as 1935, when he stated that “no study of meaning apart from context can be taken seriously” (1935: 37). The distributional hypothesis implies that it is possible to identify words with similar meanings by looking at items which have similar row vectors when a word-context matrix is constructed. Before proceeding, allow me to illustrate what word vectors are and clarify exactly how they are created. A word vector (a term used interchangeably with ‘word embedding’) is an array of numbers which encodes the context in which a word is typically found in a corpus. For instance, consider the proverb ‘What is good for the goose is good for the gander’. This sentence can be represented as a word-context matrix as shown in Table 2.1. The columns represent each word present in the corpus (Table 2.1 assumes there are no other words in the English language besides those in the proverb). The columns are ordered alphabetically from left to right. The rows represent the words we want to generate vectors for – this usually means each word in the corpus gets its own row, but for illustrative purposes Table 2.1 only generates vectors for ‘good’ and ‘goose’. The numbers at the intersection of two words are generated by calculating count( | ), where is the word we want to generate a vector for and is the word immediately after it. The word ‘for’ occurs twice after ‘good’, which means that the vector for ‘good’ is [2, 0, 0, 0, 0, 0, 0]. Similarly, the vector for ‘goose’ is [0, 0, 0, 0, 1, 0, 0]. Vectors such as the latter, where the only values are one or zero are known as ‘one-hot’ arrays, and I return to them in section 3.1.

for gander good goose is the what
good 2 0 0 0 0 0 0
goose 0 0 0 0 1 0 0
Table 2.1: An example of a word-context matrix. This matrix uses the proverb ‘what is good for the goose is good for the gander’ as a corpus.

If we were to use the entirety of the BNC as the corpus instead of a single idiom, the vector for good could look something like this: [2, 1, 4, 2, 6, (…)]. When iterating over large datasets the number of rows for which the value is zero is substantial (as can be seen even in the toy example in Table 2.1). To overcome the inefficiency of having arrays full of zeroes and infrequent pieces of actual data, word embeddings are usually stored in what are known as sparse vectors. This means that only columns with non-zero values are stored. Such a vector can have hundreds of rows (also known as dimensions), as is the case of the Google News dataset, which contains vectors for three million words, each vector formed by three hundred rows (Mikolov et al. 2013b: 6). Mikolov et al. reduced the computational complexity of vector generation and released a set of remarkable algorithms when they open sourced their approach to Continuous bag-of-words and Skip-gram under the word2vec tool. They were able to do so by standing on the shoulders of giants, albeit ones who have since received less recognition. Bengio et al. published a paper on probabilistic language models which provided one of the earliest algorithms for generating and interpreting “distributed representations of words” (2003: 7). One of the pioneering outcomes of this paper was defeating the so-called ‘curse of dimensionality’, which Roweis & Saul had previously attempted to solve (2000). The curse makes reference to the fact that the sequences of words evaluated when implementing an algorithm are likely to differ from the sequences seen during training. Bengio et al.’s use of vector representations trumped prior solutions based on n-gram concatenation both in efficiency and in overcoming the hurdle of the ‘curse’. Another milestone in the path towards word2vec was Franks, Myers & Podowski’s patent “System and method for generating a relationship network”, published during their time at the Lawrence Berkeley National Lab in 2005 (U.S. Patent 7,987,191). This method is exhaustive and more intricate than word2vec, but ultimately this complexity does not translate into gains in accuracy.

2.3 The Vector Space Model

As mentioned earlier, the techniques introduced by word2vec are not ‘deep learning’ as such. Both the CBOW algorithm and the Skip-gram approach are shallow models which favour efficiency over intricacy. The choice between deep and shallow learning was a consideration made early in the planning stages of this study. Reading Jason Brownlee’s 2014 article on deep learning, in which he speaks of “the seductive trap of black-box machine learning”, was an early indication that a shallow model may be more suitable for my study (Brownlee 2014: 1). Brownlee highlights issues with neural networks, namely that they are by definition opaque processes. Jeff Clune succinctly summarises the issue by saying: “even though we make these networks, we are no closer to understanding them than we are a human brain” (Castelvecchi 2016: 22). Despite Le & Zuidema’s (2015) recent success in modelling distributional semantics using Long Short Term Memory (a type of recursive neural network), my mind was set against using deep learning in this study for two reasons. First, it would be excessively complicated for the scope of the research being undertaken, and second, the ‘black-box’ nature of neural networks would complicate writing about the inner workings of my algorithm. Having decided between deep and shallow learning and opting for the latter, I faced another choice before creating my model. I had to decide between the two most widely adopted vector space representations: word2vec (Mikolov et al. 2013b) and GloVe (Global Vectors for Word Representation; Pennington, Socher & Manning 2014). Python implementations of both are available as open source, through Kula’s (2014) glove-python module for GloVe and Rehurek & Sojka’s (2010) gensim module for word2vec. The documentation for the gensim module characterises the difference between the two technologies by saying that GloVe requires more memory whereas word2vec takes longer to train (Rehurek 2014). Since memory is expensive and time was not a pressing concern, I chose to use the word2vec algorithms (CBOW and Skip-gram) as implemented by the gensim Python module. This decision was supported by Yoav Goldberg’s (2014) case study of the GloVe model. Goldberg disproves Pennington, Socher and Manning’s (2014) claims that GloVe outperforms word2vec by a wide margin and does this by testing both on the same corpus, which the authors of the GloVe paper had neglected to do.

As mentioned earlier, Mikolov et al. (2013b) recommend using Skip-gram as opposed to CBOW in the paper that introduced word2vec. Their preference for Skip-gram is justified by the impressive accuracy improvements they report over earlier work such as Turian et al.’s 2010 paper on word embeddings. This preference is further substantiated by Goldberg & Levy’s 2014 study on the negative-sampling word embedding algorithm used by the Skip-gram approach. However, I must draw attention to the fact that Mikolov et al.’s original claims are based mainly on a word and phrase analogy task (extending the work first reported in Mikolov et al. 2013a). Since their original findings, impressive as they are, seem to be limited to this linguistic context I cannot presuppose that the Skip-gram approach is necessarily best for all other cases. As such, part of my study is also devoted to testing whether the CBOW or Skip-gram approach is the most suitable for the task of generating word embeddings which successfully paraphrase verbal metonymy. Success in this task is defined as the algorithm which returns the highest proportion of accurate paraphrases. Since this project is being undertaken over a timespan of months, speed is a secondary concern.

Vectors for each word are created by observing the patterns in which a particular word tends to appear. For instance, when generating the vector for the name of a country it is quite likely that the structure ‘citizens of X marched on the streets…’ is present many times in the corpus, where X can be a number of countries. The important factor here is that it does not matter whether ‘France’, ‘Italy’ or ‘Nicaragua’ stand in the place of ‘X’. Rather, what matters is that the algorithm into which the resulting vectors are fed into learns the relationship between each of these words and eventually recognises that they are all instances of the same kind of entity (despite not necessarily knowing that the label that speakers of English assign to these words is ‘country’). An intuitive way of visualising the way in which the algorithm sees these representations of meaning is shown in Figure 2.1. Four vectors are shown in Figure 2.1, which is based on Mikolov et al.’s graphical representation of the Vector Space Model (2013b: 4) and uses data from my model trained on the BNC. Each connects the semantic representation of a country to its corresponding capital city in the vector space. What is of interest is the proximity of each label to each other and the angle at which the connecting vectors (grey dashed lines) are drawn. By observing the proximity of labels, we can intuitively tell that Spain, Italy and France are closer neighbours to each other than Nicaragua is to either of the three European countries. However, if Figure 2.1 were to show the entirety of the BNC, we would see that indeed Nicaragua is closer to Italy, for instance, than it is to ‘herring’. The grey lines, the vectors connecting word embeddings, are of relevance when seeking to evaluate the similarity between two entities in the vector space. By computing the cosine similarity between the two lines, normalised similarity scores between the semantics of each word can be obtained (the method and consequences of doing so are explored more in depth in section 3.3). More intuitively, by looking at Figure 2.1 it is evident that the lines have similar angles and bearings, and as such must bear some similarity in the semantic relations they encode.

Figure 2.1: A vector space model of country-capital city relations. This model was generated using the Skip-gram approach trained on data from the British National Corpus.

Besides calculating similarity scores it is also possible to carry out vector algebra with these representations of meaning. Equations such as the aforementioned “Paris - France + Italy = Rome” (Mikolov et al. 2013b: 9) are interesting, but this paper makes recurrent use of cosine similarity instead. Mikolov et al.’s stringent accuracy metrics (they accept only exact matches) mean that the overall accuracy of this semantic algebra stands at 60% in their original implementation (2013b: 10). However, more recent studies have refined such algorithms and accept ‘closest-neighbour’ answers rather than limiting themselves to exact matches (Levy, Goldberg & Ramat-Gan: 2014; Gagliano et al. 2016). The present study rejects the hindering stringency of Mikolov et al. and instead uses a ‘closest-neighbour’ evaluation.

3.1 The BNC and word2vec

The British National Corpus is a dataset of British English during the second half of the twentieth century. The corpus consists of one hundred million words split across a number of genres, with 90% of the data comprising the written part of the BNC, while the rest is composed of transcriptions of spoken English (Burnard 2007). The corpus also includes data from an automatic part-of-speech tagger. Text samples do not exceed forty-five thousand words each and were collected from a variety of mediums writing in a number of genres. The first version of the BNC was released in 1994, with subsequent revisions appearing in 2001 and 2007. This study makes use of the 2007 BNC in XML format. The BNC has been used in studies covering a variety of disciplines, including syntax (Rayson et al. 2001), sociolinguistics (Xiao & Tao 2007) and computational linguistics (Verspoor 1997; Lapata & Lascarides 2003). Besides being a reputable corpus which has been implemented in a number of studies, a major reason for choosing the BNC is of a more practical nature. The Natural Language Toolkit (a module extending the Python programming language) includes an interface for efficiently iterating over the BNC. Due to my intention to use Python as the main language in this project and my previous experience with the NLTK module, the choice of the BNC as a source of data was an obvious one. I still had to write code of my own with which to parse the corpus, find cooccurrences and create vectors, but the use of NLTK sped up the process considerably. A four-million-word sample (known as the BNC Baby) is available alongside the full BNC. This sample contains the same proportion of spoken and written texts, and the distribution of texts by genre and domain remains the same as in the full corpus. This proportion of data between the BNC Baby and the full corpus makes it ideally split for creating a training set and a test set. In this study, a ten-million-word fragment of the full BNC (different from the words in the BNC Baby) is used to generate the word embeddings on which the model is subsequently trained. The purpose of the training data is to discover relationships between words in the corpus, and as such would be bad practice to test an algorithm on the same data used to train it. Neglecting to do this usually leads to overfitting the data, which means that the model learns too much about the random variation it should not be interested in rather than focusing on the actual relationships (Wei & Dunbrack 2013).

The first step in this study is the generation of two vocabularies of word embeddings for the data found in the BNC: one generated using word2vec’s Continuous bag-of-words, the other using Skip-grams. First, I analyse the CBOW approach. Consider the following sentence: ‘Those who cannot remember the past are condemned to compute it’ (Pinker 1999: 164). The first step CBOW takes in order to generate word vectors from this sentence is to ‘read’ through it one word at a time, as through a sliding window which includes a focus word together with the four previous words and the next four words. This would mean that for the focus word ‘past’, its context window is formed by ‘who cannot remember the condemned to compute it’. A window size of four context words was chosen on account of Shutova et al.’s (2012) experimental success with a smaller window size than the one used by Erk and Padó (2008). Additionally, Mikolov et al.’s original paper also uses four words and warns that window size is one of the most important factors that affect the performance of a model (2013b: 8). The context words are encoded in a ‘one-hot’ array, as seen previously in the example for ‘goose’ in Table 2.1. The number of dimensions was set to one hundred columns. Additionally, a weight matrix is constructed, which is a representation of the frequency of each word in the corpus. This weight matrix has V rows, where V is the size of the vocabulary and D columns, where D is the size of the context window, also referred to as the number of dimensions (in my implementation, eight dimensions). The weight matrix does not represent a one-to-one relation between values in the rows and the word the row represents. Rather, the representation of the word is scattered amongst all the columns in the array. In the example of the quote by Pinker, each one-hot array would have eleven dimensions, with only one of its columns set to one, the rest to zero. Once the arrays and weight matrix have been created, thealgorithm trains the model with the aim of maximisingP(

|

). That is, maximising the probability of observing the focus word,

, given the eight context words surrounding it. In our example the objective of training is to maximise the probability of ‘past’ given the eight words in the context window as an input. Table 3.1 shows a one-hot array with eleven dimensions (vocabulary size) being multiplied by the weight matrix for the corpus (this matrix is truncated and only shows the first three of eleven rows, but does show the full number of dimensions: eight, corresponding to the number of context words). CBOW computes the final vector for each word it encounters by performing this operation many times. Finally, a normalized exponential function (also known as the softmax function) is used to produce a categorical distribution: a probability distribution over D dimensions (Mikolov et al: 2013a; Morin & Bengio 2005).

The Skip-gram model, on the other hand, takes the Continuous bag-of-words approach and effectively turns it on its head. Where CBOW uses eight one-hot context word arrays as inputs, Skip-gram uses a single array. This input vector is a one-hot array of size D constructed with the focus word instead (‘past’ in the example above). The same process involving the weight matrix is used, but this time the aim is to output the probability of observing one of the context words. Where CBOW output a single probability distribution, Skip-gram outputs eight different ones. This last step is quite resource intensive, particularly in regards to memory. An efficient and effective solution proposed by Rong is to “limit the number of output vectors that must be updated per training instance” (2014: 10). This is achieved using the softmax function again. Softmax represents all the words in the vocabulary as elements of a binary tree and computes the probability of a random walk from the root to any word in the vocabulary. The further intricacies of this approach are beyond the scope of the present paper. Instead, I direct the reader’s attention to the work of Morin & Bengio (2005), Mnih & Hinton (2009) and the aforementioned paper by Rong (2014) which explains in great detail the many parameters of word2vec and the potential optimisations that may be applied to CBOW and Skip-gram. The main advantage of implementing Skip-grams with the improvements suggested by Rong is that there is a boost to speed without a loss of accuracy. Instead of having the Skip-gram algorithm evaluate V output vectors, it only has to process arrays instead (Rong 2014: 13). For the Pinker example, this means going from 11 vectors to 3.46 vectors – a considerable difference since there are 68% less arrays to evaluate. Despite the obvious improvements offered by Skip-grams, two separate training sets were created by running CBOW and Skip-grams on a ten-million-word fragment of the British National Corpus. These sets have a vocabulary size of the ten thousand most frequent words in the BNC and the arrays have one hundred dimensions (columns). As a benchmark, this may be compared to the size of the Google News vector dataset. This dataset has become one of the standards for collections of word embeddings in academia and the open source community. Released by Mikolov et al. (2013b), it has a vocabulary composed of the 1 million most frequent words in Google News articles, with each array comprising three hundred dimensions.

input weight matrix hidden layer
1 x V V x D 1 x D
[ 0 0 1 0 0 0 0 0 0 0 0 ]  = [ i, j, (…) o, p ]
Table 3.1: One-hot vectors in the CBOW algorithm. V is the vocabulary size and D is the size of the context window. The values ‘a’ through ‘x’ represent the distribution of weights assigned as a function of each word’s frequency in the vocabulary. (Adapted from Colyer 2016).

3.2 Searching for metonymy

Once the two training datasets have been generated, they are not needed again until section 3.3, where they are necessary in order to evaluate the paraphrases generated by the model. The next step is to decide the kind of metonymy that the model should aim to paraphrase and look for examples in the test data. The test data is composed of the entirety of the four-million-word BNC Baby. In order to keep the set of sentences to paraphrase and the number of candidates returned by the model manageable, this experiment only considers instances of metonymy which employ one of three verbs. These three verbs are ‘begin’, ‘enjoy’ and ‘finish’. There are two reasons why these three verbs have been selected. The first is so that verbs from both categories defined by Katsika et al.’s 2012 psycholinguistic study on complement coercion are present. ‘Begin’ and ‘finish’ are metonymic aspectual verbs while ‘finish’ is a metonymic psychological verb (Katsika et al. 2012: 61). Secondly, these three verbs are uniformly distributed across the data presented by Utt et al. (2013). Their paper assigns numerical values to a measure of ‘eventhood’ which captures the extent to which these verbs “expect objects that are events rather than entities” (2013: 31). ‘Begin’ receives an eventhood score of 0.91, ‘finish’: 0.66 and ‘enjoy’ scores 0.57 (the upper bound for eventhood was 0.91, the lower 0.54) and are all confirmed to take part in metonymic constructions (Utt et al. 2013: 7).

First, the BNC Baby is scraped for sentences which contain one of the three target verbs. This aims to cut down on the processing costs of any subsequent tasks so that it is not necessary to iterate over irrelevant sections of the corpus. Next, the algorithm looks through these files for instances of Noun Phrases present immediately after or in close proximity following one of the three verbs – these are potential instances of metonymy. This is facilitated by the fact that the BNC features an extensive amount of metadata in the tags for each word. Once the list of all sentences which potentially contain verbal metonymy has been created, sentences are inspected manually to discard false positives where there is no target to paraphrase (naturally, it would be ideal to automate this task, and this is considered in the future directions evaluated in Chapter 6). However, it is crucial that these target sentences are actual instances of metonymy, and as such required human judgement. The task described in section 3.3, the actual generation and ranking of paraphrases, is completely automated. The list of target sentences is scraped for the noun phrases that are typically used in conjunction with the three verbs when they are used metonymically. Once this is done, the entirety of the BNC Baby is searched through for sentences containing the noun phrases observed in the previous step. This approach of concentrating on collocations is supported empirically by the results reported by Verspoor (1997), who found it to be the case that “95.0% of the logical metonymies for begin and 95.6% the logical metonymies for finish can resolved on the basis of information provided by the noun the verb selects for” (Lapata & Lascarides 2003: 41).

3.3 Generating paraphrases

Once the model has observed all the sentences in the BNC Baby that contain noun phrases commonly associated with instances of verbal metonymy involving one of the three verbs, it is time for the last two steps of the algorithm. This section contemplates the generation of paraphrases and the subsequent assignment of a confidence score to each of these candidates so that they may be ranked. The first task, that of generating the paraphrases, iterates through the sentences collected at the end of the last section and validates them with the Stanford parser. For example, suppose that the algorithm aims to paraphrase “He seems to enjoy the job, doesn’t he?” (Appendix 1: enjoy-3). It looks through the BNC Baby looking for collocates of the noun phrase “the job” and returns candidate sentences that include constructions such as “get / do / see the job”. Once this is done, each candidate is considered separately and submitted to the updated Stanford dependency parser (Manning et al. 2016). Originally released by Marneffe et al. (2006), the parser provides “both a syntactic and a shallow semantic representation” (Manning & Schuster 2016: 1). The parser outputs typed dependencies (grammatical relations) between the elements of any string provided as input. In this model, it performs the task of checking that the verb and noun in the candidate paraphrase are in a direct object relationship. The final step in the algorithm is to compute the confidence score for any approved candidates. This is done by computing the cosine similarity between the joint word vector of the target phrase and that of the candidate. The similarity is obtained by dividing the dot product of the two vectors by the product of the two vectors’ magnitudes, or expressed as a formula:

A simple explanation for this formula is that the numerator measures the degree to which the two vectors are related, while the denominator serves as a normalization factor that keeps the result under a maximum value of one. Figure 3.1 shows this graphically – the angle between two vectors is measured and the cosine of the angle gives a value which determines how related the two vectors are. The function returns a minimum value of zero corresponding to vectors perpendicular to each other, meaning they are unrelated. Values range up to one, only obtained when comparing identical vectors.

Figure 3.1: Cosine similarity between word vectors. Each sentence is represented by a vector and the cosine of the angle between them yields a similarity score.

The model generates similarity scores for each paraphrase, rejecting those that score below 0.2 as completely irrelevant. Sentences with scores above 0.5 are considered viable paraphrases. Two rankings are created: one that measures similarity against the CBOW word embeddings generated earlier and another that does the same but uses vectors produced using the Skip-gram approach. The algorithm has reached the end of its cycle. The next chapter presents the results of this study, highlighting the successes and pitfalls of the model.

References

  • [1] Bengio, Yoshua et al. 2003. A neural probabilistic language model. Journal of machine learning research 3. 1137-1155
  • [2] The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk
  • [3] Brownlee, Jason. 2014. The seductive trap of black-box machine learning. Online article http://machinelearningmastery.com/the-seductive-
    trap-of-black-box-machinelearning/
    (Accessed: 24th October 2016).
  • [4] Burnard, Lou. 2007. Reference guide for the British National Corpus (XML edition). http://www.natcorp.ox.ac.uk/docs/URG (Accessed: 8th March 2017).
  • [5] Castelvecchi, Davide. 2016. Can we open the black box of AI? Nature News 538(7623). 20-23.
  • [6] Colyer, Adrian. 2016. The amazing power of word vectors. https://blog.acolyer.org/2016/04/21/the-amazing-power-of-
    word-vectors
    (Accessed 20th February 2017).
  • [7] Erk, Katrin & Sebastian Padó. 2008. A structured vector space model for word meaning in context. Conference on Empirical Methods in Natural Language Processing. 897-906.
  • [8] Firth, J. R. (1935). The technique of semantics. Transactions of the Philological Society, 34(1). 36-73.
  • [9] Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. Studies in linguistic analysis. 1–32.
  • [10] Franks, Kasian, Cornelia A. Myers, & Raf M. Podowski. 2011. System and method for generating a relationship network. U.S. Patent 7,987,191.
  • [11] Gagliano, Andrea et al. 2016. Intersecting Word Vectors to Take Figurative Language to New Heights. Computational Linguistics for Literature. 20-31.
  • [12] Goldberg, Yoav. 2014. On the importance of comparing apples to apples: a case study using the GloVe model. Unpublished manuscript. https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-
    kRl8czSwpti15s/
    (Accessed 15th October 2016).
  • [13] Goldberg, Yoav & Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 https://arxiv.org/pdf/1402.3722v1.pdf (Accessed: 18th November 2016).
  • [14] Harris, Zellig S. 1954. Distributional structure. Word 10(2). 146-162.
  • [15] Jackendoff, R. (1997). The Architecture of the Language Faculty. MIT Press, Cambridge.
  • [16] Katsika, A., Braze, D., Deo, A., & Piñango, M. M. (2012). Complement coercion: Distinguishing between type-shifting and pragmatic inferencing.

    The mental lexicon

    , 7(1), 58-76.
  • [17]

    Kottur, Satwik et al. 2016. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. IEEE Conference on Computer Vision and Pattern Recognition. 4985-4994

  • [18] Lapata, Maria & Alex Lascarides. 2003. A probabilistic account of logical metonymy. Computational Linguistics 29(2). 261-315.
  • [19] Lapata, Maria, Frank Keller & Christoph Scheepers (2003). Intra-sentential context effects on the interpretation of logical metonymy. Cognitive Science 27(4), 649–668
  • [20] Le, Phong & Willem Zuidema. 2015. Compositional distributional semantics with long short term memory. arXiv preprint arXiv:1503.02510. https://arxiv.org/pdf/1503.02510.pdf (Accessed: 21st November 2016).
  • [21] Levy, Omer, Yoav Goldberg & Israel Ramat-Gan. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. Computational Natural Language Learning (CoNLL) 171-180.
  • [22] Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge: Cambridge University Press
  • [23] Manning, Christopher D. et al. 2016. Universal dependencies v1: A multilingual treebank collection. 10th edition of the Language Resources and Evaluation Conference (LREC 2016). 1659-1666.
  • [24] Manning, Christopher D. & Sebastian Schuster. 2016. Enhanced English universal dependencies: An improved representation for natural language understanding tasks. 10th edition of the Language Resources and Evaluation Conference (LREC 2016). 2371-2378.
  • [25] Markert, Katja & Malvina Nissim. 2006. Metonymic proper names: A corpus-based account. Trends in linguistic studies and monographs 171. 152-169.
  • [26] Marneffe, Marie-Catherine, Bill MacCartney, & Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. International Conference on Language Resources and Evaluation 6. 449-454.
  • [27] McElree, Brian, Matthew Traxler, Martin Pickering, Rachel Seely & Ray Jackendoff. 2001. Reading time evidence for enriched composition. Cognition 78(1), 17-25.
  • [28] Mikolov, Tomas et al. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 https://arxiv.org/pdf/1301.3781.pdf (Accessed: 6th March 2017).
  • [29] Mikolov, Tomas et al. 2013b. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. 3111-3119.
  • [30] Mnih, Andriy & Geoffrey E. Hinton. 2009. A scalable hierarchical distributed language model. Advances in Neural Information Processing Systems. 1081-1088.
  • [31] Moore, Gordon. 1965. Cramming more components onto integrated circuits. Electronics 38(8). 114-116.
  • [32] Morin, Frederic & Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model.

    10th International Workshop on Artificial Intelligence and Statistics

    (5). 246-252.
  • [33] “compère, v.” OED Online. 2017. Oxford University Press. http://www.oed.com (Accessed: 25th March 2017).
  • [34] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. Conference on Empirical Methods in Natural Language Processing (EMNLP) 14. 1532-1543.
  • [35] Pershina, Maria et al. 2015. Idiom Paraphrases: Seventh Heaven vs Cloud Nine. Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem). 76-82.
  • [36] Pinker, Steven. 1999. Words and Rules: The Ingredients of Language. Weidenfeld & Nicolson, London.
  • [37] Rapp, Reinhard. 2003. Word sense discovery based on sense descriptor dissimilarity. Ninth Machine Translation Summit, 315-322.
  • [38] Rayson, Paul et al. 2001. Grammatical word class variation within the British National Corpus sampler. Language and Computers 36(1). 295-306.
  • [39] Rehurek, Radim, and Petr Sojka. 2010. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks. 1-5.
  • [40] Rehurek, Radim. 2014. Making sense of word2vec. https://raretechnologies.com/making-sense-of-word2vec
    (Accessed: 9th October 2016).
  • [41] Rong, Xin. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 https://arxiv.org/pdf/1411.2738.pdf (Accessed 4th February 2017).
  • [42] Roweis, Sam, and Lawrence Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500). 2323-2326
  • [43] Salton, Gerard et al. 1975. A vector space model for automatic indexing. Communications of the Association for Computing Machinery (ACM) 18(11). 613-620.
  • [44] Shutova, Ekaterina et al. 2012. Unsupervised Metaphor Paraphrasing using a Vector Space Model. 24th International Conference on Computational Linguistics (COLING). 1121-1130
  • [45] Shutova, Ekaterina et al. 2013. A computational model of logical metonymy. ACM Transactions on Speech and Language Processing (TSLP) 10(3). 11-39
  • [46] Traxler, Matthew, Robin Morris & Rachel Seely. 2002. Processing subject and object relative clauses: Evidence from eye movements. Journal of Memory and Language 47(1). 69-90.
  • [47]

    Turian, Joseph, Lev Ratinov, & Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning.

    48th Annual Meeting of the Association for Computational Linguistics (ACL). 384-394.
  • [48] Turney, Peter. 2006. Similarity of semantic relations. Computational Linguistics 32(3), 379-416.
  • [49] Utiyama, Masao, Masaki Murata, & Hitoshi Isahara. 2000. A statistical approach to the processing of metonymy. 18th conference on Computational Linguistics 2. 885-891.
  • [50] Utt, Jason et al. 2013. The curious case of metonymic verbs: A distributional characterization. 10th International Conference on Computational Semantics (W13-0604). 30-39.
  • [51] Verspoor, Cornelia Maria. 1997. Conventionality-governed logical metonymy. 2nd international workshop on Computational Semantics. 300-312.
  • [52] Wei, Qiong & Roland Dunbrack. 2013. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PloS one 8(7). e67863.
  • [53] Wolf, Lior, Yair Hanani, Kfir Bar & Nachum Dershowitz. 2014. Joint word2vec networks for bilingual semantic representations. International Journal of Computational Linguistics and Applications 5(1). 27-44.
  • [54] Xiao, Richard & Hongyin Tao. 2007. A corpus-based sociolinguistic study of amplifiers in British English. Sociolinguistic studies 1(2). 241-273

I Full set of ranking tables

This appendix provides the full set of results for the experiment described in this paper. It gives a full account of the paraphrase rankings for the 49 target sentences considered by the model. The ‘Not in vocab.’ label indicates a phrasal verb or idiom (e.g. “Keep the process going” or “Keep an eye on the scene”) that was scraped by algorithm but was unsuccessfully evaluated. Confidence scores with green backgrounds are those above the 0.5 threshold, with grey representing those below 0.5. Additionally, scores preceded by an asterisk and set in bold type are those which human judgement has deemed to be false positives or false negatives. The dataset can be summarised as follows:
‘Begin’: 10 instances of verbal metonymy evaluated. 48 paraphrases generated: 12 true positives and 29 true negatives; 3 false positives and 4 false negatives.
‘Enjoy’: 20 instances of verbal metonymy evaluated. 84 paraphrases generated in total: 31 true positives and 39 true negatives; 9 false positives and 5 false negatives.
‘Finish’: 11 instances of verbal metonymy evaluated. 47 paraphrases generated in total: 9 true positives and 27 true negatives; 2 false positives and 9 false negatives.

Data for verbal metonymy containing ‘begin’:

begin-1 begin-2
“Before I began the formal research…”
(aca/CRS, sentence 1012)
“…he liked to begin the unwinding process.”
(fic/CDB, sentence 201)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Undertake the research. 0.60157 Undergo the process. 0.57620
Conduct the research. 0.55156 Do the process. 0.53042
Assist the research. *0.53126 Carry out the process. *0.49178
Inform the research. 0.42233 Build the process. 0.49178
Cite the research. 0.32108 Keep the process going. Not in vocab.
Set out the research. Not in vocab.
begin-3 begin-4
“…any attempt to begin the painful separation…”
(aca/CTY, sentence 412)
“I think you should begin the next chapter now.”
(aca/F9V, sentence 899)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Reflect the separation. 0.39124 Read the chapter. 0.51943
Abolish the separation. 0.38755 Write the chapter. 0.35864
Overcome the separation. 0.38322 Explain the chapter. 0.35352
Discuss the chapter. 0.27910
begin-5 begin-6
“…could’ve used material from the question
to begin the essay…”
(aca/HXH, sentence 1356)
“…persuaded Louis to begin the task not
completed…”
(aca/EA7, sentence 439)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Start the essay. *0.94397 Face the task. 0.59582
Write the essay. 0.56490 Give the task. 0.43031
Build the essay. 0.47654 Allocate the task. 0.42848
Organise the essay. 0.44926 Achieve the task. 0.39055
Develop the essay. *0.45284 Ignore the task. 0.34324
Form the essay. 0.36391 Delegate the task. 0.31528
Follow the essay. 0.35311 Have the task. 0.27727
Structure the essay. 0.34428 Synthesise the task. 0.27803
Shape the essay. 0.33418 Turn his hand to the task. *Not in vocab.
begin-7 begin-8
“…to begin the usual psalms…”
(fic/H9C, sentence 2201)
“…went out to the kitchen to begin
the dinner.”
(fic/H9C, sentence 3078)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Sing the psalm. 0.54752 Cook the dinner. 0.51848
Chant the psalm. *0.47784 Leave the dinner. 0.33579
Share the dinner. 0.34324
begin-9 begin-10
“…Gaveston began the dark satanic ritual…”
(fic/H9C, sentence 3078)
“The All Blacks begin the Irish leg of
their tour…”
(news/A80, sentence 276)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Perform the ritual. 0.67040 Go on the leg. 0.60371
Complete the ritual. *0.54511 Prepare for the leg. 0.59806
Witness the ritual. 0.35892 Face the leg. *0.49084
Win the leg. 0.33879

Data for verbal metonymy containing ‘enjoy’:

enjoy-1 enjoy-2
“…the union enjoys the same defences as
an individual.”
(aca/FSS, sentence 1456)
“…a coalition would enjoy the support…”
(aca/J57, sentence 1719)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Have the defence. 0.53353 Receive the support. 0.59601
Apply the defence. 0.49698 Have the support. 0.52088
Pursue the defence. 0.47940 Secure the support. *0.48592
Assess the defence. 0.37605 Rally the support. 0.33238
Contest the defence. 0.35429 Command the support. 0.28217
enjoy-3 enjoy-4
“He seems to enjoy the job doesn’t he?”
(dem/KPB, sentence 2187)
“…well erm I enjoyed the Mozart…”
(dem/KPU, sentence 955)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Do the job. 0.68356 Listen to the Mozart. 0.67809
Get the job. *0.65040 See the Mozart. *0.59159
See the job. 0.32827 Hear the Mozart. 0.53761
enjoy-5 enjoy-6
“You’ll enjoy the story.”
(dem/KBW, sentence 9489)
“Though I enjoyed the book immensely…”
(news/K37, sentence 174)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Hear the story. *0.61463 Read the book. *0.53622
Read the story. 0.55733 Delve (into) the book. 0.48435
Know the story. 0.53276 Sell the book. 0.43931
Tell the story. 0.48357 Publish the book. 0.42264
Like the story. 0.48325 Review the book. 0.34899
Finish the story. 0.43046 Research the book. 0.26940
enjoy-7 enjoy-8
“I’ve enjoyed the concert but…”
(dem/KPU, sentence 975)
“Charlotte did not enjoy the journey…”
(fic/CB5, sentence 1908)
Paraphrase candidate Confidence Paraphrase candidate Confidence
See concert. 0.68158 Spend the journey. *0.64580
Listen to the concert. 0.58792 Observe the journey. 0.47043
Go to the concert. 0.55673 Complete the journey. 0.49725
enjoy-9 enjoy-10
“Owen himself rather enjoyed the view…”
(fic/J10, sentence 946)
“…doing the job by myself and enjoying
the work.”
(fic/CCW, sentence 2133)
Paraphrase candidate Confidence Paraphrase candidate Confidence
See the view. 0.67598 Do the work. 0.62705
Admire the view. 0.67346 Continue the work. 0.54503
Look at the view. 0.52113 Handle the work. 0.42257
Screen the view. 0.36740 Grudge the work. 0.34002
Delay the work. 0.30198
enjoy-11 enjoy-12
“She enjoyed the opera…”
(fic/G0Y, sentence 2445)
“…anyone who enjoys the criticism.”
(news/AHC, sentence 342)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Go to the opera. 0.56013 Take the criticism. 0.60199
Visit the opera. 0.51182 Come under criticism. 0.58639
Accept the criticism. 0.51384
Face the criticism. *0.38531
Rebut the criticism. 0.36010
enjoy-13 enjoy-14
“You will not enjoy the meeting.”
(fic/H85, sentence 1150)
“If only Kit could enjoy the scene…”
(fic/G0S, sentence 1968)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Attend the meeting. 0.54394 See the scene. 0.68378
Ensure the meeting. *0.50981 Watch the scene. 0.67125
Arrange the meeting. 0.48937 Take in the scene. 0.59520
Open the meeting. 0.43172 Visualise the scene. 0.48953
Witness the meeting. 0.39758 Stare at the scene. 0.42828
End the meeting. 0.38716 Picture the scene. 0.34204
Chair the meeting. 0.33628 Survey the scene. *0.29907
Keep an eye on the scene. Not in vocab.
enjoy-15 enjoy-16
“…she enjoyed the smell and the
sound of them.”
(fic/J54, sentence 1218)
“…the professionals enjoying
the advantages…”
(news/A8P, sentence 87)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Inhale the smell. 0.51965 Have the advantage. *0.48953
Catch the smell. 0.50337 Is the advantage. 0.47691
Notice the smell. 0.38023 Explain the advantage. 0.38402
Detect the smell. 0.31561
enjoy-17 enjoy-18
“…Austria traditionally enjoys
the distinction…”
(news/A3P, sentence 109)
“…appears to enjoy the attentions of
his doting…”
(news/K37, sentence 188)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Have the distinction. 0.61792 Have the attention. 0.62154
Make the distinction. *0.52451 Attract the attention. 0.55481
Give the distinction. 0.49339 Catch the attention. 0.50076
enjoy-19 enjoy-20
“…we set forth to enjoy the countryside.”
(news/AJF, sentence 255)
“…say I actually enjoyed the experience…”
(news/AHC, sentence 741)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Love the countryside. *0.68629 Appreciate the experience. *0.83557
Visit the countryside. *0.41460 Have the experience. 0.67259
Scour the countryside. 0.34624 Forget the experience. 0.49317

Data for verbal metonymy containing ‘finish’:

finish-1 finish-2
“…we haven’t finished the garden.”
(dem/KP5, sentence 851)
“You haven’t finished the work
over there…”
(dem/KBW, sentence 16021)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Do the garden. 0.57878 Do the work. 0.57858
Go (in) the garden. 0.48242 Get the work. 0.47965
Dig the garden. *0.35942 Carry (on with) the work. 0.43916
finish-3 finish-4
“I won’t finish the whole book.”
(dem/KBW, sentence 17355)
“And then we finished the game…”
(dem/KB7, sentence 501)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Read the book. 0.59457 Win the game. 0.8159
Put the book. 0.47582 End the game. *0.59684
See the book. 0.46215 Play the game. 0.56021
Bring the book. 0.37519 Buy the game. 0.39317
Have the book. 0.37220 Like the game. 0.33214
Wade through the book. Not in vocab.
finish-5 finish-6
“We finished the story…”
(dem/KBW, sentence 17074)
“…and finish the mortgage earlier.”
(dem/KB7, sentence 3736)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Enjoy the story. 0.42942 Pay the mortgage. 0.60147
Hear the story. 0.39092 Clear the mortgage. *0.47388
Read the story. *0.38208 Wait for the mortgage. 0.41958
Tell the story. 0.37544 Afford the mortgage. 0.41605
finish-7 finish-8
“Adam had finished the list
of instructions…”
(fic/G0L, sentence 1487)
“…finished the game with only…”
(news/CH3, sentence 6700)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Reach (for) the list. 0.4608 Play the game. 0.56537
Look (at) the list. 0.42192 Miss the game. *0.50261
Read the list. *0.38076 Save the game. 0.48052
Write the list. 0.42559 Watch the game. 0.40168
Develop the list. 0.35096 Control the game. 0.28200
finish-9 finish-10
“She finished the whisky.”
(fic/K8V, sentence 3338)
“…I want to stay on here to finish the job.”
(news/CH3, sentence 307)
Paraphrase candidate Confidence Paraphrase candidate Confidence
Drink the whisky. 0.58234 Do the job. 0.58553
Gulp the whisky. *0.41849 Get the job. *0.48158
Look (at) the whisky. 0.41755 Find the job. 0.42892
Toss back the whisky. *0.38207 Have the job. 0.37238
finish-11
“Finish the last packet of cigarettes…”
(news/BM4, sentence 1431)
Paraphrase candidate Confidence
Carry the packet. 0.44237
Crumple the packet. 0.40580
Smoke the packet. *0.35518
Open the packet. 0.36162

Ii Colophon

Servers hosted on Amazon Web Services were used to process data. The specifics of how these servers were configured can be found in Appendix III. These servers ran Ubuntu and scripts were written in Python 3.6, https://www.python.org. The experiment made extensive use of the NLTK (to parse the BNC) and gensim (Rehurek & Sojka 2010) modules. The updated Stanford dependency parser was also used (Manning et al. 2016; Manning & Schüster 2016). Locally, I used Sublime Text 3 as a Python IDE and text editor. PuTTY was my SSH client of choice and FileZilla was used to upload large files (such as the BNC) to the servers’ Elastic Block Store. The code for this paper is available as doi:10.5281/zenodo.569505 and on GitHub: https://github.com/albertomh/ug-dissertation.

Iii AWS server architecture

The experiment was carried out on AWS servers running Ubuntu 16.04 (Long Term Support version). The creation of word vectors and evaluation of cosine similarity between targets and candidates required a t2.large instance. word2vec stores parameters as arrays of ‘vocabulary size * size of floats’ (4 bytes). Three matrices of these characteristics are kept in RAM at any one point. In this experiment a matrix of 10 000 * 100 words was kept unfragmented in memory. This required a server with at least 4GB of RAM. For scraping VNP collocates from the BNC a t2.micro instance was used. Both the large and micro servers use Intel Xeon processors, which provide a balance of computational, memory and network resources and are burstable beyond baseline performance on demand. The BNC was kept on the Elastic Block Store chunked in XML files. Scripts ran concurrently with the EBS instance to access the BNC. Scraped data was stored as Python data types (lists, dictionaries) in text files and then interpreted using the ast module when needed (so as to not have them permanently loaded in memory).

Figure 6.1: The infrastructure for the experiments was built using Amazon Web Services. Elastic Compute Cloud servers processed data held in Elastic Block Stores.