Log In Sign Up

Metaphoric Paraphrase Generation

by   Kevin Stowe, et al.

This work describes the task of metaphoric paraphrase generation, in which we are given a literal sentence and are charged with generating a metaphoric paraphrase. We propose two different models for this task: a lexical replacement baseline and a novel sequence to sequence model, 'metaphor masking', that generates free metaphoric paraphrases. We use crowdsourcing to evaluate our results, as well as developing an automatic metric for evaluating metaphoric paraphrases. We show that while the lexical replacement baseline is capable of producing accurate paraphrases, they often lack metaphoricity, while our metaphor masking model excels in generating metaphoric sentences while performing nearly as well with regard to fluency and paraphrase quality.


Lexical Complexity Controlled Sentence Generation

Text generation rarely considers the control of lexical complexity, whic...

Controllable Sentence Simplification: Employing Syntactic and Lexical Constraints

Sentence simplification aims to make sentences easier to read and unders...

Generating similes <effortlessly> like a Pro: A Style Transfer Approach for Simile Generation

Literary tropes, from poetry to stories, are at the crux of human imagin...

Learning to Start for Sequence to Sequence Architecture

The sequence to sequence architecture is widely used in the response gen...

"My Way of Telling a Story": Persona based Grounded Story Generation

Visual storytelling is the task of generating stories based on a sequenc...

An End-to-end Approach for Lexical Stress Detection based on Transformer

The dominant automatic lexical stress detection method is to split the u...

COD3S: Diverse Generation with Discrete Semantic Signatures

We present COD3S, a novel method for generating semantically diverse sen...

1 Introduction

Metaphors have long posed significant problems to researchers across a wide variety of fields. While humans seem capable of easily understanding even complex metaphors, it remains difficult to devise a formal analysis that captures the depth and breadth of meanings produced by novel metaphors. We typically think of metaphors within the Conceptual Metaphor framework Lakoff and Johnson (1980); Lakoff (1993), in which metaphors are based in conceptual mappings between different domains: we have cognitive concepts that can be used to represent and understand other concepts, and these mappings can be expressed linguistically to form concrete metaphoric expressions.

While there are many different computational approaches to metaphoric language, the field remains challenging and, in some areas, relatively unexplored. The variety of meanings captured by creative metaphors pose numerous problems to natural language processing researchers, as they rely on lexical diversity and conceptual knowledge. The bulk of work in metaphor has gone to identifying metaphor expressions or generating interpretations for them

Shutova (2015); Veale et al. (2016). Whereas previous approaches focus on classification, we instead focus on generation: how can we create novel, interesting, and valid metaphoric expressions?

This task has many possible applications, including creative writing assistance, where users can employ metaphor generation to develop more interesting, persuasive writing. Lakoff and Johnson Lakoff and Johnson (1980) suggest that not only can metaphors capture similarity between domains, they actually can generate the similarity, allowing us to view concepts in new ways; optimistically, metaphor generation may allow us to discover new metaphoric ideas to foster understanding and growth in scientific areas. This is particularly true in the domain of education, where new metaphors can be instructive both for teachers and students Marshall (1990). Metaphors are also critical for proper interaction between humans and computational agents: humans produce metaphors easily, and to have natural communication with computational models will require them to be able to do the same Zhang (2008); Wallington et al. (2011).

In contrast to previous work generating novel metaphors (§2.2), we are the first to tackle metaphor paraphrase generation, and we hope our work can function as a jumping-off point for this challenging and interesting task. This task is a particularly difficult task for a variety of reasons. First, metaphors have the potential to be enormously creative, deviating greatly from "standard" language, which means normal language models may have difficulty in producing good metaphors. Traditional paraphrasing systems attempt to keep the sentences relatively similar, while in fact we need sentences that vary substantially, in order to enforce metaphor production.

This leads also to significant problems: there are countless possible metaphor paraphrases for any given utterance and there are numerous possible metaphoric mappings that can be evoked, yielding slightly different semantic connotations. Consider the following example:

  1. The company was losing money rapidly.

This sentence has numerable possible metaphoric paraphrases, evoking many different metaphors:

  1. The company was hemorrhaging money.

  2. The company’s finances were circling the drain.

  3. The business fell off of a cliff.

  4. Profits collapsed.

In 2 and 3, "money" is conceptualized as blood and water respectively, and from conceptual metaphor theory we see that this evokes the money is a liquid mapping. In 4, "finances" is conceptualized as a physical entity, and further, one that can experience harm, perhaps evoking the economic harm is physical injury mapping. In 5, the company’s profits are conceptualized as a building, evoking the frequent metaphor of social and economic constructs being conceptualized as physical constructions, in this case specifically finances are buildings.

Note that there is a seemingly endless variety of metaphoric expressions that can fairly consistently capture the same general meaning, with a wide variety of lexical variation. This makes metaphoric paraphrases extremely difficult to evaluate automatically: traditional metrics for generation (such as BLEU Banerjee and Lavie (2005) and ROUGE Papineni et al. (2002)) rely heavily on word overlap, which is actually counterproductive for metaphoric paraphrasing: we would like our generated phrases to have less word overlap, as interesting metaphors are likely to share little lexical overlap with the original inputs. For this reason we rely on crowdsourcing, evaluating metaphoricity, fluency, and paraphrase quality.

We approach the problem of metaphoric paraphrase generation from a variety of backgrounds, each with their own positives and negatives. First, we will consider the problem one of lexical replacement, in which we identify the important words in the literal utterance and replace them with metaphoric counterparts. This yields coherent utterances, but limits the flexibility of the output. Second, we will consider this a sequence to sequence (seq2seq) problem, and employ a novel generation technique dubbed "metaphor masking" to hide important words in the input during training and evaluation, forcing the seq2seq model to learn the appropriate contexts for metaphoric and literal words. This also requires knowledge of the key words before paraphrasing, but allows for substantially more flexibility in generation.

Our contribution is thus threefold:

  • We formalize the task of metaphor generation, elucidating the datasets and experimental setup necessary.

  • We implement a lexical replacement-based baseline, as well as a novel seq2seq architecture based on "metaphor masking".

  • We perform analysis of generated metaphors, identifying strengths and weaknesses for each method.

2 Related Work

While our task is new, it bears similarity to a variety of better known NLP benchmarks. In the metaphor community, most of the efforts are focused on identification and interpretation of metaphors. We will instead focus on our two key components, paraphrasing and generation, as they relate to metaphors.

2.1 Literal Paraphrasing

Previous work investigates paraphrasing from metaphoric utterances to literal ones with the goal of providing interpretations Mao et al. (2018); Shutova (2010). Shutova et al. Shutova (2010)

treats identification and interpretation jointly, and generates literal paraphrases for metaphoric adjective-noun phrases. Vector space models have also been employed successfully for generating literal paraphrases. Shutova et al.

Shutova et al. (2012) identify a set of candidate paraphrases based on context and word vectors, and then use a model of selectional preferences to pick the most literal paraphrase. They require no training data, and achieve promising results for unsupervised literal paraphrasing.

Similarly, Mao et al. Mao et al. (2018) build a metaphor identification system using word vectors, and also use it to generate paraphrases for metaphoric sentences. This is done by replacing the verbs that are identified as metaphoric with the most likely literal candidates. They use Word2Vec embeddings Mikolov et al. (2013) combined with WordNet to identify relations between literal and metaphoric lexemes. This allows for replacement of rarer, more metaphoric senses to concrete literal ones, but doesn’t provide a solution for transitioning from a literal sense to an appropriate metaphoric one. Thus their work is effective at metaphoric to literal paraphrasing, but functions only in this direction; we will restructure their algorithm for the metaphoric direction as a lexical baseline in §4.1.

2.2 Metaphor Generation

With regard to metaphor generation, most efforts have been to generate metaphors at the lexical or phrase level, using template- and heuristic-based methods. Early work in computational metaphor generation involves generating simple "A is like B" expressions, based on probabilistic relationships between words

Abe et al. (2006); Terai and Nakagawa (2010). These methods are effective to a degree, but lack the flexibility necessary to instantiate natural language metaphors.

Other early approaches to metaphor generation are rooted in knowledge bases. Hervas et al. Pereira (2007) build a metaphor generation system by identifying metaphoric domains, building mappings between the source and target, and replacing appropriate references with the built metaphors. They show the difficulty of determining appropriate target domains for metaphors in context. Others use WordNet, building knowledge representations through semantic information from definitions Veale and Hao (2008).

Other works seek to generate conceptual metaphors, rather than open linguistic expressions. These approaches, designed to generate conceptual metaphor mappings such as money is a liquid, vary from WordNet- and selectional preference-based Mason (2004), clustering over WordNet senses Gandy et al. (2013), and using proposition databases built from syntactic relations Ovchinnikova et al. (2014). While this task is interesting and useful, particularly for doing proper reasoning from metaphoric mappings, our goal is instead to generate natural linguistic metaphors, rather than metaphoric mappings.

Word embedding approaches have been popular and effective for lexical metaphor tasks. In addition to Mao et al. and Shutova et al.’s paraphrasing work, Gagliano et al. Gagliano et al. (2016) build off of Word2Vec, using the generated vectors to identify poetic relationships between words, developing a vector-based interpretation of conceptual blends Fauconnier and Turner (1996). They identify "connector words" between concepts, allowing for the creation of linguistic metaphors that accurately capture these conceptual metaphoric mappings.

More recently there have been efforts using deep learning methods to generate metaphoric expressions more freely, using sequence-to-sequence models. Most notable is Yu et al.

Yu and Wan (2019), who use neural models to generate metaphoric expressions in an unsupervised manner. They identify source and target verbs automatically from corpora, and use these to train a neural language model. Our work is similar: they encode both literal and metaphoric pairs and produce metaphoric outputs based on verbs, but their generation task is free. We are instead working on the more constrained task of generating specific paraphrases from literal utterances.

This is the experimental paradigm we will be following: given a literal phrase, we generate a metaphoric paraphrase that should capture the same meaning. Unlike previous work, our methods are broadly applicable to free text: we are not limited to paraphrasing individual words or phrases, but rather use deep learning models for full natural language generation, which can then freely create literal paraphrases. To our knowledge, our work is the first to attempt to explicitly generate metaphoric paraphrases.

3 Data

Our goal is to generate metaphoric paraphrases for given literal phrases. Data for this task is extremely sparse: there aren’t any large scale parallel corpora containing literal and metaphoric paraphrases. Most useful is that of the Mohammad et al. Mohammad et al. (2016). Their dataset includes multiple parts; importantly, it contains 171 metaphoric sentences extracted from WordNet, with manually generated literal paraphrases. These are high quality annotations, and we will use this dataset for evaluation. While originally built from the side of generating literal paraphrases for metaphoric utterances, it is easy enough to reverse the direction, using their literal paraphrases as input and attempting to generate metaphoric outputs.

Note that there are some discrepancies between the original usage and our intended paraphrase usage. Notably, the dataset was originally built around verbs: the authors replaced the key verbs in each metaphoric sentence to yield a more literal output. This ignores cases where the metaphoric meaning of the sentence is captured by components other than the verb:

  1. The painting seems to capture the essence of Spring.

  2. These events could fracture the balance of power.

  3. The new moon reflected back at itself from the lake’s surface.

In these examples, the verb that was replaced to make a paraphrase is in bold, while the italic phrases could also be construed as metaphoric. In particular, 3 is likely to be considered metaphoric regardless of the bolded verb, due to the poetic reflexive construction "back at itself". This means that the resulting "literal" paraphrases contain literal verbs, but the sentences themselves may still contain metaphors. This isn’t prevalent in the data and doesn’t impact the experiments, as we are only trying to generate more metaphoric output sentences from more literal inputs, but it is important to be aware that our paraphrasing task differs somewhat from the design of the original dataset.

The size of this dataset is small: 171 instances is not enough to train viable deep learning models, and large scale parallel corpora for this task don’t exist. For this reason, we will use methods that are either unsupervised, or don’t rely on parallel data, and can be developed using non-parallel corpora. The lexical replacement model is the former, requiring no training data. The metaphor masking seq2seq model uses external training data, but does not require the data to be parallel. We use a masking procedure to generate artificial sentence pairs for seq2seq training, allowing the model to be function using non-parallel datasets.

4 Methods

We propose two different models for metaphoric paraphrase generation. First, we implement a lexical replacement baseline, based on that of Mao et al. Mao et al. (2018). Second, we develop a novel seq2seq framework that masks metaphoric words to better learn how to generate metaphoric outputs.

4.1 Lexical Replacement Baseline

Figure 1: Lexical Replacement Baseline

Metaphors often hinge on verbs. This intuition has fueled many identification and interpretation projects, including the inclusion of the verb-specific identification track of the metaphor detection shared task Leong et al. (2018). We implement a lexical replacement baseline that takes the literal verb and replaces it with a more metaphoric counterpart. This is based on the work of Mao et al. Mao et al. (2018), who employ this strategy in the other direction: they take metaphoric sentences and replace the metaphoric verbs with literal ones.

We implement this algorithm for metaphor generation by reversing their candidate selection. For an overview of the process, see Figure 1. We begin with a literal sentence with a marked verb (a). (b) We use the WordNet sense hierarchy to find related words to the input word which will then be "candidates" to replace it, but rather than searching "up" the hierarchy for hypernyms, we search "down" the hierarchy for troponyms: more specific verbs (in bold). We believe that in the lexical replacement task, replacement with more specific verbs is likely to yield more metaphoric expressions, as these specific verbs require specific contexts to be understood literally. When placed in an unfamiliar context, they adopt metaphoric meanings via a coercion-like process Steedman and Moens (1988). (c) We follow their algorithm for picking the best candidate: we take the mean output embedding of the context (based on the Google News Word2Vec vectors Mikolov et al. (2013)

), and select the candidate word that best matches that mean by way of cosine similarity. (d) This yields the (more specific) word that best fits the context, generating a more metaphoric expression.

This method, then, takes as input a sentence with a known literal verb, generates possible metaphoric candidates to replace that verb, and chooses the best fitting option. It requires no external training data, but relies on WordNet, and is restricted to only generating metaphoric verbs.

4.2 Metaphor masking model

Figure 2: Metaphor masking for the seq2seq model.

Sequence to sequence (seq2seq) learning paradigms are vital for a variety of NLP applications: machine translation, style transfer, natural language generation, and more Chen et al. (2018); Mueller et al. (2017); Dušek et al. (2020). These methods rely on encoding input sentences into vectors, and then applying decoders to generate some output from that input vector. They are often trained on parallel corpora (as in the case of machine translation), with the model learning to output some text based on the vector encoded from the input.

Seq2seq models have been used to generate metaphoric text Yu and Wan (2019), but here we are focused on paraphrase generation. In order to apply seq2seq models to this task, we develop a new framework dubbed "metaphor masking". In this framework, we replace metaphoric words in the input texts with metaphor masks (unique "metaphor" tokens), hiding the lexical item. This creates artificial parallel training data: the input is the masked text, with the hidden metaphorical word, and the output is the original text. Through this learning paradigm, the model learns that it needs to generate metaphoric words when it encounters the metaphor mask token. At test time, we provide the model with the literal input, mask the verb, and the model produces an output conditioned on the metaphor masking training. An overview of the process is shown in Figure 2.

This procedure requires additional annotated data to generate the parallel inputs for training. For this, we employ a number of available metaphor corpora: the VUAMC dataset Steen et al. (2010), another partition of the Mohammad et al. dataset that contains individual sentences labelled as literal or metaphoric Mohammad et al. (2016)111Note that our test data also comes from this source: we have removed all text examples from this dataset for all training., the Trofi dataset Birke and Sarkar (2006), and the additional data collected by Stowe et al. Stowe and Palmer (2018). Each of these datasets contains annotations of metaphoric verbs, although the annotation schema differ, so we expect some variety and noise in the model. Combining these datasets yields 35,415 verbs, of which 11,593 are metaphoric.

Our final goal is to generate short metaphoric utterances based on the Mohammad et al. Mohammad et al. (2016) dataset. In order to match this, we trim our training data around the verbs: each verb is treated as a separate training instance, along with 7 words of context on each side. We use all 35,415 sentences as input to the model: non-metaphoric sentences are left as-is, with the input mirroring the output. Metaphoric data is masked during training, replacing the input verb with a metaphor masking and using the original as output. This yields 35,415 pairs for training, 11,593 of which contain metaphoric masks. We hypothesize that using both literal and metaphoric datasets will allow the model to better distinguish between sentences with a metaphor mask and those without, generating stronger metaphoric outputs. We use a transformer architecture Vaswani et al. (2017) with 6 layers in the encoder and decoder. The model uses 8 heads to learn different attention distributions. In the end they are concatenated. The hidden size for encoder and decoder is 512. We use normalization per tokens, with a vocabulary size of 30K. The model was trained using ADAM optimiser, with an initial learning rate of 0.5.

5 Crowdsourced Evaluation

The approaches to evaluating metaphoric and literal sentences using crowdsourcing include evaluating hand-generated sentences for metaphoricity Mohammad et al. (2016); Bizzoni and Lappin (2018), evaluation of the output of automatic metaphor generation systems Yu and Wan (2019); Veale (2016), and evaluation of novelty in verbal metaphors Do Dinh et al. (2018). Uniquely focusing on metaphor evaluation, Miyazawa and Miyao Miyazawa and Miyao (2017) highlight the importance of effective evaluation. They use four key metrics: metaphoricity, novelty, comprehensibility, and overall evaluation, to measure the success of metaphor generation in Japanese.

We will rely on two components that are typical of metaphor generation. First, we evaluate metaphoricity, with the goal of producing coherent and interesting metaphors, rather than conventional, common language. Second, we evaluate fluency, attempting to capture the syntactic viability of the generated output. Additionally, as we are attempting to generate paraphrases, we also include crowdsourced evaluation of paraphrase quality.

Annotators were thus asked to rate sentences with regard to three different factors: metaphoricity, fluency, and paraphrase quality. Each sentence was rated by five separate workers on a Likert scale from 1 to 4.222We chose this scale rather than the 1 to 5 used by Yu et al. Yu and Wan (2019) to encourage workers to avoid "neutral" responses of ’3’. We filtered out results of users who failed test sentences and those who only completed 1 task, aiming to keep results from consistent and knowledgeable workers.

Fluency judgments were relatively simple. For this, we asked annotators to rate the sentences based on how fluent (from incomprehensible to fluent English) a sentence is.

For paraphrase judgments, we used with two different setups. We have access to three components: the original literal input ; , the original metaphoric paraphrase of ; and , the generated metaphoric paraphrase of . We first evaluate paraphrasing, comparing the generated metaphoric outputs with the gold metaphors from the test data, allowing us to compare the system output to the gold data. We also experimented with comparing generated paraphrases to the literal inputs, as these should also be valid paraphrases. This represents our evaluation, comparing the resulting paraphrases with the original literal inputs. For each, we presented the worker with a gold input (either literal for or metaphoric for ) and the generated output, and asked them how good of a paraphrase the output was, from "completely unrelated" to "strong paraphrase".

Metaphor evaluation is more difficult, and we attempt to follow previous crowdsourcing approaches for metaphor rating. Based on the schema from Do Dinh et al. Do Dinh et al. (2018) and Yu et al. Yu and Wan (2019), we provided basic definitions of metaphoricity for crowdworkers, allowing them to use their intuitions about what to consider metaphoric. We found in a pilot study that providing longer, more complex descriptions of metaphoricity increased the difficulty of the task, so we chose to keep the definition simple.333To facilitate future work, full descriptions of the task, parameters, payments, and guidelines, along with the crowdsourced results and codebase, will be released upon publication.

Our crowdsourcing setup was repeated for three outputs. The gold metaphors of Mohammad et al. Mohammad et al. (2016), which also contain hand-crafted literal paraphrases, the lexical replacement baseline, and the output of our experimental system: sentences generated via seq2seq with metaphor masking.

6 Analysis

Figure 3: Evaluation of each model via crowdsourcing.
Source Text Met Flu PP

Input He was lavished with praise
Gold He was showered with praise 3 4 4
LexRep He was lavished with praise 2.6 3.8 3.8
MM He was pleading with impishly 2.8 2.2 1
2 Input The moon reflected back at itself from the lake’s surface
Gold The moon glared back at itself from the lake’s surface 3.3 4 3
LexRep The moon sparkled back at itself from the lake’s surface 3.2 4 3.6
MM The DMZ falls back at itself from the glittering surface 3.25 2.6 1.4
3 Input She appears among royalty
Gold She circulates among royalty 2.4 4 3.6
LexRep She manifested among royalty 2.75 3.8 3.4
MM She clawed among doorknob 2.8 2.6 1.6
4 Input Life in the camp weakened him
Gold Life in the camp drained him 2.75 3.75 3.6
LexRep Life in the camp emasculated him 3.3 3.8 2.8
MM Life in the camp miss him 1.8 2.2 3
Table 1: Samples with the highest scores for LexRep

The mean scores for the crowdsourced evaluations for each system are shown in Figure 3.444Note that comparison is unavailable for the gold data: there is no to compare to The sentences generated by lexical replacement bear closer resemblance to the literal inputs: they have lower metaphoricity scores and higher paraphrase rankings. This is expected, as the change to the input is only a single word. The metaphor masking model shows more similarity to the metaphoric outputs: they have low paraphrase similarity in the settings, but better paraphrase scores, and high metaphoricity. The fact that the metaphor masking model produces metaphoricity scores on par with the gold standard metaphors, and is consistent with the lexical model in terms of fluency and paraphrase quality, shows that this method is very effective at generating metaphoric sentences. The quality of the paraphrases is still relatively low, averaged 1.3 points below the gold paraphrases, but this is an important first step in for this task.

In order to understand how each of these models performed, we do qualitative analysis over the results. We examined the results of each model: what does the lexical replacement baseline do well, and what benefits can we gain from employing our metaphor masking model?

6.1 Lexical Replacement

Source Text Met Flu PP
5 Input She was saddened by his refusal of her invitation
Gold She was crushed by his refusal of her invitation 2.4 4 4
LexRep She was saddens by his refusal of her invitation 1.25 2.8 3.25
MM She besieged by his refusal of her invitation 3.2 3.8 3.6
6 Input The critics overpraised this broadway production
Gold The critics puffed up this broadway production 3.2 3.75 3.75
LexRep The critics this broadway production 2.2 2.25 2.2
MM The critics hailed this airborne production 3.2 3.75 3
7 Input The company dismissed him after many years of service
Gold The company dumped him after many years of service 2.2 4 3.75
LexRep The company scoff him after many years of service 1.5 3 2.4
MM The company downsized him after many years of service 2.6 4 3.5
8 Input This story will intrigue you
Gold This story will grab you 3.2 3.75 4
LexRep This story will schemed you 2.8 2.33 1.8
MM This story will help you 3.6 4 2.4

Table 2: Samples with the highest scores for Metaphor Masking

Table 1 shows the best lexical replacement outputs, based on their improvement over the metaphor masking model. The replacement model performs well in fluency and paraphrase quality, particularly because it copies most of the input, only replacing a single word. In some cases, the "best fit" candidate is the original input word. These perform exceedingly well in fluency and paraphrase quality, as they match the input sentence, but understandably lack metaphoricity (1). However, in many cases the model often makes novel and metaphoric word choices, indicating the validity of this approach for metaphor generation (2-4).

This baseline has numerous theoretical advantages and disadvantages. It yields output sentences that are very similar to the inputs, as we are only replacing a single word. This can be beneficial, as the outputs will be necessarily syntactically and semantically coherent except for the replaced word, but also severely restricts the creativity and novelty of the output.

A downside is that this method requires knowledge of the target verb. Our data has the target verb in the literal and metaphoric paraphrases annotated, but sometimes these verbs contain particles (such as "start on" and "use up"), which make lexical replacement difficult. Just replacing the verb and maintaining the particle sometimes yields good results ("I [started] on the problem" "I [fell] on the problem"), while replacing both verb and particle can also be correct ("they [used up] their food" "they [demolished] their food"). Second, if we apply this method to unseen data, we will first need to identify the target verbs, making it more reliant on external knowledge and prone to error. Finally, it is dependent on WordNet, which restricts the power and flexibility with regard to creativity.

While the above examples highlight the strength of the lexical replacement baseline, they also show the weaknesses of the metaphor masking approach. Due to the free nature of generation, we often see words in the generated output that bear little to relation to the original input ("impishly" in (1) and "DMZ" in (2)). These kinds of errors elucidate how the more constrained lexical replacement model tends to yield better paraphrases.

6.2 Metaphor Masking

The metaphor masking model tend to generate more metaphoric sentences with similar fluency, although they often are not valid paraphrases of the original input. Table 2 shows examples for which the metaphor masking model performs best in comparison with the lexical replacement model.

Source Text Met Flu PP
9 Input I can’t cope with it anymore
Gold I can’t hack it anymore 1.5 3 3.75
LexRep I can’t improvising with it anymore 1.8 2.3 2
MM I chemotherapy stuck with it gathers 2.25 2 1.5
10 Input Actions communicate louder than words
Gold Actions talk louder than words 3.2 3.5 3.75
LexRep Actions conveys louder than words 1.75 3.4 3
MM Seurat culminated pointillism than words 2.5 2 1.2
11 Input Which horse are you betting on
Gold Which horse are you backing 2 4 4
LexRep Which horse are you bet on 1.4 2.6 3.8
MM Gillette horse are you going on 3 1.4 1.6
12 Input A weather vane tops the building
Gold A weather vane crowns the building 1.75 3.8 3.8
LexRep A weather vane clears the building 2.25 4 2.4
MM A weather verbs raided the building 3.4 1.8 1.2
13 Input It occurred to him that she had betrayed him
Gold It dawned on him that she had betrayed him 2.2 4 3.7
LexRep It intervened to him that she had betrayed him 2 3 2.8
MM It comes to him that she had greeted him 2.2 2.8 1.5
Table 3: Sentences for which both LexRep and MM models performed poorly.

Metaphor masking tends to produce fairly consistent outputs, which are syntactically regular. Hiding the metaphoric word causes the model to make a prediction, yielding varied outputs, and these are more metaphoric than their inputs.

This model is complementary to the lexical replacement model: as it is based on a sequence-to-sequence transformer model, it is relatively free in its generation. It frequently generates words not in the original input which leads to more creative, metaphoric outputs. Examples like 5 and 6 show the power of the metaphor masking model: it is capable of generating a wider variety of words that yield better metaphors. As the model isn’t constrained to a particular resource, it has more power with regard to lexical choice. Example 6 shows another benefit with regard to metaphoricity: the model can generate multiple words not present in the input ("hailed", "airborne"), yielding more creative utterances, although these are often worse paraphrases.555Note that the lexical replacement model fails for this sentence, as no candidate word was found for "puffed up"..

While the model often generates strong metaphors, there are also cases where the model predicts a word for the masked metaphoric word that is extremely literal (7 and 8), which yields sentences that are fluent and good paraphrases but lacking in metaphoricity. This deficit is due to the lack of information for the model about the metaphoric class. As our dataset is limited, the model doesn’t have enough signal to fully distinguish what goes into a metaphoric gap. More data (both metaphoric and overtly literal) should help the model generate more surprising and metaphoric outputs.

We can also see from Table 2 the weaknesses of the lexical replacement baseline. As the candidates are generated with diverse syntactic endings, they often exhibit disagreement with their arguments (5, 7, 8). Additionally, it doesn’t always make metaphoric predictions: in 5, the output matches the input verb, yielding an extremely literal paraphrase.

6.3 Consistent Errors

The sentences that confound both of our models tend to be idiomatic (Table 3). These are cases where the "metaphoric" meaning of the sentence isn’t captured explicitly by the verb, but rather spans the entire utterance. For example, in 10, the communication metaphors is present, regardless of the verb used: the literal verb "communicate" may be less metaphoric as a verb than the gold "talk", but the metaphor of the sentence persists. This causes difficulties for our systems which require metaphoricity to be focused on the verb.

The lexical replacement model often makes lexical choices that either don’t match the original meaning (9), or don’t maintain any metaphoricity (13). As WordNet is a finite resource, the number of candidate replacement verbs is often small, and this restricts the system from finding truly novel metaphoric expressions. It also may be the case that finding the "best fit" word from output vectors is actually counterproductive: Mao et al. Mao et al. (2018) use this procedure for finding the best literal paraphrases, and although we alter their approach to identify more metaphoric candidates, the model might still prefer the most literal option.

A possible solution left for future work is to select the "worst fit" from the candidates: the word who’s vector is least likely to match the context. This would ensure contrast between domains, but in preliminary studies lead to the model invariably picking syntactically incomprehensible or semantically incoherent choices. For future work, we believe better limitations on the candidate selection, enforcing syntactic constraints while allowing a wider variety of domains, will allow us to implement the "worst fit" approach more effectively with the potential to generate much more interesting metaphoric replacements.

The metaphoric masking model struggles with short sentences: it often generates words that don’t fit the context, yielding unparseable expressions (see 9-12). The relatively idiomatic nature of these expressions also hinders the model’s performance: as the metaphoricity isn’t focused singularly on the verb, the model is unable to make accurate predictions about the masked token.

A possible solution here is to expand the masking to other parts of speech, or even to phrases. This would allow the model to generate over more complex metaphoric expressions. Additionally, if our seq2seq model can accurately pick up on masked metaphor tasks, this gives us both flexibility and control over metaphor generation: we will be able to choose which parts of utterances we would like to metaphoric, allowing for much more powerful generation systems.

One consistent problem in this process is the difficulty of keeping annotation categories independent. We find that generated sentences that are incoherent syntactically also tend to be considered bad paraphrases (Spearman correlation of .559, ). It is likely because if a sentence is difficult to syntactically parse, it is more difficult to assess its meaning, making judgments of semantic similarity difficult. Additionally, metaphoricity ratings correlate negatively to a lesser degree with paraphrase quality (-.112, ). Strong metaphoric paraphrases likely add additional meaning or de-emphasize some of the original literal meaning, making their paraphrase quality lower. Interestingly, fluency and metaphor ratings did not significantly correlate, indicating that disfluent sentences were neither more or less metaphoric than their fluent counterparts.

It is important to note the variety of possible generated expressions that are considered good. Different generated metaphors can even maintain some of the original literal meaning, while highlighting different aspects, as good novel metaphors are known for. Consider the generated example "This idea harmonizes up with the other one", intended to paraphrase "This idea matches up with the other one". This captures in many senses the original input of "matches up", but also provides something more: not only do the ideas go together, but perhaps they also improve upon one another. Because of the variety of acceptable outputs, automatic generation of metaphoric paraphrases is exceedingly difficult. For this reason, we present an automatic metric for evaluating metaphoric paraphrases.

7 Conclusions and Future Work

We’ve established a new task for natural language generation: the creation of metaphoric paraphrases for literal sentences. We explore two possible models for accomplishing this task: an adapted lexical replacement baseline model that relies on WordNet to find candidate verbs and the output vectors of word embeddings to match their contexts, and a seq2seq transformer-based model that masks metaphoric verbs to encourage generation of metaphoric outputs. Crowdsourced evaluations show that both models are successful at different aspects of the task: the lexical replacement baseline yields consistent paraphrases that lack metaphoricity, while the metaphor masking model yields extremely metaphor outputs that often don’t accurately paraphrase the input.

Future work in this area is hindered by the lack of available data. In order to improve these methods, we need better datasets. This couples with the problem of evaluation: standard evaluation metrics for language generation are often misleading with regard to metaphors. Better datasets would allow for the development of better metrics for evaluation, and in turn better evaluation metrics may allow us to build better systems for automatically identifying metaphoric paraphrases, allowing us to build better corpora.

Another possible direction to explore is the incorporation of knowledge representations. Our lexical replacement method relies heavily on WordNet, and can make local changes based on a small number of candidate verbs. Our metaphor masking model is relatively free, but neither contain any knowledge of the metaphors in use.

To truly be able to generate metaphors based on actual metaphoric mappings, we need to incorporate some knowledge of the source and target domains involved. This could involve leveraging FrameNet Baker et al. (1998) or MetaNet Dodge et al. (2015), developing a novel metaphor knowledge base, or learning domain knowledge in an unsupervised fashion. Developing metaphor knowledge bases that capture relations between domains in a usable way will not only allow for better metaphor generation, but also better reasoning and understanding of texts that make use of more complicated metaphoric expressions. However the ordeal is undertaken, generation of coherent metaphors will inevitably require better representation of the interaction between the domains evoked.


  • K. Abe, S. Kayo, and M. Nakagawa (2006) A computational model of the metaphor generation process. In Proceedings of the 28th Annual Meeting of the Cognitive Science Society, pp. 937–942. Cited by: §2.2.
  • C. F. Baker, C.J. Fillmore, and J.B. Lowe (1998) The Berkeley FrameNet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, Montreal, Canada, pp. 86–90. External Links: Link Cited by: §7.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. External Links: Link Cited by: §1.
  • J. Birke and A. Sarkar (2006) A clustering approach for nearly unsupervised recognition of nonliteral language. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 329–336. External Links: Link Cited by: §4.2.
  • Y. Bizzoni and S. Lappin (2018)

    Predicting human metaphor paraphrase judgments with deep neural networks

    In Proceedings of the Workshop on Figurative Language Processing, New Orleans, Louisiana, pp. 45–55. External Links: Link, Document Cited by: §5.
  • M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, Y. Wu, and M. Hughes (2018)

    The best of both worlds: combining recent advances in neural machine translation

    cs.CL/1804.09849v2. External Links: 1804.09849, Link Cited by: §4.2.
  • E. Do Dinh, H. Wieland, and I. Gurevych (2018) Weeding out conventionalized metaphors: a corpus of novel metaphor annotations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1412–1424. External Links: Link, Document Cited by: §5, §5.
  • E. Dodge, J. Hong, and E. Stickles (2015) MetaNet: deep semantic automatic metaphor analysis. In Proceedings of the Third Workshop on Metaphor in NLP, Denver, Colorado, pp. 40–49. External Links: Link Cited by: §7.
  • O. Dušek, J. Novikova, and V. Rieser (2020) Evaluating the state-of-the-art of end-to-end natural language generation: the E2E NLG challenge. Computer Speech & Language 59, pp. 123–156. External Links: ISSN 0885-2308, Document, Link Cited by: §4.2.
  • G. Fauconnier and M. Turner (1996) Blending as a central process of grammar. In Conceptual Structure, Discourse, and Language, A. Goldberg (Ed.), Cited by: §2.2.
  • A. Gagliano, E. Paul, K. Booten, and M. A. Hearst (2016) Intersecting word vectors to take figurative language to new heights. In Proceedings of the Fifth Workshop on Computational Linguistics for Literature, San Diego, California, USA, pp. 20–31. External Links: Link, Document Cited by: §2.2.
  • L. Gandy, N. Allan, M. Atallah, O. Frieder, N. Howard, S. Kanareykin, M. Koppel, M. Last, Y. Neuman, and S. Argamon (2013) Automatic identification of conceptual metaphors with limited knowledge. In

    Proceedings of the 27th AAAI Conference on Artificial Intelligence

    Bellevue, Washington, pp. 328–334. External Links: Link Cited by: §2.2.
  • G. Lakoff and M. Johnson (1980) Metaphors we live by. University of Chicago Press, Chicago. Cited by: §1, §1.
  • G. Lakoff (1993) The contemporary theory of metaphor. In Metaphor and Thought, A. Ortony (Ed.), pp. 202–251. Cited by: §1.
  • C. W. (. Leong, B. Beigman Klebanov, and E. Shutova (2018) A report on the 2018 VUA metaphor detection shared task. In Proceedings of the Workshop on Figurative Language Processing, New Orleans, Louisiana, pp. 56–66. External Links: Link, Document Cited by: §4.1.
  • R. Mao, C. Lin, and F. Guerin (2018) Word embedding and WordNet based metaphor identification and interpretation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 1222–1231. External Links: Link, Document Cited by: §2.1, §2.1, §4.1, §4, §6.3.
  • H. H. Marshall (1990) This issue: metaphors we learn by. Theory Into Practice 29 (2), pp. 70–70. External Links: Document, Link, Cited by: §1.
  • Z. J. Mason (2004) CorMet: a computational, corpus-based conventional metaphor extraction system. Computational Linguistics 30 (1), pp. 23–44. External Links: Link, Document Cited by: §2.2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    CoRR abs/1301.3781. External Links: Link Cited by: §2.1, §4.1.
  • A. Miyazawa and Y. Miyao (2017) Evaluation metrics for automatically generated metaphorical expressions. In The 12th International Conference on Computational Semantics, Montpellier, France. External Links: Link Cited by: §5.
  • S. Mohammad, E. Shutova, and P. Turney (2016) Metaphor as a medium for emotion: an empirical study. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, Berlin, Germany, pp. 23–33. External Links: Link, Document Cited by: §3, §4.2, §4.2, §5, §5.
  • J. Mueller, D. Gifford, and T. Jaakkola (2017) Sequence to better sequence: continuous revision of combinatorial structures. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Sydney, Australia, pp. 2536–2544. External Links: Link Cited by: §4.2.
  • E. Ovchinnikova, V. Zaytsev, S. Wertheim, and R. Israel (2014) Generating conceptual metaphors from proposition stores. cs.CL/1409.7619. External Links: Link Cited by: §2.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311–318. External Links: Link, Document Cited by: §1.
  • F. C. Pereira (2007) Enrichment of automatically generated texts using metaphor. In Proceedings of the Sixth Mexican International Conference on Artificial Intelligence, Aguascalientes, Mexico, pp. 944–954. Cited by: §2.2.
  • E. Shutova, T. Van de Cruys, and A. Korhonen (2012) Unsupervised metaphor paraphrasing using a vector space model. In Proceedings of the 24th International Conference on Computational Linguistics, Mumbai, India, pp. 1121–1130. External Links: Link Cited by: §2.1.
  • E. Shutova (2010) Automatic Metaphor Interpretation as a Paraphrasing Task. In The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, pp. 1029–1037. External Links: Link Cited by: §2.1.
  • E. Shutova (2015) Design and Evaluation of Metaphor Processing Systems. Computational Linguistics 41, pp. 579–623. Cited by: §1.
  • M. Steedman and M. Moens (1988) Temporal ontology and temporal reference. Computational Linguistics 2 (14), pp. 15–28. External Links: Link Cited by: §4.1.
  • G.J. Steen, A.G. Dorst, J.B. Herrmann, A.A. Kaal, T. Krennmayr, and T. Pasma (2010) A method for linguistic metaphor identification. from mip to mipvu.. Converging Evidence in Language and Communication Research, John Benjamins (English). External Links: ISBN 9789027239037 Cited by: §4.2.
  • K. Stowe and M. Palmer (2018) Leveraging syntactic constructions for metaphor processing. In Workshop on Figurative Language Processing, New Orleans, Louisiana, pp. 17–26. External Links: Link Cited by: §4.2.
  • A. Terai and M. Nakagawa (2010) A computational system of metaphor generation with evaluation mechanism. In International Conference on Artificial Neural Networks, Thessaloniki, Greece, pp. 142–147. External Links: Link Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In 31st Conference on Neural Information Processing Systems, Long Beach, California, pp. 5998–6008. External Links: Link Cited by: §4.2.
  • T. Veale and Y. Hao (2008) A fluid knowledge representation for understanding and generating creative metaphors. In Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK, pp. 945–952. External Links: Link Cited by: §2.2.
  • T. Veale, E. Shutova, and B. B. Klebanov (2016) Metaphor: a computational perspective. Synthesis Lectures on Human Language Technologies 9 (1), pp. 1–160. External Links: Document, Link Cited by: §1.
  • T. Veale (2016) Round up the usual suspects: knowledge-based metaphor generation. In Proceedings of the Fourth Workshop on Metaphor in NLP, San Diego, California, pp. 34–41. External Links: Link, Document Cited by: §5.
  • A. Wallington, R. Agerri, J. Barnden, M. Lee, and T. Rumbell (2011)

    Affect transfer by metaphor for an intelligent conversational agent


    Affective computing and sentiment analysis. Emotion, metaphor and terminology

    Vol. 45, pp. 53–66. External Links: Document Cited by: §1.
  • Z. Yu and X. Wan (2019) How to avoid sentences spelling boring? Towards a neural approach to unsupervised metaphor generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 861–871. External Links: Link, Document Cited by: §2.2, §4.2, §5, §5, footnote 2.
  • L. Zhang (2008) Metaphorical affect sensing in an intelligent conversational agent. In Proceedings of the Fifth International Conference on Advances in Computer Entertainment Technology, Yokohama, Japan, pp. 100–106. External Links: Document Cited by: §1.