Code for ...
We present end-to-end neural models for detecting metaphorical word use in context. We show that relatively standard BiLSTM models which operate on complete sentences work well in this setting, in comparison to previous work that used more restricted forms of linguistic context. These models establish a new state-of-the-art on existing verb metaphor detection benchmarks, and show strong performance on jointly predicting the metaphoricity of all words in a running text.READ FULL TEXT VIEW PDF
Word embeddings are an essential component in a wide range of natural
Conversational context information, higher-level knowledge that spans ac...
Neural models with minimal feature engineering have achieved competitive...
Rapidly developed neural models have achieved competitive performance in...
The problem of detecting and recognizing text in natural scenes has prov...
Current state-of-art feature-engineered and end-to-end Automated Essay S...
The SimpleQuestions dataset is one of the most commonly used benchmarks ...
Code for ...
Metaphors are pervasive in natural language, and detecting them requires challenging contextual reasoning about whether specific situations can actually happen. Lakoff and Johnson (1980). For example, in Table 1, “examining” is metaphorical because it is impossible to literally use a “microscope” to examine an entire country. In this paper, we present end-to-end neural models for metaphor detection, which can learn rich contextual word representations that are crucial for accurate interpretation of figurative language.
In contrast, most previous approaches focused on limited forms of linguistic context, for example by only providing SVO triples such as (car, drink, gasoline) to the model Shutova et al. (2016); Tsvetkov et al. (2013); Rei et al. (2017); Bulat et al. (2017). While the verbal arguments provide strong cues, providing the full sentential context supports more accurate prediction, as seen in Table 1. Even in the few cases when the full sentence is used Köper and im Walde (2017); Turney et al. (2011); Jang et al. (2016) existing models have used unigram-based features with limited expressivity.
We investigate two common task formulations: (1) given a target verb in a sentence, classifying whether it is metaphorical or not, and (2) given a sentence, detecting all of the metaphorical words (independent of their POS tags). We find that relatively standard architectures based on bi-directional LSTMsHochreiter and Schmidhuber (1997) augmented with contextualized word embeddings Peters et al. (2018) perform surprisingly well on both tasks, even with modest amount of training data. We improve the previous state-of-the-art by 7.5 F1 on the VU Amsterdam Metaphor Corpus (VUA) for the sequence labeling task Steen et al. (2010), by 2.5 F1 on the VUA verb classification dataset, and by 4.9 F1 on the MOH-X dataset Mohammad et al. (2016). Our code is publicly available at https://github.com/gao-g/metaphor-in-context.
|The experts started examining the Soviet Union with a microscope to study perceived changes.|
|Rockford teachers are honored for saving a drowning student.|
|You’re drowning in student loan debt.|
We study two task formulations.
Given a sentence , predict a sequence of binary labels to indicate the metaphoricity of each word.
Given a sentence and a target verb index , predict a binary label to indicate the metaphoricity of the target .
While both formulations have been studied in previous work, it is worth noting that the sequence labeling task generalizes the classification task in that the prediction for the target verb can be extracted from the full sentence predictions. In addition, as will be shown in Section 5, we find that given accurate annotations for all words in a sentence, the sequence labeling model outperforms the classification model even when the evaluation is set up as a classification task.
Our models use a bidirectional LSTM to encode a sentence, and a feedforward neural network for classification, optimized for the log-likelihood of gold labels.
For both sequence labeling and classification, we represent each token in the input sentence with a pre-trained word embedding
. To further encode contextual information, we also concatenate ELMo (Embeddings from Language Models) vectorsfrom Peters et al. Peters et al. (2018). These vectors have been shown to be useful for word sense disambiguation, a task closely related to metaphor detection Birke and Sarkar (2006).
Figure 1 shows the model architecture. We input the word representation to a bidirectional LSTM, producing a contextualized representation for each token. Then we use a feedforward neural network that takes to predict a label for each word .
When the dataset does not contain annotations for every word, we make the simplifying assumption that every unannotated word is used literally.
Figure 2 shows the model architecture. We concatenate an index embedding , which indicates whether is the target verb. We use as an input to a bidirectional LSTM, producing a contextualized representation .
We add an attention layer by computing the attention weight for token , and compute the representation as a weighted sum of LSTM output states where and are learned parameters.
Finally, we feed to a feedforward network to compute the label scores for target verb.
We evaluate performance on a number of benchmark datasets, including two for classification (TroFi and MOH\MOH-X) and one for tagging (VUA).111For detailed information about each dataset, please refer to original papers: TroFi Birke and Sarkar (2006), MOH Mohammad et al. (2016), VUA Steen et al. (2010). MOH-X refers to a subset of MOH dataset used in previous work Shutova et al. (2016) where verb and its argument are extracted from each sentence. Table 2 shows statistics for the verb classification datasets. Despite being two times larger than the MOH dataset, the TroFi dataset contains only 50 unique verbs, and the larger VUA dataset contains over 2K unique verbs. The MOH dataset contains shorter and simpler sentences (example sentences in WordNet), compared to sentences in other datasets which come from resources such as news articles. The TroFi and MOH-X datasets are constructed to have higher percentages of metaphor, compared to the natural likelihood of metaphor in a running text, as seen in the VUA dataset.
We perform 10 fold cross-validation on the MOH-X and TroFi datasets, following prior work. For the VUA dataset, we use the original training and test split Klebanov et al. (2016), and set aside 10% of the training set as a development set.
The VUA dataset contains annotations for all words in each sentence. We divide the data into training, development, and test set following the same split for the VUA verb classification task. While the label classes are less balanced (only 11% metaphors at the token level), this dataset is much bigger. Table 3 shows the data statistics.
|#||%||# Uniq.||Avg #|
|# Unique tokens||13,843||7,458||7,200|
|# Unique sent.||6,323||1,550||2,694|
|Wu Wu et al. (2018) ensemble||60.8||70.0||65.1||-|
|Model||MOH-X (10 fold)||TroFi (10 fold)||VUA - Test|
|Wu (2018) ensemble||-||-||-||-||-||-||-||-||60.0||76.3||67.2||-||-|
We report precision, recall and F1 measure for the metaphor class as well as the overall accuracy. For the VUA dataset, we also report macro-averaged F1 score across four genres (conversation, academic writing, fiction and news).
We propose a simple yet effective lexical baseline. It assigns the metaphor label if the word is annotated metaphorically more frequently than as literally in the training set, and the literal label otherwise. We also compare our models to previously published work, including: (1) a logistic regression classifier with features that indicate verb lemmas and the verbs’ semantic class from WordNetKlebanov et al. (2016), (2) a neural similarity network with skip-gram word embeddings Rei et al. (2017), (3) a balanced logistic regression classifier on target verb lemma that uses a set of features based on multi-sense abstractness rating Köper and im Walde (2017), and (4) a CNN-LSTM ensemble model with weighted-softmax classifier which incorporates pre-trained word2vec, POS tags, and word cluster features Wu et al. (2018).222The best performing model on the VUA Metaphor Detection Shared Task at the NAACL 2018 workshop on Figurative Language Processing.
We experiment with both sequence labeling model (SEQ) and classification model (CLS) for the verb classification task, and the sequence labeling model (SEQ) for the sequence labeling task.
We used 300d GloVe vectors Pennington et al. (2014) and 1024d ELMo vectors. We used additional 50d index embedding for the classification task. The LSTM module has a 300d hidden state. We applied dropout on the input to LSTM and on the input to the feedforward layer. We fine-tuned learning rate and dropout rate for each model on each dataset. We used SGD to optimize the CLS model and Adam Kingma and Ba (2013) for the SEQ model. We used spaCy Honnibal and Montani (2017) for lemmatization, tokenization, and part-of-speech tagging.
Performance on the sequence labeling task is reported in Table 4. While prior work Klebanov et al. (2014); Özbal et al. (2016) reported on the same dataset, the experiment setting is not comparable (they did cross validation on a smaller training set).333As a point of reference, their macro-averaged F1 scores were 33.25 / 50.6 respectively. Our lexical baseline performs strongly in terms of precision, as some words and POS tags are almost exclusively annotated as literal. Our sequence labeling model mainly improves recall.
Table 5 reports the breakdown of performance by POS tags. Not surprisingly, tags with more data are easier to classify. Adposition is the easiest to identify as metaphorical and is also the most frequently metaphorical class (28%). On the other hand, particles are challenging to identify, since they are often associated with multi-word expressions, such as “put down the disturbances”.
Table 6 shows performance on the verb classification task for three datasets (MOH-X , TroFi and VUA).444We did not compare to Shutova et al. Shutova et al. (2016) as their experiment setting is not comparable.
Our models achieve strong performance on all datasets, outperforming existing models on the MOH-X and VUA datasets. On the MOH-X dataset, the CLS model outperforms the SEQ model, likely due to the simpler overall sentence structure and the fact that the target verbs are the only words annotated for metaphoricity. For the VUA dataset, where we have annotations for all words in a sentence, the SEQ model significantly outperforms the CLS model. This result shows that predicting metaphor labels of context words helps to predict the target verb. We hypothesize that Köper et al. Köper and im Walde (2017) outperforms our models on the TroFi dataset for a similar reason: their work uses concreteness labels, which highly correlate to metaphor labels of neighboring words in the sentence. Also, their best model uses the verb lemma as a feature, which itself provides a strong clue in the dataset of 50 verbs (see lexical baseline).
Table 7 shows an ablation study on input representations (with or without ELMo vectors). Contextualized word vectors improve the performance of both models by a large margin.
|✗||✗||To throw up an impenetrable Berlin Wall between you and them could be tactless.||-|
|✗||✗||In reality you just invent a tale, as if you were sitting round a fire in a cave.||direct metaphor|
|✗||✗||So they bought immunity.||indirect metaphor|
|✗||✗||During the early states of the phased evacuation the logistical problem facing the police was the street-by-street warning of the population to make ready for evacuation.||indirect metaphor|
|✗||✔||There are few things worse than being bludgeoned into reading a book you hate.||indirect metaphor|
|✗||✔||He thought of thick, fat, hot motorways carving up that land.||personification|
|✗||✔||One might ask whether motorists are ever justified in knowingly taking risks with other people’s lives.||-|
|✗||✔||The abstract talk of commuting by rail or road being replaced by information technology finds a concrete expression in the idea of telecottages.||-|
|✗||✔||A fly landed on the empty, staring vizor, and crawled across it.||-|
We sampled 100 errors of our best model from the VUA verb classification development set for analysis. Table 8 shows examples. Following the original annotation guideline,555http://www.vismet.org/metcor/documentation/home.html we classify metaphors into five categories: direct metaphor, indirect metaphor, implicit metaphor, personification, and borderline case. Indirect metaphor, the most common type for verbs, means that the basic meaning of a word is different from its contextual meaning. Implicit metaphor occurs due to an underlying link which points to a recoverable metaphorical concept.
About half of the errors were false positives, and the other half were false negatives. Among the false negatives, 33% are indirect metaphors, 18% are personifications, and 2% are direct metaphors. Among 55 false positives, 31% of verbs have implicit arguments that are not explicitly mentioned in the context, 15% have long range dependencies (at least five words away) from core arguments, 10% have arguments with rare word senses, and 5% have anthropomorphic arguments. Finally, we found about half of false negatives and 20% of false positives to be borderline cases, showing the subjective nature of the task.
We sampled 257 dev examples that the CLS model gets wrong but the SEQ model gets correct. We found that the SEQ model outperforms the CLS model on detecting personifications, indirect metaphors, and direct metaphors involving uncommon verbs.
There has been significant work on studying different features for metaphor detection, including concretenesss and abstractness Turney et al. (2011); Tsvetkov et al. (2014); Köper and im Walde (2017), imaginability Broadwell et al. (2013); Strzalkowski et al. (2013), feature norms Bulat et al. (2017), sensory features Tekiroglu et al. (2015); Shutova et al. (2016), bag-of-words features Köper and im Walde (2016), and semantic class using WordNet Hovy et al. (2013); Tsvetkov et al. (2014). More recently, embedding-based approaches Köper and im Walde (2017); Rei et al. (2017) showed gains on various benchmarks.
Many neural models with various features and architectures were introduced in the 2018 VUA Metaphor Detection Shared Task. They include LSTM-based models and CRFs augmented by linguistic features, such as WordNet, POS tags, concreteness score, unigrams, lemmas, verb clusters, and sentence-length manipulation Swarnkar and Singh (2018); Pramanick et al. (2018); Mosolova et al. (2018); Bizzoni and Ghanimifard (2018); Wu et al. (2018). Researchers also studied different word embeddings, such as embeddings trained from corpora representing different levels of language mastery Stemle and Onysko (2018)
and binarized vectors that reflect the General Inquirer dictionary category of a wordMykowiecka et al. (2018). We show that contextualized word embedding significantly improves metaphor detection. We also study both sequence labeling and classification approaches, suggesting that sequence labeling approach enhances performance when used to jointly predict the metaphoricity of all words in a sentence.
In this paper, we present simple BiLSTM models augmented with contextualized word representation for metaphor detection. Our models establish new state-of-the-arts across multiple existing benchmarks, and our error analysis shows remaining challenges for metaphor detection.
We thank the anonymous reviewers for their insightful comments. This work was supported in part by the NSF (IIS-1714566 and IIS-1252835), the ARO (W911NF-16-1-0121), the DARPA CwC program through ARO (W911NF-15-1-0543), and gifts from Google and Facebook.
spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.
Detecting figurative word occurrences using recurrent neural networks.In Proceedings of the Workshop on Figurative Language Processing, NAACL, pages 124–127.