Log In Sign Up

Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms

We present a fairly large, Potential Idiomatic Expression (PIE) dataset for Natural Language Processing (NLP) in English. The challenges with NLP systems with regards to tasks such as Machine Translation (MT), word sense disambiguation (WSD) and information retrieval make it imperative to have a labelled idioms dataset with classes such as it is in this work. To the best of the authors' knowledge, this is the first idioms corpus with classes of idioms beyond the literal and the general idioms classification. In particular, the following classes are labelled in the dataset: metaphor, simile, euphemism, parallelism, personification, oxymoron, paradox, hyperbole, irony and literal. Many past efforts have been limited in the corpus size and classes of samples but this dataset contains over 20,100 samples with almost 1,200 cases of idioms (with their meanings) from 10 classes (or senses). The corpus may also be extended by researchers to meet specific needs. The corpus has part of speech (PoS) tagging from the NLTK library. Classification experiments performed on the corpus to obtain a baseline and comparison among three common models, including the BERT model, give good results. We also make publicly available the corpus and the relevant codes for working with it for NLP tasks.


ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

We present our work on collecting ArzEn-ST, a code-switched Egyptian Ara...

Vector Representations of Idioms in Conversational Systems

We demonstrate, in this study, that an open-domain conversational system...

Automatic Parallel Corpus Creation for Hindi-English News Translation Task

The parallel corpus for multilingual NLP tasks, deep learning applicatio...

English-Twi Parallel Corpus for Machine Translation

We present a parallel machine translation training corpus for English an...

Machine Learning Approaches for Amharic Parts-of-speech Tagging

Part-of-speech (POS) tagging is considered as one of the basic but neces...

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

Multi-word expressions (MWEs) are a hot topic in research in natural lan...

The NLP Engine: A Universal Turing Machine for NLP

It is commonly accepted that machine translation is a more complex task ...

1 Introduction

Idioms pose strong challenges to nlp systems, whether with regards to tasks such as mt, wsd, information retrieval or metonymy resolution Korkontzelos et al. (2013)

. For example, in conversational systems, generating adequate responses depending on the idiom´s class (for a user-input such as "My wife kicked the bucket") will benefit users of such systems. This is because distinguishing the earlier example as an euphemism (a polite form of a hard expression), instead of just a general idiom, may elicit a sympathetic response from the conversational system, instead of a bland one. Also, classifying idioms into various classes has the potential benefit of automatic substitution of their literal meaning with mt.

Idioms are part of figures of speech, which are mwe that have different meanings from the constituent meaning of the words Quinn and Quinn (1993); Drew and Holt (1998), though some draw a distinction between the two Grant and Bauer (2004). Not all mwe are idioms. A mwe may be compositional, i.e. its meaning is predictable from the composite words Diab and Bhutada (2009). Research in this area is, therefore, important, especially since the use of idiomatic expressions is very common in spoken and written text Lakoff and Johnson (2008); Diab and Bhutada (2009).

Figures of speech are so diverse that a detailed evaluation is out of the scope of this work. Indeed, figures of addition and subtraction create a complex but interesting collection Quinn and Quinn (1993). Sometimes, idioms are not well-defined and classification of cases are not clear Grant and Bauer (2004); Alm-Arvius (2003). Even single words can be expressed as metaphors Lakoff and Johnson (2008); Birke and Sarkar (2006). This fact makes distinguishing between figures of speech or idioms and literals quite a difficult challenge in some instances Quinn and Quinn (1993). Previous work have focused on datasets without the actual classification of the senses of expressions beyond the literal and general idioms Li and Sporleder (2009); Cook et al. (2007). Also, many of them have fewer than 10,000 samples Sporleder et al. (2010); Li and Sporleder (2009); Cook et al. (2007)

. It is therefore imperative to have a fairly large dataset for neural networks training, given that more data increases the performance of neural network models.

Adewumi et al. (2019, 2020).

The objectives of this work are to create a corpus of potential idiomatic expressions in English language and make it publicly available for the nlp research community. There are two usual approaches to idiom detection: type-based and tokens-in-context (or token-based) Peng et al. (2015); Cook et al. (2007); Li and Sporleder (2009); Sporleder et al. (2010). This work focuses on the latter approach by presenting an annotated corpus. This will contribute to advancing research in token-based idiom detection, which has enjoyed less attention in the past, compared to type-based. Identification of fixed syntax (or static) idioms is much easier than those with inflections since exact phrasal match can be used. The idioms corpus has almost 1,200 cases of idioms (with their meanings) (e.g. cold feet, kick the bucket, etc), 10 classes (or senses, including literal) and over 20,100 samples from, mainly, the bnc with 96.9% and about 3.1% from UK-based web pages (UKWAC). This is, possibly, the first idioms corpus with classes of idioms beyond the literal and general idioms classification. The authors further carried out classification experiments on the corpus to obtain a baseline and comparison among three common models, including the bert model. The following sections include related work, methodology for creating the corpus, corpus details, experiments and the conclusion.

2 Literature Review

There have been variations in the methods used in past efforts at creating idioms corpora. Some corpora have less than 100 cases of idioms, less than 10,000 samples with few classes and without classification of the idioms Sporleder et al. (2010). Furthermore, labelled datasets for idioms in English are minimal. Table 1 summarizes some of the related work, in comparison to the authors’.

The IDIX corpus, based on expressions from the bnc, does not classify idioms, though annotation was more than the literal and non-literal alternatives Sporleder et al. (2010). They used Google search to ascertain how frequent each idiom is for the purpose of selection. Their automatic extraction from the bnc returned some erroneous results which were manually filtered out. It contains 5,836 samples and 78 cases. Li and Sporleder (2009)

extracted 3,964 literal and non-literal expressions from the Gigaword corpus. The expressions covered only 17 idiom cases. Meanwhile,

Cook et al. (2007) selected 60 verb-noun construct (VNC) token expressions and extracted 100 sentences for each from the bnc. These were annotated using two native English speakers Cook et al. (2007).

Diab and Bhutada (2009) used svm to perform binary classification into literal and idiomatic expressions on a subset of the VNC-Token. The English SemEval-2013 dataset had over 4,350 samples Korkontzelos et al. (2013). The annotation did not include idiom classification but differentiated literal, figurative use or both, by using three crowd-workers per example. It only contained idioms (from a manually-filtered list) that have their figurative and literal use, excluding those with only figurative use.

Saxena and Paul (2020) introduced English Possible Idiomatic Expressions (EPIE) corpus, containing 25,206 samples of 717 idiom cases. The dataset does not specify the number of literal samples and does not include idioms classification. Haagsma et al. (2020) generated potential idiomatic expressions in a recent work (MAGPIE) and annotated the dataset using only two main classes (idiomatic or literal), through crowdsourcing. The idiomatic samples are 2.5 times more frequent than the literals, with 1,756 idiom cases and an average of 32 samples per case. There are 126 cases with only one instance and 372 cases with less than 6 instances in the corpus, making it potentially difficult for neural networks to learn from the samples of such cases due to sample dearth.

There are two usual approaches to idiom detection in the literature: type-based and token-in-context (token-based) Cook et al. (2007); Li and Sporleder (2009); Sporleder et al. (2010). The former attempts to distinguish if an expression can be used as an idiom while the latter relies on context for disambiguation between an idiom and its literal useage, as demonstrated in the SemEval semantic compositionality in context subtask Korkontzelos et al. (2013); Sporleder et al. (2010). Token-based detection is a more difficult task than semantic similarity of words and compositional phrases, as demonstrated by Korkontzelos et al. (2013), hence, detecting any of the multiple classes in an idioms dataset may be even more challenging.

There are various classes (or senses) of idioms, including metaphor, simile and paradox, among others Alm-Arvius (2003). Tropes and Schemes, according to Alm-Arvius (2003), are sub-categories of figures of speech. Tropes have to do with variations in the use of lexemes and mwe. Schemes involve rhythmic repetitions of phoneme sequences, syntactic constructions, or words with similar senses. A figure of speech becomes part of a language as an idiom when members of the community repeatedly use it. The principles of idioms are similar across languages but actual examples are not comparable or identical across languages Alm-Arvius (2003).

Dataset Cases Classes Samples
PIE-English (ours) 1,197 10 20,174
IDIX 78 5,836
Li & Sporleder 17 2 3,964
MAGPIE 1,756 2 56,192
EPIE 717 25,206
Table 1: Some datasets compared

3 Methodology

Each of the 4 contributors (who are English speakers) collected sample sentences of idioms and literals (where applicable) from the bnc, based on identified idioms in the dictionary by Easy Pace Learning111 As a form of quality control, the entire corpus was reviewed by a near-native speaker. This approach avoided common problems noticeable with crowd-sourcing methods, such as cheating the system or fatigue Haagsma et al. (2020). Although our approach is time-intensive, it also eliminates the problem noticeable with automatic extraction, such as duplicate sentences Saxena and Paul (2020) or false negatives/positives Sporleder et al. (2010)

, for which manual effort may later be required. This strategy gives high precision and recall to our total collection

Sporleder et al. (2010).

Classification of the cases of idioms was done by the near-native speaker (annotation 1 in table 3), based on their characteristics as discussed in the next section, while the classification by the authors of the dictionary is annotation 2. A common approach for annotation is to have two or more annotators and determine their inter-agreement scores Peng et al. (2015). Google search was used for cases in the dictionary that did not include classification and most of such came from The Free

The contributors were given ample time for their task to mitigate against fatigue, which can be a common hindrance to quality in dataset creation. We used the resources dedicated to the bnc and other corpora333 &
to extract the sentences. The bnc has 100M words while the UKWAC has 2B words. One of the benefits of these tools is the functionality for lemma-based search when searching for usage variants. In a few cases, where less than 6 literal samples were available from both corpora, we used inflection to generate additional examples. For example, "You need one to hold the ferret securely while the other ties the knot" was inflected as "She needs to hold the ferret securely while he ties the knot".

4 The Corpus

Idioms were selected from the dictionary in an alphabetical manner and samples were selected from the bnc & UKWAC based on the first to appear in both corpora. Each sample contains 1 or 2 sentences, with the majority containing just 1. The bnc is a popular choice for text extraction for realistic samples across domains. The bnc is, however, relatively small, hence we relied also on the second corpus, UK-based web pages, for further extraction when search results were less than the requirements (15 idiom samples and 21 for cases including both idioms and literals). Therefore, in each case, the number of samples were 22 for cases with literals and 16 for cases without literals (because of the included mwe). Six samples were decided to be the number of literal samples for each case that had both potential idiomatic expression and literal because the bnc and UKWAC sometimes had fewer or more literal samples, depending on the case. A limitation of the pie-English dataset, which seems inevitable, is the dominance of metaphors, since metaphors are the most common figures of speech Bizzoni et al. (2017); Grant and Bauer (2004). Table 2 gives the distribution of the classes of samples.

It should be reiterated that idioms classification can sometimes overlap, as shown in figure 1, and there is no general consensus on all the cases Grant and Bauer (2004); Alm-Arvius (2003). Indeed, there have been different attempts at classifying idioms, including semantic, syntactic and functional classifications Grant and Bauer (2004); Cowie and Mackin (1983). The classification employed by the authors of this work is based, largely, on the standpoint of Alm-Arvius (2003). It can be observed that a classification of a case or sample as personification also fulfills classification as metaphor, as it is also the case with euphemism. Hence, the incident of two annotators with such different annotations does not imply they are wrong but that one is more specific.

A metaphor uses a phenomenon or type of experience to outline something more general and abstract Alm-Arvius (2003); Lakoff and Johnson (2008). It describes something by comparing it with another dissimilar thing in an implicit manner. This is unlike simile, which compares in an explicit manner. Some other figures of speech sometimes overlap with metaphor and other idioms overlap with others.

Personification describes something not human as if it could feel, think or act in the same way humans could. Examples of personification are metaphors also. Hence, they form a subset (hyponym) of metaphors. Apostrophe denotes direct, vocative addresses to entities that may not be factually present (and is a subset of personification) Alm-Arvius (2003). Oxymoron is a contradictory combination of words or phrases. They are meaningful in a paradoxical way and some examples can appear hyperbolic Alm-Arvius (2003). Hyperbole is an exaggeration or overstatement. This has the effect of startling or amusing the hearer. Figure 1 is a diagram of the relationship among some classes of idioms, based on the authors’ perception of the description by Alm-Arvius (2003).

Figure 1: Classes of idioms & their relationships

The idioms are common in many English-speaking countries. There is no restriction on the syntactic pattern of the idioms in the instances. Our manual extraction approach from the base corpora increases the quality of the samples in the dataset, given that manual approaches appear to give more accurate results though demanding on time Roh et al. (2019).

Risks with data privacy are limited to what is provided in the base corpora (bnc & UKWAC). Part of speech (pos) tagging was performed using the natural language toolkit (NLTK) to process the original dataset Loper and Bird (2002). The corpus may also be extended by researchers to meet specific needs. For example, by adding IOB tags for chunking, as another approach for training. The corpus and the relevant Python codes for nlp tasks are publicly available for

Classes % of Samples Samples
Metaphor 72.7 14,666
Simile 6.11 1,232
Euphemism 11.82 2,384
Parallelism 0.32 64
Personification 2.22 448
Oxymoron 0.24 48
Paradox 0.56 112
Hyperbole 0.24 48
Irony 0.16 32
Literal 5.65 1,140
Overall 100 20,174
Table 2: Distribution of samples of idioms/literals in the corpus
Classes Annotation 1 % Annotation 2 %
Metaphor 921 76.94 877 73.27
Simile 82 6.85 66 5.51
Euphemism 148 12.36 75 6.27
Parallelism 3 0.25 9 0.75
Personification 28 2.34 66 5.51
Oxymoron 4 0.33 9 .75
Paradox 6 0.5 19 1.59
Hyperbole 3 0.25 57 4.76
Irony 2 0.17 19 1.59
Overall 1197 100 1197 100
Table 3: Annotation of classes of cases of idioms in the corpus
ID Token PoS class meaning idiom+literal
Table 4: Fields in the corpus

Examples of a sample per class in the corpus are given below. Each potential idiomatic expression in bracket represents a case.

  1. Metaphor (ring a bell): Those names ring a bell

  2. Simile (as clear as a bell): it sounds as clear as a bell

  3. Euphemism (go belly up): that Blogger could go belly up in the near future

  4. Parallelism (day in, day out): that board was used day in day out

  5. Personification (take time by the forelock): What I propose is to take time by the forelock.

  6. Oxymoron (a small fortune): a chest like this costs a small fortune if you can find one.

  7. Paradox (here today, gone tomorrow): he’s a here today, gone tomorrow politician.

  8. Hyperbole (the back of beyond): Mhm. a voice came, from the back of beyond.

  9. Irony (pigs might fly): Pigs might fly, the paramedic muttered.

  10. Literal (ring a bell): They used to ring a bell up at the hotel.

5 Experiments

The data-split was done in a stratified way before being fed to the network to address the class imbalance in the corpus. This method ensures all classes are split in the same ratio among the training and dev (or validation) sets. The split was 85:15 for the training and validation sets, respectively. All experiments were performed on a shared cluster having Tesla V100 GPUs, though only one GPU was used in training the bert model and the CPUs were used for the other classifiers. Ubuntu 18 is the OS version of the cluster.

5.1 Methodology

The pre-processing involved lowering all cases and removing all html tags, if any, though none was found as the data was extracted manually and verified. Furthermore, bad symbols and numbers were removed. The training data set is shuffled before training.

The following classifiers/models were experimented with to serve as some baseline and comparison: mnb classifier, linear svm and the bert Devlin et al. (2018). The authors used CountVectorizer as the matrix of token counts before transforming it into normalized TF-IDF representation and then feeding the mnb and svm classifiers. bert, however, uses WordPiece embeddings Devlin et al. (2018)

. The svm uses stochastic gradient descent (SGD) and hinge loss. Its default regularization is l2. The total number of training epochs is 5 for mnb and svm while it is 3 epochs for bert.

6 Results and Discussion

Tables 5 and 6 show weighted average results obtained from the experiments, over three runs per model. Figure 2 is a bar chart of table 5. It will be observed that all three classifiers give results above what may be considered chance. bert, being a pre-trained, deep neural network model, performed best out of the three classifiers.

Table 6 shows that, despite the good results, the corpus can benefit from further improvement by adding to the classes of idioms that have a low number of samples. This is because the classes recording accuracy of 0 are the ones with the least number of samples in the corpus. Adding more samples to them should improve the results. Regardless, there is strong performance in six, out of the ten, classes in the corpus.

Model Accuracy F1
mNB 0.747 0.66
SVM 0.766 0.67
BERT 0.928 0.969
Table 5: Weighted average results for the three models over 3 runs/classifier (over 3 epochs for BERT)
Class Accuracy F1
Metaphor 0.976 0.981
Simile 0.996 0.988
Euphemism 0.884 0.956
Parallelism 0.967 0.97
Personification 0.637 0.963
Oxymoron 0 0.797
Paradox 0.196 0.957
Hyperbole 0 0.789
Irony 0 0.963
Literal 0.624 0.832
Table 6: bert average results for 3 runs over the classes of idioms
Figure 2: Weighted average results for the three models over 3 runs/classifier

7 Conclusion

In this work, we address the challenge of non-availability of labelled idioms corpus with classes by creating one from the bnc and the UKWAC corpora. It is possibly the first idioms corpus with classes of idioms beyond the literal and general idioms classification. The dataset contains over 20,100 samples with almost 1,200 cases of idioms from 10 classes (or senses). The dataset may also be extended to meet specific nlp needs by researchers. The authors performed classification on the corpus to obtain a baseline and comparison among three common models, including the bert model Devlin et al. (2018). Good results are obtained. We also make publicly available the corpus and the relevant codes for working with it for nlp tasks.


The work on this project is partially funded by Vinnova under the project number 2019-02996 "Språkmodeller för svenska myndigheter".


  • Adewumi et al. (2019) Tosin P Adewumi, Foteini Liwicki, and Marcus Liwicki. 2019.

    Conversational systems in machine learning from the point of view of the philosophy of science—using alime chat and related studies.

    Philosophies, 4(3):41.
  • Adewumi et al. (2020) Tosin P Adewumi, Foteini Liwicki, and Marcus Liwicki. 2020. Word2vec: Optimal hyper-parameters and their impact on nlp downstream tasks. arXiv preprint arXiv:2003.11645.
  • Alm-Arvius (2003) Christina Alm-Arvius. 2003. Figures of speech. Studentlitteratur.
  • Birke and Sarkar (2006) Julia Birke and Anoop Sarkar. 2006. A clustering approach for nearly unsupervised recognition of nonliteral language. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
  • Bizzoni et al. (2017) Yuri Bizzoni, Stergios Chatzikyriakidis, and Mehdi Ghanimifard. 2017.

    deep” learning: Detecting metaphoricity in adjective-noun pairs.

    In Proceedings of the Workshop on Stylistic Variation, pages 43–52.
  • Cook et al. (2007) Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the workshop on a broader perspective on multiword expressions, pages 41–48.
  • Cowie and Mackin (1983) Anthony Paul Cowie and Ronald Mackin. 1983. Oxford dictionary of current idiomatic english v. 2:phrase, clause & sentence idioms.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Diab and Bhutada (2009) Mona Diab and Pravin Bhutada. 2009. Verb noun construction mwe token classification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009), pages 17–22.
  • Drew and Holt (1998) Paul Drew and Elizabeth Holt. 1998. Figures of speech: Figurative expressions and the management of topic transition in conversation. Language in society, pages 495–522.
  • Grant and Bauer (2004) Lynn Grant and Laurie Bauer. 2004. Criteria for re-defining idioms: Are we barking up the wrong tree? Applied linguistics, 25(1):38–61.
  • Haagsma et al. (2020) Hessel Haagsma, Johan Bos, and Malvina Nissim. 2020. Magpie: A large corpus of potentially idiomatic expressions. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 279–287.
  • Korkontzelos et al. (2013) Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and Chris Biemann. 2013. Semeval-2013 task 5: Evaluating phrasal semantics. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 39–47.
  • Lakoff and Johnson (2008) George Lakoff and Mark Johnson. 2008. Metaphors we live by. University of Chicago press.
  • Li and Sporleder (2009) Linlin Li and Caroline Sporleder. 2009. Classifier combination for contextual idiom detection without labelled data. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 315–323.
  • Loper and Bird (2002) Edward Loper and Steven Bird. 2002. Nltk: the natural language toolkit. arXiv preprint cs/0205028.
  • Peng et al. (2015) Jing Peng, Anna Feldman, and Hamza Jazmati. 2015. Classifying idiomatic and literal expressions using vector space representations. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 507–511.
  • Quinn and Quinn (1993) Arthur Quinn and Barney R Quinn. 1993. Figures of speech: 60 ways to turn a phrase. Psychology Press.
  • Roh et al. (2019) Yuji Roh, Geon Heo, and Steven Euijong Whang. 2019. A survey on data collection for machine learning: a big data-ai integration perspective. IEEE Transactions on Knowledge and Data Engineering.
  • Saxena and Paul (2020) Prateek Saxena and Soma Paul. 2020. Epie dataset: A corpus for possible idiomatic expressions. In International Conference on Text, Speech, and Dialogue, pages 87–94. Springer.
  • Sporleder et al. (2010) Caroline Sporleder, Linlin Li, Philip Gorinski, and Xaver Koch. 2010. Idioms in context: The idix corpus. In LREC

    . Citeseer.