Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings
Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. The advantage compared to static word embeddings has been shown for a number of tasks, such as text classification, sequence tagging, or machine translation. Since vectors of the same word can vary due to different contexts, they implicitly provide a model for word sense disambiguation (WSD). We introduce a simple but effective approach to WSD using a nearest neighbor classification on CWEs. We compare the performance of different CWE models for the task and can report improvements above the current state of the art for one standard WSD benchmark dataset. We further show that the pre-trained BERT model is able to place polysemic words into distinct 'sense' regions of the embedding space, while ELMo and flair NLP do not indicate this ability.
READ FULL TEXT