Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology

01/02/2021
by   Valentin Hofmann, et al.
0

How does the input segmentation of pretrained language models (PLMs) affect their generalization capabilities? We present the first study investigating this question, taking BERT as the example PLM and focusing on the semantic representations of derivationally complex words. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which derivational segmentation consistently outperforms BERT's WordPiece segmentation by a large margin. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2020

Generating Derivational Morphology with BERT

Can BERT generate derivationally complex words? We present the first stu...
research
04/25/2020

Quantifying the Contextualization of Word Representations with Semantic Class Probing

Pretrained language models have achieved a new state of the art on many ...
research
06/02/2020

Position Masking for Language Models

Masked language modeling (MLM) pre-training models such as BERT corrupt ...
research
08/18/2019

Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

Traditionally, many text-mining tasks treat individual word-tokens as th...
research
05/19/2020

Table Search Using a Deep Contextualized Language Model

Pretrained contextualized language models such as BERT have achieved imp...
research
06/24/2023

IERL: Interpretable Ensemble Representation Learning – Combining CrowdSourced Knowledge and Distributed Semantic Representations

Large Language Models (LLMs) encode meanings of words in the form of dis...
research
04/08/2022

Improving Tokenisation by Alternative Treatment of Spaces

Tokenisation is the first step in almost all NLP tasks, and state-of-the...

Please sign up or login with your details

Forgot password? Click here to reset