On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

04/12/2021
by   Tianyi Zhang, et al.
0

We study how masking and predicting tokens in an unsupervised fashion can give rise to linguistic structures and downstream performance gains. Recent theories have suggested that pretrained language models acquire useful inductive biases through masks that implicitly act as cloze reductions for downstream tasks. While appealing, we show that the success of the random masking strategy used in practice cannot be explained by such cloze-like masks alone. We construct cloze-like masks using task-specific lexicons for three different classification datasets and show that the majority of pretrained performance gains come from generic masks that are not associated with the lexicon. To explain the empirical success of these generic masks, we demonstrate a correspondence between the Masked Language Model (MLM) objective and existing methods for learning statistical dependencies in graphical models. Using this, we derive a method for extracting these learned statistical dependencies in MLMs and show that these dependencies encode useful inductive biases in the form of syntactic structures. In an unsupervised parsing evaluation, simply forming a minimum spanning tree on the implied statistical dependence structure outperforms a classic method for unsupervised parsing (58.74 vs. 55.91 UUAS).

READ FULL TEXT
research
04/18/2021

Linguistic dependencies and statistical dependence

What is the relationship between linguistic dependencies and statistical...
research
04/25/2023

Pretrain on just structure: Understanding linguistic inductive biases using transfer learning

Both humans and transformer language models are able to learn language w...
research
09/22/2021

Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing

Recent years pretrained language models (PLMs) hit a success on several ...
research
03/29/2021

Retraining DistilBERT for a Voice Shopping Assistant by Using Universal Dependencies

In this work, we retrained the distilled BERT language model for Walmart...
research
09/07/2021

How much pretraining data do language models need to learn syntax?

Transformers-based pretrained language models achieve outstanding result...
research
09/10/2023

Unsupervised Chunking with Hierarchical RNN

In Natural Language Processing (NLP), predicting linguistic structures, ...
research
10/28/2022

Debiasing Masks: A New Framework for Shortcut Mitigation in NLU

Debiasing language models from unwanted behaviors in Natural Language Un...

Please sign up or login with your details

Forgot password? Click here to reset