Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

04/14/2021
by   Koustuv Sinha, et al.
7

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks – including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

READ FULL TEXT

page 6

page 10

page 11

research
11/24/2022

Using Selective Masking as a Bridge between Pre-training and Fine-tuning

Pre-training a language model and then fine-tuning it for downstream tas...
research
08/15/2019

Visualizing and Understanding the Effectiveness of BERT

Language model pre-training, such as BERT, has achieved remarkable resul...
research
10/20/2020

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Recently pre-trained language representation models such as BERT have sh...
research
09/28/2021

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Recent research has adopted a new experimental field centered around the...
research
06/22/2022

reStructured Pre-training

In this work, we try to decipher the internal connection of NLP technolo...
research
06/21/2023

Task-Robust Pre-Training for Worst-Case Downstream Adaptation

Pre-training has achieved remarkable success when transferred to downstr...
research
03/13/2023

Model-tuning Via Prompts Makes NLP Models Adversarially Robust

In recent years, NLP practitioners have converged on the following pract...

Please sign up or login with your details

Forgot password? Click here to reset