How much pretraining data do language models need to learn syntax?

09/07/2021
by   Laura Perez-Mayos, et al.
0

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.

READ FULL TEXT

page 5

page 6

page 7

research
11/10/2020

When Do You Need Billions of Words of Pretraining Data?

NLP is currently dominated by general-purpose pretrained language models...
research
09/22/2021

Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing

Recent years pretrained language models (PLMs) hit a success on several ...
research
02/05/2020

Parsing as Pretraining

Recent analyses suggest that encoders pretrained for language modeling c...
research
03/03/2023

Data-Efficient Training of CNNs and Transformers with Coresets: A Stability Perspective

Coreset selection is among the most effective ways to reduce the trainin...
research
04/21/2021

Improving BERT Pretraining with Syntactic Supervision

Bidirectional masked Transformers have become the core theme in the curr...
research
01/18/2023

Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

Current vision language pretraining models are dominated by methods usin...
research
04/12/2021

On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

We study how masking and predicting tokens in an unsupervised fashion ca...

Please sign up or login with your details

Forgot password? Click here to reset