Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models

06/05/2023
by   Manisha Mukherjee, et al.
0

Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $187 and $800 each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2023

BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition

Using language models (LMs) pre-trained in a self-supervised setting on ...
research
05/21/2022

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Multilingual language models such as mBERT have seen impressive cross-li...
research
06/10/2021

Linguistically Informed Masking for Representation Learning in the Patent Domain

Domain-specific contextualized language models have demonstrated substan...
research
09/12/2019

UER: An Open-Source Toolkit for Pre-training Models

Existing works, including ELMO and BERT, have revealed the importance of...
research
01/31/2023

FLAME: A small language model for spreadsheet formulas

The widespread use of spreadsheet environments by billions of users pres...
research
05/11/2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is...
research
06/11/2021

Dynamic Language Models for Continuously Evolving Content

The content on the web is in a constant state of flux. New entities, iss...

Please sign up or login with your details

Forgot password? Click here to reset