FPDM: Domain-Specific Fast Pre-training Technique using Document-Level Metadata

06/09/2023
by   Abhilash Nandy, et al.
0

Pre-training Transformers has shown promising results on open-domain and domain-specific downstream tasks. However, state-of-the-art Transformers require an unreasonably large amount of pre-training data and compute. In this paper, we propose FPDM (Fast Pre-training Technique using Document Level Metadata), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We show that FPDM outperforms several transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains, and shows a negligible drop in performance on open-domain benchmarks. Importantly, the novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. Code and datasets are available at https://bit.ly/FPDMCode.

READ FULL TEXT

page 2

page 8

page 13

research
12/15/2022

The Effects of In-domain Corpus Size on pre-training BERT

Many prior language modeling efforts have shown that pre-training on an ...
research
05/16/2019

HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization

Neural extractive summarization models usually employ a hierarchical enc...
research
05/16/2023

Adapting Sentence Transformers for the Aviation Domain

Learning effective sentence representations is crucial for many Natural ...
research
04/16/2021

A Million Tweets Are Worth a Few Points: Tuning Transformers for Customer Service Tasks

In online domain-specific customer service applications, many companies ...
research
05/05/2022

Understanding Transfer Learning for Chest Radiograph Clinical Report Generation with Modified Transformer Architectures

The image captioning task is increasingly prevalent in artificial intell...
research
06/03/2023

TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal Domain

State-of-the-art offline Optical Character Recognition (OCR) frameworks ...
research
05/20/2022

Pre-training Transformer Models with Sentence-Level Objectives for Answer Sentence Selection

An important task for designing QA systems is answer sentence selection ...

Please sign up or login with your details

Forgot password? Click here to reset