AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain

10/26/2021
by   Jimin Hong, et al.
0

During the fine-tuning phase of transfer learning, the pretrained vocabulary remains unchanged, while model parameters are updated. The vocabulary generated based on the pretrained data is suboptimal for downstream data when domain discrepancy exists. We propose to consider the vocabulary as an optimizable parameter, allowing us to update the vocabulary by expanding it with domain-specific vocabulary based on a tokenization statistic. Furthermore, we preserve the embeddings of the added words from overfitting to downstream data by utilizing knowledge learned from a pretrained language model with a regularization term. Our method achieved consistent performance improvements on diverse domains (i.e., biomedical, computer science, news, and reviews).

READ FULL TEXT
research
09/10/2021

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

We present IndoBERTweet, the first large-scale pretrained model for Indo...
research
06/04/2023

RadLing: Towards Efficient Radiology Report Understanding

Most natural language tasks in the radiology domain use language models ...
research
11/21/2022

CBEAF-Adapting: Enhanced Continual Pretraining for Building Chinese Biomedical Language Model

Continual pretraining is a standard way of building a domain-specific pr...
research
08/04/2022

Vocabulary Transfer for Medical Texts

Vocabulary transfer is a transfer learning subtask in which language mod...
research
03/01/2023

A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-tuning of NMT: A Case Study of In-domain Translation

The effectiveness of Neural Machine Translation (NMT) models largely dep...
research
04/27/2022

UBERT: A Novel Language Model for Synonymy Prediction at Scale in the UMLS Metathesaurus

The UMLS Metathesaurus integrates more than 200 biomedical source vocabu...
research
09/10/2021

WikiCSSH: Extracting and Evaluating Computer Science Subject Headings from Wikipedia

Hierarchical domain-specific classification schemas (or subject heading ...

Please sign up or login with your details

Forgot password? Click here to reset