Using Similarity Measures to Select Pretraining Data for NER

04/01/2019
by   Xiang Dai, et al.
0

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

READ FULL TEXT
research
04/07/2020

Inexpensive Domain Adaptation of Pretrained Language Models: A Case Study on Biomedical Named Entity Recognition

Domain adaptation of Pretrained Language Models (PTLMs) is typically ach...
research
09/08/2021

Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

This work presents biomedical and clinical language models for Spanish b...
research
02/14/2023

Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

Pretrained large language models have become indispensable for solving v...
research
08/20/2020

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

In natural language processing (NLP), there is a need for more resources...
research
08/13/2019

BioFLAIR: Pretrained Pooled Contextualized Embeddings for Biomedical Sequence Labeling Tasks

Biomedical Named Entity Recognition (NER) is a challenging problem in bi...
research
10/05/2017

On the Effective Use of Pretraining for Natural Language Inference

Neural networks have excelled at many NLP tasks, but there remain open q...
research
08/28/2020

QutNocturnal@HASOC'19: CNN for Hate Speech and Offensive Content Identification in Hindi Language

We describe our top-team solution to Task 1 for Hindi in the HASOC conte...

Please sign up or login with your details

Forgot password? Click here to reset