Investigating the Effectiveness of Representations Based on Pretrained Transformer-based Language Models in Active Learning for Labelling Text Datasets

04/21/2020
by   Jinghui Lu, et al.
0

Active learning has been shown to be an effective way to alleviate some of the effort required in utilising large collections of unlabelled data for machine learning tasks without needing to fully label them. The representation mechanism used to represent text documents when performing active learning, however, has a significant influence on how effective the process will be. While simple vector representations such as bag-of-words and embedding-based representations based on techniques such as word2vec have been shown to be an effective way to represent documents during active learning, the emergence of representation mechanisms based on the pre-trained transformer-based neural network models popular in natural language processing research (e.g. BERT) offer a promising, and as yet not fully explored, alternative. This paper describes a comprehensive evaluation of the effectiveness of representations based on pre-trained transformer-based language models for active learning. This evaluation shows that transformer-based models, especially BERT-like models, that have not yet been widely used in active learning, achieve a significant improvement over more commonly used vector representations like bag-of-words or other classical word embeddings like word2vec. This paper also investigates the effectiveness of representations based on variants of BERT such as Roberta, Albert as well as comparing the effectiveness of the [CLS] token representation and the aggregated representation that can be generated using BERT-like models. Finally, we propose an approach Adaptive Tuning Active Learning. Our experiments show that the limited label information acquired in active learning can not only be used for training a classifier but can also adaptively improve the embeddings generated by the BERT-like language models as well.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2019

Investigating the Effectiveness of Representations Based on Word-Embeddings in Active Learning for Labelling Text Datasets

Manually labelling large collections of text data is a time-consuming, e...
research
10/04/2019

Investigating the Effectiveness of Word-Embedding Based Active Learning for Labelling Text Datasets

Manually labelling large collections of text data is a time-consuming, e...
research
05/10/2020

Transformer-Based Language Models for Similar Text Retrieval and Ranking

Most approaches for similar text retrieval and ranking with long natural...
research
07/11/2016

The Benefits of Word Embeddings Features for Active Learning in Clinical Information Extraction

This study investigates the use of unsupervised word embeddings and sequ...
research
04/03/2023

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

Event data, or structured records of “who did what to whom” that are aut...
research
04/08/2021

Deep Indexed Active Learning for Matching Heterogeneous Entity Representations

Given two large lists of records, the task in entity resolution (ER) is ...
research
06/16/2023

ActiveGLAE: A Benchmark for Deep Active Learning with Transformers

Deep active learning (DAL) seeks to reduce annotation costs by enabling ...

Please sign up or login with your details

Forgot password? Click here to reset