A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data

06/12/2021
by   Jinghui Lu, et al.
0

Training deep learning models with limited labelled data is an attractive scenario for many NLP tasks, including document classification. While with the recent emergence of BERT, deep learning language models can achieve reasonably good performance in document classification with few labelled instances, there is a lack of evidence in the utility of applying BERT-like models on long document classification. This work introduces a long-text-specific model – the Hierarchical BERT Model (HBM) – that learns sentence-level features of the text and works well in scenarios with limited labelled data. Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances, especially when documents are long. Also, as an extra benefit of HBM, the salient sentences identified by learned HBM are useful as explanations for labelling documents based on a user study.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2021

CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding

Scientific document understanding is challenging as the data is highly d...
research
10/27/2020

Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports

The task of text and sentence classification is associated with the need...
research
12/21/2014

Extraction of Salient Sentences from Labelled Documents

We present a hierarchical convolutional document model with an architect...
research
01/18/2022

Hierarchical Neural Network Approaches for Long Document Classification

Text classification algorithms investigate the intricate relationships b...
research
09/09/2019

Pretrained Language Models for Sequential Sentence Classification

As a step toward better document-level understanding, we explore classif...
research
05/05/2023

HiPool: Modeling Long Documents Using Graph Neural Networks

Encoding long sequences in Natural Language Processing (NLP) is a challe...
research
04/12/2016

Efficient Classification of Multi-Labelled Text Streams by Clashing

We present a method for the classification of multi-labelled text docume...

Please sign up or login with your details

Forgot password? Click here to reset