Improving Probabilistic Models in Text Classification via Active Learning

02/05/2022
by   Mitchell Bosley, et al.
0

When using text data, social scientists often classify documents in order to use the resulting document labels as an outcome or predictor. Since it is prohibitively costly to label a large number of documents manually, automated text classification has become a standard tool. However, current approaches for text classification do not take advantage of all the data at one's disposal. We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component, where a human iteratively labels documents that the algorithm is least certain about. Using text data from Wikipedia discussion pages, BBC News articles, historical US Supreme Court opinions, and human rights abuse allegations, we show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance relative to classifiers that (a) only use information from labeled data and (b) randomly decide which documents to label at the cost of manually labelling a small number of documents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2019

The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Annotation of training data is the major bottleneck in the creation of t...
research
10/14/2020

Text Classification Using Label Names Only: A Language Model Self-Training Approach

Current text classification methods typically require a good number of h...
research
10/28/2022

Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

A key bottleneck in building automatic extraction models for visually ri...
research
09/25/2019

The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

The text classification is one of the most critical areas in machine lea...
research
05/11/2018

Textual Membership Queries

Human labeling of textual data can be very time-consuming and expensive,...
research
10/08/2020

DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool

We present a lightweight annotation tool, the Data AnnotatoR Tool (DART)...
research
01/27/2016

Font Identification in Historical Documents Using Active Learning

Identifying the type of font (e.g., Roman, Blackletter) used in historic...

Please sign up or login with your details

Forgot password? Click here to reset