Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop

04/05/2018
by   Katherine Bailey, et al.
0

Most of the literature around text classification treats it as a supervised learning problem: given a corpus of labeled documents, train a classifier such that it can accurately predict the classes of unseen documents. In industry, however, it is not uncommon for a business to have entire corpora of documents where few or none have been classified, or where existing classifications have become meaningless. With web content, for example, poor taxonomy management can result in labels being applied indiscriminately, making filtering by these labels unhelpful. Our work aims to make it possible to classify an entire corpus of unlabeled documents using a human-in-the-loop approach, where the content owner manually classifies just one or two documents per category and the rest can be automatically classified. This "few-shot" learning approach requires rich representations of the documents such that those that have been manually labeled can be treated as prototypes, and automatic classification of the rest is a simple case of measuring the distance to prototypes. This approach uses pre-trained word embeddings, where documents are represented using a simple weighted average of constituent word embeddings. We have tested the accuracy of the approach on existing labeled datasets and provide the results here. We have also made code available for reproducing the results we got on the 20 Newsgroups dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2017

From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings

In this paper, we propose a novel approach for text classification based...
research
09/23/2010

A hybrid learning algorithm for text classification

Text classification is the process of classifying documents into predefi...
research
09/06/2018

An Analysis of Hierarchical Text Classification Using Word Embeddings

Efficient distributed numerical word representation models (word embeddi...
research
10/30/2018

Word Mover's Embedding: From Word2Vec to Document Embedding

While the celebrated Word2Vec technique yields semantically rich represe...
research
08/16/2020

OpenFraming: We brought the ML; you bring the data. Interact with your data and discover its frames

When journalists cover a news story, they can cover the story from multi...
research
10/20/2020

Text Classification of COVID-19 Press Briefings using BERT and Convolutional Neural Networks

We build a sentence-level political discourse classifier using existing ...
research
01/23/2012

A probabilistic methodology for multilabel classification

Multilabel classification is a relatively recent subfield of machine lea...

Please sign up or login with your details

Forgot password? Click here to reset