Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

11/05/2018
by   Edward Collins, et al.
0

Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( https://github.com/Wluper/edm ) and datasets ( http://data.wluper.com ) are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2023

Identifying Semantically Difficult Samples to Improve Text Classification

In this paper, we investigate the effect of addressing difficult samples...
research
08/22/2019

Improving Few-shot Text Classification via Pretrained Language Representations

Text classification tends to be difficult when the data is deficient or ...
research
12/03/2020

Evolving Character-Level DenseNet Architectures using Genetic Programming

DenseNet architectures have demonstrated impressive performance in image...
research
03/03/2019

Predicting Algorithm Classes for Programming Word Problems

We introduce the task of algorithm class prediction for programming word...
research
09/03/2019

Neural Attentive Bag-of-Entities Model for Text Classification

This study proposes a Neural Attentive Bag-of-Entities model, which is a...
research
01/14/2021

OrigamiSet1.0: Two New Datasets for Origami Classification and Difficulty Estimation

Origami is becoming more and more relevant to research. However, there i...
research
01/31/2020

Benchmarking Popular Classification Models' Robustness to Random and Targeted Corruptions

Text classification models, especially neural networks based models, hav...

Please sign up or login with your details

Forgot password? Click here to reset