Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

03/05/2021
by   Ethan Perez, et al.
5

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

READ FULL TEXT

page 7

page 8

page 17

research
07/28/2020

The Minimum Description Length Principle for Pattern Mining: A Survey

This is about the Minimum Description Length (MDL) principle applied to ...
research
02/04/2021

Graph Coding for Model Selection and Anomaly Detection in Gaussian Graphical Models

A classic application of description length is for model selection with ...
research
11/19/2021

Real-time Coherency Identification using a Window-Size-Based Recursive Typicality Data Analysis

This work presents a data-driven analysis of minimal length necessary fo...
research
02/14/2023

Interpolation Learning With Minimum Description Length

We prove that the Minimum Description Length learning rule exhibits temp...
research
01/09/2018

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importanc...
research
02/28/2017

A description length approach to determining the number of k-means clusters

We present an asymptotic criterion to determine the optimal number of cl...
research
05/24/2023

Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples

NLP tasks are typically defined extensionally through datasets containin...

Please sign up or login with your details

Forgot password? Click here to reset