Comparing Text Representations: A Theory-Driven Approach

09/15/2021
by   Gregory Yauney, et al.
0

Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compatibility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2023

Learning Easily Updated General Purpose Text Representations with Adaptable Task-Specific Prefixes

Many real-world applications require making multiple predictions from th...
research
06/19/2019

Surf at MEDIQA 2019: Improving Performance of Natural Language Inference in the Clinical Domain by Adopting Pre-trained Language Model

While deep learning techniques have shown promising results in many natu...
research
09/30/2020

Multiple Word Embeddings for Increased Diversity of Representation

Most state-of-the-art models in natural language processing (NLP) are ne...
research
03/31/2022

ESGBERT: Language Model to Help with Classification Tasks Related to Companies Environmental, Social, and Governance Practices

Environmental, Social, and Governance (ESG) are non-financial factors th...
research
04/21/2021

Sensitivity as a Complexity Measure for Sequence Classification Tasks

We introduce a theoretical framework for understanding and predicting th...
research
09/13/2023

OYXOY: A Modern NLP Test Suite for Modern Greek

This paper serves as a foundational step towards the development of a li...
research
01/30/2020

Introducing the diagrammatic mode

In this article, we propose a multimodal perspective to diagrammatic rep...

Please sign up or login with your details

Forgot password? Click here to reset