On Degrees of Freedom in Defining and Testing Natural Language Understanding

05/24/2023
by   Saku Sugawara, et al.
0

Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing's original interpretation of the Turing test and indicate that an NLU test does not provide an operational definition; it merely provides inductive evidence that the test subject understands the language sufficiently well to meet stakeholder objectives. In other words, stakeholders are free to arbitrarily define NLU through their objectives. To use the test results as inductive evidence, stakeholders must carefully assess if the interpretation of test scores is valid or not. However, designing and using NLU tests involve other degrees of freedom, such as specifying target skills and defining evaluation metrics. As a result, achieving consensus among stakeholders becomes difficult. To resolve this issue, we propose a validity argument, which is a framework comprising a series of validation criteria across test components. By demonstrating that current practices in NLU studies can be associated with those criteria and organizing them into a comprehensive checklist, we prove that the validity argument can serve as a coherent guideline for designing credible test sets and facilitating scientific communication.

READ FULL TEXT
research
05/17/2023

The Geometry of Chi-Square Degrees of Freedom

In this paper, we state and prove a simple geometric interpretation of t...
research
04/19/2019

Challenges and Prospects in Vision and Language Research

Language grounded image understanding tasks have often been proposed as ...
research
07/06/2022

Degrees of Freedom and Information Criteria for the Synthetic Control Method

We provide an analytical characterization of the model flexibility of th...
research
06/16/2020

On parametric tests of relativity with false degrees of freedom

General relativity can be tested by comparing the binary-inspiral signal...
research
04/05/2021

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Evaluation for many natural language understanding (NLU) tasks is broken...
research
04/14/2023

Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy

How does one measure "ability to understand language"? If it is a person...

Please sign up or login with your details

Forgot password? Click here to reset