Utility is in the Eye of the User: A Critique of NLP Leaderboards

by   Kawin Ethayarajh, et al.

Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards – in their current form – can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model's utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).


page 1

page 2

page 3

page 4


Utility-Energy Efficiency Oriented User Association with Power Control in Heterogeneous Networks

This letter investigates optimizing utility-energy efficiency (UEE), def...

Preregistering NLP Research

Preregistration refers to the practice of specifying what you are going ...

How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language

More than 43 language loss currently occurs at an accelerated rate becau...

Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?

With the growing popularity of deep-learning based NLP models, comes a n...

The Cost of Training NLP Models: A Concise Overview

We review the cost of training large-scale language models, and the driv...

A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation

Recent advances in AI and ML applications have benefited from rapid prog...

The Efficiency Misnomer

Model efficiency is a critical aspect of developing and deploying machin...

Please sign up or login with your details

Forgot password? Click here to reset