DeepAI AI Chat
Log In Sign Up

EXPLAINABOARD: An Explainable Leaderboard for NLP

by   Pengfei Liu, et al.

With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g. what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g. where does system A outperform system B? What if we combine systems A, B, C?) and (iii) examine prediction results closely (e.g. what are common errors made by multiple systems or and in what contexts do particular errors occur?). ExplainaBoard has been deployed at <>, and we have additionally released our interpretable evaluation code at <> and output files from more than 300 systems, 40 datasets, and 9 tasks to motivate the "output-driven" research in the future.


page 1

page 2

page 3

page 4


Towards More Fine-grained and Reliable NLP Performance Prediction

Performance prediction, the task of estimating a system's performance wi...

Interpretable Multi-dataset Evaluation for Named Entity Recognition

With the proliferation of models for natural language processing tasks, ...

Mukayese: Turkish NLP Strikes Back

Having sufficient resources for language X lifts it from the under-resou...

How Does Selective Mechanism Improve Self-Attention Networks?

Self-attention networks (SANs) with selective mechanism has produced sub...

OpenICL: An Open-Source Framework for In-context Learning

In recent years, In-context Learning (ICL) has gained increasing attenti...

A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation

Recent advances in AI and ML applications have benefited from rapid prog...

Representing Numbers in NLP: a Survey and a Vision

NLP systems rarely give special consideration to numbers found in text. ...