CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

by   Subhabrata Mukherjee, et al.

Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded "human-level" performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers. To help accelerate this line of work, we introduce CLUES (Constrained Language Understanding Evaluation Standard), a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at


FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark

Pretrained Language Models (PLMs) have achieved tremendous success in na...

Low-Shot Learning for Fictional Claim Verification

In this paper, we study the problem of claim verification in the context...

FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding

Few-learn learning (FSL) is one of the key future steps in machine learn...

Ranking Creative Language Characteristics in Small Data Scenarios

The ability to rank creative natural language provides an important gene...

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

The few-shot natural language understanding (NLU) task has attracted muc...

True Few-Shot Learning with Prompts – A Real-World Perspective

Prompt-based approaches are strong at few-shot learning. However, Perez ...

Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?

Numerous benchmarks for Few-Shot Learning have been proposed in the last...

Please sign up or login with your details

Forgot password? Click here to reset