Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

06/23/2023
by   Neel Jain, et al.
1

With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. For example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. Current evaluations approach this problem using small, domain-specific datasets with human-curated labels. These evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. To bypass these drawbacks, we propose a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on the input text. Self-supervised evaluation can directly monitor LLM behavior on datasets collected in the wild or streamed during live model deployment. We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. When comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. The self-supervised paradigm complements current evaluation strategies that rely on labeled data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2022

How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?

Despite the recent success of video self-supervised learning, there is m...
research
08/07/2019

Self-supervised Attention Model for Weakly Labeled Audio Event Classification

We describe a novel weakly labeled Audio Event Classification approach b...
research
06/14/2022

Self-Supervision on Images and Text Reduces Reliance on Visual Shortcut Features

Deep learning models trained in a fully supervised manner have been show...
research
02/18/2022

How Well Do Self-Supervised Methods Perform in Cross-Domain Few-Shot Learning?

Cross-domain few-shot learning (CDFSL) remains a largely unsolved proble...
research
05/25/2022

Federated Self-supervised Learning for Heterogeneous Clients

Federated Learning has become an important learning paradigm due to its ...
research
02/19/2021

An Empirical Study on Measuring the Similarity of Sentential Arguments with Language Model Domain Adaptation

Measuring the similarity between two different sentential arguments is a...
research
01/01/2021

Sensei: Self-Supervised Sensor Name Segmentation

A sensor name, typically an alphanumeric string, encodes the key context...

Please sign up or login with your details

Forgot password? Click here to reset