Automatically Auditing Large Language Models via Discrete Optimization

03/08/2023
by   Erik Jones, et al.
0

Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as an optimization problem, where we automatically search for input-output pairs that match a desired target behavior. For example, we might aim to find a non-toxic input that starts with "Barack Obama" that a model maps to a toxic output. This optimization problem is difficult to solve as the set of feasible points is sparse, the space is discrete, and the language models we audit are non-linear and high-dimensional. To combat these challenges, we introduce a discrete optimization algorithm, ARCA, that jointly and efficiently optimizes over inputs and outputs. Our approach automatically uncovers derogatory completions about celebrities (e.g. "Barack Obama is a legalized unborn" -> "child murderer"), produces French inputs that complete to English outputs, and finds inputs that generate a specific name. Our work offers a promising new tool to uncover models' failure-modes before deployment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2018

Detecting egregious responses in neural sequence-to-sequence models

In this work, we attempt to answer a critical question: whether there ex...
research
05/09/2012

Structured Input-Output Lasso, with Application to eQTL Mapping, and a Thresholding Algorithm for Fast Estimation

We consider the problem of learning a high-dimensional multi-task regres...
research
09/26/2017

Input-to-Output Gate to Improve RNN Language Models

This paper proposes a reinforcing method that refines the output layers ...
research
09/30/2022

Out-of-Distribution Detection and Selective Generation for Conditional Language Models

Machine learning algorithms typically assume independent and identically...
research
02/20/2023

Can discrete information extraction prompts generalize across language models?

We study whether automatically-induced prompts that effectively extract ...
research
09/18/2023

Pruning Large Language Models via Accuracy Predictor

Large language models(LLMs) containing tens of billions of parameters (o...
research
06/29/2020

Simplifying Models with Unlabeled Output Data

We focus on prediction problems with high-dimensional outputs that are s...

Please sign up or login with your details

Forgot password? Click here to reset