Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content

08/26/2023
by   Charles O'Neill, et al.
0

In this paper, we tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs) with a novel dual-stage optimisation technique using adversarial fine-tuning. Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts. In this adversarial cycle, the two models seek to outperform each other in the prompting phase, generating a dataset of rich examples which are then used for fine-tuning. This iterative application of prompting and fine-tuning allows continuous refinement and improved performance. The performance of our approach is evaluated through classification accuracy on a dataset consisting of problematic prompts not detected by GPT-4, as well as a selection of contentious but unproblematic prompts. We show considerable increase in classification accuracy of the judge model on this challenging dataset as it undergoes the optimisation process. Furthermore, we show that a rudimentary model can achieve 13% higher accuracy on the hold-out test set than GPT-4 after only a few rounds of this process, and that this fine-tuning improves performance in parallel tasks such as toxic comment identification.

READ FULL TEXT
research
10/19/2022

lo-fi: distributed fine-tuning without communication

When fine-tuning large neural networks, it is common to use multiple nod...
research
09/18/2023

Understanding Catastrophic Forgetting in Language Models via Implicit Inference

Fine-tuning (via methods such as instruction-tuning or reinforcement lea...
research
11/12/2018

Fine-tuning of Language Models with Discriminator

Cross-entropy loss is a common choice when it comes to multiclass classi...
research
07/21/2021

Improved Text Classification via Contrastive Adversarial Training

We propose a simple and general method to regularize the fine-tuning of ...
research
05/26/2023

HUB: Guiding Learned Optimizers with Continuous Prompt Tuning

Learned optimizers are a crucial component of meta-learning. Recent adva...
research
12/28/2021

Automatic Pharma News Categorization

We use a text dataset consisting of 23 news categories relevant to pharm...
research
04/04/2019

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Several datasets have recently been constructed to expose brittleness in...

Please sign up or login with your details

Forgot password? Click here to reset