The Capacity for Moral Self-Correction in Large Language Models

02/15/2023
by   Deep Ganguli, et al.
0

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" – to avoid producing harmful outputs – if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

READ FULL TEXT
research
08/03/2023

Does Correction Remain A Problem For Large Language Models?

As large language models, such as GPT, continue to advance the capabilit...
research
03/04/2022

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at fo...
research
05/23/2023

Aligning Large Language Models through Synthetic Feedback

Aligning large language models (LLMs) to human values has become increas...
research
08/06/2023

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

Large language models (LLMs) have demonstrated remarkable performance ac...
research
02/28/2021

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

When trained on large, unfiltered crawls from the internet, language mod...
research
09/19/2023

Evaluating large language models' ability to understand metaphor and sarcasm using a screening test for Asperger syndrome

Metaphors and sarcasm are precious fruits of our highly-evolved social c...
research
02/02/2023

Conditioning Predictive Models: Risks and Strategies

Our intention is to provide a definitive reference on what it would take...

Please sign up or login with your details

Forgot password? Click here to reset