Prompting GPT-3 To Be Reliable

10/17/2022
by   Chenglei Si, et al.
0

Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, existing research focuses on models' accuracy on standard benchmarks and largely ignores their reliability, which is crucial for avoiding catastrophic real-world harms. While reliability is a broad and vaguely defined term, this work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We establish simple and effective prompts to demonstrate GPT-3's reliability in these four aspects: 1) generalize out-of-domain, 2) balance demographic distribution to reduce social biases, 3) calibrate language model probabilities, and 4) update the LLM's knowledge. We find that by employing appropriate prompts, GPT-3 outperforms smaller-scale supervised models by large margins on all these facets. We release all processed datasets, evaluation scripts, and model predictions to facilitate future analysis. Our findings not only shed new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use large language models like GPT-3.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2023

KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application

Large language models (LLMs) learn not only natural text generation abil...
research
06/08/2023

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Enabling large language models to effectively utilize real-world tools i...
research
06/16/2022

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Large language models produce human-like text that drive a growing numbe...
research
11/08/2019

Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

Recent improvements in large-scale language models have driven progress ...
research
08/22/2023

Efficient Benchmarking (of Language Models)

The increasing versatility of language models LMs has given rise to a ne...
research
06/05/2023

Which Argumentative Aspects of Hate Speech in Social Media can be reliably identified?

With the increasing diversity of use cases of large language models, a m...
research
05/29/2023

Large Language Models are not Fair Evaluators

We uncover a systematic bias in the evaluation paradigm of adopting larg...

Please sign up or login with your details

Forgot password? Click here to reset