Large language models can accurately predict searcher preferences

09/19/2023
by   Paul Thomas, et al.
0

Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality “gold” labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2022

Training Language Models with Natural Language Feedback

Pretrained language models often do not perform tasks in ways that are i...
research
03/28/2023

Training Language Models with Language Feedback at Scale

Pretrained language models often generate outputs that are not in line w...
research
09/01/2023

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Reinforcement learning from human feedback (RLHF) is effective at aligni...
research
05/04/2023

ChatGPT-steered Editing Instructor for Customization of Abstractive Summarization

Tailoring outputs of large language models, such as ChatGPT, to specific...
research
08/02/2023

Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data

In this vision paper, we propose a shift in perspective for improving th...
research
05/17/2023

Large-Scale Text Analysis Using Generative Language Models: A Case Study in Discovering Public Value Expressions in AI Patents

Labeling data is essential for training text classifiers but is often di...
research
05/09/2023

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

There is a rapidly growing number of large language models (LLMs) that u...

Please sign up or login with your details

Forgot password? Click here to reset