UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

03/01/2023
by   Jon Saad-Falcon, et al.
0

Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using large language models (LLMs) to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains, even where only 2K synthetic queries are used for fine-tuning, and that it achieves substantially lower latency than standard reranking methods. We make our end-to-end approach, including our synthetic datasets and replication code, publicly available on Github.

READ FULL TEXT

page 4

page 5

research
11/01/2021

Unsupervised Domain Adaptation with Adapter

Unsupervised domain adaptation (UDA) with pre-trained language models (P...
research
04/03/2023

Few-shot Fine-tuning is All You Need for Source-free Domain Adaptation

Recently, source-free unsupervised domain adaptation (SFUDA) has emerged...
research
08/07/2022

Vernacular Search Query Translation with Unsupervised Domain Adaptation

With the democratization of e-commerce platforms, an increasingly divers...
research
04/09/2022

Domain-Oriented Prefix-Tuning: Towards Efficient and Generalizable Fine-tuning for Zero-Shot Dialogue Summarization

The most advanced abstractive dialogue summarizers lack generalization a...
research
04/04/2019

Unsupervised Domain Adaptation of Contextualized Embeddings: A Case Study in Early Modern English

Contextualized word embeddings such as ELMo and BERT provide a foundatio...
research
09/07/2022

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Chatbots are used in many applications, e.g., automated agents, smart ho...
research
12/06/2022

LawngNLI: A Long-Premise Benchmark for In-Domain Generalization from Short to Long Contexts and for Implication-Based Retrieval

Natural language inference has trended toward studying contexts beyond t...

Please sign up or login with your details

Forgot password? Click here to reset