Improving Retrieval-Augmented Large Language Models via Data Importance Learning

07/06/2023
by   Xiaozhong Lyu, et al.
0

Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient (ϵ, δ)-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).

READ FULL TEXT
research
06/01/2023

Reimagining Retrieval Augmented Language Models for Answering Queries

We present a reality check on large language models and inspect the prom...
research
04/19/2023

BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer

Retrieval-based language models are increasingly employed in question-an...
research
01/28/2022

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

Retrieval-based language models (R-LM) model the probability of natural ...
research
05/18/2023

The Web Can Be Your Oyster for Improving Large Language Models

Large language models (LLMs) encode a large amount of world knowledge. H...
research
02/14/2023

BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination Prediction

Language models pre-trained on scientific literature corpora have substa...
research
08/08/2023

Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance

Retrieval augmented models show promise in enhancing traditional languag...
research
09/14/2023

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration

In recent years, large language models (LLMs) have shown remarkable capa...

Please sign up or login with your details

Forgot password? Click here to reset