Koala: An Index for Quantifying Overlaps with Pre-training Corpora

03/26/2023
by   Thuy-Trang Vu, et al.
0

In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B pre-training data. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/.

READ FULL TEXT
research
11/26/2022

Gender Biases Unexpectedly Fluctuate in the Pre-training Stage of Masked Language Models

Masked language models pick up gender biases during pre-training. Such b...
research
06/05/2023

Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

Model pre-training on large text corpora has been demonstrated effective...
research
05/11/2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is...
research
10/20/2022

Tele-Knowledge Pre-training for Fault Analysis

In this work, we share our experience on tele-knowledge pre-training for...
research
08/23/2023

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been pou...
research
05/22/2023

"According to ..." Prompting Language Models Improves Quoting from Pre-Training Data

Large Language Models (LLMs) may hallucinate and generate fake informati...
research
07/12/2023

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

The computation necessary for training Transformer-based language models...

Please sign up or login with your details

Forgot password? Click here to reset