A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation

09/15/2020
by   Moin Nadeem, et al.
10

This work studies the widely adopted ancestral sampling algorithms for auto-regressive language models, which is not widely studied in the literature. We use the quality-diversity (Q-D) trade-off to investigate three popular sampling algorithms (top-k, nucleus and tempered sampling). We focus on the task of open-ended language generation. We first show that the existing sampling algorithms have similar performance. After carefully inspecting the transformations defined by different sampling algorithms, we identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation. To validate the importance of the identified properties, we design two sets of new sampling algorithms: one set in which each algorithm satisfies all three properties, and one set in which each algorithm violates at least one of the properties. We compare their performance with existing sampling algorithms, and find that violating the identified properties could lead to drastic performance degradation, as measured by the Q-D trade-off. On the other hand, we find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms. Our data and code are available at https://github.com/moinnadeem/characterizing-sampling-algorithms

READ FULL TEXT

Authors

page 1

page 2

page 3

page 4

05/12/2022

AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling

Variational Auto-Encoder (VAE) has become the de-facto learning paradigm...
07/09/2019

Procedural Content Generation through Quality Diversity

Quality-diversity (QD) algorithms search for a set of good solutions whi...
06/04/2022

CVNets: High Performance Library for Computer Vision

We introduce CVNets, a high-performance open-source library for training...
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
06/08/2021

FastSeq: Make Sequence Generation Faster

Transformer-based models have made tremendous impacts in natural languag...
06/12/2020

Reinforced Data Sampling for Model Diversification

With the rising number of machine learning competitions, the world has w...
06/24/2020

Reinforced Data Sampling for Model Diversificatio

With the rising number of machine learning competitions, the world has w...

Code Repositories

characterizing-sampling-algorithms

The official codebase for "A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.