Samuel R. Bowman

research

∙ 08/07/2023

Studying Large Language Model Generalization with Influence Functions

When trying to gain better visibility into a machine learning model in o...

0 Roger Grosse, et al. ∙

research

∙ 06/15/2023

Inverse Scaling: When Bigger Isn't Better

Work on scaling laws has found that large language models (LMs) show pre...

0 Ian R. McKenzie, et al. ∙

research

∙ 05/30/2023

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

A number of recent benchmarks seek to assess how well models handle natu...

0 Jingyuan Selena She, et al. ∙

research

∙ 05/23/2023

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Large language models (LLMs) have achieved widespread success on a varie...

5 Angelica Chen, et al. ∙

research

∙ 05/07/2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Large Language Models (LLMs) can achieve strong performance on many task...

0 Miles Turpin, et al. ∙

research

∙ 04/02/2023

Eight Things to Know about Large Language Models

The widespread public deployment of large language models (LLMs) in rece...

0 Samuel R. Bowman, et al. ∙

research

∙ 03/28/2023

Improving Code Generation by Training with Natural Language Feedback

The potential for pre-trained large language models (LLMs) to use natura...

1 Angelica Chen, et al. ∙

research

∙ 02/16/2023

Pretraining Language Models with Human Preferences

Language models (LMs) are pretrained to imitate internet text, including...

0 Tomasz Korbak, et al. ∙

research

∙ 02/15/2023

The Capacity for Moral Self-Correction in Large Language Models

We test the hypothesis that language models trained with reinforcement l...

0 Deep Ganguli, et al. ∙

research

∙ 12/20/2022

(QA)^2: Question Answering with Questionable Assumptions

Naturally-occurring information-seeking questions often contain question...

0 Najoung Kim, et al. ∙

research

∙ 12/15/2022

Constitutional AI: Harmlessness from AI Feedback

As AI systems become more capable, we would like to enlist their help to...

0 Yuntao Bai, et al. ∙

research

∙ 11/04/2022

Measuring Progress on Scalable Oversight for Large Language Models

Developing safe and useful general-purpose AI systems will require us to...

0 Samuel R. Bowman, et al. ∙

research

∙ 10/19/2022

Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions

The use of language-model-based question-answering systems to aid humans...

0 Alicia Parrish, et al. ∙

research

∙ 08/26/2022

What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

We present the results of the NLP Community Metasurvey. Run from May to ...

37 Julian Michael, et al. ∙

research

∙ 08/17/2022

What Artificial Neural Networks Can Tell Us About Human Language Acquisition

Rapid progress in machine learning for natural language processing has t...

14 Alex Warstadt, et al. ∙

research

∙ 05/23/2022

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

Summarization datasets are often assembled either by scraping naturally ...

0 Alex Wang, et al. ∙

research

∙ 05/22/2022

Instruction Induction: From Few Examples to Natural Language Task Descriptions

Large language models are able to perform a task by conditioning on a fe...

0 Or Honovich, et al. ∙

research

∙ 04/11/2022

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

Current QA systems can generate reasonable-sounding yet false answers wi...

0 Alicia Parrish, et al. ∙

research

∙ 03/12/2022

What Makes Reading Comprehension Questions Difficult?

For a natural language understanding benchmark to be useful in research,...

0 Saku Sugawara, et al. ∙

research

∙ 12/16/2021

QuALITY: Question Answering with Long Input Texts, Yes!

To enable building and testing models on long-document comprehension, we...

2 Richard Yuanzhe Pang, et al. ∙

research

∙ 11/16/2021

Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

More capable language models increasingly saturate existing task benchma...

12 Jason Phang, et al. ∙

research

∙ 10/15/2021

Learning with Noisy Labels by Targeted Relabeling

Crowdsourcing platforms are often used to collect datasets for training ...

0 Derek Chen, et al. ∙

research

∙ 10/15/2021

When Combating Hype, Proceed with Caution

In an effort to avoid reinforcing widespread hype about the capabilities...

0 Samuel R. Bowman, et al. ∙

research

∙ 10/15/2021

BBQ: A Hand-Built Bias Benchmark for Question Answering

It is well documented that NLP models learn social biases present in the...

8 Alicia Parrish, et al. ∙

research

∙ 09/17/2021

Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Despite the success of fine-tuning pretrained language encoders like BER...

6 Jason Phang, et al. ∙

research

∙ 09/14/2021

NOPE: A Corpus of Naturally-Occurring Presuppositions in English

Understanding language requires grasping not only the overtly stated con...

0 Alicia Parrish, et al. ∙

research

∙ 06/01/2021

Comparing Test Sets with Item Response Theory

Recent years have seen numerous NLP datasets introduced to evaluate the ...

7 Clara Vania, et al. ∙

research

∙ 06/01/2021

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Crowdsourcing is widely used to create data for common natural language ...

0 Nikita Nangia, et al. ∙

research

∙ 04/15/2021

Does Putting a Linguist in the Loop Improve NLU Data Collection?

Many crowdsourced NLP datasets contain systematic gaps and biases that a...

0 Alicia Parrish, et al. ∙

research

∙ 04/05/2021

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Evaluation for many natural language understanding (NLU) tasks is broken...

0 Samuel R. Bowman, et al. ∙

research

∙ 11/10/2020

When Do You Need Billions of Words of Pretraining Data?

NLP is currently dominated by general-purpose pretrained language models...

0 Yian Zhang, et al. ∙

research

∙ 10/13/2020

Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

Large-scale natural language inference (NLI) datasets such as SNLI or MN...

0 Clara Vania, et al. ∙

research

∙ 10/11/2020

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

One reason pretraining on self-supervised linguistic tasks is effective ...

0 Alex Warstadt, et al. ∙

research

∙ 10/09/2020

Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

A growing body of work shows that models exploit annotation artifacts to...

0 William Huang, et al. ∙

research

∙ 10/08/2020

Precise Task Formalization Matters in Winograd Schema Evaluations

Performance on the Winograd Schema Challenge (WSC), a respected English ...

0 Haokun Liu, et al. ∙

research

∙ 09/30/2020

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Pretrained language models, especially masked language models (MLMs) hav...

8 Nikita Nangia, et al. ∙

research

∙ 07/14/2020

Can neural networks acquire a structural bias from raw linguistic data?

We evaluate whether BERT, a widely used neural network for sentence proc...

0 Alex Warstadt, et al. ∙

research

∙ 05/27/2020

Self-Training for Unsupervised Parsing with PRPN

Neural unsupervised parsing (UP) models learn to parse without access to...

0 Anhad Mohananey, et al. ∙

research

∙ 05/26/2020

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

Intermediate-task training has been shown to substantially improve pretr...

0 Jason Phang, et al. ∙

research

∙ 05/01/2020

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

While pretrained models such as BERT have shown large gains across natur...

0 Yada Pruksachatkun, et al. ∙

research

∙ 04/28/2020

Learning to Learn Morphological Inflection for Resource-Poor Languages

We propose to cast the task of morphological inflection - mapping a lemm...

0 Katharina Kann, et al. ∙

research

∙ 04/24/2020

Collecting Entailment Data for Pretraining: New Protocols and Negative Results

Textual entailment (or NLI) data has proven useful as pretraining data f...

0 Samuel R. Bowman, et al. ∙

research

∙ 03/04/2020

jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models

We introduce jiant, an open source toolkit for conducting multitask and ...

0 Yada Pruksachatkun, et al. ∙

research

∙ 12/02/2019

BLiMP: A Benchmark of Linguistic Minimal Pairs for English

We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLi...

0 Alex Warstadt, et al. ∙

research

∙ 11/27/2019

Do Attention Heads in BERT Track Syntactic Dependencies?

We investigate the extent to which individual attention heads in pretrai...

0 Phu Mon Htut, et al. ∙

research

∙ 09/22/2019

Inducing Constituency Trees through Neural Machine Translation

Latent tree learning(LTL) methods learn to parse sentences using only in...

0 Phu Mon Htut, et al. ∙

research

∙ 09/05/2019

Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs

Though state-of-the-art sentence representation models can perform tasks...

0 Alex Warstadt, et al. ∙

research

∙ 09/04/2019

Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

Development sets are impractical to obtain for real low-resource languag...

0 Katharina Kann, et al. ∙

research

∙ 05/24/2019

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

The GLUE benchmark (Wang et al., 2019b) is a suite of language understan...

5 Nikita Nangia, et al. ∙

research

∙ 05/24/2019

Human vs. Muppet: A Conservative Estimate of HumanPerformance on the GLUE Benchmark

The GLUE benchmark (Wang et al., 2019b) is a suite of language understan...

3 Nikita Nangia, et al. ∙

Samuel R. Bowman

Featured Co-authors

Sign in with Google

Consider DeepAI Pro