Jacob Steinhardt

research

∙ 07/18/2023

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Modern language models can imitate complex patterns through few-shot lea...

0 Danny Halawi, et al. ∙

research

∙ 07/05/2023

Jailbroken: How Does LLM Safety Training Fail?

Large language models trained for safety and harmlessness remain suscept...

0 Alexander Wei, et al. ∙

research

∙ 06/26/2023

Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition

As the scale of machine learning models increases, trends such as scalin...

0 Meena Jagadeesan, et al. ∙

research

∙ 06/21/2023

Mass-Producing Failures of Multimodal Systems with Language Models

Deployed multimodal systems can fail in ways that evaluators did not ant...

0 Shengbang Tong, et al. ∙

research

∙ 06/13/2023

Incentivizing High-Quality Content in Online Recommender Systems

For content recommender systems such as TikTok and YouTube, the platform...

0 Xinyan Hu, et al. ∙

research

∙ 03/14/2023

Eliciting Latent Predictions from Transformers with the Tuned Lens

We analyze transformers from the perspective of iterative inference, see...

0 Nora Belrose, et al. ∙

research

∙ 03/08/2023

Automatically Auditing Large Language Models via Discrete Optimization

Auditing large language models for unexpected behaviors is critical to p...

0 Erik Jones, et al. ∙

research

∙ 02/28/2023

Goal Driven Discovery of Distributional Differences via Language Descriptions

Mining large corpora can generate useful discoveries but is time-consumi...

0 Ruiqi Zhong, et al. ∙

research

∙ 02/23/2023

Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws

Specifying reward functions for complex tasks like object manipulation o...

0 Kush Bhatia, et al. ∙

research

∙ 01/12/2023

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new...

0 Neel Nanda, et al. ∙

research

∙ 12/07/2022

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with ...

0 Collin Burns, et al. ∙

research

∙ 11/01/2022

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Research in mechanistic interpretability seeks to explain behaviors of m...

0 Kevin Wang, et al. ∙

research

∙ 10/18/2022

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

In recent years, deep neural networks have demonstrated increasingly str...

0 Mantas Mazeika, et al. ∙

research

∙ 06/30/2022

Forecasting Future World Events with Neural Networks

Forecasting future world events is a challenging but valuable task. Fore...

0 Andy Zou, et al. ∙

research

∙ 06/27/2022

Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior

Transparency methods such as model visualizations provide information th...

0 Jean-Stanislas Denain, et al. ∙

research

∙ 06/27/2022

Supply-Side Equilibria in Recommender Systems

Digital recommender systems such as Spotify and Netflix affect not only ...

0 Meena Jagadeesan, et al. ∙

research

∙ 02/24/2022

Capturing Failures of Large Language Models via Human Cognitive Biases

Large language models generate complex, open-ended outputs: instead of o...

0 Erik Jones, et al. ∙

research

∙ 02/11/2022

Predicting Out-of-Distribution Error with the Projection Norm

We propose a metric – Projection Norm – to predict a model's performance...

10 Yaodong Yu, et al. ∙

research

∙ 01/28/2022

Summarizing Differences between Text Distributions with Natural Language

How do two distributions of texts differ? Humans are slow at answering t...

16 Ruiqi Zhong, et al. ∙

research

∙ 01/10/2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Reward hacking – where RL agents exploit gaps in misspecified reward fun...

0 Alexander Pan, et al. ∙

research

∙ 12/09/2021

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

In real-world applications of machine learning, reliable and safe system...

3 Dan Hendrycks, et al. ∙

research

∙ 12/08/2021

The Effect of Model Size on Worst-Group Generalization

Overparameterization is shown to result in poor test accuracy on rare su...

4 Alan Pham, et al. ∙

research

∙ 10/25/2021

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

When making everyday decisions, people are guided by their conscience, a...

0 Dan Hendrycks, et al. ∙

research

∙ 09/28/2021

Unsolved Problems in ML Safety

Machine learning (ML) systems are rapidly increasing in size, are acquir...

0 Dan Hendrycks, et al. ∙

research

∙ 08/19/2021

Learning Equilibria in Matching Markets from Bandit Feedback

Large-scale, two-sided matching platforms must find market outcomes that...

2 Meena Jagadeesan, et al. ∙

research

∙ 08/03/2021

Grounding Representation Similarity with Statistical Testing

To understand neural network behavior, recent works quantitatively compa...

9 Frances Ding, et al. ∙

research

∙ 05/20/2021

Measuring Coding Challenge Competence With APPS

While programming is one of the most broadly applicable skills in modern...

0 Dan Hendrycks, et al. ∙

research

∙ 05/13/2021

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Larger language models have higher accuracy on average, but are they bet...

14 Ruiqi Zhong, et al. ∙

research

∙ 04/17/2021

Agnostic learning with unknown utilities

Traditional learning approaches for classification implicitly assume tha...

0 Kush Bhatia, et al. ∙

research

∙ 03/17/2021

Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition

Adversarially trained models exhibit a large generalization gap: they ca...

8 Yaodong Yu, et al. ∙

research

∙ 03/13/2021

Approximating How Single Head Attention Learns

Why do models often attend to salient words, and how does this evolve th...

4 Charlie Snell, et al. ∙

research

∙ 03/10/2021

Limitations of Post-Hoc Feature Alignment for Robustness

Feature alignment is an approach to improving robustness to distribution...

5 Collin Burns, et al. ∙

research

∙ 03/05/2021

Measuring Mathematical Problem Solving With the MATH Dataset

Many intellectual endeavors require mathematical problem solving, but th...

0 Dan Hendrycks, et al. ∙

research

∙ 10/22/2020

Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming

Convex relaxations have emerged as a promising approach for verifying de...

6 Sumanth Dathathri, et al. ∙

research

∙ 09/07/2020

Measuring Massive Multitask Language Understanding

We propose a new test to measure a text model's multitask accuracy. The ...

28 Dan Hendrycks, et al. ∙

research

∙ 08/05/2020

Aligning AI With Shared Human Values

We show how to assess a language model's knowledge of basic concepts of ...

13 Dan Hendrycks, et al. ∙

research

∙ 06/29/2020

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

We introduce three new robustness benchmarks consisting of naturally occ...

5 Dan Hendrycks, et al. ∙

research

∙ 05/28/2020

Robust estimation via generalized quasi-gradients

We explore why many recently proposed robust estimation problems are eff...

0 Banghua Zhu, et al. ∙

research

∙ 05/19/2020

Identifying Statistical Bias in Dataset Replication

Dataset replication is a useful tool for assessing whether improvements ...

10 Logan Engstrom, et al. ∙

research

∙ 02/26/2020

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

The classical bias-variance trade-off predicts that bias decreases and v...

0 Zitong Yang, et al. ∙

research

∙ 01/21/2020

When does the Tukey median work?

We analyze the performance of the Tukey median estimator under total var...

0 Banghua Zhu, et al. ∙

research

∙ 11/25/2019

A Benchmark for Anomaly Segmentation

Detecting out-of-distribution examples is important for safety-critical ...

26 Dan Hendrycks, et al. ∙

research

∙ 09/19/2019

Generalized Resilience and Robust Statistics

Robust statistics traditionally focuses on outliers, or perturbations in...

0 Banghua Zhu, et al. ∙

research

∙ 08/21/2019

Testing Robustness Against Unforeseen Adversaries

Considerable work on adversarial defense has studied robustness to a fix...

1 Daniel Kang, et al. ∙

research

∙ 07/16/2019

Natural Adversarial Examples

We introduce natural adversarial examples -- real-world, unmodified, and...

6 Dan Hendrycks, et al. ∙

research

∙ 05/03/2019

Transfer of Adversarial Robustness Between Perturbation Types

We study the transfer of adversarial robustness of deep neural networks ...

12 Daniel Kang, et al. ∙

research

∙ 11/13/2018

FrAngel: Component-Based Synthesis with Control Structures

In component-based program synthesis, the synthesizer generates a progra...

0 Kensen Shi, et al. ∙

research

∙ 11/02/2018

Semidefinite relaxations for certifying robustness to adversarial examples

Despite their impressive performance on diverse tasks, neural networks f...

0 Aditi Raghunathan, et al. ∙

research

∙ 11/02/2018

Stronger Data Poisoning Attacks Break Data Sanitization Defenses

Machine learning models trained on data from the outside world can be co...

0 Pang Wei Koh, et al. ∙

research

∙ 07/09/2018

Troubling Trends in Machine Learning Scholarship

Collectively, machine learning (ML) researchers are engaged in the creat...

12 Zachary C Lipton, et al. ∙

Jacob Steinhardt

Featured Co-authors

Sign in with Google

Consider DeepAI Pro