b'Yuanzhi Li'

research

∙ 09/11/2023

Textbooks Are All You Need II: phi-1.5 technical report

We continue the investigation into the power of smaller Transformer-base...

0 Yuanzhi Li, et al. ∙

research

∙ 09/01/2023

Efficient RLHF: Reducing the Memory Usage of PPO

Reinforcement Learning with Human Feedback (RLHF) has revolutionized lan...

0 Michael Santacroce, et al. ∙

research

∙ 06/27/2023

Length Generalization in Arithmetic Transformers

We examine how transformers cope with two challenges: learning basic int...

0 Samy Jelassi, et al. ∙

research

∙ 06/20/2023

The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks

We study the implicit bias of batch normalization trained by gradient de...

2 Yuan Cao, et al. ∙

research

∙ 06/09/2023

Specifying and Solving Robust Empirical Risk Minimization Problems Using CVXPY

We consider robust empirical risk minimization (ERM), where model parame...

0 Eric Luxenberg, et al. ∙

research

∙ 06/02/2023

Why Clean Generalization and Robust Overfitting Both Happen in Adversarial Training

Adversarial training is a standard method to train deep neural networks ...

0 Binghui Li, et al. ∙

research

∙ 05/31/2023

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

While stochastic gradient descent (SGD) is still the most popular optimi...

0 Yan Pan, et al. ∙

research

∙ 05/24/2023

SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning

Open-world survival games pose significant challenges for AI algorithms ...

0 Yue Wu, et al. ∙

research

∙ 05/23/2023

Physics of Language Models: Part 1, Context-Free Grammar

We design experiments to study how generative language models, like GPT,...

0 Zeyuan Allen-Zhu, et al. ∙

research

∙ 05/19/2023

The probability flow ODE is provably fast

We provide the first polynomial-time convergence guarantees for the prob...

0 Sitan Chen, et al. ∙

research

∙ 05/12/2023

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Language models (LMs) are powerful tools for natural language processing...

0 Ronen Eldan, et al. ∙

research

∙ 05/04/2023

Weighted Tallying Bandits: Overcoming Intractability via Repeated Exposure Optimality

In recommender system or crowdsourcing applications of online learning, ...

0 Dhruv Malik, et al. ∙

research

∙ 05/03/2023

Plan, Eliminate, and Track – Language Models are Good Teachers for Embodied Agents

Pre-trained large language models (LLMs) capture procedural knowledge ab...

0 Yue Wu, et al. ∙

research

∙ 04/07/2023

On the Importance of Contrastive Loss in Multimodal Learning

Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2...

0 Yunwei Ren, et al. ∙

research

∙ 03/22/2023

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Artificial intelligence (AI) researchers have been developing and refini...

6 Sébastien Bubeck, et al. ∙

research

∙ 03/15/2023

The Benefits of Mixup for Feature Learning

Mixup, a simple data augmentation method that randomly mixes two data po...

0 Difan Zou, et al. ∙

research

∙ 03/07/2023

How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

While the successes of transformers across many domains are indisputable...

0 Yuchen Li, et al. ∙

research

∙ 02/09/2023

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

High sample complexity has long been a challenge for RL. On the other ha...

0 Yue Wu, et al. ∙

research

∙ 02/07/2023

What Matters In The Structured Pruning of Generative Language Models?

Auto-regressive large language models such as GPT-3 require enormous com...

0 Michael Santacroce, et al. ∙

research

∙ 11/04/2022

How Does Adaptive Optimization Impact Local Neural Network Geometry?

Adaptive optimization methods are well known to achieve superior converg...

0 Kaiqi Jiang, et al. ∙

research

∙ 10/13/2022

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performa...

0 Samy Jelassi, et al. ∙

research

∙ 10/09/2022

Dissecting adaptive methods in GANs

Adaptive methods are a crucial component widely used for training genera...

0 Samy Jelassi, et al. ∙

research

∙ 09/22/2022

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

We provide theoretical convergence guarantees for score-based generative...

1 Sitan Chen, et al. ∙

research

∙ 08/04/2022

Towards Understanding Mixture of Experts in Deep Learning

The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlle...

27 Zixiang Chen, et al. ∙

research

∙ 07/13/2022

Towards understanding how momentum improves generalization in deep learning

Stochastic gradient descent (SGD) with momentum is widely used for train...

0 Samy Jelassi, et al. ∙

research

∙ 05/31/2022

Learning (Very) Simple Generative Models Is Hard

Motivated by the recent empirical successes of deep generative models, w...

0 Sitan Chen, et al. ∙

research

∙ 05/12/2022

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL...

0 Zixin Wen, et al. ∙

research

∙ 04/24/2022

Complete Policy Regret Bounds for Tallying Bandits

Policy regret is a well established notion of measuring the performance ...

0 Dhruv Malik, et al. ∙

research

∙ 04/08/2022

Learning Polynomial Transformations

We consider the problem of learning high dimensional polynomial transfor...

0 Sitan Chen, et al. ∙

research

∙ 01/18/2022

Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs

Arguably the most fundamental question in the theory of generative adver...

0 Sitan Chen, et al. ∙

research

∙ 11/01/2021

Posterior Inference for Quantile Regression: Adaptation to Sparsity

Quantile regression is a powerful data analysis tool that accommodates h...

0 Yuanzhi Li, et al. ∙

research

∙ 11/01/2021

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

Recently there is a surge of interest in understanding the horizon-depen...

0 Yuanzhi Li, et al. ∙

research

∙ 09/29/2021

On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization

Adam-type methods, the extension of adaptive gradient methods, have show...

0 Zehao Dou, et al. ∙

research

∙ 08/25/2021

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

Adaptive gradient methods such as Adam have gained increasing popularity...

17 Difan Zou, et al. ∙

research

∙ 06/17/2021

LoRA: Low-Rank Adaptation of Large Language Models

The dominant paradigm of natural language processing consists of large-s...

13 Edward J. Hu, et al. ∙

research

∙ 06/15/2021

Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity

Reinforcement learning (RL) is empirically successful in complex nonline...

2 Dhruv Malik, et al. ∙

research

∙ 06/04/2021

Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

Generative adversarial networks (GANs) are among the most successful mod...

0 Zeyuan Allen-Zhu, et al. ∙

research

∙ 05/31/2021

Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning

How can neural networks trained by contrastive learning extract features...

0 Zixin Wen, et al. ∙

research

∙ 02/26/2021

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

We empirically demonstrate that full-batch gradient descent on neural ne...

0 Jeremy M Cohen, et al. ∙

research

∙ 01/01/2021

When Is Generalizable Reinforcement Learning Tractable?

Agents trained by reinforcement learning (RL) often fail to generalize b...

0 Dhruv Malik, et al. ∙

research

∙ 12/17/2020

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

We formally study how Ensemble of deep learning models can improve test ...

78 Zeyuan Allen-Zhu, et al. ∙

research

∙ 09/30/2020

A law of robustness for two-layers neural networks

We initiate the study of the inherent tradeoffs between the size of a ne...

0 Sébastien Bubeck, et al. ∙

research

∙ 07/09/2020

Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

We consider the dynamic of gradient descent for learning a two-layer neu...

8 Yuanzhi Li, et al. ∙

research

∙ 05/20/2020

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Despite the great empirical success of adversarial training to defend de...

0 Zeyuan Allen-Zhu, et al. ∙

research

∙ 03/09/2020

When can Wasserstein GANs minimize Wasserstein Distance?

Generative Adversarial Networks (GANs) are widely used models to learn c...

0 Yuanzhi Li, et al. ∙

research

∙ 01/13/2020

Backward Feature Correction: How Deep Learning Performs Deep Learning

How does a 110-layer ResNet learn a high-complexity classifier using rel...

5 Zeyuan Allen-Zhu, et al. ∙

research

∙ 07/10/2019

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Stochastic gradient descent with a large initial learning rate is a wide...

6 Yuanzhi Li, et al. ∙

research

∙ 06/25/2019

Complexity of Highly Parallel Non-Smooth Convex Optimization

A landmark result of non-smooth convex optimization is that gradient des...

0 Sébastien Bubeck, et al. ∙

research

∙ 05/24/2019

What Can ResNet Learn Efficiently, Going Beyond Kernels?

How can neural networks such as ResNet efficiently learn CIFAR-10 with t...

0 Zeyuan Allen-Zhu, et al. ∙

research

∙ 04/28/2019

Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

We consider the non-stochastic version of the (cooperative) multi-player...

4 Sébastien Bubeck, et al. ∙

Yuanzhi Li

Featured Co-authors

Sign in with Google

Consider DeepAI Pro