Why Does MAML Outperform ERM? An Optimization Perspective

by   Liam Collins, et al.

Model-Agnostic Meta-Learning (MAML) has demonstrated widespread success in training models that can quickly adapt to new tasks via one or few stochastic gradient descent steps. However, the MAML objective is significantly more difficult to optimize compared to standard Empirical Risk Minimization (ERM), and little is understood about how much MAML improves over ERM in terms of the fast adaptability of their solutions in various scenarios. We analytically address this issue in a linear regression setting consisting of a mixture of easy and hard tasks, where hardness is determined by the number of gradient steps required to solve the task. Specifically, we prove that for Ω(d_eff) labelled test samples (for gradient-based fine-tuning) where d_eff is the effective dimension of the problem, in order for MAML to achieve substantial gain over ERM, the optimal solutions of the hard tasks must be closely packed together with the center far from the center of the easy task optimal solutions. We show that these insights also apply in a low-dimensional feature space when both MAML and ERM learn a representation of the tasks, which reduces the effective problem dimension. Further, our few-shot image classification experiments suggest that our results generalize beyond linear regression.


page 1

page 2

page 3

page 4


Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

We propose an algorithm for meta-learning that is model-agnostic, in the...

Sample Efficient Linear Meta-Learning by Alternating Minimization

Meta-learning synthesizes and leverages the knowledge from a given set o...

On the Subspace Structure of Gradient-Based Meta-Learning

In this work we provide an analysis of the distribution of the post-adap...

Continuous-Time Meta-Learning with Forward Mode Differentiation

Drawing inspiration from gradient-based meta-learning methods with infin...

Towards Understanding Generalization in Gradient-Based Meta-Learning

In this work we study generalization of neural networks in gradient-base...

How to distribute data across tasks for meta-learning?

Meta-learning models transfer the knowledge acquired from previous tasks...

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

We consider stochastic gradient descent (SGD) for least-squares regressi...