Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts

09/24/2021
by   Wouter Kool, et al.
0

Training large-scale mixture of experts models efficiently on modern hardware requires assigning datapoints in a batch to different experts, each with a limited capacity. Recently proposed assignment procedures lack a probabilistic interpretation and use biased estimators for training. As an alternative, we propose two unbiased estimators based on principled stochastic assignment procedures: one that skips datapoints which exceed expert capacity, and one that samples perfectly balanced assignments using an extension of the Gumbel-Matching distribution [29]. Both estimators are unbiased, as they correct for the used sampling procedure. On a toy experiment, we find the `skip'-estimator is more effective than the balanced sampling one, and both are more robust in solving the task than biased alternatives.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2020

Efficient Balanced Treatment Assignments for Experimentation

In this work, we reframe the problem of balanced treatment assignment as...
research
07/31/2018

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Stochastic gradient descent (SGD), which dates back to the 1950s, is one...
research
03/30/2021

BASE Layers: Simplifying Training of Large, Sparse Models

We introduce a new balanced assignment of experts (BASE) layer for large...
research
01/31/2018

On the construction of unbiased estimators for the group testing problem

Debiased estimation has long been an area of research in the group testi...
research
05/25/2023

A Guide Through the Zoo of Biased SGD

Stochastic Gradient Descent (SGD) is arguably the most important single ...
research
04/20/2017

Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

Training convolutional networks (CNN's) that fit on a single GPU with mi...
research
09/06/2019

Optimal unbiased estimators via convex hulls

Necessary and sufficient conditions for the square-integrability of rece...

Please sign up or login with your details

Forgot password? Click here to reset