Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

04/20/2017
by   Sam Gross, et al.
0

Training convolutional networks (CNN's) that fit on a single GPU with minibatch stochastic gradient descent has become effective in practice. However, there is still no effective method for training large CNN's that do not fit in the memory of a few GPU cards, or for parallelizing CNN training. In this work we show that a simple hard mixture of experts model can be efficiently trained to good effect on large scale hashtag (multilabel) prediction tasks. Mixture of experts models are not new (Jacobs et. al. 1991, Collobert et. al. 2003), but in the past, researchers have had to devise sophisticated methods to deal with data fragmentation. We show empirically that modern weakly supervised data sets are large enough to support naive partitioning schemes where each data point is assigned to a single expert. Because the experts are independent, training them in parallel is easy, and evaluation is cheap for the size of the model. Furthermore, we show that we can use a single decoding layer for all the experts, allowing a unified feature embedding space. We demonstrate that it is feasible (and in fact relatively painless) to train far larger models than could be practically trained with standard CNN architectures, and that the extra capacity can be well used on current datasets.

READ FULL TEXT

page 5

page 6

research
12/16/2013

Learning Factored Representations in a Deep Mixture of Experts

Mixtures of Experts combine the outputs of several "expert" networks, ea...
research
01/31/2023

Patch Gradient Descent: Training Neural Networks on Very Large Images

Traditional CNN models are trained and tested on relatively low resoluti...
research
04/11/2023

Revisiting Single-gated Mixtures of Experts

Mixture of Experts (MoE) are rising in popularity as a means to train ex...
research
11/19/2015

Mediated Experts for Deep Convolutional Networks

We present a new supervised architecture termed Mediated Mixture-of-Expe...
research
12/21/2013

GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training

The ability to train large-scale neural networks has resulted in state-o...
research
05/20/2022

SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

With the increasing diversity of ML infrastructures nowadays, distribute...
research
09/24/2021

Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts

Training large-scale mixture of experts models efficiently on modern har...

Please sign up or login with your details

Forgot password? Click here to reset