Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

01/30/2019
by   Shun Liao, et al.
0

Computations for the softmax function are significantly expensive when the number of output classes is large. In this paper, we present a novel softmax inference speedup method, Doubly Sparse Softmax (DS-Softmax), that leverages sparse mixture of sparse experts to efficiently retrieve top-k classes. Different from most existing methods that require and approximate a fixed softmax, our method is learning-based and can adapt softmax weights for a better approximation. In particular, our method learns a two-level hierarchy which divides entire output class space into several partially overlapping experts. Each expert is sparse and only contains a subset of output classes. To find top-k classes, a sparse mixture enables us to find the most probable expert quickly, and the sparse expert enables us to search within a small-scale softmax. We empirically conduct evaluation on several real-world tasks (including neural machine translation, language modeling and image classification) and demonstrate that significant computation reductions can be achieved without loss of performance.

READ FULL TEXT
research
05/28/2018

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Softmax is an output activation function for modeling categorical probab...
research
12/13/2018

Effectiveness of Hierarchical Softmax in Large Scale Classification Tasks

Typically, Softmax is used in the final layer of a neural network to get...
research
12/23/2021

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for th...
research
03/22/2018

Unbiased scalable softmax optimization

Recent neural network and language models rely on softmax distributions ...
research
09/23/2016

One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

The softmax representation of probabilities for categorical variables pl...
research
10/30/2022

Learning to Defer to Multiple Experts: Consistent Surrogate Losses, Confidence Calibration, and Conformal Ensembles

We study the statistical properties of learning to defer (L2D) to multip...
research
10/29/2018

On Controllable Sparse Alternatives to Softmax

Converting an n-dimensional vector to a probability distribution over n ...

Please sign up or login with your details

Forgot password? Click here to reset