An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

by   Alexandre de Brébisson, et al.

In a multi-class classification problem, it is standard to model the output of a neural network as a categorical distribution conditioned on the inputs. The output must therefore be positive and sum to one, which is traditionally enforced by a softmax. This probabilistic mapping allows to use the maximum likelihood principle, which leads to the well-known log-softmax loss. However the choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives. In particular Vincent et al. (2015) recently introduced a class of loss functions, called the spherical family, for which there exists an efficient algorithm to compute the updates of the output weights irrespective of the output size. In this paper, we explore several loss functions from this family as possible alternatives to the traditional log-softmax. In particular, we focus our investigation on spherical bounds of the log-softmax loss and on two spherical log-likelihood losses, namely the log-Spherical Softmax suggested by Vincent et al. (2015) and the log-Taylor Softmax that we introduce. Although these alternatives do not yield as good results as the log-softmax loss on two language modeling tasks, they surprisingly outperform it in our experiments on MNIST and CIFAR-10, suggesting that they might be relevant in a broad range of applications.


page 1

page 2

page 3

page 4


The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

Despite being the standard loss function to train multi-class neural net...

Exploring Alternatives to Softmax Function

Softmax function is widely used in artificial neural networks for multic...

Relaxed Softmax for learning from Positive and Unlabeled data

In recent years, the softmax model and its fast approximations have beco...

Exact gradient updates in time independent of output size for the spherical loss family

An important class of problems involves training deep neural networks wi...

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models

Many applications of generative models rely on the marginalization of th...

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

An important class of problems involves training deep neural networks wi...

On Controllable Sparse Alternatives to Softmax

Converting an n-dimensional vector to a probability distribution over n ...

Please sign up or login with your details

Forgot password? Click here to reset