MLP-Mixer as a Wide and Sparse MLP

06/02/2023
by   Tomohiro Hayase, et al.
0

Multi-layer perceptron (MLP) is a fundamental component of deep learning that has been extensively employed for various problems. However, recent empirical successes in MLP-based architectures, particularly the progress of the MLP-Mixer, have revealed that there is still hidden potential in improving MLPs to achieve better performance. In this study, we reveal that the MLP-Mixer works effectively as a wide MLP with certain sparse weights. Initially, we clarify that the mixing layer of the Mixer has an effective expression as a wider MLP whose weights are sparse and represented by the Kronecker product. This expression naturally defines a permuted-Kronecker (PK) family, which can be regarded as a general class of mixing layers and is also regarded as an approximation of Monarch matrices. Subsequently, because the PK family effectively constitutes a wide MLP with sparse weights, one can apply the hypothesis proposed by Golubeva, Neyshabur and Gur-Ari (2021) that the prediction performance improves as the width (sparsity) increases when the number of weights is fixed. We empirically verify this hypothesis by maximizing the effective width of the MLP-Mixer, which enables us to determine the appropriate size of the mixing layers quantitatively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/17/2020

Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

Fast Fourier transform, Wavelets, and other well-known transforms in sig...
research
08/17/2017

Restricted Boltzmann machine to determine the input weights for extreme learning machines

The Extreme Learning Machine (ELM) is a single-hidden layer feedforward ...
research
06/11/2014

Explicit Computation of Input Weights in Extreme Learning Machines

We present a closed form expression for initializing the input weights i...
research
10/25/2022

Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training

Training a sparse neural network from scratch requires optimizing connec...
research
06/07/2021

Representation mitosis in wide neural networks

Deep neural networks (DNNs) defy the classical bias-variance trade-off: ...
research
11/29/2019

Richer priors for infinitely wide multi-layer perceptrons

It is well-known that the distribution over functions induced through a ...
research
07/23/2020

PareCO: Pareto-aware Channel Optimization for Slimmable Neural Networks

Slimmable neural networks provide a flexible trade-off front between pre...

Please sign up or login with your details

Forgot password? Click here to reset