Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

by   Tri Dao, et al.

Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off speed, space, and accuracy. We consider a different approach: we introduce a family of matrices called kaleidoscope matrices (K-matrices) that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity. We empirically validate that K-matrices can be automatically learned within end-to-end pipelines to replace hand-crafted procedures, in order to improve model quality. For example, replacing channel shuffles in ShuffleNet improves classification accuracy on ImageNet by up to 5 hand-engineered pipelines – we replace filter bank feature computation in speech data preprocessing with a learnable kaleidoscope layer, resulting in only 0.4 K-matrices can capture latent structure in models: for a challenging permuted image classification task, a K-matrix based representation of permutations is able to learn the right latent structure and improves accuracy of a downstream convolutional model by over 9 implementation of our approach, and use K-matrices in a Transformer network to attain 36


Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Fast linear transforms are ubiquitous in machine learning, including the...

Learning Compressed Transforms with Low Displacement Rank

The low displacement rank (LDR) framework for structured matrices repres...

Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

Fast Fourier transform, Wavelets, and other well-known transforms in sig...

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer

In this paper, we seek to reduce the computation complexity of transform...

Doping: A technique for efficient compression of LSTM models using sparse structured additive matrices

Structured matrices, such as those derived from Kronecker products (KP),...

LambdaNetworks: Modeling Long-Range Interactions Without Attention

We present lambda layers – an alternative framework to self-attention – ...

Structured Transforms for Small-Footprint Deep Learning

We consider the task of building compact deep learning pipelines suitabl...

Please sign up or login with your details

Forgot password? Click here to reset