Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective

02/02/2023
by   Michael E. Sander, et al.
0

The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2018

Differentiable Dynamic Programming for Structured Prediction and Attention

Dynamic programming (DP) solves a variety of structured combinatorial pr...
research
07/15/2021

Lockout: Sparse Regularization of Neural Networks

Many regression and classification procedures fit a parameterized functi...
research
07/15/2020

Fast Differentiable Clipping-Aware Normalization and Rescaling

Rescaling a vector δ⃗∈ℝ^n to a desired length is a common operation in m...
research
12/04/2017

Learning Sparse Neural Networks through L_0 Regularization

We propose a practical method for L_0 norm regularization for neural net...
research
02/20/2014

Group-sparse Matrix Recovery

We apply the OSCAR (octagonal selection and clustering algorithms for re...
research
03/28/2016

Sparse Activity and Sparse Connectivity in Supervised Learning

Sparseness is a useful regularizer for learning in a wide range of appli...
research
07/11/2019

A General Decoupled Learning Framework for Parameterized Image Operators

Many different deep networks have been used to approximate, accelerate o...

Please sign up or login with your details

Forgot password? Click here to reset