Deep Learning Meets Projective Clustering

10/08/2020
by   Alaa Maalouf, et al.
31

A common approach for compressing NLP networks is to encode the embedding layer as a matrix A∈ℝ^n× d, compute its rank-j approximation A_j via SVD, and then factor A_j into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of A represent points in ℝ^d, and the rows of A_j represent their projections onto the j-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of A may be spread around k>1 subspaces, so factoring A based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by projective clustering from computational geometry, we suggest replacing this subspace by a set of k subspaces, each of dimension j, that minimizes the sum of squared distances over every point (row in A) to its closest subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of k small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by 40% while incurring only a 0.5% average drop in accuracy over all nine GLUE tasks, compared to a 2.8% drop using the existing SVD approach. On RoBERTa we achieve 43% compression of the embedding layer with less than a 0.8% average drop in accuracy as compared to a 3% drop previously. Open code for reproducing and extending our results is provided.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2020

Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank Approximation

A common technique for compressing a neural network is to compute the k-...
research
07/02/2019

Tight Sensitivity Bounds For Smaller Coresets

An ε-coreset for Least-Mean-Squares (LMS) of a matrix A∈R^n× d is a smal...
research
06/30/2020

Subspace approximation with outliers

The subspace approximation problem with outliers, for given n points in ...
research
06/30/2022

Language model compression with weighted low-rank factorization

Factorizing a large matrix into small matrices is a popular strategy for...
research
02/12/2018

ClosNets: a Priori Sparse Topologies for Faster DNN Training

Fully-connected layers in deep neural networks (DNN) are often the throu...
research
02/15/2020

Sparse Coresets for SVD on Infinite Streams

In streaming Singular Value Decomposition (SVD), d-dimensional rows of a...
research
11/30/2015

Coresets for Kinematic Data: From Theorems to Real-Time Systems

A coreset (or core-set) of a dataset is its semantic compression with re...

Please sign up or login with your details

Forgot password? Click here to reset