Towards Memory-Efficient Training for Extremely Large Output Spaces – Learning with 500k Labels on a Single Commodity GPU

06/06/2023
by   Erik Schultheis, et al.
0

In classification problems with large output spaces (up to millions of labels), the last layer can require an enormous amount of memory. Using sparse connectivity would drastically reduce the memory requirements, but as we show below, it can result in much diminished predictive performance of the model. Fortunately, we found that this can be mitigated by introducing a penultimate layer of intermediate size. We further demonstrate that one can constrain the connectivity of the sparse layer to be uniform, in the sense that each output neuron will have the exact same number of incoming connections. This allows for efficient implementations of sparse matrix multiplication and connection redistribution on GPU hardware. Via a custom CUDA implementation, we show that the proposed approach can scale to datasets with 670,000 labels on a single commodity GPU with only 4GB memory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2018

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multipl...
research
07/21/2017

Memory-Efficient Implementation of DenseNets

The DenseNet architecture is highly computationally efficient as a resul...
research
07/28/2020

At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation

This paper presents GPU performance optimization and scaling results for...
research
10/16/2020

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in ...
research
03/27/2019

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Graph Convolutional Networks (GCNs) are recently getting much attention ...
research
10/12/2020

PECOS: Prediction for Enormous and Correlated Output Spaces

Many challenging problems in modern applications amount to finding relev...
research
04/14/2021

Novel Matrix Hit and Run for Sampling Polytopes and Its GPU Implementation

We propose and analyze a new Markov Chain Monte Carlo algorithm that gen...

Please sign up or login with your details

Forgot password? Click here to reset