Memory-Efficient Implementation of DenseNets

07/21/2017
by   Geoff Pleiss, et al.
0

The DenseNet architecture is highly computationally efficient as a result of feature reuse. However, a naive DenseNet implementation can require a significant amount of GPU memory: If not properly managed, pre-activation batch normalization and contiguous convolution operations can produce feature maps that grow quadratically with network depth. In this technical report, we introduce strategies to reduce the memory consumption of DenseNets during training. By strategically using shared memory allocations, we reduce the memory cost for storing feature maps from quadratic to linear. Without the GPU memory bottleneck, it is now possible to train extremely deep DenseNets. Networks with 14M parameters can be trained on a single GPU, up from 4M. A 264-layer DenseNet (73M parameters), which previously would have been infeasible to train, can now be trained on a single workstation with 8 NVIDIA Tesla M40 GPUs. On the ImageNet ILSVRC classification dataset, this large DenseNet obtains a state-of-the-art single-crop top-1 error of 20.26

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2020

At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation

This paper presents GPU performance optimization and scaling results for...
research
05/27/2019

SpecNet: Spectral Domain Convolutional Neural Network

The memory consumption of most Convolutional Neural Network (CNN) archit...
research
06/06/2023

Towards Memory-Efficient Training for Extremely Large Output Spaces – Learning with 500k Labels on a Single Commodity GPU

In classification problems with large output spaces (up to millions of l...
research
12/07/2017

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In this work we present In-Place Activated Batch Normalization (InPlace-...
research
08/08/2020

GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification

One of the most efficient methods to solve L2-regularized primal problem...
research
10/04/2019

GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory

We present an implementation of the overlap-and-save method, a method fo...
research
12/05/2022

MobileTL: On-device Transfer Learning with Inverted Residual Blocks

Transfer learning on edge is challenging due to on-device limited resour...

Please sign up or login with your details

Forgot password? Click here to reset