L3 Fusion: Fast Transformed Convolutions on CPUs

12/04/2019
by   Rati Gelashvili, et al.
0

Fast convolutions via transforms, either Winograd or FFT, had emerged as a preferred way of performing the computation of convolutional layers, as it greatly reduces the number of required operations. Recent work shows that, for many layer structures, a well–designed implementation of fast convolutions can greatly utilize modern CPUs, significantly reducing the compute time. However, the generous amount of shared L3 cache present on modern CPUs is often neglected, and the algorithms are optimized solely for the private L2 cache. In this paper we propose an efficient `L3 Fusion` algorithm that is specifically designed for CPUs with significant amount of shared L3 cache. Using the hierarchical roofline model, we show that in many cases, especially for layers with fewer channels, the `L3 fused` approach can greatly outperform standard 3 stage one provided by big vendors such as Intel. We validate our theoretical findings, by benchmarking our `L3 fused` implementation against publicly available state of the art.

READ FULL TEXT
research
09/20/2018

FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why

Winograd-based convolution has quickly gained traction as a preferred ap...
research
12/24/2014

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

We examine the performance profile of Convolutional Neural Network train...
research
05/20/2019

Leaking Information Through Cache LRU States

The widely deployed Least-Recently Used (LRU) cache replacement policy a...
research
07/02/2019

Cache-Friendly Search Trees; or, In Which Everything Beats std::set

While a lot of work in theoretical computer science has gone into optimi...
research
06/03/2019

Separable Layers Enable Structured Efficient Linear Substitutions

In response to the development of recent efficient dense layers, this pa...
research
10/22/2015

ZNN - A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines

Convolutional networks (ConvNets) have become a popular approach to comp...
research
06/03/2023

BandwidthBreach: Unleashing Covert and Side Channels through Cache Bandwidth Exploitation

In the modern CPU architecture, enhancements such as the Line Fill Buffe...

Please sign up or login with your details

Forgot password? Click here to reset