A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets
Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) identifying requirements for efficient data address generation and developing an efficient accelerator offloading scheme reducing overhead by 7x over previously published results; (ii) support a rich set of operations allowing for efficient calculation of the back-propagation phase. The low control overhead allows up to 8 NTX engines to be controlled by a simple processor. Evaluations in a near-memory computing scenario where the accelerator is placed on the logic base die of a Hybrid Memory Cube demonstrate a 2.6x energy efficiency improvement over contemporary GPUs at 4.4x less silicon area, and an average compute performance of 1.01 Tflop/s for training large state-of-the-art networks with full floating-point precision. The architecture is scalable and paves the way towards efficient deep learning in a distributed near-memory setting.
READ FULL TEXT