Machine Learning Training on a Real Processing-in-Memory System

06/13/2022
by   Juan Gomez-Luna, et al.
42

Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., computing systems with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate machine learning training. To do so, we (1) implement several representative classic machine learning algorithms (namely, linear regression, logistic regression, decision tree, K-means clustering) on a real-world general-purpose PIM architecture, (2) characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our experimental evaluation on a memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound machine learning workloads, when the necessary operations and datatypes are natively supported by PIM hardware. To our knowledge, our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/16/2022

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Training machine learning (ML) algorithms is a computationally intensive...
10/04/2021

Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Many modern workloads such as neural network inference and graph process...
08/02/2022

A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

Sequence alignment is a fundamentally memory bound computation whose per...
05/20/2022

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

Analog in-memory computing (AIMC) cores offers significant performance a...
05/05/2020

One-step regression and classification with crosspoint resistive memory arrays

Machine learning has been getting a large attention in the recent years,...
03/08/2019

Processor in Non-Volatile Memory (PiNVSM): Towards to Data-centric Computing in Decentralized Environment

The AI problem has no solution in the environment of existing hardware s...
06/21/2014

Thermodynamic-RAM Technology Stack

We introduce a technology stack or specification describing the multiple...

References

  • (1) A. Géron,

    Hands-on Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems

    , 2019.
  • (2) E. Alpaydin, Introduction to Machine Learning, 2020.
  • (3) I. Goodfellow et al., Deep Learning, 2016.
  • (4) M. Mohri et al., Foundations of Machine Learning, 2018.
  • (5) S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, 2014.
  • (6) S. Raschka and V. Mirjalili, Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-Learn, and TensorFlow 2, 2019.
  • (7) O. Mutlu et al., “Processing Data Where It Makes Sense: Enabling In-Memory Computation,” MicPro, 2019.
  • (8) O. Mutlu et al., “A Modern Primer on Processing in Memory,” Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, 2021, https://arxiv.org/pdf/2012.03112.pdf.
  • (9) S. Ghose et al., “Processing-in-Memory: A Workload-Driven Perspective,” IBM JRD, 2019.
  • (10) V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine,” arXiv:1905.09822 [cs.AR], 2020.
  • (11) O. Mutlu et al., “Enabling Practical Processing in and near Memory for Data-Intensive Computing,” in DAC, 2019.
  • (12) H. S. Stone, “A Logic-in-Memory Computer,” IEEE TC, 1970.
  • (13) W. H. Kautz, “Cellular Logic-in-Memory Arrays,” IEEE TC, 1969.
  • (14) D. E. Shaw et al., “The NON-VON Database Machine: A Brief Overview,” IEEE Database Eng. Bull., 1981.
  • (15) P. M. Kogge, “EXECUBE - A New Architecture for Scaleable MPPs,” in ICPP, 1994.
  • (16) M. Gokhale et al., “Processing in Memory: The Terasys Massively Parallel PIM Array,” IEEE Computer, 1995.
  • (17) D. Patterson et al., “A Case for Intelligent RAM,” IEEE Micro, 1997.
  • (18) M. Oskin et al., “Active Pages: A Computation Model for Intelligent Memory,” in ISCA, 1998.
  • (19) Y. Kang et al., “FlexRAM: Toward an Advanced Intelligent Memory System,” in ICCD, 1999.
  • (20) K. Mai et al., “Smart Memories: A Modular Reconfigurable Architecture,” in ISCA, 2000.
  • (21) R. C. Murphy et al., “The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems,” in Intelligent Memory Systems.   Springer.
  • (22) J. Draper et al., “The Architecture of the DIVA Processing-in-Memory Chip,” in SC, 2002.
  • (23) S. Aga et al., “Compute Caches,” in HPCA, 2017.
  • (24) C. Eckert et al., “Neural Cache: Bit-serial In-cache Acceleration of Deep Neural Networks,” in ISCA, 2018.
  • (25) D. Fujiki et al., “Duality Cache for Data Parallel Acceleration,” in ISCA, 2019.
  • (26) M. Kang et al.

    , “An Energy-Efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation in SRAM,” in

    ICASSP, 2014.
  • (27) V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
  • (28) V. Seshadri et al., “Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM,” arXiv:1611.09988 [cs:AR], 2016.
  • (29) V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” CAL, 2015.
  • (30) V. Seshadri et al., “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.
  • (31) S. Angizi and D. Fan, “Graphide: A Graph Processing Accelerator Leveraging In-DRAM-computing,” in GLSVLSI, 2019.
  • (32) J. Kim et al., “The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability Tradeoff in Modern DRAM Devices,” in HPCA, 2018.
  • (33) J. Kim et al., “D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput,” in HPCA, 2019.
  • (34) F. Gao et al., “ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs,” in MICRO, 2019.
  • (35) K. K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM,” in HPCA, 2016.
  • (36) X. Xin et al., “ELP2IM: Efficient and Low Power Bitwise Operation Processing in DRAM,” in HPCA, 2020.
  • (37) S. Li et al., “DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator,” in MICRO, 2017.
  • (38) Q. Deng et al., “DrAcc: A DRAM Based Accelerator for Accurate CNN Inference,” in DAC, 2018.
  • (39) N. Hajinazar et al., “SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM,” in ASPLOS, 2021.
  • (40) S. H. S. Rezaei et al., “NoM: Network-on-Memory for Inter-Bank Data Transfer in Highly-Banked Memories,” CAL, 2020.
  • (41) Y. Wang et al., “FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching,” in MICRO, 2020.
  • (42) M. F. Ali et al., “In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology,” in TCAS-I, 2019.
  • (43) S. Li et al., “Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-Volatile Memories,” in DAC, 2016.
  • (44) S. Angizi et al., “PIMA-Logic: A Novel Processing-in-Memory Architecture for Highly Flexible and Energy-efficient Logic Computation,” in DAC, 2018.
  • (45) S. Angizi et al., “CMP-PIM: An Energy-efficient Comparator-based Processing-in-Memory Neural Network Accelerator,” in DAC, 2018.
  • (46) S. Angizi et al., “AlignS: A Processing-in-Memory Accelerator for DNA Short Read Alignment Leveraging SOT-MRAM,” in DAC, 2019.
  • (47) Y. Levy et al., “Logic Operations in Memory Using a Memristive Akers Array,” Microelectronics Journal, 2014.
  • (48) S. Kvatinsky et al., “MAGIC—Memristor-Aided Logic,” IEEE TCAS II: Express Briefs, 2014.
  • (49) A. Shafiee et al.

    , “ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars,” in

    ISCA, 2016.
  • (50) S. Kvatinsky et al., “Memristor-Based IMPLY Logic Design Procedure,” in ICCD, 2011.
  • (51) S. Kvatinsky et al., “Memristor-Based Material Implication (IMPLY) Logic: Design Principles and Methodologies,” TVLSI, 2014.
  • (52) P.-E. Gaillardon et al., “The Programmable Logic-in-Memory (PLiM) Computer,” in DATE, 2016.
  • (53) D. Bhattacharjee et al., “ReVAMP: ReRAM based VLIW Architecture for In-memory Computing,” in DATE, 2017.
  • (54) S. Hamdioui et al., “Memristor Based Computation-in-Memory Architecture for Data-intensive Applications,” in DATE, 2015.
  • (55) L. Xie et al., “Fast Boolean Logic Papped on Memristor Crossbar,” in ICCD, 2015.
  • (56) S. Hamdioui et al., “Memristor for Computing: Myth or Reality?” in DATE, 2017.
  • (57) J. Yu et al., “Memristive Devices for Computation-in-Memory,” in DATE, 2018.
  • (58) C. Giannoula et al., “SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures,” in HPCA, 2021.
  • (59) I. Fernandez et al., “NATSA: A Near-Data Processing Accelerator for Time Series Analysis,” in ICCD, 2020.
  • (60) D. S. Cali et al., “GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis,” in MICRO, 2020.
  • (61) J. S. Kim et al., “GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
  • (62) J. Ahn et al., “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
  • (63) J. Ahn et al., “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in ISCA, 2015.
  • (64) A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
  • (65) A. Boroumand et al., “CoNDA: Efficient Cache Coherence Support for near-Data Accelerators,” in ISCA, 2019.
  • (66) G. Singh et al., “NAPEL: Near-memory Computing Application Performance Prediction via Ensemble Learning,” in DAC, 2019.
  • (67) H. Asghari-Moghaddam et al., “Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems,” in MICRO, 2016.
  • (68) O. O. Babarinsa and S. Idreos, “JAFAR: Near-Data Processing for Databases,” in SIGMOD, 2015.
  • (69) P. Chi et al., “PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation In ReRAM-Based Main Memory,” in ISCA, 2016.
  • (70) A. Farmahini-Farahani et al., “NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules,” in HPCA, 2015.
  • (71) M. Gao et al., “Practical Near-Data Processing for In-Memory Analytics Frameworks,” in PACT, 2015.
  • (72) M. Gao and C. Kozyrakis, “HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing,” in HPCA, 2016.
  • (73) B. Gu et al., “Biscuit: A Framework for Near-Data Processing of Big Data Workloads,” in ISCA, 2016.
  • (74) Q. Guo et al., “3D-Stacked Memory-Side Acceleration: Accelerator and System Design,” in WoNDP, 2014.
  • (75) M. Hashemi et al., “Accelerating Dependent Cache Misses with an Enhanced Memory Controller,” in ISCA, 2016.
  • (76) M. Hashemi et al., “Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads,” in MICRO, 2016.
  • (77) K. Hsieh et al., “Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in ISCA, 2016.
  • (78) D. Kim et al., “Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory,” in ISCA, 2016.
  • (79) G. Kim et al., “Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs,” in SC, 2017.
  • (80) J. H. Lee et al., “BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models,” in PACT, 2015.
  • (81) Z. Liu et al., “Concurrent Data Structures for Near-Memory Computing,” in SPAA, 2017.
  • (82) A. Morad et al., “GP-SIMD Processing-in-Memory,” ACM TACO, 2015.
  • (83) L. Nai et al., “GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks,” in HPCA, 2017.
  • (84) A. Pattnaik et al., “Scheduling Techniques for GPU Architectures with Processing-in-Memory Capabilities,” in PACT, 2016.
  • (85) S. H. Pugsley et al., “NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads,” in ISPASS, 2014.
  • (86) D. P. Zhang et al., “TOP-PIM: Throughput-Oriented Programmable Processing in Memory,” in HPDC, 2014.
  • (87) Q. Zhu et al., “Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware,” in HPEC, 2013.
  • (88) B. Akin et al., “Data Reorganization in Memory Using 3D-Stacked DRAM,” in ISCA, 2015.
  • (89) M. Gao et al., “Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in ASPLOS, 2017.
  • (90) M. Drumond et al., “The Mondrian Data Engine,” in ISCA, 2017.
  • (91) G. Dai et al., “GraphH: A Processing-in-Memory Architecture for Large-scale Graph Processing,” IEEE TCAD, 2018.
  • (92) M. Zhang et al., “GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition,” in HPCA, 2018.
  • (93) Y. Huang et al., “A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing,” in IPDPS, 2020.
  • (94) Y. Zhuo et al., “GraphQ: Scalable PIM-based Graph Processing,” in MICRO, 2019.
  • (95) P. C. Santos et al., “Operand Size Reconfiguration for Big Data Processing in Memory,” in DATE, 2017.
  • (96) W.-M. Hwu et al., “Rebooting the Data Access Hierarchy of Computing Systems,” in ICRC, 2017.
  • (97) M. Besta et al., “SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems,” in MICRO, 2021.
  • (98) J. D. Ferreira et al., “pLUTo: In-DRAM Lookup Tables to Enable Massively Parallel General-Purpose Computation,” arXiv:2104.07699 [cs.AR], 2021.
  • (99) A. Olgun et al., “QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAMs,” in ISCA, 2021.
  • (100) S. Lloyd and M. Gokhale, “In-memory Data Rearrangement for Irregular, Data-intensive Computing,” Computer, 2015.
  • (101) D. G. Elliott et al., “Computational RAM: Implementing Processors in Memory,” IEEE Design & Test of Computers, 1999.
  • (102) L. Zheng et al., “RRAM-based TCAMs for pattern search,” in ISCAS, 2016.
  • (103) J. Landgraf et al., “Combining Emulation and Simulation to Evaluate a Near Memory Key/Value Lookup Accelerator,” 2021.
  • (104) A. Rodrigues et al., “Towards a Scatter-Gather Architecture: Hardware and Software Issues,” in MEMSYS, 2019.
  • (105) S. Lloyd and M. Gokhale, “Design Space Exploration of Near Memory Accelerators,” in MEMSYS, 2018.
  • (106) S. Lloyd and M. Gokhale, “Near Memory Key/Value Lookup Acceleration,” in MEMSYS, 2017.
  • (107) M. Gokhale et al., “Near Memory Data Structure Rearrangement,” in MEMSYS, 2015.
  • (108) R. Nair et al., “Active Memory Cube: A Processing-in-Memory Architecture for Exascale Systems,” IBM JRD, 2015.
  • (109) A. C. Jacob et al., “Compiling for the Active Memory Cube,” Tech. rep. RC25644 (WAT1612-008). IBM Research Division, Tech. Rep., 2016.
  • (110) Z. Sura et al., “Data Access Optimization in a Processing-in-Memory System,” in CF, 2015.
  • (111) R. Nair, “Evolution of Memory Architecture,” Proceedings of the IEEE, 2015.
  • (112) R. Balasubramonian et al., “Near-Data Processing: Insights from a MICRO-46 Workshop,” IEEE Micro, 2014.
  • (113) Y. Xi et al., “In-Memory Learning With Analog Resistive Switching Memory: A Review and Perspective,” Proceedings of the IEEE, 2020.
  • (114) K. Hsieh et al., “Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation,” in ICCD, 2016.
  • (115) A. Boroumand et al., “LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory,” CAL, 2016.
  • (116) C. Giannoula et al., “SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems,” arXiv preprint arXiv:2201.05072, 2022.
  • (117) C. Giannoula et al.

    , “Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-in-Memory Architectures,” in

    SIGMETRICS, 2022.
  • (118) A. Denzler et al., “Casper: Accelerating stencil computation using near-cache processing,” arXiv preprint arXiv:2112.14216, 2021.
  • (119) A. Boroumand et al., “Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design,” arXiv:2103.00798 [cs.AR], 2021.
  • (120) A. Boroumand et al., “Polynesia: Enabling Effective Hybrid Transactional Analytical Databases with Specialized Hardware Software Co-Design,” in ICDE, 2022.
  • (121) G. Singh et al., “FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications,” IEEE Micro, 2021.
  • (122) G. Singh et al., “Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric,” ACM TRETS, 2021.
  • (123) J. M. Herruzo et al., “Enabling Fast and Energy-Efficient FM-Index Exact Matching Using Processing-Near-Memory,” The Journal of Supercomputing, 2021.
  • (124) L. Yavits et al., “GIRAF: General Purpose In-Storage Resistive Associative Framework,” IEEE TPDS, 2021.
  • (125) B. Asgari et al., “FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction,” in HPCA, 2021.
  • (126) A. Boroumand et al., “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,” arXiv preprint arXiv:2109.14320, 2021.
  • (127) A. Boroumand et al., “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,” in PACT, 2021.
  • (128) A. Boroumand, “Practical Mechanisms for Reducing Processor-Memory Data Movement in Modern Workloads,” Ph.D. dissertation, Carnegie Mellon University, 2020.
  • (129) G. Singh et al., “NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling,” in FPL, 2020.
  • (130) V. Seshadri and O. Mutlu, “Simple Operations in Memory to Reduce Data Movement,” in Advances in Computers, Volume 106, 2017.
  • (131) S. Diab et al., “High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory,” arXiv preprint arXiv:2204.02085, 2022.
  • (132) S. Diab et al., “High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory,” in HICOMB, 2022.
  • (133) D. Fujiki et al., “In-Memory Data Parallel Processor,” in ASPLOS, 2018.
  • (134) Y. Zha and J. Li, “Hyper-AP: Enhancing Associative Processing Through A Full-Stack Optimization,” in ISCA, 2020.
  • (135) O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
  • (136) O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory Systems,” SUPERFRI, 2014.
  • (137) H. Ahmed et al., “A Compiler for Automatic Selection of Suitable Processing-in-Memory Instructions,” in DATE, 2019.
  • (138) S. Jain et al., “Computing-in-Memory with Spintronics,” in DATE, 2018.
  • (139) N. M. Ghiasi et al., “GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis,” in ASPLOS, 2022.
  • (140) G. F. Oliveira et al., “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,” IEEE Access, 2021.
  • (141) G. F. Oliveira et al., “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,” arXiv:2105.03725 [cs.AR], 2021.
  • (142) S. Cho et al., “McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge,” IEEE Access, 2020.
  • (143) H. Shin et al., “McDRAM: Low latency and energy-efficient matrix computations in DRAM,” IEEE TCADICS, 2018.
  • (144) P. Gu et al., “iPIM: Programmable In-Memory Image Processing Accelerator using Near-Bank Architecture,” in ISCA, 2020.
  • (145) D. Lavenier et al., “Variant Calling Parallelization on Processor-in-Memory Architecture,” in BIBM, 2020.
  • (146) V. Zois et al., “Massively Parallel Skyline Computation for Processing-in-Memory Architectures,” in PACT, 2018.
  • (147) D. Weber et al., “Current and Future Challenges of DRAM Metallization,” in IITC, 2005.
  • (148) Y. Peng et al., “Design, Packaging, and Architectural Policy Co-optimization for DC Power Integrity in 3D DRAM,” in DAC, 2015.
  • (149) F. Devaux, “The True Processing In Memory Accelerator,” in Hot Chips, 2019.
  • (150) M. Yuffe et al., “A Fully Integrated Multi-CPU, GPU and Memory Controller 32nm processor,” in ISSCC, 2011.
  • (151) R. Christy et al., “8.3 A 3GHz ARM Neoverse N1 CPU in 7nm FinFET for Infrastructure Applications,” in ISSCC, 2020.
  • (152) T. Singh et al., “3.2 Zen: A Next-generation High-performance x86 Core,” in ISSCC, 2017.
  • (153) UPMEM, “UPMEM Website,” https://www.upmem.com, 2020.
  • (154) UPMEM, “Introduction to UPMEM PIM. Processing-in-memory (PIM) on DRAM Accelerator (White Paper),” 2018.
  • (155) J. Gómez-Luna et al., “Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture,” arXiv:2105.03814 [cs.AR], 2021.
  • (156) J. Gómez-Luna et al., “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,” IEEE Access, 2022.
  • (157) J. Gómez-Luna et al., “Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware,” in IGSC, 2021.
  • (158) Y.-C. Kwon et al., “25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2 TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications,” in ISSCC, 2021.
  • (159) S. Lee et al., “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,” in ISCA, 2021.
  • (160) L. Ke et al., “Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM,” IEEE Micro, 2021.
  • (161) JEDEC, “High Bandwidth Memory (HBM) DRAM,” Standard No. JESD235, 2013.
  • (162) D. Lee et al., “Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” TACO, 2016.
  • (163) S. Lee et al., “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications,” in ISSCC, 2022.
  • (164) D. Niu et al., “184QPS/W 64Mb/mm2 3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System,” in ISSCC, 2022.
  • (165) D. A. Freedman, Statistical Models: Theory and Practice, 2009.
  • (166) X. Yan and X. Su,

    Linear Regression Analysis: Theory and Computing

    , 2009.
  • (167) D. W. Hosmer Jr et al., Applied Logistic Regression, 2013.
  • (168) S. Suthaharan, “Decision Tree Learning,” in Machine Learning Models and Algorithms for Big Data Classification, 2016.
  • (169) S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information Theory, 1982.
  • (170) D. B. Kirk et al., Programming Massively Parallel Processors, 3rd Edition, Chapter 16 - Application Case Study: Machine Learning.   Morgan Kaufmann, 2017.
  • (171) S. Chetlur et al., “cuDNN: Efficient Primitives for Deep Learning,” arXiv preprint arXiv:1410.0759, 2014.
  • (172) M. Abadi et al., “Tensorflow: A System for Large-scale Machine Learning,” in OSDI, 2016.
  • (173) Run:AI, “Best GPU for Deep Learning,” https://www.run.ai/guides/gpu-deep-learning/best-gpu-for-deep-learning/, 2021.
  • (174) N. P. Jouppi et al.

    , “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in

    ISCA, 2017.
  • (175) N. P. Jouppi et al., “Ten Lessons from Three Generations Shaped Google’s TPUv4i: Industrial Product,” in ISCA, 2021.
  • (176) UPMEM, “UPMEM User Manual. Version 2021.3.0,” 2021.
  • (177) UPMEM, “UPMEM Software Development Kit (SDK).” https://sdk.upmem.com, 2021.
  • (178) N. Zmora et al., “Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT,” https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/.
  • (179) A. Gholami et al., “A Survey of Quantization Methods for Efficient Neural Network Inference,” in

    Low-Power Computer Vision

    .
  • (180) NVIDIA, “NVIDIA H100 Tensor Core GPU Architecture. White Paper,” https://nvdam.widen.net/s/9bz6dw7dqr/gtc22-whitepaper-hopper, 2022.
  • (181)

    J. Han and C. Moraga, “The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning,” in

    IWANN, 1995.
  • (182) Q. Deng et al., “LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator,” in DAC, 2019.
  • (183) M. Gao et al., “DRAF: A Low-power DRAM-based Reconfigurable Acceleration Fabric,” in ISCA, 2016.
  • (184) E. W. Weisstein, “Taylor Series,” https://mathworld.wolfram.com/TaylorSeries.html, 2004.
  • (185) J. Gómez-Luna et al., “An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System,” arXiv preprint arXiv:2207.07886, 2022.