Enabling High-Capacity, Latency-Tolerant, and Highly-Concurrent GPU Register Files via Software/Hardware Cooperation

by   Mohammad Sadrosadati, et al.

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 34


page 22

page 24


A GPU Register File using Static Data Compression

GPUs rely on large register files to unlock thread-level parallelism for...

Information Freshness in Cache Updating Systems with Limited Cache Storage Capacity

We consider a cache updating system with a source, a cache with limited ...

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar FallosPermanentes en una GPU

The ever-increasing parallelism demand of General-Purpose Graphics Proce...

Effectiveness and predictability of in-network storage cache for scientific workflows

Large scientific collaborations often have multiple scientists accessing...

GREENER: A Tool for Improving Energy Efficiency of Register Files

Graphics Processing Units (GPUs) maintain a large register file to incre...

ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis

We propose ZnG, a new GPU-SSD integrated architecture, which can maximiz...

Efficient Deep Learning Using Non-Volatile Memory Technology

Embedded machine learning (ML) systems have now become the dominant plat...

Please sign up or login with your details

Forgot password? Click here to reset