MELOPPR: Software/Hardware Co-design for Memory-efficient Low-latency Personalized PageRank

04/13/2021
by   Lixiang Li, et al.
0

Personalized PageRank (PPR) is a graph algorithm that evaluates the importance of the surrounding nodes from a source node. Widely used in social network related applications such as recommender systems, PPR requires real-time responses (latency) for a better user experience. Existing works either focus on algorithmic optimization for improving precision while neglecting hardware implementations or focus on distributed global graph processing on large-scale systems for improving throughput rather than response time. Optimizing low-latency local PPR algorithm with a tight memory budget on edge devices remains unexplored. In this work, we propose a memory-efficient, low-latency PPR solution, namely MeLoPPR, with largely reduced memory requirement and a flexible trade-off between latency and precision. MeLoPPR is composed of stage decomposition and linear decomposition and exploits the node score sparsity: Through stage and linear decomposition, MeLoPPR breaks the computation on a large graph into a set of smaller sub-graphs, that significantly saves the computation memory; Through sparsity exploitation, MeLoPPR selectively chooses the sub-graphs that contribute the most to the precision to reduce the required computation. In addition, through software/hardware co-design, we propose a hardware implementation on a hybrid CPU and FPGA accelerating platform, that further speeds up the sub-graph computation. We evaluate the proposed MeLoPPR on memory-constrained devices including a personal laptop and Xilinx Kintex-7 KC705 FPGA using six real-world graphs. First, MeLoPPR demonstrates significant memory saving by 1.5x to 13.4x on CPU and 73x to 8699x on FPGA. Second, MeLoPPR allows flexible trade-offs between precision and execution time: when the precision is 80 CPU is up to 15x and up to 707x on FPGA; when the precision is around 90 speedup is up to 70x on FPGA.

READ FULL TEXT
research
08/07/2021

Clio: A Hardware-Software Co-Designed Disaggregated Memory System

Memory disaggregation has attracted great attention recently because of ...
research
06/17/2022

Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform

Mini-batch inference of Graph Neural Networks (GNNs) is a key problem in...
research
03/18/2021

Solving Large Top-K Graph Eigenproblems with a Memory and Compute-optimized FPGA Design

Large-scale eigenvalue computations on sparse matrices are a key compone...
research
06/12/2016

Automated Space/Time Scaling of Streaming Task Graph

In this paper, we describe a high-level synthesis (HLS) tool that automa...
research
09/22/2020

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

Sparse matrix-vector multiplication is often employed in many data-analy...
research
09/21/2021

Low-latency NuMI Trigger for the CHIPS-5 Neutrino Detector

The CHIPS R D project aims to develop affordable water Cherenkov detec...
research
09/22/2022

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Transformer is a deep learning language model widely used for natural la...

Please sign up or login with your details

Forgot password? Click here to reset