NumS: Scalable Array Programming for the Cloud

06/28/2022
by   Melih Elibol, et al.
10

Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2019

Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

Computationally-intensive loops are the primary source of parallelism in...
research
11/02/2022

Distributed Work Stealing in a Task-Based Dataflow Runtime

The task-based dataflow programming model has emerged as an alternative ...
research
01/21/2020

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Modern interconnects offer remote direct memory access (RDMA) features. ...
research
10/03/2019

Minimax Bounds for Distributed Logistic Regression

We consider a distributed logistic regression problem where labeled data...
research
07/01/2019

Distributed-Memory Load Balancing with Cyclic Token-based Work-Stealing Applied to Reverse Time Migration

Reverse time migration (RTM) is a prominent technique in seismic imaging...
research
03/22/2021

hep_tables: Heterogeneous Array Programming for HEP

Array operations are one of the most concise ways of expressing common f...

Please sign up or login with your details

Forgot password? Click here to reset