The Noir Dataflow Platform: Efficient Data Processing without Complexity

06/07/2023
by   Luca De Martini, et al.
0

Today, data analysis drives the decision-making process in virtually every human activity. This demands for software platforms that offer simple programming abstractions to express data analysis tasks and that can execute them in an efficient and scalable way. State-of-the-art solutions range from low-level programming primitives, which give control to the developer about communication and resource usage, but require significant effort to develop and optimize new algorithms, to high-level platforms that hide most of the complexities of parallel and distributed processing, but often at the cost of reduced efficiency. To reconcile these requirements, we developed Noir, a novel distributed data processing platform written in Rust. Noir provides a high-level dataflow programming model as mainstream data processing systems. It supports static and streaming data, it enables data transformations, grouping, aggregation, iterative computations, and time-based analytics, incurring in a low overhead. This paper presents In this paper, we present the programming model and the implementation details of Noir. We evaluate it under heterogeneous workloads. We compare it with state-of-the-art solutions for data analysis and high-performance computing, as well as alternative research products, which offer different programming abstractions and implementation strategies. Noir programs are compact and easy to write: developers need not care about low-level concerns such as resource usage, data serialization, concurrency control, and communication. Noir consistently presents comparable or better performance than competing solutions, by a large margin in several scenarios. We conclude that Noir offers a good tradeoff between simplicity and performance, allowing developers to easily express complex data analysis tasks and achieve high performance and scalability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2018

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important ...
research
12/06/2018

K-Pg: Shared State in Differential Dataflows

Many of the most popular scalable data-processing frameworks are fundame...
research
08/01/2022

Bring the BitCODE – Moving Compute and Data in Distributed Heterogeneous Systems

In this paper, we present a framework for moving compute and data betwee...
research
12/01/2020

LifeStream: A High-performance Stream Processing Engine for Waveform Data

Hospitals around the world collect massive amount of physiological data ...
research
11/29/2018

Inviwo - A Visualization System with Usage Abstraction Levels

The complexity of today's visualization applications demands specific vi...
research
06/05/2019

Architectural Middleware that Supports Building High-performance, Scalable, Ubiquitous, Intelligent Personal Assistants

Intelligent Personal Assistants (IPAs) are software agents that can perf...
research
03/21/2022

A Model and Survey of Distributed Data-Intensive Systems

Data is a precious resource in today's society, and is generated at an u...

Please sign up or login with your details

Forgot password? Click here to reset