DeepAI AI Chat
Log In Sign Up

Fries: Fast and Consistent Runtime Reconfiguration in Dataflow Systems with Transactional Guarantees (Extended Version)

10/19/2022
by   Zuozhi Wang, et al.
0

A computing job in a big data system can take a long time to run, especially for pipelined executions on data streams. Developers often need to change the computing logic of the job such as fixing a loophole in an operator or changing the machine learning model in an operator with a cheaper model to handle a sudden increase of the data-ingestion rate. Recently many systems have started supporting runtime reconfigurations to allow this type of change on the fly without killing and restarting the execution. While the delay in reconfiguration is critical to performance, existing systems use epochs to do runtime reconfigurations, which can cause a long delay. In this paper we develop a new technique called Fries that leverages the emerging availability of fast control messages in many systems, since these messages can be sent without being blocked by data messages. We formally define consistency in runtime reconfigurations, and develop a Fries scheduler with consistency guarantees. The technique not only works for different classes of dataflows, but also works for parallel executions and supports fault tolerance. Our extensive experimental evaluation on clusters show the advantages of this technique compared to epoch-based schedulers.

READ FULL TEXT

page 1

page 2

page 3

page 4

10/01/2018

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

This paper presents FT-GAIA, a software-based fault-tolerant parallel an...
08/24/2021

The Case for Task Sampling based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a sched...
03/01/2023

Computing Redundancy in Blocking Systems: Fast Service or No Service

Redundancy in distributed computing systems reduces job completion time....
10/01/2017

Delay Asymptotics and Bounds for Multi-Task Parallel Jobs

We study delay of jobs that consist of multiple parallel tasks, which is...
10/07/2019

Fast and Bayes-consistent nearest neighbors

Research on nearest-neighbor methods tends to focus somewhat dichotomous...
12/09/2020

Operator as a Service: Stateful Serverless Complex Event Processing

Complex Event Processing (CEP) is a powerful paradigm for scalable data ...