Fries: Fast and Consistent Runtime Reconfiguration in Dataflow Systems with Transactional Guarantees (Extended Version)

10/19/2022
by   Zuozhi Wang, et al.
0

A computing job in a big data system can take a long time to run, especially for pipelined executions on data streams. Developers often need to change the computing logic of the job such as fixing a loophole in an operator or changing the machine learning model in an operator with a cheaper model to handle a sudden increase of the data-ingestion rate. Recently many systems have started supporting runtime reconfigurations to allow this type of change on the fly without killing and restarting the execution. While the delay in reconfiguration is critical to performance, existing systems use epochs to do runtime reconfigurations, which can cause a long delay. In this paper we develop a new technique called Fries that leverages the emerging availability of fast control messages in many systems, since these messages can be sent without being blocked by data messages. We formally define consistency in runtime reconfigurations, and develop a Fries scheduler with consistency guarantees. The technique not only works for different classes of dataflows, but also works for parallel executions and supports fault tolerance. Our extensive experimental evaluation on clusters show the advantages of this technique compared to epoch-based schedulers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2018

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

This paper presents FT-GAIA, a software-based fault-tolerant parallel an...
research
08/24/2021

The Case for Task Sampling based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a sched...
research
02/08/2019

Consistency models in distributed systems: A survey on definitions, disciplines, challenges and applications

The replication mechanism resolves some challenges with big data such as...
research
03/01/2023

Computing Redundancy in Blocking Systems: Fast Service or No Service

Redundancy in distributed computing systems reduces job completion time....
research
10/01/2017

Delay Asymptotics and Bounds for Multi-Task Parallel Jobs

We study delay of jobs that consist of multiple parallel tasks, which is...
research
10/07/2019

Fast and Bayes-consistent nearest neighbors

Research on nearest-neighbor methods tends to focus somewhat dichotomous...

Please sign up or login with your details

Forgot password? Click here to reset