Stability and Optimization of Speculative Queueing Networks

04/21/2021
by   Jonatha Anselmi, et al.
0

We provide a queueing-theoretic framework for job replication schemes based on the principle "replicate a job as soon as the system detects it as a straggler". This is called job speculation. Recent works have analyzed replication on arrival, which we refer to as replication. Replication is motivated by its implementation in Google's BigTable. However, systems such as Apache Spark and Hadoop MapReduce implement speculative job execution. The performance and optimization of speculative job execution is not well understood. To this end, we propose a queueing network model for load balancing where each server can speculate on the execution time of a job. Specifically, each job is initially assigned to a single server by a frontend dispatcher. Then, when its execution begins, the server sets a timeout. If the job completes before the timeout, it leaves the network, otherwise the job is terminated and relaunched or resumed at another server where it will complete. We provide a necessary and sufficient condition for the stability of speculative queueing networks with heterogeneous servers, general job sizes and scheduling disciplines. We find that speculation can increase the stability region of the network when compared with standard load balancing models and replication schemes. We provide general conditions under which timeouts increase the size of the stability region and derive a formula for the optimal speculation time, i.e., the timeout that minimizes the load induced through speculation. We compare speculation with redundant-d and redundant-to-idle-queue-d rules under an S& X model. For light loaded systems, redundancy schemes provide better response times. However, for moderate to heavy loadings, redundancy schemes can lose capacity and have markedly worse response times when compared with a speculative scheme.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2020

Improving the performance of heterogeneous data centers through redundancy

We analyze the performance of redundancy in a multi-type job and multi-t...
research
08/08/2020

Achievable Stability in Redundancy Systems

We consider a system with N parallel servers where incoming jobs are imm...
research
01/21/2017

Light traffic behavior under the power-of-two load balancing strategy: The case of heterogeneous servers

We consider a multi-server queueing system under the power-of-two policy...
research
05/27/2020

Threshold-based rerouting and replication for resolving job-server affinity relations

We consider a system with several job types and two parallel server pool...
research
11/14/2019

Optimal Server Selection for Straggler Mitigation

The performance of large-scale distributed compute systems is adversely ...
research
05/28/2021

Fork-join and redundancy systems with heavy-tailed job sizes

We investigate the tail asymptotics of the response time distribution fo...
research
06/27/2020

Queues with Small Advice

Motivated by recent work on scheduling with predicted job sizes, we cons...

Please sign up or login with your details

Forgot password? Click here to reset