Mitigating the Performance Impact of Network Failures in Public Clouds

05/23/2023
by   Pooria Namyar, et al.
0

Some faults in data center networks require hours to days to repair because they may need reboots, re-imaging, or manual work by technicians. To reduce traffic impact, cloud providers mitigate the effect of faults, for example, by steering traffic to alternate paths. The state-of-art in automatic network mitigations uses simple safety checks and proxy metrics to determine mitigations. SWARM, the approach described in this paper, can pick orders of magnitude better mitigations by estimating end-to-end connection-level performance (CLP) metrics. At its core is a scalable CLP estimator that quickly ranks mitigations with high fidelity and, on failures observed at a large cloud provider, outperforms the state-of-the-art by over 700× in some cases.

READ FULL TEXT
research
10/15/2022

Failure Analysis of Big Cloud Service Providers Prior to and During Covid-19 Period

Cloud services are important for societal function such as healthcare, c...
research
11/21/2019

Predicting Failures in Multi-Tier Distributed Systems

Many applications are implemented as multi-tier software systems, and ar...
research
08/18/2018

Impact of Link Failures on the Performance of MapReduce in Data Center Networks

In this paper, we utilize Mixed Integer Linear Programming (MILP) models...
research
09/20/2019

Scalable Traffic Engineering for Higher Throughput in Heavily-loaded Software Defined Networks

Existing traffic engineering (TE) solutions performs well for software d...
research
08/30/2017

An Exploratory Study of Field Failures

Field failures, that is, failures caused by faults that escape the testi...
research
09/16/2019

Impact of Correlated Failures in 5G Dual Connectivity Architectures for URLLC Applications

Achieving end-to-end ultra-reliability and resiliency in mission critica...

Please sign up or login with your details

Forgot password? Click here to reset