Optimizing Waiting Thresholds Within A State Machine

10/08/2018
by   Rohit Pandey, et al.
0

Azure (the cloud service provided by Microsoft) is composed of physical computing units which are called nodes. These nodes are controlled by a software component called Fabric Controller (FC), which can consider the nodes to be in one of many different states such as Ready, Unhealthy, Booting, etc. Some of these states correspond to a node being unresponsive to FCs requests. When a node goes unresponsive for more than a set threshold, FC intervenes and reboots the node. We minimized the downtime caused by the intervention threshold when a node switches to the Unhealthy state by fitting various heavy-tail probability distributions. We consider using features of the node to customize the organic recovery model to the individual nodes that go unhealthy. This regression approach allows us to use information about the node like hardware, software versions, historical performance indicators, etc. to inform the organic recovery model and hence the optimal threshold. In another direction, we consider generalizing this to an arbitrary number of thresholds within the node state machine (or Markov chain). When the states become intertwined in ways that different thresholds start affecting each other, we can't simply optimize each of them in isolation. For best results, we must consider this as an optimization problem in many variables (the number of thresholds). We no longer have a nice closed form solution for this more complex problem like we did with one threshold, but we can still use numerical techniques (gradient descent) to solve it.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2019

Asymptotic Analysis Based Greedy Method for Threshold-Based Distributed Optimization of Persistent Monitoring on Graphs

We consider the optimal multi-agent persistent monitoring problem define...
research
07/31/2021

Application of hypercomplex number system in the dynamic network model

In recent years, the direction of the study of networks in which connect...
research
11/06/2019

Asymptotic Analysis for Greedy Initialization of Threshold-Based Distributed Optimization of Persistent Monitoring on Graphs

We consider the optimal multi-agent persistent monitoring problem define...
research
04/30/2019

Some results on multithreshold graphs

Jamison and Sprague defined a graph G to be a k-threshold graph with thr...
research
07/30/2018

Distributed Stochastic Optimization in Networks with Low Informational Exchange

We consider a distributed stochastic optimization problem in networks wi...
research
02/28/2021

A Central Limit Theorem for Diffusion in Sparse Random Graphs

We consider bootstrap percolation and diffusion in sparse random graphs ...
research
07/19/2019

Learning sparsity in reservoir computing through a novel bio-inspired algorithm

The mushroom body is the key network for the representation of learned o...

Please sign up or login with your details

Forgot password? Click here to reset