Annual Interruption Rate as a KPI, its measurement and comparison
This article is divided into two chapters. The first chapter describes the failure rate as a KPI and studies its properties. The second one goes over ways to compare this KPI across two groups using the concepts of statistical hypothesis testing. In section 1., we will motivate the failure rate as a KPI (in Azure, it is dubbed `Annual Interruption Rate' or AIR. In section 3, we will discuss measuring failure rate from logs machines typically generate. In section 1.2, we will discuss the problem of measuring it from real-world data. In section 2.1, we will discuss the general concepts of hypothesis testing. In section 2.2, we will go over some general count distributions for modeling Azure reboots. In section 2.3, we will go over some experiments on applying various hypothesis tests to simulated data. In section 2.4, we will discuss some applications of this work like using these statistical methods to catch regressions in failure rate and how long we need to let changes to the system `bake' before we are reasonably sure they didn't regress failure rate.
READ FULL TEXT