Breaking hypothesis testing for failure rates

01/13/2020 ∙ by Rohit Pandey, et al. ∙ Microsoft 0

We describe the utility of point processes and failure rates and the most common point process for modeling failure rates, the Poisson point process. Next, we describe the uniformly most powerful test for comparing the rates of two Poisson point processes for a one-sided test (henceforth referred to as the "rate test"). A common argument against using this test is that real world data rarely follows the Poisson point process. We thus investigate what happens when the distributional assumptions of tests like these are violated and the test still applied. We find a non-pathological example (using the rate test on a Compound Poisson distribution with Binomial compounding) where violating the distributional assumptions of the rate test make it perform better (lower error rates). We also find that if we replace the distribution of the test statistic under the null hypothesis with any other arbitrary distribution, the performance of the test (described in terms of the false negative rate to false positive rate trade-off) remains exactly the same. Next, we compare the performance of the rate test to a version of the Wald test customized to the Negative Binomial point process and find it to perform very similarly while being much more general and versatile. Finally, we discuss the applications to Microsoft Azure. The code for all experiments performed is open source and linked in the introduction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Stochastic point processes are useful tools used to model point in time events (like earthquakes, supernova explosions, machine or organ failure, etc.). Hence, they are ubiquitous across industries as varied as cloud computing, health care, climatology, etc. Two of the core properties of point processes are the rates of event arrival (how many events per unit time) and the inter-arrival time between successive events (for example, how long is a machine expected to run before it fails).

At Microsoft Azure, we have realized that machine failures are most conveniently described by point processes and have framed our KPIs (Key Performance Indicators, numbers that serve as a common language across the organization to gauge performance) around failure rates for them. Hence, we dedicate section-I to event rates for point processes. The simplest point process for modeling these random variables, and the only one that has a constant failure rate is the Poisson point process. Hence, that process will act as our base.

Now, it is very important for us at Azure to be able to perform statistical inference on these rates (given our core KPI is devised around them) using for example, hypothesis testing. When a new feature is deployed, we want to be able to say if the failure rate is significantly worse in the treatment group that received it vis-a-vis a control group that didn’t. Another field where hypothesis testing on failure rates is an active area of research is medicine (see for example, [3]). Hence, we describe the “uniformly most powerful test” for comparing failure rates in section II and study its properties. In doing so, we reach some very interesting conclusions.

In hypothesis testing, we always assume some distributions for the two groups we want to compare. A common theme across the body of research on hypothesis testing appears to be a resistance to violating this expectation too much (for example, the authors in [3] refer to the false positive rate getting inflated when the distributional assumptions are invalidated and recommend not using the test in those scenarios). However, as we know, all models are wrong. This applies to any distributional assumption we pick to model our data - we can bet on the real data diverging from these assumptions to varying degrees.

We therefore put our hypothesis test to the test by conducting some experiments where we willfully violate the distributional assumptions of our test (use a negative binomial point process instead of Poisson for example even though the test is devised with a Poisson assumption in mind) and study the consequences. We find some scenarios (non pathological) where it turns out that violating the underlying distributional assumptions of the test to a larger extent actually makes it better (where “better” is defined as having a better false negative to false positive rate trade off). This is covered in section III-B. Hence, we challenge this notion that violating the distributional assumptions of the test is necessarily a bad thing to be avoided.

We also reach an interesting conclusion that if we swap out the distribution of the null hypothesis with any other distribution under the sun, the trade off between the false negative rate and false positive rate remains unchanged. This conclusion holds not just for the rate test, but any one sided test. For example, if we take the famous two sample t-test for comparing means and replace the t-distribution with (for example) some weird multi-modal distribution, the false negative to false positive rate trade off will remain unchanged. These experiments are covered in section III.

Next, we measure the performance of our test, designed for the Poisson point process on a negative binomial point process and compare it to the state of the art hypothesis test designed for negative binomial point processes and find it fairs quite well. These comparisons are covered in section IV. Finally, in section V we cover the applications to Microsoft Azure and business impact of this work. All the code is open sourced and available on Github. For example, see here for all plots you’ll find in this paper and here for relevant tests on the library.

1 Failure rates and the Poisson process

Over the years, the core KPI used to track availability within Azure has shifted and evolved. For a long time, it was the total duration of customer VM (Virtual machine - the unit leased to Azure customers) downtime across the fleet. However, there were two issues with using this as a KPI:

  • It wasn’t normalized, meaning that if we compare it across two groups with the first one having more activity, we can obviously expect more downtime duration as well.

  • It wasn’t always aligned with customer experience. For example, a process causing many reboots each with a short duration wouldn’t move the overall downtime duration by much and hence not get prioritized for fixing. However, it would still degrade customer experience especially when their workloads were sensitive to any interruptions. Customers running gaming workloads for example tend to fall into this category.

  • It is much harder for our telemetry to accurately capture how long a VM was down for as opposed to simply stating that there was an interruption in service around some time frame.

The logical thing to do would be to define the KPI in terms of interruptions and that would at least address the second and third problems. However, the issue remained that it wasn’t normalized. For example, as the size of the Azure fleet grows over time, we expect the number of interruptions across the fleet to increase as well. But then, if we see the number of fleet-wide interruptions increasing over time, how do we tell how much of it is due to the size increasing and how much can be attributed to the platform potentially regressing?

To address these problems, a new KPI called the ‘Annual Interruption Rate’ or AIR was devised, which is basically a normalized form of interruptions. Before describing it, let’s take a highly relevant detour into the concept of “hazard rate”. It can be interpreted as the instantaneous rate at which events from a point process are occurring, much like velocity is the instantaneous rate at which something is covering distance.

This rate can be expressed in terms of properties of the distribution representing times elapsing between the events of interest which in this case might be VM reboots. Since this time between two successive reboots is a random variable, we will express it as an upper-case,

. Since this notion of rates applies to any events, that is how we will refer to these ‘reboots’. If we denote the probability density function (PDF) of this random variable,

by and the survival function (probability that the random variable, will exceed some value, ) by , then the hazard rate is given by:


The way to interpret this quantity is that at any time , the expected number of events the process will generate in the next small interval, will be given by:

. You can find a derivation of this expression in appendix A. Note again that this is an instantaneous rate, meaning it is a function of time. When we talk about the Azure KPI, we’re not looking to estimate a function of time. Instead, given some interval of time (like the last week) and some collection of VMs, we want to get a single number encapsulating the overall experience. In reality, the rate will indeed probably vary from instant to instant within our time interval of interest. So, we want one estimate to represent this entire profile.

It is helpful again to draw from our analogy with velocity. If a car were moving on a straight road with a velocity that is a function of time and we wanted to find a single number to represent its average speed, what would we do? We would take the total distance traveled and divide by the total time taken for the trip. Similarly, the average rate (let’s denote it by ) over a period of time will become the number of events we are modeling divided by the total observation time interval (say ).


Just as it is possible to drive a car with a steady, constant velocity, making the average and instantaneous rates the same, it is also possible to have a process where the instantaneous rate is always a constant, and this is what the average rate as well will become. This special point process is called the Poisson point process (the only process with this constant rate property - henceforth denoted by ). Chapter 5 of [1] covers this extensively. As soon as we say “give me a single rate defining the interruptions per unit time for this data”, we’re essentially asking to fit the data as closely as possible to a Poisson point process and get the parameter for that process. In section 5.3.2 of [1], Ross mentions that reason for the name of the Poisson point process. Namely, that the number of events in any interval, , is distributed according to a Poisson distribution with mean . The probability mass function (PMF) is defined there:


Also, the inter-arrival times of events,

follows an exponential distribution (density function

). This makes sense since it is the only distribution that has a constant hazard rate with time (which is its parameter, ). This is called the ‘memory-less’ property (the process maintains no memory - the rate remains the same regardless of what data we observed from the distribution). We now show that equation 1.2 is consistent with the Poisson process.

Proposition 1.1.

If we see point events in observation periods, from some data, the value of the rate parameter, of the Poisson process that maximizes the likelihood of seeing this data is given by:

Where is the total interruptions observed and is the total time period of observation.


Per equation 1.3, the likelihood of seeing the th observation becomes:

Which makes the likelihood across all the data points:

Taking logarithm on both sides we get the log-likelihood function,

To find the that maximizes this likelihood, we take derivative with respect to it and set to .

Solving for we get equation 1.2 as expected when is defined as the total events and is defined as the total observation period. ∎

We can also use the fact that the inter-arrival times, are exponential to reach the same conclusion and this alternate derivation is covered in appendix B. Note that the estimator for the average rate, obtained here will hold for any point process, not just .

Proposition 1.2.

Our estimator for the rate, described in equation 1.2 for a Poisson point process is unbiased and asymptotically consistent.


Let’s say we observe the process for a certain amount of time,

. The unbiased estimator of

will become:

The expected value of this estimator is: meaning it is unbiased.

And the variance of this estimator will be:

For a large time frame of observation, the variance in this estimator will go to , making it asymptotically consistent. ∎

The ‘average rate’ defined here is what the ‘AIR’ (Annual Interruption Rate) KPI used within Azure is based on. It is the projected number of reboots/ other events (like blips and pauses, etc.) a customer will experience if they rent 100 VMs and run them for a year (or rent one VM and run it for 100 years; what matters is the VM-years). So, in equation 1.2, if we measure the number of interruptions and VM-years for any scope (ex: entire Azure, a customer within Azure, a certain hardware, etc.) we get the corresponding average rate.

This definition in equation 1.2 is almost there, but is missing one subtlety related to VMs in Azure (or any cloud environment) going down for certain intervals of time as opposed to being point-events. This means that the VM might be up and running for an interval of time and then go down and stay down for some other interval before switching back to up and so on. The way to address this is to use in the denominator, the total intervals of machine UP-time only (discounting the time the machines stay down). This way, we get a failure rate per unit time the machines are actually running, which is far more useful as a KPI. In practice, this doesn’t matter too much since the total time the machines spent being down is negligible compared to the time they spend being up (else we wouldn’t have a business).

2 Hypothesis testing: closed form expressions for false positive-negative trade off

There are many questions that can be answered within the framework of hypothesis testing (see chapter 1 of [2]). For example, we could answer the question: are the rates from two processes “different” in a meaningful way. This is called a two-sided test. Here, we will stay focused on answering if a treatment group (group-1) has a higher failure rate than a control group (group-0). This is called a one-sided test. This question is particularly relevant in cloud environments like Azure where new software features are getting constantly deployed and we’re interested in answering if a particular deployment caused the failure rate to regress. We will reference these two groups throughout this document.

A detailed description of hypothesis testing is beyond the scope of what we’re discussing here. For a comprehensive treatment, refer to [2] and the blog linked here for an intuitive, visual introduction. Instead, let’s simply define some terms that will be used throughout this document (they will be re-introduced with context as the need arises in the proceeding text; this is just meant as a sort of index of terms). Some of them pertain to hypothesis testing and can be looked up in the references above or in a multitude of other sources online that cover the topic.

The number of failure events observed in data collected from group-0, the control group. We will consider multiple distributions for this variable in the proceeding discussion.

The total observation time for group-0, the control group.

The number of failure events observed in data collected from group-1, the treatment group. Again, multiple distributional assumptions will be considered.

The total observation time for group-1, the treatment group.

The underlying failure rate of the control group. Per equation 1.2, the unbiased estimator for this rate is:

The effect size. If we imagine that the treatment group has a worse failure rate, this is the amount by which we assume it to be worse. It is closely related the the alternate hypothesis, defined below.

The test statistic. We take the data from the two groups and convert it to a single number. We can then observe this number from our collected data and if it’s high (or low) enough, conclude a regression was caused. For example, it could be the difference in estimated rates.

The null hypothesis of the test. We always start with the assumption of innocence and this represents the hypothesis that the treatment group does not have a worse failure rate than the control group. Further, the distributional assumptions on and made by the test are satisfied. For this paper, this will mostly mean that and are both Poisson processes, .

The alternate hypothesis. In this hypothesis, we assume that the treatment group indeed has a worse failure rate than the control group. To make it concrete, we assume it’s worse by the effect size, . Like , the distributional assumptions made by the test are assumed satisfied. This will mean for the most part that the control group follows and the treatment group follows .

This is a new hypothesis we’re defining. It is like , apart from allowing the distributional assumptions on and to be different from the test. The failure rates for the two processes are still assumed to be the same for the two groups. It allows us to address the question of what happens when we use a test designed on one set of assumptions on real data that diverges from those assumptions.

Like , apart from allowing the distributional assumptions on and to be different from the test. The failure rates for the two processes are still assumed to differ by just as with .

The distribution of our test statistic, under .

The distribution of our test statistic, under .

The distribution of our test statistic, under .

The distribution of our test statistic, under .

The p-value of the hypothesis test, representing the likelihood that something as or more extreme (with “extreme” defined in the direction of , which here means towards greater treatment failure rates) as the observed test statistic could be seen under the assumptions of .

The type-1 error rate of the test. It is the only parameter defined arbitrarily by us. Under the assumptions of , what is the probability the test will reject it? It is the theoretical false positive rate from the test. The binary decision saying weather or not there is a regression in the treatment group is made using the indicator variable: .

The false positive rate (FPR) for real world data when there is no difference in rates between the groups (so, under ) and we still use to reject the null hypothesis. We will see in proposition 2.1 that if then .

The false negative rate of our test (as a function of the type-1 error rate we arbitrarily set), defined as the probability that we will fail to reject the null hypothesis under or .

The false negative rate at the value of where we get a false positive rate of .

Let’s also define henceforth for a distribution (typically the test statistic in our context),

, the cumulative distribution function (CDF) of

and , the survival function of .

Armed with the notation defined above, we can now describe how our hypothesis test (one sided with alternate hypothesis being that the treatment group has a higher rate) proceeds (refer to figure 1):

Figure 1: The false positive-false negative rate trade-off. As we increase , which is the threshold the test statistic, needs to cross for rejecting the null increases. As a result, the p-value, which is the blue area reduces while the false negative rate, which is the green area increases.
Step 1:

Obtain the distribution, of the test statistic under the null hypothesis, . This distribution is represented by the blue distribution in figure 1.

Step 2:

Observe the estimated value of the test statistic, in the data we collect. This value is represented by the red line in figure 1. We assume that this test statistic is higher when the difference in rates between the treatment and control groups is higher.

Step 3:

Find the probability of seeing something as or more extreme than under the assumptions of . This is called the p-value, and is represented by the blue area to the right of the red line in figure 1.

Step 4:

For some arbitrarily defined type-1 error rate (a common value is 5%), , reject the null and conclude there is a regression if .

Proposition 2.1.

Under the assumptions of the null hypothesis (the hypothesis whose distributional assumption for the test statistic is used to calculate the p-value, ), the type-1 error rate of our test () is the same as the false positive rate ().


The p-value will be given by:


where is the survival function of the distribution, .

The false positive rate, then becomes the probability that the p-value will be lower than the type-1 error rate, .


Under the assumptions of the null hypothesis however, the test statistic is distributed as :

Substituting into equation 2.2 we get:


Where in the third step, we used the fact that is a monotonically decreasing function. ∎

Corollary 2.1.1.

Under the null hypothesis, the p-value (

) is uniformly distributed over



From equation 2.2 and the result of proposition 2.1 we have,

The only distribution that satisfies this property is the uniform distribution, . ∎

In making a binary decision on weather or not there is a regression in failure rates for the treatment group, there will be a trade off between false negative (failing to reject null when its false) and false positive (rejecting null when it’s true) error rates.

To define false negative rate, we assume there is actually a difference in the failure rates for the treatment and control groups. We assumed that the test statistic follows the distribution under this hypothesis. If it so happens that the distributional assumptions on and happen to be of the same form as those used to derive under the hypothesis (apart from the failure rate corresponding to being higher than that for by ), we get but that’s incidental. Referencing figure 1 again, we get the following proposition:

Proposition 2.2.

The false negative rate of our hypothesis test described earlier as a function of the type-1 error rate we arbitrarily set is given by:


Refer again to figure 1 where the green distribution represents , the hypothesis where the failure rate of the treatment group is higher than that of the control group. Our test translates to some threshold, on the observed test statistic where we reject the null if . Since we have per equation 2.1, , we get:


The false negative rate then becomes the probability of the observed test statistic being below this threshold:

where is the cumulative density function of .

Substituting equation 2.4 and noting under current assumptions we get:


As a special case, if in the alternate hypothesis, the distributional assumptions of and are maintained, we have and equation 2.5 becomes:


Alternately, we can also proceed as follows to prove proposition 2.2:


The false negative rate, is defined as the probability of failing to reject the null hypothesis conditional on it being true. The probability of failing to reject the null is . Using equation 2.1, this becomes:

But, under the alternate hypothesis we have:

This implies

Where in the second equation we used the fact that the survival function, is a decreasing function. ∎

What if we assumed some distributions for and , leading to the null hypothesis, . In real life, and follow some other distribution, while still having the same rates for the two processes. This leads to another null hypothesis, . The test statistic under the two hypotheses:

In this case, the violation of the distributional assumption causes the false positive and type-1 error rates to diverge unlike equation 2.3.

Proposition 2.3.

Under , the false positive rate as a function of the type-1 error is given by:


Left to the reader, proceed similarly to equation 2.3. ∎

Corollary 2.3.1.

If we’re applying a hypothesis test that is designed under that involves the test statistic following a distribution given by where as we expect to encounter data where we know the null hypothesis is actually going to follow the distribution . If we’re then targeting a false positive rate of , we should set the type-1 error rate, to:


and the probability of observing something as or more extreme than the test statistic under the distributional assumptions of becomes:


where is the p-value under .

Corollary 2.3.2.

If the effect size, , the false negative rate, as a function of the false positive rate, is given by:


The result follows from equations 2.7 and 2.5 and noting that if then (assuming the only difference between and is the effect size).

Note that this profile is equivalent to tossing a coin with being the probability of heads and rejecting the null if we get heads. ∎

Proposition 2.4.

Under , the false negative rate as a function of false positive rate is given by:


From equation 2.7 we have:

Substituting into equation 2.5 we get:


Equations 2.5 and 2.10 define the false negative rate as a function of the type-1 error rate (which we set arbitrarily) and false positive rate (which we get from the real data) respectively. We should expect:

  • The higher we set the type-1 error rate, , the more prone to fire our hypothesis test, is becoming, rejecting the null hypothesis more easily. So, the false positive rate should become higher when we do this. Hence, the type-1 error rate should be an increasing function of the type-1 error rate we set.

  • By a similar argument, the higher we set the type-1 error rate, the lower our false negative rate should become, since the test is only more likely to reject the null.

  • From the above two arguments, it follows that the false negative rate should always be a decreasing function of the false positive rate.

Here, we prove the second and third conclusions above.

Proposition 2.5.

The false negative rate can only be a decreasing function of the type-1 error rate we set and false positive rate our test consequently provides.


We have from equation 2.10,

Differentiating with respect to we get:

Where is the probability density function of which will always be positive while the second term will be negative since the survival function of any distribution is always decreasing and so is its inverse. Hence we always have being a monotonically decreasing function and we can similarly prove the same for

Since our hypothesis test is basically an oracle, that is supposed to alert us when there is a difference and not fire when there isn’t, there is a good argument to the assertion that the profile is all that matters when comparing various hypothesis tests. If a test produces a better false negative rate for any given actual false positive rate () than another (everything else being equal), it should be preferred. Such a test is called “more powerful” since the power is defined as .

2.1 The most powerful test for failure rates

As mentioned in section I, talking of failure rates is synonymous with fitting a Poisson process to whatever point process we’re modeling and finding the rate, of that Poisson process. This gives us a good starting point for comparing failure rates since we now have not just a statistic, but an entire point process to work with.

For comparing the rate parameters of two Poisson point processes, there exists a uniformly most powerful (UMP) test (see section 4.5 of [2]). This test is mathematically proven to produce the best false negative rate (power) given any false positive rate, effect size and amount of data (in this context, observation period). We will describe the test here, but refer to [2] for a detailed treatment and why this is the “Uniformly most powerful (UMP)” test for comparing Poisson rates.

To review, we have two Poisson processes. We observe events in time from the first and events in time from the second one. Hence, the estimates for the two failure rates we want to compare are: and . The proceeding theorem will help convert this hypothesis testing problem into a simpler one, but we need a few Lemmas before we get to it.

Lemma 2.6.

If we sum two independent Poisson random variables with means and , we get another Poisson random variable with mean .


Let and denote the two Poisson random variables. Conditioning on the value of ,

Since and are independent by definition,

The Binomial distribution with parameters

and is defined as the number of heads we get when we toss a coin with being its probability of heads times (represented henceforth by ), we have the following lemma:

Lemma 2.7.

Given that two Poisson processes with rates and which we observe for periods and ; conditional on a total of events observed, the number of events, from the second process is a Binomial distribution with parameters and .



represent the random numbers describing the number of events from the two processes. We have by Bayes theorem:

Since the two processes are independent,

The number of events, in a time interval of length from a Poisson process with rate is Poisson distributed with mean . Also, using the result of Lemma 2.6,


which is the Binomial probability mass function (PMF) as required. ∎

Corollary 2.7.1.

If two Poisson processes have the same rate, and are observed for periods and , then conditional on observing events from both processes, the number of events from the first process is a Binomial distribution with parameters and .


Substitute into equation 2.11 above. ∎

Per Corollary 2.7.1, we’ve managed to get rid of rate, if the two processes are identical (which is a requirement for the null hypothesis), a nuisance parameter. This ensures our hypothesis test for failure rates will work the same regardless of the base failure rate for the two processes, . So, conditional on the total events from the two processes being (which is something we observe), asking if the second process has a higher failure rate becomes equivalent to asking if the conditional Binomial distribution has a higher value of the parameter, than as the null hypothesis would suggest. We have thus reduced the two sample rate test to a one sample Binomial test on the probability of success, .

2.1.1 The one-sample Binomial test

To get the p-value (probability of being able to reject the null hypothesis), we ask - “what is the probability of seeing something as or more extreme than the observed data per the null hypothesis”. Here, “extreme” is defined in the direction of the alternate hypothesis. So, if we observe heads out of tosses in our data and our null hypothesis is that the probability of heads is , then the p-value, becomes the probability of seeing or more heads if the probability of seeing heads in a single toss was . So we get (where ):


2.1.2 Back to comparing Poisson rates

Theorem 2.8.

Given two Poisson processes with rates and , under the null hypothesis - , conditional on observing a total of events and alternate hypothesis, with a similar condition, if we observe events from the first process in time and events from the second in time , the p-value, is given by:


We can then pick a type-1 error rate, and reject the null if .


Per corollary 2.7.1, conditional on observing a total of events from both processes and failure rates being the same, the distribution of events from the second process, is . Substituting this into equation 2.12, the result follows. ∎

Note that for our simple, one-sided test, the Poisson rate test can be readily swapped with the Binomial test. However, there is some subtlety when dealing with two-sided tests and confidence intervals. This is covered in


2.2 False positive negative trade off

Now that we have described our test for comparing failure rates, we will evaluate the false positive to false negative rate trade off function () under the assumptions of the null hypothesis (the hypothesis under whose test statistic distribution, the p-value is calculated). This will give us a framework to later obtain the same trade off when the distributional assumptions are violated.

In equation 2.6, we described this trade off, . Let’s see what this looks like for the rate test. We have the following corollary to theorem 2.8:

Corollary 2.8.1.

Given the null hypothesis for the rate test in theorem 2.8 , describing the number of events in the control group (in observation time, ) and describing the same for the treatment group in observation time , we get the false negative rate corresponding to a type-1 error rate of :

and the false negative rate corresponding to false positive rate :


Since our test conditions on the total number of events observed, , we start with describing our under that condition as well. Denoting by and the number of events observed in groups 0 and 1 in observation times and respectively and noting that , being the number of events from group-1 conditional on

is a discrete random variable equation

2.5 becomes:

Since our test statistic for this particular test is simply the number of events from the first process we get , making the equation above:

To get the overall , we simply marginalize over all possible values of to get:


It is sometimes convenient to use equation 2.14 (especially when the conditional distribution in that equation has a nice closed form) and other times, 2.15. Under (the assumptions of the rate test under the null hypothesis), and follow Poisson distributions with the same means, and respectively. The second part of the proposition follows as a result of equation 2.10.

Proposition 2.9.

If we apply the uniformly most powerful rate test as described in theorem 2.8 to defined as both treatment and control groups following and defined as control following and treatment following where , the false negative rate for any false positive rate goes to zero if we collect data from both processes for a very large period of time ().


We will prove this for the special case, . Let’s assume that both groups (control and treatment) are observed for a time period, .

Substituting the results of lemmas 2.6 and 2.7 into equation 2.14 we get:



We will proceed from here for the special case, . This makes (where is the greatest integer ). So we get:


Now if we show for some ,


we would have shown the result since the Taylor’s expansion of implies:

And so (in conjunction with the fact that ),

Comparing equations 2.17 and 2.18, the inequality would certainly hold if it were possible to find an such that:

Let and this requirement becomes:


This is obviously true for any finite value of since the summation, and the summation in equation 2.19 is missing some positive terms compared with this summation. Those terms will sum to something finite and allow us to choose some . This holds for all .

The only concern remaining is that we might not be able to find an satisfying equation 2.19 when . And indeed, this turns out to be the case only for .

Let’s find the limit:

Noting the inequality

and the limit , we deduce that

AM-GM inequality on and guarantees that and the equality holds if and only if . Hence we see that an satisfying equation 2.19 will exist if but not if for example, . This shows an exists for the case we’re interested in ( and hence ) and concludes the proof. ∎

3 Breaking the test

We now have a pretty straightforward test for testing the rates of two point processes which is indeed proven to be the best possible when these are Poisson point processes. All we need is four numbers, the number of events and time period of observation in which those events were collected for two groups. But is this too simple? The Poisson point process is quite restrictive in the assumptions it makes and is almost never a good model for real-world data. Is applying a test built on it’s assumptions then, naive? Let’s explore this question in this section by breaking every possible underlying assumption and investigating how the test behaves.

3.1 Swapping out the distribution of the null hypothesis

In the construction of our hypothesis test, we used equation 2.11, which allowed us to condition on and use the fact that the distribution of the number of events from the second process, , is Binomial (let’s call it ). Similarly, the distribution of our test statistic, under the alternate hypothesis (given some effect size) is which happens to also be Binomial with the same number of tosses, parameter but a different probability of heads parameter, .

In the spirit of finding ways to break our test, let’s say we won’t be using the distribution of the null hypothesis, anymore and will instead swap it out with another arbitrary distribution, , with the same support as (non-negative integers ). The following result is somewhat surprising:

Theorem 3.1.

For any one-sided hypothesis test, if we swap out the distribution of the null hypothesis, with another arbitrary distribution, that has the same support, we get the same false negative rate corresponding to any false positive rate.


From equations 2.3 and 2.6, we get the false positive rate to false negative rate trade-off function.

Now, consider the test where we replace with (known henceforth as the “contorted test”). First, let’s obtain a relationship for the false positive rate for this test, . Using a similar reasoning as we used to obtain equation 2.3,


Note that this time, the two functions don’t cancel out. So, the type-1 error () for our contorted test (with replaced with ) is different from the false positive rate, .

Now, let’s explore the false negative rate of this contorted test. Using a similar approach as for proposition 2.2 we get:


In equation 3.1, applying followed by to both sides we get,

And substituting this into equation 3.2 we get the - trade off for this test:

Which means for the contorted test, given a false positive rate , the false negative rate is

But the above is the same as the we got from equation 2.6. This shows that given a false positive rate , the false negative rates for the two tests, and are equal and proves the theorem. It is also easy to see that we could have replaced with and with and reached the same conclusion, meaning the theorem continues to hold even when the original and contorted tests are applied to data that doesn’t follow the assumptions of the test. ∎

Consider we’re trying to apply the hypothesis test for failure rates described in theorem 2.8 to point processes that aren’t Poisson processes. One consequence of this violation of the distributional assumption would be that the conditional (on the total number of events, ) distribution of the test statistic, will no longer be Binomial. We might consider trying to find what this distribution is and replace the Binomial distribution with it so as to devise a test more tailored to the point processes from our data. Per theorem 3.1, this would be a waste of time as far as the - trade off goes as swapping out the Binomial with any other distribution under the sun would not improve the false negative rate we get corresponding to a false positive rate.

Also note that nothing in the derivation was specific to the rate tests. The conclusion of theorem 3.1 holds for any one-sided hypothesis test. In the famous two sample t-test for comparing means for instance, if we swap out the t-distribution with a normal or even some strange multi-modal distribution, the false negative to false positive trade off will remain unchanged.

3.2 Violating assumptions

Now, we get to scenarios where we apply the rate test as described in theorem 2.8 as-is to point processes that are not Poisson processes. For example, a core property of the Poisson point process is that the mean and variance of the count of events within any interval are the same. Many real world point processes don’t depict this behavior, with variance typically being higher than mean.

What then, is the price we pay in still applying the rate test derived on the assumptions of the Poisson process to rates from these non-Poisson processes? This depends of course, on the particular point process we’re dealing with. In this section, we’ll consider different ways we can generalize the Poisson point process with its constant failure rate and then see what happens with the rate test applied to them. Three of these generalizations are covered in section 5.4 of [1] viz the non-homogeneous, compound and mixed Poisson processes. For a non-homogeneous Poisson process, the rate is allowed to vary with time (), but in a way that it isn’t affected by the arrivals of events. The number of events within any intervals is still Poisson distributed in this process (with mean ) and so, doesn’t depart from the Poisson process in a significant way. The other two generalizations do fundamentally alter the distribution of the number of events within intervals and we’ll deal with them in turn.

3.2.1 The Compound Poisson Process

The Compound Poisson process, covered in section 5.4.2 of [1] involves a Compounding distribution superposed on the Poisson process. We still have a Poisson process dictating event arrivals. However, each time we get an arrival from the Poisson process, we get a random number of events (the compounding random variable, ) instead of a single event.

This is especially relevant to failures within a cloud platform like Microsoft Azure wherein there are multiple single points of failure that have the effect of clustering machine reboots together, leading to a higher variance of event counts within time interval than mean. The most obvious one is multiple virtual machines (the units rented to customers; VMs) being co-hosted on a single physical machine (or node). If the node goes down, all the VMs will go down together. Now, the number of VMs on a node when it goes down will be a random variable itself (the compounding random variable).

Per equation (5.23) of [1] (or simply from the definition), the number of events in any interval, will be given by ( are independent identically distributed with the same distribution as ):

Per equations (5.24) and (5.25) from [1], the mean and variance of such a point process will become (assuming the underlying Poisson process has a rate, ):


This allows for our variance to be much higher than the mean and makes clear the fact that the number of events in any interval is no longer Poisson distributed.

Deterministic-ally compounded Poisson process

What is the simplest kind of compounding we can do (apart from none at all)? We can have a constant number of events for each Poisson arrival. In other words, becomes a deterministic number instead of a random variable and let’s say the value it takes each time is (we’ll call such a process ). For such a process, the number of events generated by either group must be an integer multiple of . Also, per equations (5.24) and (5.25) of [1], the mean and variance in the number of events become:

Since the variance is now higher than the mean, this is a fundamentally different point process from the Poisson point process.

Lemma 3.2.

For the deterministic-ally compounded Poisson process, the probability mass function of the point process becomes:


Since every Poisson arrival results in exactly events, the number of events in any interval must be a multiple of . And if we observe events in any interval, then the number of Poisson arrivals must have been . ∎

Lemma 3.3.

Let be the distribution of the number of events in the treatment group conditional on the total events across both groups being . We must have:

  • where


For the first part, since the events from any must be a multiple of , so too must be the number of events from a sum of two of them.

For the second part, if we can surmise that the number of Poisson arrivals across both groups was . So, the conditional Poisson arrivals from the treatment groups is still governed by the conclusions of lemma 2.7 and corollary 2.7.1. And once we know the Poisson arrivals from the treatment group, the total failures will just be times that.

For the third part, the probability mass for any under is simply moved to under . Hence, if is the point where the sum of probabilities after it sum to under , this point will get scaled by as well under . ∎

Now, let’s assume that the number of failures in our two groups follow the deterministic-ally compounded Poisson process.

Under , we will have both treatment and control groups following