# Kolmogorov-Smirnov Test

## What is the Kolmogorov–Smirnov test?

The Kolmogorov–Smirnov test compares the probability distributions between two data sets. The nonparametric test often calculates the distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. However, the test can be used to compare distributions of two sample sets, rather that one sample set and a reference distribution. In fact, in comparing two sample distributions, the Kolmogorov–Smirnov test is one of the most useful tests as it is sensitive to differences in both location and shape of the distributions.

## How does the Kolmogorov–Smirnov test work?

The function for the test statistic is defined as:

Source

In this function F

is defined a the theoretical cumulative distribution of the the sample distribution being tested. It is important to note that not only do the distributions need to be continuous for the test to work correctly, but it must also be fully specified. This means that parameters like shape and location cannot be estimated from the data. If the calculated value of

D is greater than the critical value obtained from a table, then the data does not follow a specified distribution.

The graph on the left displays the K-S test between a sample distribution and a reference distribution. The graph on the right displays a K-S test between two sample distributions. The black line indicates the value of the K-S test.
Source

### Significance of the Kolmogorov–Smirnov test

The Kolmogorov–Smirnov test is often used to answer specific questions about one's data. For example, the test is particularly useful in determining from what distribution the data derives (e.g. Are the data from an exponential distribution? Are the data from a logistic distribution? etc.).

There are some limitations of the Kolmogorov–Smirnov test, as referenced above. The test only applies to continuous distributions. Additionally, the test is often more sensitive near the center of the distribution, rather than the tails. Lastly, and arguably most important, is that the distribution must be fully specified. The critical region of the K-S test is no longer valid if parameters like location or scale are estimated.