**Stratified Sampling**

In classification studies it is typically a tacit assumption that sampling is random; indeed, it is commonplace for this assumption to be made throughout a text on classification. For instance, Devroye *et al.* declare on page 2 of their text that all sampling is random (Devroye, 1996). The assumption is so pervasive that it can be applied without mention. With regard to the problem at hand, Duda *et al.* (2001) state, ‘In typical supervised pattern classification problems, the estimation of the prior probabilities presents no serious difficulties.’ But, in fact, there are often serious difficulties.

Under the assumption of random sampling, the data set, *S** _{n}* = {(

**X**

_{1},

*Y*

_{1}), … , (

**X**

*,*

_{n}*Y*)}, is drawn independently from a fixed distribution of feature-label pairs, (

_{n}**X**,

*Y*); in particular, this means that if a sample of size

*n*is drawn for a binary classification problem, then the numbers of sample points,

*n*

_{0}and

*n*

_{1}, in classes 0 and 1, respectively, are random variables such that

*n*

_{0}+

*n*

_{1}=

*n*. Thus, if the sample is large, we can expect the sampling ratio to be close to the prior probability.

Suppose the sampling **is not random**,* in the sense that the ratios are chosen prior to sampling*. In this ‘separate (stratified) sampling’ case, where the sample points are selected randomly from the two classes but, given the total sample size, the individual class counts n0 and n1 __are not random__. Then, in effect, we have no sensible estimate of class prior probability.

**Why Is This Important?**

Since our aim is to use the data to train a classifier, does the inability to consistently estimate c matter? Why is all of this a major issue for data science? It has been shown that the separate sampling with no incorporation of true class prior probabilities have a significant degradation in classification performance (here).

Unfortunately, separate sampling is ubiquitous in many fields including bioinformatics, in particular, with genomic classification, where a standard approach is to take samples from two classes, say, different types of cancer or different stages of cancer, for which the number of specimens in each class is not chosen randomly, and then to design a classifier. But, when the true class ratios are unknown, which is practically often the case, what can we do to mitigate the problem?

**One Robust Solution**

Let us define a *sampling ratio*, r, controlling the ratio between the number of samples in each class. Obviously, there will be different sensitivsities of different classification rules to the sampling ratio, r. Now, going beyond the case where the true class prevalences (i.e. class prior probabilities) are known or approximately known, consider the third implication, where one has no good idea concerning the class ratios. Then the minimax is an option.

One can see that there exists a sampling ratio (at least for the tested classification rules here) where class-conditional errors cross each other, and that's what is shown to be the "minimax" ratio. Of course, its suitability depends upon the classification rule and feature-label distribution.

**Conclusion**

Comments

There are no comments yet.