 # A Numerical Example on the Principles of Stochastic Discrimination

Studies on ensemble methods for classification suffer from the difficulty of modeling the complementary strengths of the components. Kleinberg's theory of stochastic discrimination (SD) addresses this rigorously via mathematical notions of enrichment, uniformity, and projectability of an ensemble. We explain these concepts via a very simple numerical example that captures the basic principles of the SD theory and method. We focus on a fundamental symmetry in point set covering that is the key observation leading to the foundation of the theory. We believe a better understanding of the SD method will lead to developments of better tools for analyzing other ensemble methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Methods for classifier combination, or ensemble learning, can be divided into two categories: 1)

decision optimization methods that try to obtain consensus among a given set of classifiers to make the best decision; 2) coverage optimization methods that try to create a set of classifiers that work best with a fixed decision combination function to cover all possible cases.

Decision optimization methods rely on the assumption that the given set of classifiers, typically of a small size, contain sufficient expert knowledge about the application domain, and each of them excels in a subset of all possible input. A decision combination function is chosen or trained to exploit the individual strengths while avoiding their weaknesses. Popular combination functions include majority/plurality votes, sum/product rules, rank/confidence score combination, and probabilistic methods. While numerous successful applications of these methods have been reported, the joint capability of the classifiers sets an intrinsic limitation on decision optimization that the combination functions cannot overcome. A challenge in this approach is to find out the “blind spots” of the ensemble and to obtain a classifier that covers them.

Coverage optimization methods use an automatic and systematic mechanism to generate new classifiers with the hope of covering all possible cases. A fixed, typically simple, function is used for decision combination. This can take the form of training set subsampling, such as stacking , bagging, and boosting, feature subspace projection, superclass/subclass decomposition, or other forms of random perturbation of the classifier training procedures . Open questions in these methods are 1) how many classifiers are enough? 2) what kind of differences among the component classifiers yields the best combined accuracy? 3) how much limitation is set by the form of the component classifiers?

Apparently both categories of ensemble methods run into some dilemma. Should the component classifiers be weakened in order to achieve a stronger whole? Should some accuracy be sacrificed for the known samples to obtain better generalization for the unseen cases? Do we seek agreement, or differences among the component classifiers?

A central difficulty in studying the performance of these ensembles is how to model the complementary strengths among the classifiers. Many proofs rely on an assumption of statistical independence of component classifiers’ decisions. But rarely is there any attempt to match this assumption with observations of the decisions. Often, global estimates of the component classifiers’ accuracies are used in their selection, while in an ensemble what matter more are the local estimates, plus the relationship between the local accuracy estimates on samples that are close neighbors in the feature space.

111there is more discussion on these difficulties in a recent review.

Deeper investigation of these issues leads back to three major concerns in choosing classifiers: discriminative power, use of complementary information, and generalization power. A complete theory on ensembles must address these three issues simultaneously. Many current theories rely, either explicitly or implicitly, on ideal assumptions on one or two of these issues, or have them omitted entirely, and are therefore incomplete.

Kleinberg’s theory and method of stochastic discrimination (SD) 

is the first attempt to explicitly address these issues simultaneously from a mathematical point of view. In this theory, rigorous notions are made for discriminative power, complementary information, and generalization power of an ensemble. A fundamental symmetry is observed between the probability of a fixed model covering a point in a given set and the probability of a fixed point being covered by a model in a given ensemble. The theory establishes that, these three conditions are sufficient for an ensemble to converge, with increases in its size, to the most accurate classifier for the application.

Kleinberg’s analysis uses a set-theoretic abstraction to remove all the algorithmic details of classifiers, features, and training procedures. It considers only the classifiers’ decision regions in the form of point sets, called weak models, in the feature space. A collection of classifiers is thus just a sample from the power set of the feature space. If the sample satisfies a uniformity condition, i.e., if its coverage is unbiased for any local region of the feature space, then a symmetry is observed between two probabilities (w.r.t. the feature space and w.r.t. the power set, respectively) of the same event that a point of a particular class is covered by a component of the sample. Discrimination between classes is achieved by requiring some minimum difference in each component’s inclusion of points of different classes, which is trivial to satisfy. By way of this symmetry, it is shown that if the sample of weak models is large, the discriminant function, defined on the coverage of the models on a single point and the class-specific differences within each model, converges to poles distinct by class with diminishing variance.

We believe that this symmetry is the key to the discussions on classifier combination. However, since the theory was developed from a fresh, original, and independent perspective on the problem of learning, there have not been many direct links made to the existing theories. As the concepts are new, the claims are high, the published algorithms appear simple, and the details of more sophisticated implementations are not known, the method has been poorly understood and is sometimes referred to as mysterious.

It is the goal of this lecture to illustrate the basic concepts in this theory and remove the apparent mystery. We present the principles of stochastic discrimination with a very simple numerical example. The example is so chosen that all computations can be easily traced step-by-step by hand or with very simple programs. We use Kleinberg’s notation wherever possible to make it easier for the interested readers to follow up on the full theory in the original papers. The emphasis in this note is on explaining the concepts of uniformity and enrichment, and the behavior of the discriminant when both conditions are fulfilled. For the details of the mathematical theory and outlines of practical algorithms, please refer to the original publications .

## 2 Symmetry of Probabilities Induced by Uniform Space Covering

The SD method is based on a fundamental symmetry in point set covering. To illustrate this symmetry, we begin with a simple observation. Consider a set and all the subsets with two elements , , and . By our choice, each of these subsets has captured of the elements of . We call this ratio . Let us now look at each member of , and check how many of these three subsets have included that member. For example, is in two of them, so we say is captured by of these subsets. We will obtain the same value for all elements of . This value is the same as . This is a consequence of the fact that we have used all such 2-member subsets and we have not biased this collection towards any element of . With this observation, we begin a larger example.

Consider a set of 10 points in a one-dimensional feature space . Let this set be called . Assume that contains only points in and nothing else. Let each point in be identified as , , …, as follows.

. . . . . . . . . .

Now consider the subsets of . Let the collection of all such subsets be , which is the power set of . We call each member of a model, and we restrict our consideration to only those models that contain 5 points in , therefore each model has a size that is 0.5 of the size of . Let this set of models be called . Some members of are as follows.

 {q0, q1, q2, q3, q4 } {q0, q1, q2, q3, q5 } {q0, q1, q2, q3, q6 } …

There are members in . Let be a pseudo-random permutation of members in as listed in Table 1 in the Appendix. We identify models in this sequence by a single subscript such that . We expand a collection by including more and more members of in the order of the sequence as follows. , , …, .

Since each model covers some points in , for each member in , we can count the number of models in that include , call this count , and calculate the ratio of this count over the size of , call it . That is, . As expands, this ratio changes and we show these changes for each in Table 2 in the Appendix. The values of are plotted in Figure 1. As is clearly visible in the Figure, the values of converge to 0.5 for each . Also notice that because of the randomization, we have expanded in a way that is not biased towards any particular , therefore the values of are similar after has acquired a certain size (say, when ). When =, every point is covered by the same number of models in , and their values of are identical and is equal to 0.5, which is the ratio of the size of each relative to (recall that we always include 5 points from in each ).

Formally, when , , from the perspective of a fixed , the probability of it being contained in a model from is

 ProbM(q∈m|m∈M0.5,A)=0.5.

We emphasize that this probability is a measure in the space by writing the probability as . On the other hand, by the way each is constructed, we know that from the perspective of a fixed ,

 ProbF(q∈m|q∈A) = 0.5.

Note that this probability is a measure in the space . We have shown that these two probabilities, w.r.t. two different spaces, have identical values. In other words, let the membership function of be , i.e.,

, the random variables

and

have the same probability distribution, when

is restricted to and is restricted to . This is because both variables can have values that are either 1 or 0, and they have the value 1 with the same probability (0.5 in this case). This symmetry arises from the fact that the collection of models covers the set uniformly, i.e., since we have used all members of , each point have the same chance to be included in one of these models. If any two points in a set have the same chance to be included in a collection of models, we say that this collection is -uniform. It can be shown, by a simple counting argument, that uniformity leads to the symmetry of and , and hence distributions of and .

The observation and utilization of this duality are central to the theory of stochastic discrimination. A critical point of the SD method is to enforce such a uniform cover on a set of points. That is, to construct a collection of models in a balanced way so that the uniformity (hence the duality) is achieved without exhausting all possible models from the space. Figure 1: Plot of Y(q,Mt) versus t. Each line represents the trace of Y(q,Mt) for a particular q as Mt expands.

## 3 Two-Class Discrimination

Let us now label each point in by one of two classes (marked by “x”) and (marked by “o”) as follows.

x x x o o o o x x o

This gives a training set for each class . In particular,

 TR1={q0,q1,q2,q7,q8},

and

 TR2={q3,q4,q5,q6,q9}.

How can we build a classifier for and using models from ? First, we evaluate each model by how well it has captured the members of each class. Define ratings () for each as

 ri(m)=ProbF(q∈m|q∈TRi).

For example, consider model , where is in and the rest are in . has 5 members and 1 is in , therefore . has (incidentally, also) 5 members and 4 of them are in , therefore . Thus these ratings represent the quality of the models as a description of each class. A model with a rating for a class is a perfect model for that class. We call the difference between and the degree of enrichment of with respect to classes , i.e., . A model is enriched if . Now we define, for all enriched models ,

 X12(q,m)=Cm(q)−r2(m)r1(m)−r2(m),

and let be if . For a given , and are fixed, and the value of for each in can have one of two values depending on whether is in . For example, for , and , so for points , and for points . Next, for each set , we define a discriminant

 Y12(q,Mt)=1tt∑k=1X12(q,mk).

As the set expands, the value of changes for each . We show, in Table 3 in the Appendix, the values of for each and each , and for each new member of , , and the two values of . The values of for each are plotted in Figure 2. Figure 2: Plot of Y12(q,Mt) versus t. Each line represents the trace of Y12(q,Mt) for a particular q as Mt expands.

In Figure 2  we see two separate trends. All those points that belong to class have their values converging to 1.0, and all those in converging to 0.0. Thus can be used with a threshold to classify an arbitrary point . We can assign to class if , and to class if , and remain undecided when . Observe that this classifier is fairly accurate far before has expanded to the full set . We can also change the two poles of to 1.0 and -1.0 respectively by simply rescaling and shifting :

 X12(q,m)=2(Cm(q)−r2(m)r1(m)−r2(m))−1.

How did this separation of trends happen? Let us now take a closer look at the models in each and see how many of them cover each point . For a given , among its members, there can be different values of and . But because of our choices of the sizes of , , and , we have only a small set of distinct values that and can have. Namely, since each model has 5 points, there are only six possibilities as follows.

no. of points from 0 1 2 3 4 5
no. of points from 5 4 3 2 1 0
0.0 0.2 0.4 0.6 0.8 1.0
1.0 0.8 0.6 0.4 0.2 0.0

Note that in a general setting and do not have to sum up to 1. If we included models of a larger size, say, one with 10 points, we can have both and equal to 1.0. We have simplified matters by using models of a fixed size and training sets of the same size. According to the values of and , in this case we have only 6 different kinds of models.

Now we take a detailed look at the coverage of each point by each kind of models, i.e., models of a particular rating (quality) for each class. Let us count how many of the models of each value of and cover each point , and call this and respectively. We can normalize this count by the number of models having each value of or , and obtain a ratio and respectively. Thus, for each point , we have “a profile of coverage” by models of each value of ratings and that is described by these ratios. For example, point at is only covered by 5 models () in , and from Table 3 we know that has various numbers of models in each rating as summarized in the following table.

 r1 0 0.2 0.4 0.6 0.8 1 no. of models in M10 with r1 0 2 2 4 2 0 NM10,r1,TR1(q0) 0 0 0 3 2 0 fM10,r1,TR1(q0) 0 0 0 0.75 1 0 r2 0 0.2 0.4 0.6 0.8 1 no. of models in M10 with r2 0 2 4 2 2 0 NM10,r2,TR2(q0) 0 2 3 0 0 0 fM10,r2,TR2(q0) 0 1 0.75 0 0 0

We show such profiles for each point and each set in Figure 3 (as a function of ) and Figure 4 (as a function of ) respectively.

Observe that as increases, the profiles of coverage for each point converge to two distinct patterns. In Figure 3, the profiles for points in converge to a diagonal , and in Figure 4, those for points in also converge to a diagonal . That is, when , we have for all in and for all , , and for all in and for all , . Thus we have the symmetry in place for both and . This is a consequence of being both -uniform and -uniform.

The discriminant is a summation over all models in , which can be decomposed into the sums of terms corresponding to different ratings for either or . To understand what happens with the points in , we can decompose their by values of . Assume that there are models in that have . Since we have only 6 distinct values for , is a union of 6 disjoint sets, and can be decomposed as

 Y12(q,Mt) = t0.0t[1t0.0∑t0.0k0.0=1X12(q,mk0.0)]  + t0.2t[1t0.2∑t0.2k0.2=1X12(q,mk0.2)]  + t0.4t[1t0.4∑t0.4k0.4=1X12(q,mk0.4)]  + t0.6t[1t0.6∑t0.6k0.6=1X12(q,mk0.6)]  + t0.8t[1t0.8∑t0.8k0.8=1X12(q,mk0.8)]  + t1.0t[1t1.0∑t1.0k1.0=1X12(q,mk1.0)].

The factor in the square bracket of each term is the expectation of values of corresponding to that particular rating . Since is the same for all contributing to that term, by our choice of sizes of , , and the models, is also the same for all those relevant to that term. Let that value of be , we have, for each (fixed) , each value of and the associated value ,

 E(X12(q,mx))=E(Cmx(q)−yx−y)=E(Cmx(q))−yx−y=x−yx−y=1.

The second to the last equality is a consequence of the uniformity of : because the collection (when ) covers uniformly, we have for each value , , and since has only two values (0 or 1), and , we have the expected value of equal to . Therefore

 Y12(q,Mt)=t0.0+t0.2+t0.4+t0.6+t0.8+t1.0t=1.

In a more general case, the values of are not necessarily equal for all models with the same value for , so we cannot take and out as constants. But then we can further split the term by the values of , and proceed with the same argument.

A similar decomposition of into terms corresponding to different values of will show that for those points in .

## 4 Projectability of Models

We have built a classifier and shown that it works for and . How can this classifier work for an arbitrary point that is not in or ? Suppose that the feature space contains other points (marked by “,”), and that each is close to some training point (marked by “.”) as follows.

., ., ., ., ., ., ., ., ., .,

We can take the models as regions in the space that cover the points in the same manner as before. Say, if each point has a particular value of the feature (in our one-dimensional feature space) that is . We can define a model by ranges of values for this feature, e.g., in our example covers , so we take

 m1= {q|v(q2)+v(q3)2

Thus we can tell if an arbitrary point with value for this feature is inside or outside this model.

We can calculate the model’s ratings in exactly the same way as before, using only the points . But now the same classifier works for the new points , since we can use the new definitions of models to determine if is inside or outside each model. Given the proximity relationship as above, those points will be assigned to the same class as their closest neighboring . If these are indeed the true classes for the points , the classifier is perfect for this new set. In the SD terminology, if we call the two subsets of points that should be labeled as two different classes and , i.e., , , we say that and are -indiscernible, and similarly and are also -indiscernible. This is to say, from the perspective of , there is no difference between and , or and , therefore all the properties of that are observed using and can be projected to and . The central challenge of an SD method is to maintain projectability, uniformity, and enrichment of the collection of models at the same time.

## 5 Developments of SD Theory and Algorithms

### 5.1 Algorithmic Implementations

The method of stochastic discrimination constructs a classifier by combining a large number of simple discriminators that are called weak models. A weak model is simply a subset of the feature space. Conceptually, the classifier is constructed by a three-step process: (1) weak model generation, (2) weak model evaluation, and (3) weak model combination. The generator enumerates weak models in an arbitrary order and passes them on to the evaluator. The evaluator has access to the training set. It rates and filters the weak models according to their capability in capturing points of each class, and their contribution to satisfying the uniformity condition. The combinator then produces a discriminant function that depends on a point’s membership status with respect to each model, and the models’ ratings. At classification, a point is assigned to the class for which this discriminant has the highest value. Informally, the method captures the intuition of gaining wisdom from graded random guesses.

#### Weak model generation.

Two guidelines should be observed in generating the weak models:

(1) projectability: A weak model should be able to capture more than one point so that the solution can be projectable to points not included in the training set. Geometrically, this means that a useful model must be of certain minimum size, and it should be able to capture points that are considered neighbors of one another. To guarantee similar accuracies of the classifier (based on similar ratings of the weak models) on both training and testing data, one also needs an assumption that the training data are representative. Data representativeness and model projectability are two sides of the same question. More discussions of this can be found in . A weak model defines a neighborhood in the space, and we need a training sample in a neighborhood of every unseen sample. Otherwise, since our only knowledge of the class boundaries is from the given training set, there can be no basis for any inference concerning regions of the feature space where no training samples are given.

(2) simplicity of representation: A weak model should have a simple representation. That means, the membership of an arbitrary point with respect to a model must be cheaply computable. To illustrate this, consider representing a model as a listing of all the points it contains. This is practically useless since the resultant solution could be as expensive as an exhaustive template matching using all the points in the feature space. An example of a model with a simple representation is a half-plane in a two-dimensional feature space.

Conditions (1) and (2) restrict the type of weak models yet by no means reduce the number of candidates to any tangible limit. To obtain an unbiased collection of the candidates with minimum effort, random sampling with replacement is useful. The training of the method thus relies on a stochastic process which, at each iteration, generates a weak model that satisfies the above conditions.

A convenient way to generate weak models randomly is to use a type of model that can be described by a small number of parameters. The values of the parameters can be chosen pseudo-randomly. Some example types of models that can be generated this way include (1) half-spaces bounded by a threshold on a randomly selected feature dimension; (2) half-spaces bounded by a hyperplane of equi-distance to two randomly selected points; (3) regions bounded by two parallel hyperplanes perpendicular to a randomly selected axis. (4) hypercubes centered at randomly selected points with edges of varying lengths; (5) balls (based on the city-block metric) centered at randomly selected points with randomly selected radii; and (6) balls (based on the Euclidean metric) centered at a randomly selected points with randomly selected radii. A model can also be a union or intersection of several regions of these types. An implementation of SD using hyper-rectangular boxes as weak models is described in

.

A number of heuristics may be used in creating these models. These heuristics specify the way random points are chosen from the space, or set limits on the maximum and minimum sizes of the models. By this we mean restricting the choice of random points to, for instance, points in the space whose coordinates fall inside the range of those of the training samples, or restricting the radii of the balls to, for instance, a fraction of the range of values in a particular feature dimension. The purpose of these heuristics is to speed up the search for acceptable models by confining the search within the most interesting regions, or to guarantee a minimum model size.

#### Enrichment enforcement.

The enrichment condition is relatively easy to enforce, as models biased towards one class are most common. But since the strength of the biases () determines the rate at which accuracy increases, we tend to prefer to use models with an enrichment degree further away from zero.

One way to implement this is to use a threshold on the enrichment degree to select weak models from the random stream so that they are of some minimum quality. In this way, one will be able to use a smaller collection of models to yield a classifier of the same level of accuracy. However, there are tradeoffs involved in doing this. For one thing, models of higher rating are less likely to appear in the stream, and so more random models have to be explored in order to find sufficient numbers of higher quality weak models. And once the type of model is fixed and the value of the threshold is set, there is a risk that such models may never be found.

Alternatively, one can use the most enriched model found in a pre-determined number of trials. This also makes the time needed for training more predictable, and it permits a tradeoff between training time and quality of the weak models.

In enriching the model stream, it is important to remember that if the quality of weak models selected is allowed to get too high, there is a risk that they will become training set specific, that is, less likely to be projectable to unseen samples. This could present a problem since the projectability of the final classifier is directly based on the projectability of its component weak models.

#### Uniformity promotion.

The uniformity condition is much more difficult to satisfy. Strict uniformity requires that every point be covered by the same number of weak models of every combination of per-class ratings. This is rather infeasible for continuous and unconstrained ratings.

One useful strategy is to use only weak models of a particular rating. In such cases, the ratings and are the same for all models enriched for the discrimination between classes and

, so we need only to make sure that each point is included in the same number of models. To enforce this, models can be created in groups such that each group partitions the entire space into a set of non-overlapping regions. An example is to use leaves of a fully-split decision tree, where each leave is perfectly enriched for one class, and each point is covered by exactly one leave of each tree. For any pairwise discrimination between classes

and , we can use only those leaves of the trees that contain only points of class . In other words, is always 1 and is always 0. Constraints are put in the tree-construction process to guarantee some minimum projectability.

With other types of models, a first step to promote uniformity is to use models that are unions of small regions with simple boundaries. The component regions may be scattered throughout the space. These models have simple representations but can describe complicated class boundaries. They can have some minimum size and hence good projectability. At the same time, the scattered locations of component regions do not tend to cover large areas repeatedly.

A more sophisticated way to promote uniformity involves defining a measure of the lack of uniformity and an algorithm to minimize such a measure. The goal is to create or retain more models located in areas where the coverage is thinner. An example of such a measure is the count of those points that are covered by less-than-average number of previously retained models. For each point in the class to be positively enriched, we calculate, out of all previous models used for that class, how many of them have covered . If the coverage is less than the average for class , we call a weak point. When a new model is created, we check how many such weak points are covered by the new model. The ratio of the set of covered weak points to the set of all the weak points is used as a merit score of how well this model improves uniformity. We can accept only those models with a score over a pre-set threshold, or take the model with the best score found in a pre-set number of trials. One can go further to introduce a bias to the model generator so that models covering the weak points are more likely to be created. The later turns out to be a very effective strategy that led to good results in our experiments.

### 5.2 Alternative Discriminants and Approximate Uniformity

The method outlined above allows for rich possibilities of variation at the algorithmic level. The variations may be in the design of the weak model generator, or in ways to enforce the enrichment and uniformity conditions. It is also possible to change the definition of the discriminant, or to use different kinds of ratings.

A variant of the discriminating function is studied in detail in . In this variant, we define the ratings by

 r′i(m)=|m∩TRi||m∩TR|,

for all

. It is an estimate of the posterior probability that a point belongs to class

given the condition that it is included in model . The discriminant for class is defined to be:

 Wi(q)=∑k=1,...,piCm(q)r′i(m)∑k=1,...,piCm(q).

where is the number of models accumulated for class .

It turns out that, with this discriminant, the classifier also approaches perfection asymptotically provided that an additional symmetry condition is satisfied. The symmetry condition requires that the ensemble includes the same number of models for all permutations of . It prevents biases created by using more -enriched models than -enriched models for all pairs . Again, this condition may be enforced by using only certain particular permutations of the ratings . This alternative discriminant is convenient for multi-class discrimination problems.

The SD theory established the mathematical concepts of enrichment, uniformity, and projectability of a weak model ensemble. Bounds on classification accuracy are developed based on strict requirements on these conditions, which is a mathematical idealization. In practice, there are often difficult tradeoffs among the three conditions. Thus it is important to understand how much of the classification performance is affected when these conditions are weakened. This is the subject of study in , where notions of near uniformity and weak indiscernibility are introduced and their implications are studied.

### 5.3 Structured Collections of Weak Models

As a constructive procedure, the method of stochastic discrimination depends on a detailed control of the uniformity of model coverage, which is outlined but not fully published in the literature . The method of random subspaces followed these ideas but attempted a different approach. Instead of obtaining weak discrimination and projectability through simplicity of the model form, and forcing uniformity by sophisticated algorithms, the method uses complete, locally pure partitions as given in fully split decision trees  or nearest neighbor classifiers  to achieve strong discrimination and uniformity, and then explicitly forces different generalization patterns on the component classifiers. This is done by training large capacity component classifiers such as nearest neighbors and decision trees to fully fit the data, but restricting the training of each classifier to a coordinate subspace of the feature space where all the data points are projected, so that classifications remain invariant in the complement subspace. If there is no ambiguity in the subspaces, the individual classifiers maintain maximum accuracy on the training data, with no cases deliberately chosen to be sacrificed, and thus the method does not run into the paradox of sacrificing some training points in the hope for better generalization accuracy. This is to create a collection of weak models in a structured way.

However the tension among the three factors persists. There is another difficult tradeoff in how much discriminating power to retain for the component classifiers. Can every one use only a single feature dimension so as to maximize invariance in the complement dimensions? Also, projection to coordinate subspaces sets parts of the decision boundaries parallel to the coordinate axes. Augmenting the raw features by simple transformations  introduces more flexibility, but it may still be insufficient for an arbitrary problem. Optimization of generalization performance will continue to depend on a detailed control of the projections to suit a particular problem.

## 6 Conclusions

The theory of stochastic discrimination identifies three and only three sufficient conditions for a classifier to achieve maximum accuracy for a problem. These are just the three elements long believed to be important in pattern recognition: discrimination power, complementary information, and generalization ability. It sets a foundation for theories of ensemble learning. Many current questions on classifier combination can have an answer in the arguments of the SD theory: What is good about building the classifier on weak models instead of strong models? Because weak models are easier to obtain, and their smaller capacity renders them less sensitive to sampling errors in small training sets

 

, thus they are more likely to have similar coverage on the unseen points from the same problem. Why are many models needed? Because the method relies on the law of large numbers to reduce the variance of the discriminant on each single point. How should these models complement each other? The uniformity condition specifies exactly what kind of correlation is needed among the individual models.

Finally, we emphasize that the accuracy of SD methods is not achieved by intentionally limiting the VC dimension of the complete system; the combination of many weak models can have a very large VC dimension. It is a consequence of the symmetry relating probabilities in the two spaces, and the law of large numbers. It is a structural property of the topology. The observation of this symmetry and its relationship to ensemble learning is a deep insight of Kleinberg’s that we believe can lead to better understanding of other ensemble methods.

## Acknowledgements

The author thanks Eugene Kleinberg for many discussions over the past decade on the theory of stochastic discrimination, its comparison to other approaches, and perspectives on the fundamental issues in pattern recognition.

## References

•  R. Berlind, An Alternative Method of Stochastic Discrimination with Applications to Pattern Recognition, Doctoral Dissertation, Department of Mathematics, State University of New York at Buffalo, 1994.
•  L. Breiman, “Bagging predictors,” Machine Learning, 24, 1996, 123-140.
•  D. Chen, “Estimates of Classification Accuracies for Kleinberg’s Method of Stochastic Discrimination in Pattern Recognition,” Ph.D. Thesis, SUNY at Buffalo, 1998.
•  T.G. Dietterich, G. Bakiri, “Solving multiclass learning problems via error-correcting output codes,”

Journal of Artificial Intelligence Research

, 2, 1995, 263-286.
•  Y. Freund, R.E. Schapire, “Experiments with a New Boosting Algorithm,” Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, July 3-6, 1996, 148-156.
• 

L.K. Hansen, P. Salamon, “Neural network ensembles,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-12, 10, October 1990, 993-1001.
•  T.K. Ho, Random Decision Forests, Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, August 14-18, 1995, 278-282.
•  T.K. Ho, Multiple classifier combination: Lessons and next steps, in A. Kandel, H. Bunke, (eds.), Hybrid Methods in Pattern Recognition, World Scientific, 2002.
•  T.K. Ho, E.M. Kleinberg, Building Projectable Classifiers of Arbitrary Complexity, Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 25-30, 1996, 880-885.
•  T.K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 8, August 1998, 832-844.
•  T.K. Ho, Nearest Neighbors in Random Subspaces, Proceedings of the Second International Workshop on Statistical Techniques in Pattern Recognition, Sydney, Australia, August 11-13, 1998, 640-648.
•  T.K. Ho, J. J. Hull, S.N. Srihari, Decision Combination in Multiple Classifier Systems, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-16, 1, January 1994, 66-75.
•  Y.S. Huang, C.Y. Suen, A method of combining multiple experts for the recognition of unconstrained handwritten numerals, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-17, 1, January 1995, 90-94.
•  J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-20, 3, March 1998, 226-239.
•  E.M. Kleinberg, Stochastic Discrimination, Annals of Mathematics and Artificial Intelligence, 1, 1990, 207-239.
•  E.M. Kleinberg, An overtraining-resistant stochastic modeling method for pattern recognition, Annals of Statistics, 4, 6, December 1996, 2319-2349.
•  E.M. Kleinberg, On the algorithmic implementation of stochastic discrimination, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-22, 5, May 2000, 473-490.
• 

E.M. Kleinberg, A mathematically rigorous foundation for supervised learning, in J. Kittler, F. Roli, (eds.),

Multiple Classifier Systems, Lecture Notes in Computer Science 1857, Springer, 2000, 67-76.
•  L. Lam, C.Y. Suen, Application of majority voting to pattern recognition, IEEE Transactions on Systems, Man, and Cybernetics, SMC-27, 5, September/October 1997, 553-568.
•  V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, 1982.
•  V. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998.
•  D.H. Wolpert, Stacked generalization, Neural Networks, 5, 1992, 241-259.