1 Introduction
Many statistical classification methods distinguish between only two classes by drawing a hypersurface in the feature space. In a Support Vector Machine (SVM), the hypersurface is drawn by minimizing the classification error. By implicitly transforming the feature space through operations on the dot product, the shape of this hypersurface can be made quite complex
(Müller et al., 2001). In Mills (2011)the hypersurface is discretely sampled by finding the root of the difference in conditional probabilities along a series of lines drawn between the two classes. The conditional probabilities are found using a kernel density estimation technique
(Terrell and Scott, 1992) called Adaptive Gaussian Filtering (AGF).There are many methods of generalizing binary classification schemes to more than than two classes. The LIBSVM library(Chang and Lin, 2011), for instance, uses a “oneagainstone” approach wherein each class is compared against every other class. For large numbers of classes this approach is quite inefficient since there will be binary classifications, where is the number of classes. Many other methods exist and the possible number will increase exponentially with the number of classes.
In many problems a different method of dividing or partitioning the classes would be appropriate. Consider four land surface types: coniferous forest, deciduous forest, corn field and wheat field. Here a hierarchical scheme (also called a decision tree) seems most appropriate since the related surface types will cluster together: first discriminate between forest and field. If forest is returned, then discriminate between evergreens and hardwoods. If field, then between corn and wheat. As another example, in a classification problem involving discretized continuum values, it makes sense to place the partitions between classes that define adjacent ranges in the continuum data.
New extensions to the libAGF library (Mills, 2011) generalize the binary classification problem so that the most appropriate method can be used to partition a multiclass problem without having to write a new program in each case. The AGF borderstraining method has been paired with this algorithm, the combination of which we refer to as “multiborders”. In what follows, we describe the rational behind the software, how it works and test it on an example problem comprised of discretized continuous data.
2 Theory
Suppose we have several partitions as in Figure 1(a), each uniquely grouping all the classes into two sets. The following equations relate the conditional probabilities of the classes to those returned by the binary partitions:
where is the conditional probability of class at test point , and are the conditional probabilities of the first and second classes respectively on either side of partition . The classes contained in either side of the th partition are given by and , respectively.
We call this nonhierarchical
multiborders classification. The popular “oneagainsttherest” approach, in which each class is singled out and classified against the remaining is one example of nonhierarchical classification and will be overdetermined for every case. Note that the oneagainstone approach is not covered by this method nor by the hierarchical approach described below since it requires excluding data from certain classes in absence of any prior knowledge of the class of the test point.
In a hierarchical classification scheme (or decision tree), the classes are first partitioned, then each of those partitions are partitioned again and so on until a class number is returned instead of another partition. The scenario for the first example is illustrated in Figure 1(b). A big difference between this and the nonhierarchical approach, is that all data from classes in the losing partition are excluded from subsequent classifications, whereas in the nonhierarchical approach all the data is used in all the binary classifications. As a corollary, in hierarchical multiborders classification, only the conditional probability of the winning class is returned, whereas the nonhierarchical method solves for all of them. The two types can of course be combined.
3 Control language
A recursive control language is used to describe any possible configuration in this hierarchical approach. In BackusNaur form, the control language looks like this:
branch  ::=  model “{” branchlist “}” CLASS 
model  ::=  TWOCLASS partitionlist 
branchlist  ::=  branch branchlist branch 
partitionlist  ::=  partition partitionlist partition 
partition  ::=  TWOCLASS classlist “ / ” classlist “;” 
classlist  ::=  CLASS classlist “ ” CLASS 
.
CLASS is a class value between 0 and . It is used in two senses. It may be one of the class values in a partition in a nonhierarchical model. In this case it’s value is relative, that is local to the nonhierarchical model. It may also be the class value returned from a top level partition in the hierarchy in which case it’s value is absolute.
TWOCLASS is a binary classification model.
There are two versions of control file: one for training and one for
classification using the trained model. The command, multi_borders
,
reads a training control file and outputs a classification control file.
For training, TWOCLASS contains a doublequoted set of parameters
or options for training a twoclass model. For classification, it
is the name of a trained, binary classification model.
The multi_borders
command returns a series of statements for training
each of the binary classifiers required for the overall model, in addition
to the final control file which contains the names of each.
The classify_m
command takes the output from multi_borders
and uses it to perform classifications on a set of test data.
If the model has only one level, all the conditional probabilities are returned,
otherwise only the winning probability is returned.
Commandline programs use AGF with borders sampling (class_borders
)
as the binary classification model, however the sourcelevel, C++ interface
allows the user to specify any binary classification model desired.
4 Test scenarios
Algorithm  train  class.  unc.  acc.  corr.  correlation 

time (s)  time (s)  coeff.  cond. prob.  
AGF  N/A  235  0.43  0.56  0.92  1. 
Nonhierarchical  189  2.0  0.41  0.53  0.91  0.94 
Hierarchical  111  0.84  0.42  0.54  0.91  0.89 
To test the algorithm we use some of the same satellite humidity data as described in Mills (2009). The specific humidity values are discretized into eight classes by dividing them at seven geometricly increasing values from 0.001 to 0.00007. Classes are labelled from 0 to 7 from lowest to highest humidity ranges. Two experiments were done. The first used nonhierarchical classification by partitioning the classes between each adjacent class, as shown in the following control file:
"" 0 / 1 2 3 4 5 6 7; "" 0 1 / 2 3 4 5 6 7; "" 0 1 2 / 3 4 5 6 7; "" 0 1 2 3 / 4 5 6 7; "" 0 1 2 3 4 / 5 6 7; "" 0 1 2 3 4 5 / 6 7; "" 0 1 2 3 4 5 6 / 7; {0 1 2 3 4 5 6 7}
The blank options section means options can be passed from the command line. The second experiment was hierarchical and partitioned the classes recursively in half:
"s 150 W 40 k 300" { "s 100 W 30 k 250" { "s 75 W 25 k 200" {0 1} "s 75 W 25 k 200" {2 3} } "s 100 W 30 k 250" { "s 75 W 25 k 200" {4 5} "s 75 W 25 k 200" {6 7} } }
The results from this experiment are shown in Table 1, where they are compared with an AGF model with no pretraining. While accuracy suffers somwhat using the multiborders models, there is an enormous improvement in classification speed, while training times are less than the classification times for the untrained model.
For the nonhierarchical model, conditional probabilities were solved using a simple linear least squares. Accuracy of estimates could likely be improved by using constraints or regularization (Press et al., 1992).
5 Conclusions
Software was described that allows one to specify, in a recursive and general way, a multiclass classification model comprised of one or more binary classifiers. The system was tested on discretized satellite humidity data using both a strictly hierarchical and strictly nonhierarchical model and compared with a direct kernel estimator without any pretraining. While the accuracy of both pretrained models suffered somewhat compared to the classifier without pretraining, time performance was greatly improved. Software is available at: http://libagf.sourceforge.net.
References
 Chang and Lin (2011) Chang, C.C. and Lin, C.J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27.
 Mills (2009) Mills, P. (2009). Isoline retrieval: An optimal method for validation of advected contours. Computers & Geosciences, 35(11):2020–2031.
 Mills (2011) Mills, P. (2011). Efficient statistical classification of satellite measurements. International Journal of Remote Sensing, 32(21):6109–6132.

Müller et al. (2001)
Müller, K.R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B.
(2001).
An introduction to kernelbased learning algorithms.
IEEE Transactions on Neural Networks
, 12(2):181–201.  Press et al. (1992) Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press, 2nd edition edition.
 Terrell and Scott (1992) Terrell, D. G. and Scott, D. W. (1992). Variable kernel density estimation. Annals of Statistics, 20:1236–1265.
Comments
There are no comments yet.