1 Introduction
(a)  (b)  (c)  (d) 
Classification algorithms such as decision tree [16, 2], neural networks [1]
, support vector machine (SVM) have been widely used in many areas. The classification procedure in many of these algorithms can be understood as performing reasoning using logic operators (and, or, not), with deterministic or probabilistic formulations. The recent development of the AdaBoost algorithm
[12] has particularly advanced the performance of many applications in the field. We focus on the AdaBoost algorithm in this paper (also called boosting together with its variations [5, 11]).Boosting algorithms have many advantages over the traditional classification algorithms. Its asymptotical behavior when combining a large number of weak classifiers is less prone to the overfitting problem. Once trained, a boosting algorithm performs weighted sum on the selected weak classifiers. This linear summation weakly performs the ‘and’ and ‘or’ operations. In the discrete case, as long as the overall score is above the threshold, a pattern is considered as positive. This may include a combinotory combinations of the conditions. Some weak classifiers may require to be satisfied together (‘and’), and some may not as long a subset answer yes (‘or’).
In the literature, decision stump has been widely used as weak classifier due to its speed and small complexity. However, decision stump does not have strong discrimination power. A comprehensive empirical study for a wide variety of classifiers including SVM, Boosting (using decisiontree and decisionstump), neural networks, and nearest neighboorhood, was reported in [6]. Each decision stump corresponds to a thresholded feature ( is changeable to ):
(1) 
We call stump based AdaBoost algorithm AdaStump for the remainder of this paper. Fig. (1.b) displays a failure example of the AdaStump. We see that it can not deal with the ‘xor’ patterns, even with 100 stumps.
One solution to this problem is to adopt more powerful weak classifiers, such as decision tree, to the boosting algorithm. It was proposed by several authors [11, 18] and we call it AdaTree here for notional convenience (it is different from the AdaTree method [13]). However, using decision tree greatly increases the time and computational complexity of the boosting algorithm. Many vision applications were trained on very large datasets with each sample having thousands or even millions features [22]. This limits the use of decision tree or CART, and AdaStump remains mostly used in vision [22]. In this paper, we show that AdaStump intrinsically can not deal with the ‘xor’ problem. We propose layered logic models for classification, namely AdaOr, AdaAnd, and AdaAndOr. The algorithm has several interesting properties: (1) it naturally incorporates the ‘and’, ‘or’, and ‘not’ relations in the algorithm; (2) it has much more discrimination power than AdaStump; (3) it has much smaller computational complexity than tree based AdaBoost with only slightly degraded classification performance.
A recent effort to combine ‘and’ and ‘or’ in AdaBoost has been proposed in [8]. However, the ‘and’ and ‘or’ relations are not naturally embedded in the algorithm and it requires very complex optimization procedure in training. How the algorithm can be used for general tasks in machine learning and computer vision is at best unclear.
We apply the proposed models, AdaOr, AdaAnd, and AdaAndOr, on several typical datasets from the Irvine repository and two challenging vision applications, object segmentation and pedestrian detection. Among the models, AdaAndOr performs the best nearly in all cases. We observe significant improvements on all the datasets over AdaStump. For pedestrian detection, the performance of AdaAndOr is very close to HOG [7] using simple Haar features, though the main objective of this paper is not to develop a pedestrian detector.
2 AdaBoost algorithm
In this section, we briefly review the AdaBoost algorithm and explain why AdaStump fails on the ‘xor’ problem.
2.1 Algorithms and theory
Let be a set of training samples and is the distribution for each sample . AdaBoost algorithm [12] proposed by Freund and Schapire learns a strong classifier , based on the training set, by sequentially combining a number of weak classifiers. We briefly give the general AdaBoost algorithm[12] below:
Given: For Train weak classifier using distribution . Get weak hypothesis . Calculate the error of . Compute . Update: with . Output the the strong classifier: . 
The AdaBoost algorithm minimizes the total error by sequentially selecting and computing in a greedy manner. At each step, it is to minimize
by coordinate descent:
(1) Select the best weak classifier from the candidate pool which minimizes .
(2) Compute by taking , which yields
A very important property of AdaBoost is that after a certain number rounds, the test error still goes down even the training error is not improving [4]. This makes AdaBoost less prone to the overfitting problem than many other classifiers. Schapire et al. [19] explained this behavior of AdaBoost from the margin theory perspective. For any data ,
essentially gives the confidence of the estimation
to . For any given , the overall test error is bounded by(2) 
where is the VC dimension of the weak classifier and is the number of training samples. Eqn. (2) shows three directions to reduce the test error: (1) increase the margin (related to training error but not exactly the same); (2) reduce the complexity of the weak classifier; (3) increase the size of training data.
Moreover, it is shown [11] that AdaBoost and its variations are asymptotically approaching the posterior distribution (there are still some debates about this probabilistic formulation of the AdaBoost algorithm).
(3) 
The margin is directly tied to the discriminative probability.
2.2 The xor problem
It is wellknown that the points shown in Fig. (1.a) as ‘xor’ are not linearly separable. The red and blue points are the positive and negative samples respectively. Each weak classifier makes a decision whether a point lies above or below a line passing the original. Using this type of weak classifier, the AdaBoost algorithm is not able to separate the red points from the blue ones. It is easy to verify. For any positive sample with , then is a positive sample also. However
and therefore .
(a) initial weights  (b) reweighted 
Let
We denote , , as the ‘and’, ‘or’, and ‘not’ operations respectively. Thus, the positive samples in Fig. (1.a) can be denoted by
One of the key properties in AdaBoost is that it reweights the training samples after each round by giving higher weights to those which were not correctly classified by the previous weak classifiers. We take a close look at the reweighting scheme to the points in Fig. (1.a). Initially, all the samples receive equal weights, shown in Fig. (3.a). For any weak classifier (line passing the origin), the error is which means that they are equally bad. In a computer simulation the value is usually slightly smaller than since the training points are discretized samples. Once a weak classifier is selected, e.g., the line (), then positive samples and negative samples are correctly classified, and they will receive lower weights. Fig. (3.b) shows the weights for the samples after the first step of the AdaBoost. Clearly, the weak classifier to minimize the error for this round would be (), which is a contradictory decision to the previous weak classifier (). The reweighted points after this round essentially lead the situation back to Fig. (3.a). The combination of the two weak classifiers is where
denotes an empty set. The algorithm then keeps repeating the same procedure, which is a deadlock. Due to this reason, AdaBoost is sensitive to outliers since it keeps giving high weights to those missclassified samples.
2.3 Possible solutions
The previous section shows that AdaStump cannot solve the ‘xor’ problem (on the line features passing the origin). The AdaBoost algorithm makes an overall decision based on a weighted sum . It weakly performs the ‘and’, and ‘or’ operations on the weak classifiers. The ‘not’ is often embedded in the stump classifier by switching and . We assume that all types of weak classifiers have the aspect of ‘not’ and we focus on ‘and’, and ‘or’ operations for the rest of this paper.
There are several possible ways to improve the algorithm:

Designing hyper features to allow the patterns to be linearly separable. For example, in the ‘xor’ case, it could be . However, (1) it is often very hard to find the meaningful features which will nicely separate the positive and negative samples; (2) complex features often lead to the overfitting problem.

Introducing the explicit ‘and’ and ‘or’ relations into the AdaBoost.
We can put ‘and’s on top of ‘or’s, or vice versa, or completely mix the two together. The probabilistic boosting tree (PBT) algorithm [20] is one way of recursively combining ‘and’s with ‘or’s. The disadvantages of PBT however are: (1) it requires longer training time than cascade and, (2) it produces complex classifier and may lead to overfitting (like the decision tree). Another solution is to build weak classifiers with embedded ‘and’ and ‘or’ operations. Using decision tree [16] as weak classifiers has been described in several papers [11, 18]. However, each tree is a complex classifier and it requires much longer time in training than the stump classifier. Also, it has more algorithm complexity than decision stump.
3 Layered logic classifiers
Eqn. (3) shows that the AdaBoost algorithm is essentially approaching a logistic probability by
The overall discriminative probability is a product of the probability of each . Depending upon its weight , each makes a direct impact on . Using decision tree requires much longer time than stump classifier. This is particularly a problem in vision as we often face millions of image samples with each sample having thousands features.
Instead of using one layer AdaBoost, we can think of using twolayer AdaBoost with the weak classifier being stronger than decision stump, but simpler than decision tree. One idea might be to use AdaStump as weak classifier for the AdaBoost again, which we call AdaAdaStump, or AdaAda for short notation. However, AdaAda still somewhat performs a linear summation and has difficulty on the ‘xor’ as well. Fig. (1.c) shows the positives classified by AdaAda with 50 weak classifiers of AdaStump, which itself has 5 stump weak classifiers. It is a failure example. It is worth to mention that one can indeed to make AdaAda work on these points by using very tricky strategies of randomly selecting a subset of points in training. However this greatly increases the training complexity and the procedures are not general.
Our solution is to propose AndBoost and OrBoost algorithms in which the ‘and’ and ‘or’ operations are explicitly engaged. We give detailed descriptions below.
3.1 OrBoost
For a combined classifier, we can use the ‘or’ operation directly by
(4) 
where
(5) 
Given: For Train weak classifier using distribution . Get weak hypothesis . Train weak classifier using weights to minimize error . The algorithm also stops if the error is not decreasing. Output the overall classifier: . 
Fig (4) gives the detailed procedure of the OrBoost algorithm, which is straight forward to implement. The overall classifier is a set of ‘or’ operations on weak classifier, e.g. decision stump, and it favors positive answer. If any weak classifier provides a positive answer, then the final decision is positive, regardless of what other weak classifier will say. Unlike in the AdaBoost algorithm where misclassified samples are given higher weights in the next round, OrBoost gives up some samples quickly and focus on those which can be classified correctly. This helps to solve the deadlock situation in AdaBoost shown in Fig. (3).
(a) initial weights  (b) reweighted 
Fig. (5
) shows the feature selection and reweighting steps by the OrBoost algorithm for the xor problem. The first weak classifier is selected the same as before (
). However, positives and negatives receive low weights in the AdaBoost since they have been classified correctly. This creates a deadlock. In OrBoost, the situation is different. Note that although the weights for all the samples are fixed, the error evaluation function affects how plays a role. This is similar to the reweighting scheme in the AdaBoost. For example, positives and negatives have been classified as positives by the first weak classifier, . The errors on and are therefore decided already regardless what the later weak classifiers will be. Therefore, the second weak classifier would be . The total error by the two combined weak classifiers is .3.2 AndBoost
If we swap the labels of the positives and negatives in training, the ‘or’ operations in OrBoost can be directly turned into ‘and’ operations since
However, for a given set of the training samples, ‘and’ operations may provide complementary decisions to the ‘or’ operations. Similarly, we can use the ‘and’ operation directly by
(6) 
where
(7) 
Therefore, we can design an AndBoost algorithm in Fig. 6 which is very similar to the OrBoost algorithm in Fig. 4.
Given: For Train weak classifier using distribution . Get weak hypothesis . Train weak classifier using weights to minimize error . The algorithm also stops if the error is not decreasing. Output the overall classifier: . 
The performance of the AndBoost on the ‘xor’ problem is the same as the OrBoost algorithm.
3.3 AdaOrBoost
(a)  (b)  (c)  (d) 
After the introduction of the OrBoost and AndBoost algorithms, we are ready to discuss the proposed layered models. We simply use a twolayer AdaBoost algorithm with the weak classifiers in the second layer being the choice of OrBoost, AndBoost, or both. We call the models, AdaOr, AdaAnd, and AdaAndOr respectively.
There are two levels of weak classifiers now. For AdaOr, the OrBoost is its weak classifier. For OrBoost, any type of classifier can be its weak classifier. To keep the complexity of OrBoost and AndBoost under check, we simply use the decision stump. As we mentioned before the ‘not’ operation is naturally embedded in the decision stump. Therefore, the AdaAndOr has all the aspects of logic operations, ‘and’, ‘or’, and ‘not’. Again, we call the weak classifiers in OrBoost and AndBoost operations to avoid confusion.
Fig. (1.b) shows the points which are classified by AdaStump using 100 stump weak classifiers. This failure example verifies our earlier claim for the ‘xor’ pattern. Fig. (1.d) shows the result by AdaOr using 10 OrBoost weak classifiers, in which there are 2 or operations. As we can see, the positive samples have been classified correctly. Fig. (1.c) gives the result by AdaAda.
The margin theory of AdaOr, AdaAnd, and AdaAndOr still follows the same as pointed by Schapire et al. [19] in eqn. (2). The complexity of weak classifier is decided by the OrBoost and AndBoost algorithms, which are just a sequence of ‘or’ operations or ‘and’ operations. It is slightly more complex than decision stump, but much simpler than decision tree or CART. It is worth to mention that both OrBoost and AndBoost include a special case where only one operation presents. This happens when the training error is not improving by adding the second operation. Therefore, stump classifier is also included in OrBoost and AndBoost, if stump is the choice of operator.
3.4 Experiments
There are several major issues we are concerned with for the choice of different classifiers for applications in machine learning and computer vision.

Classification power: This is often referred to as training error or margin in eqn. (2). A desirable classifier should produce low error and large margin on the training data.

Low complexity: This is often called VC dimension [21] and a classifier with small VC dimension often has a good generalization power, small difference between the training error and test error.

Size of training data: In the VC dimension and margin theory, the overall test error is also greatly decided by the availability of training data. The more training data we have and the classifier can handle, the smaller difference is between training error and test error. In reality, we often do not have enough training data since collecting them is not a easy task. Also, some noneparametric classifiers can only deal with limited amount of training data since they work on the kernel space, which explodes on large size data.

Efficient training time: For many applications in computer vision and data mining, the training data size can be immense and each data sample also has a large number of features. This demands an efficient classifier in training also. Fast training is more required in online learning algorithms [15] which has recently received many attentions in tracking.

Efficient test time: Judging the performance of a classifier is ultimately done in the test stage. A classifier is expected to be able to quickly give an answer. For many modern classifiers, this is not particularly a problem.
The first three criterion collectively decide the test error of a classifier. Another major factor affecting the performance a classifier is feature design. If the intrinsic features can be found, different types of classifiers will probably have a similar performance. However, the discussion of feature design is out of the scope of this paper. Next, we focus on the performance of AdaOrBoost with comparison to the other classifiers.
3.5 Results on UCI repository datasets
One of the reasons that the AdaBoost algorithm is widely used is due to nice generalization power. Schapire et al. gave an explanation based on the margin theory after Breiman [4] observed an interesting behavior of AdaBoost: the test error of AdaBoost further asymptotically goes down even the training error is not decreasing. This was explained in the margin theory as to increase the margin with more weak classifiers combined. Brieman [5] then designed an algorithm called ‘arcgv’ which tries to directly maximize the minimum margin in computing the for AdaBoost. The experimental results were however contradictory to the theory since arcgv produces bigger test error than AdaBoost. Reyzin and Schapire [18] tried to explain this finding and showed that the bigger test error by arcgv was indeed due to the use of complex weak classifier, CART. Next we compare AdaOr, AdaAnd, and AdaAndOr with arcgv and AdaBoost using CART and decision stump.
We use the same datasets shown in Reyzin and Schapire [18], which are all from the UCI repository: breast cancer, ionosphere, ocr49 and splice. The datasets have been slightly modified the same way as in [18]. The two splice categories were merged into one in the splice dataset to create twoclass data. Only digits 4 and 9 from the NIST database were used in the ocr49 dataset. The cancer, ion, ocr49 and splice then have 699, 351, 6000, 3175 data points respectively. Each sample usually has features, depending upon what dataset it belongs to. The data samples are randomly split into training and testing for 10 trials. Table 1 shows the corresponding numbers.
cancer  ion  ocr 49  splice  

training  630  315  1000  1000 
test  69  36  5000  2175 
To illustrate the effectiveness of the layered models, we first compare its results to those by AdaStump. Though there are other alternatives such as RealBoost and GentalBoost [11], decision stump remains being widely adopted in the AdaBoost implementation. Fig. (8.a) shows the training and test errors on the splice dataset by AdaStump, AdaOr, AdaAnd, and AdaAndOr using different number of weak classifiers. In the implementation of OrBoost and AndBoost, we use 5 ‘or’ operations. Each curve is averaged over 10 trials by randomly selecting 1000 samples for training and 2175 samples for testing. The AdaAndOr gives the best performance among all. We also observe that the differences between the training and test errors for AdaStump and others are very similar. The results for realworld vision applications also show similar behavior of AdaOr, Adaand, and AdaAndOr. This suggests that the OrBoost and AndBoost algorithms are having similar generalization power as decision stump.
(a)  (b) 
To show how the use of different number of operations is affecting the performance, we conduct another experiment on the splice dataset. We plot out the training and test errors by using 50 weak classifiers with varying number of operations. The overall performance of the models, both in training and testing, is not improving too much with more than 3 operations shown in Fig. (8.b). Similar observations apply to other datasets as well. This suggests that the significant improvement can be achieved without introducing too much overhead.
breast cancer  

ionosphere  
ocr 49  
splice 
It has been suggested [11, 5, 18] that the best performance of boosting algorithm is achieved by AdaBoost using decision tree [16] or CART [2]. Some of the confusions about generalization (test) error based on the margin theory has recently been clarified by Reyzin and Schapire [18]. In table (2), we compare the our algorithms with AdaBoost and arcgv using decision tree. For a fair comparison, we show the improvement of AdaOrBoost, arcgv using CART, and AdaBoost using CART over those using decision stump. Table (2) shows the error ratio. As we can see, the improvement of AdaOrBoost is comparable to arcgv using CART, but is worse than AdaCART. However, each CART, after tree pruning, has around 16 leaf nodes with the tree depth being around 7. Therefore, the complexity of CART is much bigger than that of OrBoost and AndBoost. This is particularly an issue for applications in vision as the training data is massive with each data sample having thousands or even millions of features. The good performance of AdaCART is achieved using an average of levels of tree. This greatly limits its usage in many vision applications and leaves the decision stump classifiers still being currently widely used [22].
(a) 
(b) 
To illustrate the effectiveness of proposed algorithms, we further demonstrate them in two challenging vision problems, object segmentation and pedestrian detection.
First, we demonstrate it on the Weizmann horse dataset [3]. We use 328 images and use 126 for training and 214 for testing. Each input image comes with a label map in which the pixels on the horse body and background are labeled as and respectively. Given a test image, our task is to classify all the pixels into horse or background. In training, we take image patch of size centered on every pixel as training samples. The background and horse body image patches are the negatives and positives respectively. For each image patch, we compute around
features such as the mean, variance, and Haar responses of the original as well as Gabor filtered. We implement a cascade approach
[22] and implement several versions. One uses AdaStump and others use AdaOr and AdaAndOr. Each cascade node selects and fuses 100 weak classifiers. All the algorithms use an identical set of features and bootstrapping procedure. Fig. (9.a) shows the precision and recall curve of the algorithms on the training and test images. We observe similar result as that for the UCI repository datasets. AdaAndOr improves the results over AdaStump by a considerable amount. The differences between the training and test errors are nearly the same in this cascade setting as well. The Fvalue of the results by AdaAndOr is around 0.8 which is better than the number 0.66 reported in [17] which uses low and middle level information.Next, we show the AdaAndOr algorithm for pedestrian detection on dataset reported in [7]. We use 8 level of cascade with different choices of weak classifiers for AdaBoost. Fig. (9b) shows the results by AdaStump, AdaAda, AdaOr, AdaAnd, and AdaAndOr. The conclusion is nearly the same as before. AdaAndOr achieves the best result among all with AdaAnd being on the second place. Though we are not specifically addressing the pedestrian detection problem here, the result is nevertheless close to that by the wellknown HOG pedestrian detector [7]. However, we only use a set of generic Haar features without tuning the system specifically for the pedestrian detection task.
3.6 Conclusions
Many of the classification problems in machine learning and computer vision can be understood as performing logic operations combining ‘and’, ‘or’, and ‘not’. In this paper, we have introduced layered logic classifiers. We show that AdaBoost can not solve the ‘xor’ problem using decision stump type of weak classifiers. We propose an OrBoost and AndBoost algorithms to study the ‘or’ and ‘and’ operations respectively. We demonstrate that the combined algorithm of two layers, AdaAndOr, greatly outperformed AdaStump which is widely used in the literature. The improvement is significant in most the cases. We demonstrate the effectiveness of AdaAndOr on traditional machine learning datasets, as well as challenging vision applications. Though decision tree based AdaBoost algorithm is shown to produce smaller test error, its complexity in training often limits its usage. The OrBoost and AndBoost algorithm only increases the time complexity slightly than decision stump, but they significantly reduce the test error. The AdaAndOr algorithm is useful for a wide variety of applications in machine learning and computer vision.
Acknowledgment This work is supported by NSF IIS1216528 (IIS1360566) and NSF CAREER award IIS0844566 (IIS1360568).
References

[1]
C. M. Bishop, “Neural networks for pattern recognition”,
Oxford University Press, 1995.  [2] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone CJ, “Classification and Regression Trees”, Chapman and Hall (Wadsworth, Inc.): New York, 1984.
 [3] E. Borenstein, E. Sharon and S. Ullman, “Combining topdown and bottomup segmentation”, Proc. IEEE workshop on Perc. Org. in Com. Vis., June 2004
 [4] L. Breiman, “Arcing classifiers”, The Annals of Statistics, 26, pp 801849, 1998.
 [5] L. Breiman, “Prediction games and arcing classifiers”, Neural Computation 11, 14931517, 1999.

[6]
R. Caruana and A. NiculescuMizil, “An Empirical Comparison of Supervised Learning Algorithms”,
Proc. of ICML , 2006.  [7] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” CVPR, 2005.
 [8] M. Dundar and J. Bi, “Joint optimization of cascaded classifiers for computer aided detection”, Proc. of CVPR, 2007.
 [9] P. Dollár, Z. Tu, and S. Belongie, “Supervised learning of edges and object boundaries”, Proc. of CVPR, 2006.
 [10] R. O. Duda and P. E. Hart, “Pattern classification”, Wiley Interscience, 2000.

[11]
J. Friedman, T. Hastie and R. Tibshirani, “Additive logistic regression: a statistical view of boosting”, Dept. of Stat., Stanford U. Te. Rep. 1998.
 [12] Y. Freund and R. E. Schapire, “A Decisiontheoretic Generalization of Online Learning And An Application to Boosting”, J. of Comp. and Sys. Sci., 55(1), 1997.
 [13] E. Grossmann, “AdaTree: Boosting a Weak Classifier into a Decision Tree”, Proc. CVPR workshop on learning in computer vision and pattern recognition, 2004.
 [14] D. Martin, C. Fowlkes, and J. Malik, “Learning to detect natural image boundaries using local brightness, color and texture cues”, IEEE PAMI, 26(5), 530549, May 2004.

[15]
N. Oza and S. Russell, “Online Bagging and Boosting”,
Proc. of 8th International Workshop on Artificial Intelligence and Statistics
, 2001.  [16] J.R. Quinlan, “Improved use of continuous attributes in C4.5”, J. of Art. Intell. Res., 4, pp. 7790, 1996.
 [17] X. Ren, C. Fowlkes, and J. Malik, “Cue integration in figure/ground labeling”, Proc. of NIPS, 2005.
 [18] L. Reyzin and R. E. Schapire, “How boosting the margin can also boost classifier complexity”, Proc. of the 23rd International Conference on Machine Learning, 2006.
 [19] R. E. Schapire, R. E. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics”, 26, pp. 16511686, 1998.
 [20] Z. Tu, “Probabilistic boosting tree: Learning discriminative models for classification, recognition, and clustering”, Proc. of ICCV, 2005.

[21]
V. Vapnik, “Statistical Learning Theory”. WileyInterscience, 1998.

[22]
P. Viola and M. Jones, “Robust RealTime Face Detection”,
Int’l J. of Comp. Vis., vol. 57, no. 2, pp. 137154, 2004.
Comments
There are no comments yet.