I Introduction
Error correcting output codes (ECOC) is an ensemble classification technique in machine learning that is motivated by coding theory where transmitted or stored information is encoded by binary strings (codewords) with high Hamming distance which allows for unique decoding of bit errors [7]. There are many variants and extension of the ECOC techniques such as the use of ternary [8] and ary codes [24], optimizing individual classifier performance concurrently by exploiting their relationships [15], and optimizing the learning of the base classifiers together as a multitask learning problem [23]. Some theoretical error bounds for ECOC can be found in [7] and [24]. Moreover, Passerini et al. [17] provided leaveoneout error bound for using kernel machines as base classifiers for ECOC classifier. More recently, the ECOC technique has been extended to handle the zeroshot learning problem [19], the lifelong learning problem [11]
, and handling adversarial examples in neural network by integrating ECOC with increasing ensemble diversity
[22].In a conventional ECOC classifier, each class of a given dataset is assigned a codeword and a learned model is trained through an ensemble of binary classifiers constructed from the columns of the corresponding ECOC matrix whose rows consists of the class codewords [7]. Each column defines a bipartition of the dataset by merging classes with the same bit value. Decoding (classification) is performed by matching the codeword predicted by with the class codeword nearest in Hamming distance. In essence, ECOC is a generalization of onevsone and onevsall classification techniques, and as an ensemble technique, it is most effective when the binary classifiers make independent mistakes on a randon sample.
In this paper we derive new bounds on ECOC classification error rates that improve on that obtained by [10] by first applying the Feller and Chernoff bounds that are well known in statistics to the case where all binary classifiers are mutally independent and then applying a more recent bound due to [12] where they are correlated. These new bounds theoretically establish the effectiveness of the ECOC approach in machine learning; in particular, we show under certain assumptions that ECOC classification error decays exponentially to zero with respect to codeword length. We also present experimental results to demonstrate the validity of these bounds by applying them to various datasets to show the effect of correlation on classification accuracy.
Let denote the aforementioned ensemble of binary classifiers (or learners) for a data set with classes. Let denote the error rate of . Since is a binary classifier that only outputs 0 or 1 where indicating an error, we shall also call the bit error rate since the outputs represent a binary string. The following result, due to [10] gives a crude bound on the accuracy of :
Theorem 1 (GS Bound, [10]).
Let denote the average bit error rate. Then the ECOC classification error rate of is bounded by four times the average bit error rate, i.e.,
(1) 
We note that the GS bound makes no assumptions regarding whether or not the classifiers are independent or how much correlation exists between them. However, the GS bound is far from being sharp: assuming that , then . Thus, the GS bound fails to answer whether it is theoretically possible for , which would validate its effectiveness as an ensemble technique. Moreover, the GS bound gives no explicit dependence of on .
To the best of our knowledge and prior to this work, no error bound exists that rigorously demonstrates that is theoretically possible in the ECOC setting. Progress so far has been limited to extending the GS bound to lossbased decoding schemes [1] and special distance measures [13]. In addition, theorems have been proven that bound the excess error rate of the ECOC classifier in terms of the excess error rates of the constituent binary classifiers [14, 3]. Here, “excess errror rate” refers to the difference between the error rate and the Bayes optimal error rate.
Our main result establishes new bounds on by calling on results from statistical theory.
Theorem 2 (Main Result).
Let be the ECOC matrix corresponding to with row dimension and minimum row Hamming distance . Set and with .

Chernoff Bound: If all binary classifiers are mutually independent, then
(2) where .

KZ Bound: If with , and all binary classifiers are mutually correlated up to secondorder only and specified by a uniform nonnegative correlation coefficient that satisfies the Bahadur bound (36), then
(3) where is defined in part 1 and .
Assuming is fixed, these bounds imply that decays exponentially to zero with respect to (codeword length).
Ii Indepdendent Base Classifiers
In this section assume that all classifiers are mutually independent, but not necessarily identically distributed. This allows us to use the Poisson binomial distribution to describe the probability of error for our ensemble of classifiers and show that the corresponding ECOC error is bounded by the classical binomial distribution based on the maximum error rate of all the classifiers.
Although the assumption of independence rarely holds in practice for realworld data sets, it is still useful as a starting point for our theoretical analysis and for establishing baseline results. An important application where this assumption is considered involves the setting of multiview learning within the context of cotraining [4], where say two classifiers are trained separately on data representing two different views (or sets of attributes). In this setting one of the assumptions requires the classifiers to be conditionally independent given the class label. This assumption can be relaxed [6, 2]. We aim to do the same in the section where we take into account correlation between classifiers.
Denote by the collection of all element subsets of . Given a subset of , we define the outcome to be such that if and if , where denotes the complement of in .
Definition 3.
Let be a set of error rates of , respectively. We define to be the probability of the event where exactly out of the classifiers suffered bit errors, i.e., those outcomes where . Then is given by (Poisson binomial distribution)
(4) 
If the classifiers are identically distributed so that for all , then we define this probability by (binomial distribution)
(5) 
Recall that the minimum Hamming distance between any two rows or any two columns of an dimensional Hadamard matrix is (see [10]). In that case, when at least of the classifiers (corresponding to the columns of ) each makes an error, i.e., misclassifies a sample, then ECOC misclassification may occur. This is because the rows of a describes an errorcorrecting code that only guarantees correct decoding up to (but strictly less than) bit errors. Therefore, in order to bound , we shall assume under a worstcase scenario that misclassification always occur when , where is the number of classifiers that suffered bit errors.
The following theorem shows that can be bounded by the maximum error rate of all the classifiers, assuming all are no larger than .
Theorem 4.
Let with . Let be a set of error rates with for all . Set . Then
(6) 
The proof of this theorem requires the following lemmas, whose proofs are given in the appendix of this paper [16]. Before stating them, we first introduce notation: given , we define and .
Lemma 5.
We have
(7) 
for all and .
Lemma 6.
is strictly increasing with respect to over the interval .
Proof.
(of Theorem 4) Since is monotone increasing in each variable , it is maximal when each is replaced by . Thus,
as desired. ∎
Definition 7.
We define the maximum ECOC error rate as the probability of the event where at least out of independent binary classifiers produces an error and is given by the cumulative sum
(8) 
If the classifiers are identically distributed, then we define the probability of this event by
(9) 
It is clear that . Moreover, we note that gives the maximum ECOC error rate for a Hadamard matrix of dimension with minimum row Hamming distance . The following theorem, which follows immediately from Theorem 4, shows that is bounded by the binomial distribution based on the largest bit error rate.
Theorem 8.
Suppose for all . Set . Then
(10) 
We now apply Feller’s result on to obtain the following simple rational bound:
Lemma 9 ([9]).
For , we have
(11) 
The following corollary shows that ECOC error rate tends to zero as the codeword length tends to infinity assuming the ratio stays fixed. This gives theoretical justification for the effectiveness of the ECOC approach for datasets with a large number of classes; of course, this assumes the existence of many relatively accurate independent binary classifiers.
Corollary 10.
Suppose and are fixed with for all . Then
(12) 
Proof.
To obtain a sharper and more useful bound, we call on the following result by Chernoff.
Theorem 11 ([5]).
Let . Then
(13) 
where is the Euler number.
The following corollary, which restates the Chernoff bound in terms of the average bit error rate, shows that decays to zero exponentially with respect to codeword length.
Corollary 12.
Let , , and . Then
(14) 
where for all . Moreover, is increasing with respect to for and decreasing with respect to for . Thus, if and are fixed with , then , and thus , decays exponentially to zero as .
Proof.
It is straightforward to prove using analytical methods that and that is increasing and decreasing with respect to over the respective intervals. As for the bound (14), we have
(15) 
Since due to , it follows that (and thus ) decays exponentially as . ∎
Iii Correlated Base Classifiers
In this section we assume dependence (correlation) between certain base classifiers to show how it affects ECOC accuracy. We first make the simple assumption that all binary classifiers are mutually independent except for a pair of dependent classifiers and , which are allowed to depend on each other as follows. Recall that each takes on two possible values, namely (correct prediction) and (incorrect prediction). As before, let denote the error rate of , i.e., . Since and are dependent on each other, we specify their correlation via the joint probability
(16) 
It follows that the remaining joint probabilities are given by
(17)  
(18)  
(19) 
We shall assume that , , and so that all probabilities are nonnegative. We then define the correlation between and as
(20) 
In particular, if and are independent so that , then .
Given a subset , we denote and define
(21) 
Definition 13.
Let . We define the probability of the event where out of classifiers produces an error (with dependence between classifiers and as defined above) by
(22) 
If for all , then we denote .
Define . The following lemma, whose proof is given in [16], shows the explicit dependence of on and .
Lemma 14.
We have
(23) 
where
Define . We apply Theorem 4 to the above lemma to obtain the following bound.
Corollary 15.
Suppose . Then
(24) 
where
In the special case where all binary classifiers are identically distributed, i.e., for all , then
The next lemma, whose proof is given in [16], assumes all classifiers are identically distributed.
Lemma 16.
is increasing with respect to for fixed , where
and .
Definition 17.
We define the maximum ECOC error rate (assuming correlation given by ) as the probability of the event where at least out of binary classifiers produces an error and therefore is given by the cumulative sum
(25) 
If the classifiers are identically distributed, i.e., for all , then we define
(26) 
The next two theorems describe the dependence of the maximum ECOC error rate on . Their proofs can be found in [16].
Theorem 18.
We have
(27) 
Theorem 19.
is increasing with respect to for and decreasing with respect to for .
The next theorem gives a simple bound for , which again implies that decays exponentially to zero but assumes that is fixed.
Theorem 20.
Let and . Then
(28) 
Proof.
Let us now use Theorem 19 to discuss the effect of a correlated pair of binary classifiers on the maximum ECOC error rate . Assuming , which implies is increasing with respect to , we conclude that over the range , the ECOC error rate is lower when there is negative correlation () compared to that for independence (), which in turn is lower than when there is positive correlation (). In other words, having negative correlation actually helps to decrease the ECOC accuracy while having positive correlation increases the ECOC error rate, which agrees with our common intuition. On the other hand, over the range , the reverse occurs since is decreasing with respect to . Thus, the moral is that positive correlation is detrimnental only if is relatively small .
We next investigate the effect of having all classifiers mutually dependent on ECOC accuracy.
Iiia All Classifiers Mutally Correlated
Suppose all classifiers are mutally correlated up to secondorder only (all higherorder correlations are zero). We define
(31)  
(32)  
(33) 
Let . Recall our definition of the outcome where if and if where . Denote by the probability of the outcome .
Definition 21.
Suppose and . We define the probability of the event where out of classifiers produces an error (with correlation given by (33)) by
(34) 
We also define the maximum ECOC error rate as the probability of the event where at least out of binary classifiers produces an error and therefore is given by the cumulative sum
(35) 
The following result by [12] gives an explicit formula for where are equal and all correlations are equal. Although their result is stated under the assumption because of its application to jury design where in their model jurors are assumed to be competent, their proof, which we partially replicate in [16] for completeness, in fact holds over the range , assuming that satisfies the Bahadur bound described in the same paper:
(36) 
where
(37) 
Theorem 22 ([12]).
Corollary 23.
Let . Suppose and is nonnegative and satisfies (36). Then
(40) 
where and . Moreover, if and are fixed, then (and thus ) decays exponentially to zero as .
Proof.
Since it was proven earlier that the first term on the righthand side of (22), , is bounded by and decays exponentially to zero as , it suffices to prove that the second term, , is bounded similarly. We first manipulate it as follows:
Then using the bound
(41) 
we have
Since for , it follows that exponentially as . Thus, the same holds for as well. ∎
Iv Experimental Results
In this section we present experimental results to demonstrate the validity of our work by performing ECOC classification on various data sets and comparing the resulting classification error rates with those predicted by the Chernoff and KZ bounds established in the previous section.
In particular, we selected six public datasets to perform ECOC classification: Pendigits, Usps, Vowel, Letter Recognition (Letters), CIFAR10, and Street View House Numbers (SVHN). Information regarding these datasets are given in Table
I.Dataset  # Samples  # Features  # Classes ()  

Pendigits  3498  16  10  2/11 
Usps  7291  256  10  2/10 
Vowel  990  10  11  2/11 
Letters  20,000  16  26  6/26 
CIFAR10  60,000  Image  10  2/10 
SVHN  99,289  Image  10  2/10 

ECOC Matrix: We employed a square ECOC matrix for every dataset () and constructed from a Hadamard matrix of dimension , where was chosen to be the smallest integer for which and denotes the number of classes. We then truncated an appropriate number of rows and columns from (starting from the top left) to obtain our square matrix of dimension . The parameter for each dataset is given in Table I.

Classification algorithms: For the datasets Pendigits, Usps, Vowel, and Letters, we employed two different models
for our base classifiers: decision tree (DT) and supportvector machines (SVM), using the Python modules (version 3.7) sklearn.tree.DecisionTreeClassifier and sklearn.svm.SVC with default settings, respectively, utilizing the scikitlearn machine learning library. Computations were performed on a standard laptop. For the image datasets CIFAR10 and SVHN, we employed a pretrained convolutional neural network, Resnet18 (loaded from Pytorch), with an additional dense layer to product binary output and using the Adam optimizer. Computations were performed for 10 epochs with a batch size of 128 and ran on the Open Science Grid
[18, 21].
Thus, given a dataset, we performed 10thfold crossvalidation based on the experimental setup described above and recorded the ECOC error rate (experimental) for each fold, as well mean ECOC error rate and standard deviation for all ten folds. To compute the GS, Chernoff, and KZ bounds given by (
1), (2), and (3), respectively, for each fold, we used the mean bit error rate , obtained by averaging the bit error rates of all the binary classifiers. In addition, for the KZ bound we used the mean correlation for , obtained by averaging the coefficients of the correlation matrix of the binary classifiers. Full results, including values used for and , are given in [16] (Tables IIIVIII).Iva Results and Discussion
Experimental results show that for all datasets the ECOC error rates () are either below all three bounds (GS, Chernoff, and KZ) or clustered around the Chernoff and KZ bounds, where the latter occurs for Letters (DT and SVM) and Pendigits (SVM). This can be seen in the plots in Figures 27 for Pendigits, Letters, CIFAR10, and SVHN (see [16] for plots of USPS and Vowels) where ECOC error rates are shown for each of the ten folds and in Table II
where results are averaged over all folds (lowest value indicated in bold). These results demonstrate the validity of all three bounds. However, Figures 35 (Letters) and 7 (Pendigits) clearly show that the Chernoff and KZ bounds provide much more accurate estimates of the ECOC error compared to the GS bound. This is to be expected for Letters where the number of binary classifiers (
), is signficantly larger than all the other datasets. As discussed earlier, the Chernoff and KZ bounds decay exponentially to zero with respect to and thus are more effective for larger values of . Overall, we believe our experimental results demonstrate that the Chernoff and GS bounds are quite useful in practice.

ECOC Error Rate  

Dataset  Model  Experimental  GS  Chernoff Bound  KZ Bound 
Pendigits  DT  0.034 0.0034  0.134 0.0070  0.148 0.0130  0.192 0.03450 
SVM  0.022 0.0024  0.047 0.0059  0.023 0.0054  0.030 0.0071  
Usps  DT  0.091 0.0117  0.288 0.0209  0.466 0.0431  0.500 0.0482 
SVM  0.028 0.0050  0.063 0.0085  0.040 0.0100  0.049 0.0149  
Vowel  DT  0.144 0.0397  0.449 0.0604  0.749 0.0833  0.746 0.0626 
SVM  0.166 0.0368  0.422 0.0553  0.710 0.0891  0.712 0.0876  
Letters  DT  0.061 0.0057  0.274 0.0114  0.047 0.0082  0.055 0.0108 
SVM  0.106 0.0046  0.302 0.0086  0.070 0.0081  0.093 0.0191  
CIFAR10  CNN  0.023 0.0015  0.065 0.0042  0.041 0.0049  0.074 0.0098 
SVHN  CNN  0.011 0.0010  0.034 0.0018  0.013 0.0013  0.021 0.0025 
V Conclusions and Future Works
In this paper, we presented two new classification error bounds for the ECOC ensemble learning: the first under the assumption that all base classifiers are independent and the second under the assumption that all base classifiers are mutually correlated up to firstorder. These bounds have exponential decay complexity with respect to codeword length and theoretically validate the effectiveness of the ECOC approach. Moreover, we perform ECOC classification on six datasets and compare their error rates with our bounds to experimentally validate our work and show the effect of correlation on classification accuracy. Future work include investigating the Chernoff bound for ECOC in settings with limited independence [20] and comparing the performance of binary vs ary ECOC with respect to the error bounds presented in this paper.
References
 [1] (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. 1, pp. 113–141. Cited by: §I.
 [2] (2004) Cotraining and expansion: towards bridging theory and practice. In NIPS 2004, pp. 89–96. Cited by: §II.
 [3] (2009) Errorcorrecting tournaments. In International Conference on Algorithmic Learning Theory (ALT), 2009, pp. 247–262. Cited by: §I.
 [4] (1998) Combining labeled and unlabeled data with cotraining. In COLT 1998, pp. 92–100. Cited by: §II.
 [5] (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 23, pp. 493–507. Cited by: Theorem 11.
 [6] (2001) Sensitive error correcting output codes. In NIPS 2005, pp. 375–382. Cited by: §II.

[7]
(199501)
Solving multiclass learning problems via errorcorrecting output codes.
J. Artificial Intelligence Research
2, pp. 263–286. Cited by: §I, §I.  [8] (2008) On the decoding process in ternary errorcorrecting output codes. IEEE transactions on pattern analysis and machine intelligence 32 (1), pp. 120–134. Cited by: §I.

[9]
(1968)
An introduction to probability theory and its applications
. Vol. 1, John Wiley and Sons. External Links: ISBN 9780471257080 Cited by: Lemma 9.  [10] (1999) Multiclass learning, boosting, and errorcorrecting codes. In COLT 1999, pp. 145–155. External Links: Document Cited by: §I, §I, §II, Theorem 1.
 [11] (2020) An errorcorrecting output code framework for lifelong learning without a teacher. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 249–254. Cited by: §I.
 [12] (2011) Optimal jury design for homogeneous juries with correlated votes. Theory and Decision 71, pp. 439–459. Cited by: Appendix C, §I, §IIIA, Theorem 22.

[13]
(2003)
On nearestneighbor errorcorrecting output codes with application to allpairs multiclass support vector machines
. J. Mach. Learn. Research 4, pp. 1–15. Cited by: §I.  [14] (2005) Sensitive error correcting output codes. In COLT 2005, pp. 158–172. Cited by: §I.
 [15] (2015) Joint binary classifier learning for ecocbased multiclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (11), pp. 2335–2341. Cited by: §I.
 [16] (2021) Ensemble learning using error correcting output codes: new classification error bounds: appendix. Note: https://drive.google.com/file/d/1SuqVu2q9GFazV8FPfBEFT99ItcQMGF3/view Cited by: §II, §IIIA, §III, §III, §III, §IVA, §IV.
 [17] (2004) New results on error correcting output codes of kernel machines. IEEE transactions on neural networks 15 (1), pp. 45–54. Cited by: §I.
 [18] (2007) The open science grid. In J. Phys. Conf. Ser., 78, Vol. 78, pp. 012057. External Links: Document Cited by: item 2.

[19]
(2017)
Zeroshot action recognition with errorcorrecting output codes.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2833–2842. Cited by: §I.  [20] (1995) Chernoffhoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics 8 (2), pp. 223–250. Cited by: §V.
 [21] (2009) The pilot way to grid resources using glideinwms. In 2009 WRI World Congress on Computer Science and Information Engineering, 2, Vol. 2, pp. 428–432. External Links: Document Cited by: item 2.
 [22] (2021) Errorcorrecting output codes with ensemble diversity for robust learning in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 9722–9729. Cited by: §I.
 [23] (2013) Adaptive errorcorrecting output codes. In TwentyThird International Joint Conference on Artificial Intelligence, Cited by: §I.
 [24] (201902) Nary decomposition for multiclass classification. Mach. Learn. 108, pp. 809–830. External Links: Document Cited by: §I, §II.
Comments
There are no comments yet.