Error correcting output codes (ECOC) is an ensemble classification technique in machine learning that is motivated by coding theory where transmitted or stored information is encoded by binary strings (codewords) with high Hamming distance which allows for unique decoding of bit errors . There are many variants and extension of the ECOC techniques such as the use of ternary  and -ary codes , optimizing individual classifier performance concurrently by exploiting their relationships , and optimizing the learning of the base classifiers together as a multi-task learning problem . Some theoretical error bounds for ECOC can be found in  and . Moreover, Passerini et al.  provided leave-one-out error bound for using kernel machines as base classifiers for ECOC classifier. More recently, the ECOC technique has been extended to handle the zero-shot learning problem , the life-long learning problem 
, and handling adversarial examples in neural network by integrating ECOC with increasing ensemble diversity.
In a conventional ECOC classifier, each class of a given dataset is assigned a codeword and a learned model is trained through an ensemble of binary classifiers constructed from the columns of the corresponding ECOC matrix whose rows consists of the class codewords . Each column defines a bipartition of the dataset by merging classes with the same bit value. Decoding (classification) is performed by matching the codeword predicted by with the class codeword nearest in Hamming distance. In essence, ECOC is a generalization of one-vs-one and one-vs-all classification techniques, and as an ensemble technique, it is most effective when the binary classifiers make independent mistakes on a randon sample.
In this paper we derive new bounds on ECOC classification error rates that improve on that obtained by  by first applying the Feller and Chernoff bounds that are well known in statistics to the case where all binary classifiers are mutally independent and then applying a more recent bound due to  where they are correlated. These new bounds theoretically establish the effectiveness of the ECOC approach in machine learning; in particular, we show under certain assumptions that ECOC classification error decays exponentially to zero with respect to codeword length. We also present experimental results to demonstrate the validity of these bounds by applying them to various datasets to show the effect of correlation on classification accuracy.
Let denote the aforementioned ensemble of binary classifiers (or learners) for a data set with classes. Let denote the error rate of . Since is a binary classifier that only outputs 0 or 1 where indicating an error, we shall also call the bit error rate since the outputs represent a binary string. The following result, due to  gives a crude bound on the accuracy of :
Theorem 1 (GS Bound, ).
Let denote the average bit error rate. Then the ECOC classification error rate of is bounded by four times the average bit error rate, i.e.,
We note that the GS bound makes no assumptions regarding whether or not the classifiers are independent or how much correlation exists between them. However, the GS bound is far from being sharp: assuming that , then . Thus, the GS bound fails to answer whether it is theoretically possible for , which would validate its effectiveness as an ensemble technique. Moreover, the GS bound gives no explicit dependence of on .
To the best of our knowledge and prior to this work, no error bound exists that rigorously demonstrates that is theoretically possible in the ECOC setting. Progress so far has been limited to extending the GS bound to loss-based decoding schemes  and special distance measures . In addition, theorems have been proven that bound the excess error rate of the ECOC classifier in terms of the excess error rates of the constituent binary classifiers [14, 3]. Here, “excess errror rate” refers to the difference between the error rate and the Bayes optimal error rate.
Our main result establishes new bounds on by calling on results from statistical theory.
Theorem 2 (Main Result).
Let be the ECOC matrix corresponding to with row dimension and minimum row Hamming distance . Set and with .
Chernoff Bound: If all binary classifiers are mutually independent, then
KZ Bound: If with , and all binary classifiers are mutually correlated up to second-order only and specified by a uniform non-negative correlation coefficient that satisfies the Bahadur bound (36), then
where is defined in part 1 and .
Assuming is fixed, these bounds imply that decays exponentially to zero with respect to (codeword length).
Ii Indepdendent Base Classifiers
In this section assume that all classifiers are mutually independent, but not necessarily identically distributed. This allows us to use the Poisson binomial distribution to describe the probability of error for our ensemble of classifiers and show that the corresponding ECOC error is bounded by the classical binomial distribution based on the maximum error rate of all the classifiers.
Although the assumption of independence rarely holds in practice for real-world data sets, it is still useful as a starting point for our theoretical analysis and for establishing baseline results. An important application where this assumption is considered involves the setting of multi-view learning within the context of co-training , where say two classifiers are trained separately on data representing two different views (or sets of attributes). In this setting one of the assumptions requires the classifiers to be conditionally independent given the class label. This assumption can be relaxed [6, 2]. We aim to do the same in the section where we take into account correlation between classifiers.
Denote by the collection of all -element subsets of . Given a subset of , we define the outcome to be such that if and if , where denotes the complement of in .
Let be a set of error rates of , respectively. We define to be the probability of the event where exactly out of the classifiers suffered bit errors, i.e., those outcomes where . Then is given by (Poisson binomial distribution)
If the classifiers are identically distributed so that for all , then we define this probability by (binomial distribution)
Recall that the minimum Hamming distance between any two rows or any two columns of an -dimensional Hadamard matrix is (see ). In that case, when at least of the classifiers (corresponding to the columns of ) each makes an error, i.e., misclassifies a sample, then ECOC misclassification may occur. This is because the rows of a describes an error-correcting code that only guarantees correct decoding up to (but strictly less than) bit errors. Therefore, in order to bound , we shall assume under a worst-case scenario that misclassification always occur when , where is the number of classifiers that suffered bit errors.
The following theorem shows that can be bounded by the maximum error rate of all the classifiers, assuming all are no larger than .
Let with . Let be a set of error rates with for all . Set . Then
The proof of this theorem requires the following lemmas, whose proofs are given in the appendix of this paper . Before stating them, we first introduce notation: given , we define and .
for all and .
is strictly increasing with respect to over the interval .
(of Theorem 4) Since is monotone increasing in each variable , it is maximal when each is replaced by . Thus,
as desired. ∎
We define the maximum ECOC error rate as the probability of the event where at least out of independent binary classifiers produces an error and is given by the cumulative sum
If the classifiers are identically distributed, then we define the probability of this event by
It is clear that . Moreover, we note that gives the maximum ECOC error rate for a Hadamard matrix of dimension with minimum row Hamming distance . The following theorem, which follows immediately from Theorem 4, shows that is bounded by the binomial distribution based on the largest bit error rate.
Suppose for all . Set . Then
We now apply Feller’s result on to obtain the following simple rational bound:
Lemma 9 ().
For , we have
The following corollary shows that ECOC error rate tends to zero as the codeword length tends to infinity assuming the ratio stays fixed. This gives theoretical justification for the effectiveness of the ECOC approach for datasets with a large number of classes; of course, this assumes the existence of many relatively accurate independent binary classifiers.
Suppose and are fixed with for all . Then
To obtain a sharper and more useful bound, we call on the following result by Chernoff.
Theorem 11 ().
Let . Then
where is the Euler number.
The following corollary, which restates the Chernoff bound in terms of the average bit error rate, shows that decays to zero exponentially with respect to codeword length.
Let , , and . Then
where for all . Moreover, is increasing with respect to for and decreasing with respect to for . Thus, if and are fixed with , then , and thus , decays exponentially to zero as .
It is straightforward to prove using analytical methods that and that is increasing and decreasing with respect to over the respective intervals. As for the bound (14), we have
Since due to , it follows that (and thus ) decays exponentially as . ∎
We emphasize that the improvement of the Chernoff bound (Corollary 12) over the GS bound (Theorem 1) is due to the assumption that all the binary classifiers are mutually independent. Figure 1 clearly demonstrates this for large and small .
Iii Correlated Base Classifiers
In this section we assume dependence (correlation) between certain base classifiers to show how it affects ECOC accuracy. We first make the simple assumption that all binary classifiers are mutually independent except for a pair of dependent classifiers and , which are allowed to depend on each other as follows. Recall that each takes on two possible values, namely (correct prediction) and (incorrect prediction). As before, let denote the error rate of , i.e., . Since and are dependent on each other, we specify their correlation via the joint probability
It follows that the remaining joint probabilities are given by
We shall assume that , , and so that all probabilities are non-negative. We then define the correlation between and as
In particular, if and are independent so that , then .
Given a subset , we denote and define
Let . We define the probability of the event where out of classifiers produces an error (with dependence between classifiers and as defined above) by
If for all , then we denote .
Define . The following lemma, whose proof is given in , shows the explicit dependence of on and .
Define . We apply Theorem 4 to the above lemma to obtain the following bound.
Suppose . Then
In the special case where all binary classifiers are identically distributed, i.e., for all , then
The next lemma, whose proof is given in , assumes all classifiers are identically distributed.
is increasing with respect to for fixed , where
We define the maximum ECOC error rate (assuming correlation given by ) as the probability of the event where at least out of binary classifiers produces an error and therefore is given by the cumulative sum
If the classifiers are identically distributed, i.e., for all , then we define
The next two theorems describe the dependence of the maximum ECOC error rate on . Their proofs can be found in .
is increasing with respect to for and decreasing with respect to for .
The next theorem gives a simple bound for , which again implies that decays exponentially to zero but assumes that is fixed.
Let and . Then
Let us now use Theorem 19 to discuss the effect of a correlated pair of binary classifiers on the maximum ECOC error rate . Assuming , which implies is increasing with respect to , we conclude that over the range , the ECOC error rate is lower when there is negative correlation () compared to that for independence (), which in turn is lower than when there is positive correlation (). In other words, having negative correlation actually helps to decrease the ECOC accuracy while having positive correlation increases the ECOC error rate, which agrees with our common intuition. On the other hand, over the range , the reverse occurs since is decreasing with respect to . Thus, the moral is that positive correlation is detrimnental only if is relatively small .
We next investigate the effect of having all classifiers mutually dependent on ECOC accuracy.
Iii-a All Classifiers Mutally Correlated
Suppose all classifiers are mutally correlated up to second-order only (all higher-order correlations are zero). We define
Let . Recall our definition of the outcome where if and if where . Denote by the probability of the outcome .
Suppose and . We define the probability of the event where out of classifiers produces an error (with correlation given by (33)) by
We also define the maximum ECOC error rate as the probability of the event where at least out of binary classifiers produces an error and therefore is given by the cumulative sum
The following result by  gives an explicit formula for where are equal and all correlations are equal. Although their result is stated under the assumption because of its application to jury design where in their model jurors are assumed to be competent, their proof, which we partially replicate in  for completeness, in fact holds over the range , assuming that satisfies the Bahadur bound described in the same paper:
Theorem 22 ().
Let . Suppose and is non-negative and satisfies (36). Then
where and . Moreover, if and are fixed, then (and thus ) decays exponentially to zero as .
Since it was proven earlier that the first term on the right-hand side of (22), , is bounded by and decays exponentially to zero as , it suffices to prove that the second term, , is bounded similarly. We first manipulate it as follows:
Then using the bound
Since for , it follows that exponentially as . Thus, the same holds for as well. ∎
Iv Experimental Results
In this section we present experimental results to demonstrate the validity of our work by performing ECOC classification on various data sets and comparing the resulting classification error rates with those predicted by the Chernoff and KZ bounds established in the previous section.
In particular, we selected six public datasets to perform ECOC classification: Pendigits, Usps, Vowel, Letter Recognition (Letters), CIFAR-10, and Street View House Numbers (SVHN). Information regarding these datasets are given in TableI.
|Dataset||# Samples||# Features||# Classes ()|
ECOC Matrix: We employed a square ECOC matrix for every dataset () and constructed from a -Hadamard matrix of dimension , where was chosen to be the smallest integer for which and denotes the number of classes. We then truncated an appropriate number of rows and columns from (starting from the top left) to obtain our square matrix of dimension . The parameter for each dataset is given in Table I.
Classification algorithms: For the datasets Pendigits, Usps, Vowel, and Letters, we employed two different models
for our base classifiers: decision tree (DT) and support-vector machines (SVM), using the Python modules (version 3.7) sklearn.tree.DecisionTreeClassifier and sklearn.svm.SVC with default settings, respectively, utilizing the scikit-learn machine learning library. Computations were performed on a standard laptop. For the image datasets CIFAR-10 and SVHN, we employed a pre-trained convolutional neural network, Resnet-18 (loaded from Pytorch), with an additional dense layer to product binary output and using the Adam optimizer. Computations were performed for 10 epochs with a batch size of 128 and ran on the Open Science Grid[18, 21].
Thus, given a dataset, we performed 10th-fold cross-validation based on the experimental setup described above and recorded the ECOC error rate (experimental) for each fold, as well mean ECOC error rate and standard deviation for all ten folds. To compute the GS, Chernoff, and KZ bounds given by (1), (2), and (3), respectively, for each fold, we used the mean bit error rate , obtained by averaging the bit error rates of all the binary classifiers. In addition, for the KZ bound we used the mean correlation for , obtained by averaging the coefficients of the correlation matrix of the binary classifiers. Full results, including values used for and , are given in  (Tables III-VIII).
Iv-a Results and Discussion
Experimental results show that for all datasets the ECOC error rates () are either below all three bounds (GS, Chernoff, and KZ) or clustered around the Chernoff and KZ bounds, where the latter occurs for Letters (DT and SVM) and Pendigits (SVM). This can be seen in the plots in Figures 2-7 for Pendigits, Letters, CIFAR-10, and SVHN (see  for plots of USPS and Vowels) where ECOC error rates are shown for each of the ten folds and in Table II
where results are averaged over all folds (lowest value indicated in bold). These results demonstrate the validity of all three bounds. However, Figures 3-5 (Letters) and 7 (Pendigits) clearly show that the Chernoff and KZ bounds provide much more accurate estimates of the ECOC error compared to the GS bound. This is to be expected for Letters where the number of binary classifiers (), is signficantly larger than all the other datasets. As discussed earlier, the Chernoff and KZ bounds decay exponentially to zero with respect to and thus are more effective for larger values of . Overall, we believe our experimental results demonstrate that the Chernoff and GS bounds are quite useful in practice.
||ECOC Error Rate|
|Dataset||Model||Experimental||GS||Chernoff Bound||KZ Bound|
|Pendigits||DT||0.034 0.0034||0.134 0.0070||0.148 0.0130||0.192 0.03450|
|SVM||0.022 0.0024||0.047 0.0059||0.023 0.0054||0.030 0.0071|
|Usps||DT||0.091 0.0117||0.288 0.0209||0.466 0.0431||0.500 0.0482|
|SVM||0.028 0.0050||0.063 0.0085||0.040 0.0100||0.049 0.0149|
|Vowel||DT||0.144 0.0397||0.449 0.0604||0.749 0.0833||0.746 0.0626|
|SVM||0.166 0.0368||0.422 0.0553||0.710 0.0891||0.712 0.0876|
|Letters||DT||0.061 0.0057||0.274 0.0114||0.047 0.0082||0.055 0.0108|
|SVM||0.106 0.0046||0.302 0.0086||0.070 0.0081||0.093 0.0191|
|CIFAR-10||CNN||0.023 0.0015||0.065 0.0042||0.041 0.0049||0.074 0.0098|
|SVHN||CNN||0.011 0.0010||0.034 0.0018||0.013 0.0013||0.021 0.0025|
V Conclusions and Future Works
In this paper, we presented two new classification error bounds for the ECOC ensemble learning: the first under the assumption that all base classifiers are independent and the second under the assumption that all base classifiers are mutually correlated up to first-order. These bounds have exponential decay complexity with respect to codeword length and theoretically validate the effectiveness of the ECOC approach. Moreover, we perform ECOC classification on six datasets and compare their error rates with our bounds to experimentally validate our work and show the effect of correlation on classification accuracy. Future work include investigating the Chernoff bound for ECOC in settings with limited independence  and comparing the performance of binary vs -ary ECOC with respect to the error bounds presented in this paper.
-  (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. 1, pp. 113–141. Cited by: §I.
-  (2004) Co-training and expansion: towards bridging theory and practice. In NIPS 2004, pp. 89–96. Cited by: §II.
-  (2009) Error-correcting tournaments. In International Conference on Algorithmic Learning Theory (ALT), 2009, pp. 247–262. Cited by: §I.
-  (1998) Combining labeled and unlabeled data with co-training. In COLT 1998, pp. 92–100. Cited by: §II.
-  (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 23, pp. 493–507. Cited by: Theorem 11.
-  (2001) Sensitive error correcting output codes. In NIPS 2005, pp. 375–382. Cited by: §II.
Solving multiclass learning problems via error-correcting output codes.
J. Artificial Intelligence Research2, pp. 263–286. Cited by: §I, §I.
-  (2008) On the decoding process in ternary error-correcting output codes. IEEE transactions on pattern analysis and machine intelligence 32 (1), pp. 120–134. Cited by: §I.
An introduction to probability theory and its applications. Vol. 1, John Wiley and Sons. External Links: Cited by: Lemma 9.
-  (1999) Multiclass learning, boosting, and error-correcting codes. In COLT 1999, pp. 145–155. External Links: Cited by: §I, §I, §II, Theorem 1.
-  (2020) An error-correcting output code framework for lifelong learning without a teacher. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 249–254. Cited by: §I.
-  (2011) Optimal jury design for homogeneous juries with correlated votes. Theory and Decision 71, pp. 439–459. Cited by: Appendix C, §I, §III-A, Theorem 22.
On nearest-neighbor error-correcting output codes with application to all-pairs multiclass support vector machines. J. Mach. Learn. Research 4, pp. 1–15. Cited by: §I.
-  (2005) Sensitive error correcting output codes. In COLT 2005, pp. 158–172. Cited by: §I.
-  (2015) Joint binary classifier learning for ecoc-based multi-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (11), pp. 2335–2341. Cited by: §I.
-  (2021) Ensemble learning using error correcting output codes: new classification error bounds: appendix. Note: https://drive.google.com/file/d/1SuqVu2q9GFazV8FPfBEFT99ItcQM-GF3/view Cited by: §II, §III-A, §III, §III, §III, §IV-A, §IV.
-  (2004) New results on error correcting output codes of kernel machines. IEEE transactions on neural networks 15 (1), pp. 45–54. Cited by: §I.
-  (2007) The open science grid. In J. Phys. Conf. Ser., 78, Vol. 78, pp. 012057. External Links: Cited by: item 2.
-  (2017) Zero-shot action recognition with error-correcting output codes. In , pp. 2833–2842. Cited by: §I.
-  (1995) Chernoff-hoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics 8 (2), pp. 223–250. Cited by: §V.
-  (2009) The pilot way to grid resources using glideinwms. In 2009 WRI World Congress on Computer Science and Information Engineering, 2, Vol. 2, pp. 428–432. External Links: Cited by: item 2.
-  (2021) Error-correcting output codes with ensemble diversity for robust learning in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 9722–9729. Cited by: §I.
-  (2013) Adaptive error-correcting output codes. In Twenty-Third International Joint Conference on Artificial Intelligence, Cited by: §I.
-  (2019-02) N-ary decomposition for multi-class classification. Mach. Learn. 108, pp. 809–830. External Links: Cited by: §I, §II.
of Lemma 14.
We first assume the case where and consider the following partition of :
This proves (23) for this case. The other two cases, and , can be proven by a similar argument by considering appropriate partitions of and , respectively. ∎