Ensemble Learning using Error Correcting Output Codes: New Classification Error Bounds

09/18/2021
by   Hieu D. Nguyen, et al.
Rowan University
0

New bounds on classification error rates for the error-correcting output code (ECOC) approach in machine learning are presented. These bounds have exponential decay complexity with respect to codeword length and theoretically validate the effectiveness of the ECOC approach. Bounds are derived for two different models: the first under the assumption that all base classifiers are independent and the second under the assumption that all base classifiers are mutually correlated up to first-order. Moreover, we perform ECOC classification on six datasets and compare their error rates with our bounds to experimentally validate our work and show the effect of correlation on classification accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/16/2018

Error correcting codes from sub-exceeding fonction

In this paper, we present error-correcting codes which are the results o...
11/03/2021

A McEliece cryptosystem using permutation codes

This paper is an attempt to build a new public-key cryptosystem; similar...
10/15/2020

Entropic proofs of Singleton bounds for quantum error-correcting codes

We show that a relatively simple reasoning using von Neumann entropy ine...
05/20/1999

Linear and Order Statistics Combiners for Pattern Classification

Several researchers have experimentally shown that substantial improveme...
02/05/2021

Function-Correcting Codes

Motivated by applications in machine learning and archival data storage,...
10/30/2020

Integer Programming-based Error-Correcting Output Code Design for Robust Classification

Error-Correcting Output Codes (ECOCs) offer a principled approach for co...
08/25/2019

LightMC: A Dynamic and Efficient Multiclass Decomposition Algorithm

Multiclass decomposition splits a multiclass classification problem into...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Error correcting output codes (ECOC) is an ensemble classification technique in machine learning that is motivated by coding theory where transmitted or stored information is encoded by binary strings (codewords) with high Hamming distance which allows for unique decoding of bit errors [7]. There are many variants and extension of the ECOC techniques such as the use of ternary [8] and -ary codes [24], optimizing individual classifier performance concurrently by exploiting their relationships [15], and optimizing the learning of the base classifiers together as a multi-task learning problem [23]. Some theoretical error bounds for ECOC can be found in  [7] and [24]. Moreover, Passerini et al. [17] provided leave-one-out error bound for using kernel machines as base classifiers for ECOC classifier. More recently, the ECOC technique has been extended to handle the zero-shot learning problem [19], the life-long learning problem [11]

, and handling adversarial examples in neural network by integrating ECOC with increasing ensemble diversity 

[22].

In a conventional ECOC classifier, each class of a given dataset is assigned a codeword and a learned model is trained through an ensemble of binary classifiers constructed from the columns of the corresponding ECOC matrix whose rows consists of the class codewords [7]. Each column defines a bipartition of the dataset by merging classes with the same bit value. Decoding (classification) is performed by matching the codeword predicted by with the class codeword nearest in Hamming distance. In essence, ECOC is a generalization of one-vs-one and one-vs-all classification techniques, and as an ensemble technique, it is most effective when the binary classifiers make independent mistakes on a randon sample.

In this paper we derive new bounds on ECOC classification error rates that improve on that obtained by [10] by first applying the Feller and Chernoff bounds that are well known in statistics to the case where all binary classifiers are mutally independent and then applying a more recent bound due to [12] where they are correlated. These new bounds theoretically establish the effectiveness of the ECOC approach in machine learning; in particular, we show under certain assumptions that ECOC classification error decays exponentially to zero with respect to codeword length. We also present experimental results to demonstrate the validity of these bounds by applying them to various datasets to show the effect of correlation on classification accuracy.

Let denote the aforementioned ensemble of binary classifiers (or learners) for a data set with classes. Let denote the error rate of . Since is a binary classifier that only outputs 0 or 1 where indicating an error, we shall also call the bit error rate since the outputs represent a binary string. The following result, due to [10] gives a crude bound on the accuracy of :

Theorem 1 (GS Bound, [10]).

Let denote the average bit error rate. Then the ECOC classification error rate of is bounded by four times the average bit error rate, i.e.,

(1)

We note that the GS bound makes no assumptions regarding whether or not the classifiers are independent or how much correlation exists between them. However, the GS bound is far from being sharp: assuming that , then . Thus, the GS bound fails to answer whether it is theoretically possible for , which would validate its effectiveness as an ensemble technique. Moreover, the GS bound gives no explicit dependence of on .

To the best of our knowledge and prior to this work, no error bound exists that rigorously demonstrates that is theoretically possible in the ECOC setting. Progress so far has been limited to extending the GS bound to loss-based decoding schemes [1] and special distance measures [13]. In addition, theorems have been proven that bound the excess error rate of the ECOC classifier in terms of the excess error rates of the constituent binary classifiers [14, 3]. Here, “excess errror rate” refers to the difference between the error rate and the Bayes optimal error rate.

Our main result establishes new bounds on by calling on results from statistical theory.

Theorem 2 (Main Result).

Let be the ECOC matrix corresponding to with row dimension and minimum row Hamming distance . Set and with .

  1. Chernoff Bound: If all binary classifiers are mutually independent, then

    (2)

    where .

  2. KZ Bound: If with , and all binary classifiers are mutually correlated up to second-order only and specified by a uniform non-negative correlation coefficient that satisfies the Bahadur bound (36), then

    (3)

    where is defined in part 1 and .

Assuming is fixed, these bounds imply that decays exponentially to zero with respect to (codeword length).

Ii Indepdendent Base Classifiers

In this section assume that all classifiers are mutually independent, but not necessarily identically distributed. This allows us to use the Poisson binomial distribution to describe the probability of error for our ensemble of classifiers and show that the corresponding ECOC error is bounded by the classical binomial distribution based on the maximum error rate of all the classifiers.

Although the assumption of independence rarely holds in practice for real-world data sets, it is still useful as a starting point for our theoretical analysis and for establishing baseline results. An important application where this assumption is considered involves the setting of multi-view learning within the context of co-training [4], where say two classifiers are trained separately on data representing two different views (or sets of attributes). In this setting one of the assumptions requires the classifiers to be conditionally independent given the class label. This assumption can be relaxed [6, 2]. We aim to do the same in the section where we take into account correlation between classifiers.

Denote by the collection of all -element subsets of . Given a subset of , we define the outcome to be such that if and if , where denotes the complement of in .

Definition 3.

Let be a set of error rates of , respectively. We define to be the probability of the event where exactly out of the classifiers suffered bit errors, i.e., those outcomes where . Then is given by (Poisson binomial distribution)

(4)

If the classifiers are identically distributed so that for all , then we define this probability by (binomial distribution)

(5)

Recall that the minimum Hamming distance between any two rows or any two columns of an -dimensional Hadamard matrix is (see [10]). In that case, when at least of the classifiers (corresponding to the columns of ) each makes an error, i.e., misclassifies a sample, then ECOC misclassification may occur. This is because the rows of a describes an error-correcting code that only guarantees correct decoding up to (but strictly less than) bit errors. Therefore, in order to bound , we shall assume under a worst-case scenario that misclassification always occur when , where is the number of classifiers that suffered bit errors.

The following theorem shows that can be bounded by the maximum error rate of all the classifiers, assuming all are no larger than .

Theorem 4.

Let with . Let be a set of error rates with for all . Set . Then

(6)

The proof of this theorem requires the following lemmas, whose proofs are given in the appendix of this paper [16]. Before stating them, we first introduce notation: given , we define and .

Lemma 5.

We have

(7)

for all and .

Lemma 6.

is strictly increasing with respect to over the interval .

Proof.

(of Theorem 4) Since is monotone increasing in each variable , it is maximal when each is replaced by . Thus,

as desired. ∎

Definition 7.

We define the maximum ECOC error rate as the probability of the event where at least out of independent binary classifiers produces an error and is given by the cumulative sum

(8)

If the classifiers are identically distributed, then we define the probability of this event by

(9)

It is clear that . Moreover, we note that gives the maximum ECOC error rate for a Hadamard matrix of dimension with minimum row Hamming distance . The following theorem, which follows immediately from Theorem 4, shows that is bounded by the binomial distribution based on the largest bit error rate.

Theorem 8.

Suppose for all . Set . Then

(10)

We now apply Feller’s result on to obtain the following simple rational bound:

Lemma 9 ([9]).

For , we have

(11)

The following corollary shows that ECOC error rate tends to zero as the codeword length tends to infinity assuming the ratio stays fixed. This gives theoretical justification for the effectiveness of the ECOC approach for datasets with a large number of classes; of course, this assumes the existence of many relatively accurate independent binary classifiers.

Corollary 10.

Suppose and are fixed with for all . Then

(12)
Proof.

Set . Then and since , we have . It follows from Theorem 8 and Lemma 9 that the chain of inequalities hold:

It is now clear that as . ∎

To obtain a sharper and more useful bound, we call on the following result by Chernoff.

Theorem 11 ([5]).

Let . Then

(13)

where is the Euler number.

The following corollary, which restates the Chernoff bound in terms of the average bit error rate, shows that decays to zero exponentially with respect to codeword length.

Corollary 12.

Let , , and . Then

(14)

where for all . Moreover, is increasing with respect to for and decreasing with respect to for . Thus, if and are fixed with , then , and thus , decays exponentially to zero as .

Proof.

It is straightforward to prove using analytical methods that and that is increasing and decreasing with respect to over the respective intervals. As for the bound (14), we have

(15)

Since due to , it follows that (and thus ) decays exponentially as . ∎

We emphasize that the improvement of the Chernoff bound (Corollary 12) over the GS bound (Theorem 1) is due to the assumption that all the binary classifiers are mutually independent. Figure 1 clearly demonstrates this for large and small .

Fig. 1: ECOC error bounds: GS vs Chernoff ()

We end this section by commenting that Corollary 12 is also valid for the non-binary ECOC setting where the coefficients of the ECOC matrix is chosen from a non-binary alphaset [24].

Iii Correlated Base Classifiers

In this section we assume dependence (correlation) between certain base classifiers to show how it affects ECOC accuracy. We first make the simple assumption that all binary classifiers are mutually independent except for a pair of dependent classifiers and , which are allowed to depend on each other as follows. Recall that each takes on two possible values, namely (correct prediction) and (incorrect prediction). As before, let denote the error rate of , i.e., . Since and are dependent on each other, we specify their correlation via the joint probability

(16)

It follows that the remaining joint probabilities are given by

(17)
(18)
(19)

We shall assume that , , and so that all probabilities are non-negative. We then define the correlation between and as

(20)

In particular, if and are independent so that , then .

Given a subset , we denote and define

(21)
Definition 13.

Let . We define the probability of the event where out of classifiers produces an error (with dependence between classifiers and as defined above) by

(22)

If for all , then we denote .

Define . The following lemma, whose proof is given in [16], shows the explicit dependence of on and .

Lemma 14.

We have

(23)

where

Define . We apply Theorem 4 to the above lemma to obtain the following bound.

Corollary 15.

Suppose . Then

(24)

where

In the special case where all binary classifiers are identically distributed, i.e., for all , then

The next lemma, whose proof is given in [16], assumes all classifiers are identically distributed.

Lemma 16.

is increasing with respect to for fixed , where

and .

Definition 17.

We define the maximum ECOC error rate (assuming correlation given by ) as the probability of the event where at least out of binary classifiers produces an error and therefore is given by the cumulative sum

(25)

If the classifiers are identically distributed, i.e., for all , then we define

(26)

The next two theorems describe the dependence of the maximum ECOC error rate on . Their proofs can be found in [16].

Theorem 18.

We have

(27)
Theorem 19.

is increasing with respect to for and decreasing with respect to for .

The next theorem gives a simple bound for , which again implies that decays exponentially to zero but assumes that is fixed.

Theorem 20.

Let and . Then

(28)
Proof.

We apply Theorem (18) and Corollary 12:

(29)
(30)

since . Setting gives the desired result. ∎

Let us now use Theorem 19 to discuss the effect of a correlated pair of binary classifiers on the maximum ECOC error rate . Assuming , which implies is increasing with respect to , we conclude that over the range , the ECOC error rate is lower when there is negative correlation () compared to that for independence (), which in turn is lower than when there is positive correlation (). In other words, having negative correlation actually helps to decrease the ECOC accuracy while having positive correlation increases the ECOC error rate, which agrees with our common intuition. On the other hand, over the range , the reverse occurs since is decreasing with respect to . Thus, the moral is that positive correlation is detrimnental only if is relatively small .

We next investigate the effect of having all classifiers mutually dependent on ECOC accuracy.

Iii-a All Classifiers Mutally Correlated

Suppose all classifiers are mutally correlated up to second-order only (all higher-order correlations are zero). We define

(31)
(32)
(33)

Let . Recall our definition of the outcome where if and if where . Denote by the probability of the outcome .

Definition 21.

Suppose and . We define the probability of the event where out of classifiers produces an error (with correlation given by (33)) by

(34)

We also define the maximum ECOC error rate as the probability of the event where at least out of binary classifiers produces an error and therefore is given by the cumulative sum

(35)

The following result by [12] gives an explicit formula for where are equal and all correlations are equal. Although their result is stated under the assumption because of its application to jury design where in their model jurors are assumed to be competent, their proof, which we partially replicate in [16] for completeness, in fact holds over the range , assuming that satisfies the Bahadur bound described in the same paper:

(36)

where

(37)
Theorem 22 ([12]).

Suppose for all and for all , and that satisfies (36). Then

(38)

where is defined by (9) and

(39)
Corollary 23.

Let . Suppose and is non-negative and satisfies (36). Then

(40)

where and . Moreover, if and are fixed, then (and thus ) decays exponentially to zero as .

Proof.

Since it was proven earlier that the first term on the right-hand side of (22), , is bounded by and decays exponentially to zero as , it suffices to prove that the second term, , is bounded similarly. We first manipulate it as follows:

Then using the bound

(41)

we have

Since for , it follows that exponentially as . Thus, the same holds for as well. ∎

Iv Experimental Results

In this section we present experimental results to demonstrate the validity of our work by performing ECOC classification on various data sets and comparing the resulting classification error rates with those predicted by the Chernoff and KZ bounds established in the previous section.

In particular, we selected six public datasets to perform ECOC classification: Pendigits, Usps, Vowel, Letter Recognition (Letters), CIFAR-10, and Street View House Numbers (SVHN). Information regarding these datasets are given in Table

I.

Dataset # Samples # Features # Classes ()
Pendigits 3498 16 10 2/11
Usps 7291 256 10 2/10
Vowel 990 10 11 2/11
Letters 20,000 16 26 6/26
CIFAR-10 60,000 Image 10 2/10
SVHN 99,289 Image 10 2/10
TABLE I: Datasets
  1. ECOC Matrix: We employed a square ECOC matrix for every dataset () and constructed from a -Hadamard matrix of dimension , where was chosen to be the smallest integer for which and denotes the number of classes. We then truncated an appropriate number of rows and columns from (starting from the top left) to obtain our square matrix of dimension . The parameter for each dataset is given in Table I.

  2. Classification algorithms: For the datasets Pendigits, Usps, Vowel, and Letters, we employed two different models

    for our base classifiers: decision tree (DT) and support-vector machines (SVM), using the Python modules (version 3.7) sklearn.tree.DecisionTreeClassifier and sklearn.svm.SVC with default settings, respectively, utilizing the scikit-learn machine learning library. Computations were performed on a standard laptop. For the image datasets CIFAR-10 and SVHN, we employed a pre-trained convolutional neural network, Resnet-18 (loaded from Pytorch), with an additional dense layer to product binary output and using the Adam optimizer. Computations were performed for 10 epochs with a batch size of 128 and ran on the Open Science Grid

    [18, 21].

Thus, given a dataset, we performed 10th-fold cross-validation based on the experimental setup described above and recorded the ECOC error rate (experimental) for each fold, as well mean ECOC error rate and standard deviation for all ten folds. To compute the GS, Chernoff, and KZ bounds given by (

1), (2), and (3), respectively, for each fold, we used the mean bit error rate , obtained by averaging the bit error rates of all the binary classifiers. In addition, for the KZ bound we used the mean correlation for , obtained by averaging the coefficients of the correlation matrix of the binary classifiers. Full results, including values used for and , are given in [16] (Tables III-VIII).

Iv-a Results and Discussion

Experimental results show that for all datasets the ECOC error rates () are either below all three bounds (GS, Chernoff, and KZ) or clustered around the Chernoff and KZ bounds, where the latter occurs for Letters (DT and SVM) and Pendigits (SVM). This can be seen in the plots in Figures 2-7 for Pendigits, Letters, CIFAR-10, and SVHN (see [16] for plots of USPS and Vowels) where ECOC error rates are shown for each of the ten folds and in Table II

where results are averaged over all folds (lowest value indicated in bold). These results demonstrate the validity of all three bounds. However, Figures 3-5 (Letters) and 7 (Pendigits) clearly show that the Chernoff and KZ bounds provide much more accurate estimates of the ECOC error compared to the GS bound. This is to be expected for Letters where the number of binary classifiers (

), is signficantly larger than all the other datasets. As discussed earlier, the Chernoff and KZ bounds decay exponentially to zero with respect to and thus are more effective for larger values of . Overall, we believe our experimental results demonstrate that the Chernoff and GS bounds are quite useful in practice.

Fig. 2: Pendigits: Mean bit error vs ECOC
error (DT using 10-fold cross-validation)
Fig. 3: Pendigits: Mean bit error vs ECOC
error (SVM using 10-fold cross-validation)
Fig. 4: Letters: Mean bit error vs ECOC
error (DT using 10-fold cross-validation)
Fig. 5: Letters: Mean bit error vs ECOC
error (SVM using 10-fold cross-validation)
Fig. 6: CIFAR-10: Mean bit error vs ECOC
error (CNN using 10-fold cross-validation)
Fig. 7: SVHN: Mean bit error vs ECOC
error (CNN using 10-fold cross-validation)

ECOC Error Rate
Dataset Model Experimental GS Chernoff Bound KZ Bound
Pendigits DT 0.034 0.0034 0.134 0.0070 0.148 0.0130 0.192 0.03450
SVM 0.022 0.0024 0.047 0.0059 0.023 0.0054 0.030 0.0071
Usps DT 0.091 0.0117 0.288 0.0209 0.466 0.0431 0.500 0.0482
SVM 0.028 0.0050 0.063 0.0085 0.040 0.0100 0.049 0.0149
Vowel DT 0.144 0.0397 0.449 0.0604 0.749 0.0833 0.746 0.0626
SVM 0.166 0.0368 0.422 0.0553 0.710 0.0891 0.712 0.0876
Letters DT 0.061 0.0057 0.274 0.0114 0.047 0.0082 0.055 0.0108
SVM 0.106 0.0046 0.302 0.0086 0.070 0.0081 0.093 0.0191
CIFAR-10 CNN 0.023 0.0015 0.065 0.0042 0.041 0.0049 0.074 0.0098
SVHN CNN 0.011 0.0010 0.034 0.0018 0.013 0.0013 0.021 0.0025
TABLE II: ECOC error rate (Mean and standard deviation of 10-fold cross-validation): Experimental vs GS, Chernoff, and KZ bounds.

V Conclusions and Future Works

In this paper, we presented two new classification error bounds for the ECOC ensemble learning: the first under the assumption that all base classifiers are independent and the second under the assumption that all base classifiers are mutually correlated up to first-order. These bounds have exponential decay complexity with respect to codeword length and theoretically validate the effectiveness of the ECOC approach. Moreover, we perform ECOC classification on six datasets and compare their error rates with our bounds to experimentally validate our work and show the effect of correlation on classification accuracy. Future work include investigating the Chernoff bound for ECOC in settings with limited independence [20] and comparing the performance of binary vs -ary ECOC with respect to the error bounds presented in this paper.

References

  • [1] E. L. Allwein, R. E. Schapire, and Y. Singer (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. 1, pp. 113–141. Cited by: §I.
  • [2] M. Balcan, A. Blum, and K. Yang (2004) Co-training and expansion: towards bridging theory and practice. In NIPS 2004, pp. 89–96. Cited by: §II.
  • [3] A. Beygelzimer, J. Langford, and P. Ravikumar (2009) Error-correcting tournaments. In International Conference on Algorithmic Learning Theory (ALT), 2009, pp. 247–262. Cited by: §I.
  • [4] A. Blum and T. Mitchell (1998) Combining labeled and unlabeled data with co-training. In COLT 1998, pp. 92–100. Cited by: §II.
  • [5] H. Chernoff (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 23, pp. 493–507. Cited by: Theorem 11.
  • [6] S. Dasgupta, M. L. Littman, and D. A. McAllester (2001) Sensitive error correcting output codes. In NIPS 2005, pp. 375–382. Cited by: §II.
  • [7] T. Dietterich and G. Bakiri (1995-01) Solving multiclass learning problems via error-correcting output codes.

    J. Artificial Intelligence Research

    2, pp. 263–286.
    Cited by: §I, §I.
  • [8] S. Escalera, O. Pujol, and P. Radeva (2008) On the decoding process in ternary error-correcting output codes. IEEE transactions on pattern analysis and machine intelligence 32 (1), pp. 120–134. Cited by: §I.
  • [9] W. Feller (1968)

    An introduction to probability theory and its applications

    .
    Vol. 1, John Wiley and Sons. External Links: ISBN 978-0-471-25708-0 Cited by: Lemma 9.
  • [10] V. Guruswami and A. Sahai (1999) Multiclass learning, boosting, and error-correcting codes. In COLT 1999, pp. 145–155. External Links: Document Cited by: §I, §I, §II, Theorem 1.
  • [11] S. Ho, M. Marchiano, S. Zockoll, and H. Nguyen (2020) An error-correcting output code framework for lifelong learning without a teacher. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 249–254. Cited by: §I.
  • [12] S. Kaniovski and A. Zaigraev (2011) Optimal jury design for homogeneous juries with correlated votes. Theory and Decision 71, pp. 439–459. Cited by: Appendix C, §I, §III-A, Theorem 22.
  • [13] A. Klautau, N. Jevtić, and A. Orlitsky (2003)

    On nearest-neighbor error-correcting output codes with application to all-pairs multiclass support vector machines

    .
    J. Mach. Learn. Research 4, pp. 1–15. Cited by: §I.
  • [14] J. Langford and A. Beygelzimer (2005) Sensitive error correcting output codes. In COLT 2005, pp. 158–172. Cited by: §I.
  • [15] M. Liu, D. Zhang, S. Chen, and H. Xue (2015) Joint binary classifier learning for ecoc-based multi-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (11), pp. 2335–2341. Cited by: §I.
  • [16] H. D. Nguyen, M. S. Khan, N. Kaegi, S. Ho, J. Moore, L. Borys, and L. Lavalva (2021) Ensemble learning using error correcting output codes: new classification error bounds: appendix. Note: https://drive.google.com/file/d/1SuqVu2q9GFazV8FPfBEFT99ItcQM-GF3/view Cited by: §II, §III-A, §III, §III, §III, §IV-A, §IV.
  • [17] A. Passerini, M. Pontil, and P. Frasconi (2004) New results on error correcting output codes of kernel machines. IEEE transactions on neural networks 15 (1), pp. 45–54. Cited by: §I.
  • [18] R. Pordes, D. Petravick, B. Kramer, D. Olson, M. Livny, A. Roy, P. Avery, K. Blackburn, T. Wenaus, F. Würthwein, I. Foster, R. Gardner, M. Wilde, A. Blatecky, J. McGee, and R. Quick (2007) The open science grid. In J. Phys. Conf. Ser., 78, Vol. 78, pp. 012057. External Links: Document Cited by: item 2.
  • [19] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang (2017) Zero-shot action recognition with error-correcting output codes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2833–2842. Cited by: §I.
  • [20] J. P. Schmidt, A. Siegel, and A. Srinivsan (1995) Chernoff-hoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics 8 (2), pp. 223–250. Cited by: §V.
  • [21] I. Sfiligoi, D. C. Bradley, B. Holzman, P. Mhashilkar, S. Padhi, and F. Wurthwein (2009) The pilot way to grid resources using glideinwms. In 2009 WRI World Congress on Computer Science and Information Engineering, 2, Vol. 2, pp. 428–432. External Links: Document Cited by: item 2.
  • [22] Y. Song, Q. Kang, W. P. Tay, Y. Song, Q. Kang, and W. Tay (2021) Error-correcting output codes with ensemble diversity for robust learning in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 9722–9729. Cited by: §I.
  • [23] G. Zhong and M. Cheriet (2013) Adaptive error-correcting output codes. In Twenty-Third International Joint Conference on Artificial Intelligence, Cited by: §I.
  • [24] J. T. Zhou, I. W. Tsang, S. Ho, and K. Müller (2019-02) N-ary decomposition for multi-class classification. Mach. Learn. 108, pp. 809–830. External Links: Document Cited by: §I, §II.

Appendix A Proofs of Lemma 14 and 16

of Lemma 14.

We first assume the case where and consider the following partition of :

where

(44)
(45)

Then

This proves (23) for this case. The other two cases, and , can be proven by a similar argument by considering appropriate partitions of and , respectively. ∎

Proof of Lemma 16.

We first assume the case where . It suffices to prove that the derivative of with respect to is non-negative. Since Lemma 14 shows that is linear in , we have

(46)