1 Introduction
In the last decade datasets have experimented an exponential growth rate, generating vast collections of data that need to automatically be analyzed. In particular, multimedia datasets have experienced an explosion on data availability, thanks to the almost negligible cost of gathering multimedia data from Internet. Therefore, there is a pushing need for efficient algorithms that are able to automatize knowledge extraction processes on those datasets. One of the classic problems in Pattern Recognition and Machine Intelligence is to perform automatic classification, i.e., automatically attributing a label to each sample of the dataset. In this sense, the classification process is often considered as first step for higher order representations or knowledge extractions. In multiclass classification problems the goal is to find a function , that maps samples to a finite discrete set of labels with
. While there exists a large set of approaches to estimate
all of them can be grouped in two different categories: SingleMachine/SingleLoss approaches and Divide and Conquer approaches. The formers attempt to approximate a single for the complete multiclass problem, while the latter decoupleinto a set of binary subfunctions (binary classifiers) that are potentially easier to estimate and aggregate the results.
In this sense, ErrorCorrecting Output Codes (ECOC) is a divide and conquer approach that has proven to be very effective in many different multiclass contexts. The core property within an ECOC is its capability to correct errors in binary classifiers by using redundancy. However, existing literature represents the errorcorrecting capability of an ECOC as an scalar, hindering a deeper the analysis of errorcorrection and redundancy on class pairs. Furthermore, classical divide and conquer approaches that have been included in the ECOC framework like One vs. All [48] or Random [2] approaches ignore the data distribution, thus not taking profit of allocating the errorcorrecting capabilities of ECOCs in a problemdependent fashion. In addition, recent problemdependent ECOC designs have focused on designing the binary subfunctions rather than analyzing the core errorcorrecting property. In order to overcome this limitations, our proposal builds an ECOC matrix by factorizing a design matrix that encodes the desired ’correction properties’ between classes (i.e a design matrix which can be obtained directly from data or be set by experts on the problem domain). The proposed method finds the ECOC coding that yields the closest configuration to the design matrix. We cast the task of designing an ECOC as a matrix factorization problem with binary constraints. A visual example is shown in Figure 1.
2 Related Work
2.1 Singlemachine/Singleloss Approaches
The multiclass problem can be directly treated by some methods that exhibit a multiclass behaviour off the shelf (i.e Nearest Neighbours [22]
[30][6]). However, some of the most powerful methods for binary classification like Support Vector Machines (SVM) or Adaptive Boosting (AdaBoost) can not be directly extended to the multiclass case and further development is required. In this sense, literature is prolific on singleloss strategies to estimate
. One of the most well know approaches are the extensions of SVMs [7] to the multiclass case. For instance, the work of Weston and Watkins [55] presents a singlemachine extension of the SVM method to cope with the multiclass case, in which predictor functions are trained, constrained with slack variables per sample. However, a more recent adaptation of [14] reduces the number of constraints per samples to one, paying only for the second largest classification score among the predictors. To solve the optimization problem a dual decomposition algorithm is derived, which iteratively solves the quadratic programming problem associated with each training sample. Despite these efforts, singlemachine approaches to estimate scale poorly with the number of classes and are often outperformed by simple decompositions [48, 52]. In recent years various works that extended the classical Adaptive Boosting method [20] to the multiclass setting have been presented [51, 43]. In [62] the authors directly extend the AdaBoost algorithm to the multiclass case without reducing it to multiple binary problems, that is estimating a singlefor the whole multiclass problem. This algorithm is based on an exponential loss function for multiclass classification which is optimized on a forward stagewise additive model. Furthermore, the work of Saberian and Vasconcenlos
[50] presents a derivation of a new margin loss function for multiclass classification altogether with the set of real class codewords that maximize the presented multiclass margin, yielding boundaries with max margin. However, though these methods are consistently derived and supported with strong theoretical results, methodologies that jointly optimize a multiclass loss function present some limitations:
They scale linearly with , rendering them unsuitable for problems with a large .

Due to their singleloss architecture the exploitation of parallelization on modern multicore processors is difficult.

They can not recover from classification errors on the class predictors.
2.2 Divide and Conquer Approaches
On the other hand, the divide and conquer approach has drawn a lot of attention due to its excellent results and easily parallelizable architecture [48, 52, 2, 18, 46, 4, 40, 28]. In this sense, instead of developing a method to cope with the multiclass case, divide and conquer approaches decouple into a set of binary problems which are treated separately. Once the responses of binary classifiers are obtained a committee strategy is used to find the final output. In this trend one can find three main lines of research: flat strategies, hierarchical classification, and ECOC. Flat strategies like One vs. One [52] and One vs. All [48] are those that use a predefined problem partition scheme followed by a committee strategy to aggregate the binary classifier outputs. On the other hand, hierarchical classification relies on a similarity metric distance among classes to build a binary tree in which nodes correspond to different problem partitions [23, 40, 28]. Finally, the ECOC framework consists of two steps: In the coding step, a set of binary partitions of the original problem are encoded in a matrix of discrete codewords [16] (univocally defined, one code per class) (see Figure 2). At the decoding step a final decision is obtained by comparing the test codeword resulting of the union of the binary classifier responses with every class codeword and choosing the class codeword at minimum distance [17, 61]. The coding step has been widely studied in literature, yielding three different types of codings: predefined codings [48, 52], random codings [2] and problemdependent codings for ECOC [18, 46, 4, 57, 24, 58]. Predefined codings like One vs. All or One vs. One are directly embeddable in the ECOC framework. In [2], the authors propose the Dense and Sparse Random coding designs with a fixed code length of , respectively. In [2] the authors encourage to generate a set of random matrices and select the one that maximizes the minimum distance between rows, thus showing the highest correction capability. However, the selection of a suitable code length still remains an open problem.
2.3 Problemdependent Strategies
Alternatively, problemdependent strategies for ECOC have proven to be successful in multiclass classification tasks [57, 23, 24, 58, 18, 60, 59, 46]. A common trend of these works is to exploit information of the multiclass data distribution obtained a priori in order to design a decomposition into binary problems that are easily separable. In that sense, [57]
computes a spectral decomposition of the graph laplacian associated to the multiclass problem. The expected most separable partitions correspond to the thresholded eigenvectors of the laplacian. However, this approach does not provide any warranties on defining unequivocal codewords (which is a core property of the ECOC coding framework) or obtaining a suitable code length
. In [24], Gao and Koller propose a method which adaptively learns an ECOC coding by optimizing a novel multiclass hinge loss function sequentially. On an update of their earlier work, Gao and Koller propose in [23] a joint optimization process to learn a hierarchy of classifiers in which each node corresponds to a binary subproblem that is optimized to find easily separable subproblems. Nonetheless, although the hierarchical configuration speeds up the testing step, it is highly prone to error propagation since node misclassifications can not be recovered. Finally, the work of Zhao et. al [58] proposes a dual projected gradient method embedded on a constrained concaveconvex procedure to optimize an objective composed of a measure of expected problem separability, codeword correlation and regularization terms. In the light of these results, a general trend of recent works is to optimize a measure of binary problem separability in order to induce easily separable subproblems. This assumption leads to ECOC coding matrices that boost the boundaries of easily separable classes while modeling with low redundancy the ones with most confusion.(a)  (b)  (c) 
2.4 Our approach
In this paper we present the ErrorCorrecting Factorization (ECF) method for factorizing a design matrix of desired ’errorcorrecting properties’ between classes into a discrete ECOC matrix. The proposed ECF method is a general framework for the ECOC coding step since the design matrix is a flexible tool for errorcorrection analysis. In this sense, the problem of designing the ECOC matrix is reduced to defining the design matrix, where higher level reasoning may be used. For example, following recent stateoftheart works one could build a design matrix following a ”hard classes are left behind” spirit, boosting the boundaries of easily separable classes and disregarding the classes that are not easily separable. An alternative for building the design matrix is the ”no class is left behind” criteria, where we may boost those classes that are prone to be confused in the hope of recovering more errors. Note that the design matrix could also directly encode knowledge of domain experts on the problem, providing a great flexibility on the design of the ECOC coding matrix. Figure 2 shows different coding schemes and the real boundaries learned by binary classifiers (SVM with RBF kernel) for a Toy problem of classes (see section 5 for further details on the dataset). We can see how the binary problems induced by ECF in Fig. 2(a) boost the boundaries of classes that are prone to be confused, while other approaches that use equal or higher number of classifiers like Dense Random [2] in Fig. 2(b), or classic One vs. All designs in Fig. 2(c) fail in this task. The paper is organized as follows: Section 3 introduces the ECOC properties and derives ECF, where we cast the problem of finding an ECOC matrix that follows a certain distribution of correction as a discrete optimization problem. Section 4 presents a discussion of the method addressing important issues from the point of view of the ECOC framework. Concretely, we derive the optimal problemdependent code length for ECOCs obtained by means of ECF, which to the best of our knowledge is the first time this question is tackled in the extended ECOC literature. In addition, we show how ECF converges to a solution with negligible objective value when the design matrix follows certain constraints. Section 5 shows how ECF yields ECOC coding matrices that obtain higher classification performances than stateoftheart methods with comparable or lower computational complexity. Finally, Section 6 concludes the paper.
3 Methodology
In this section, we review existing properties of the ECOC framework and propose to cast the ECOC coding matrix optimization as a Matrix Factorization problem that can be solved efficiently using a constrained coordinate descent approach.
3.1 ErrorCorrecting Output Codes
ECOC is a multiclass framework inspired on the basis of errorcorrecting principles of communication theory [16], which is composed of two different steps: coding [16, 2] and decoding [17, 61]. At the coding step an ECOC coding matrix (see notation^{1}^{1}1Bold capital letters denote matrices (e.g. ), bold lowercase letters represent vectors (e.g., ). All nonbold letters denote scalar variables. is the th row of the matrix X. is the th column of the matrix . is a matrix or vector of all ones of the appropriate size. denotes the scalar in the th row and th column of . denotes the Frobenius norm. is used to denote the Lpnorm. is an operator which concatenates vectors and y . denotes the rank of . denotes the pointwise inequality) is constructed, where denotes the number of classes in the problem and the number of bipartitions (also known as dichotomies) to be learnt. In the coding matrix, the rows (’s, also known as codewords) are unequivocally defined, since these are the identifiers of each category in the multiclass problem. On the other hand, the columns of (’s) denote the bipartitions to be learnt by base classifiers (also known as dichotomizer). Therefore, for a certain column a dichotomizer learns the boundary between classes valued and classes valued . However, [2] introduced a third value, defining ternary valued coding matrices. . In this case, for any given dichotomy categories can be valued as or depending on the metaclass they belong to, or if they are ignored by the dichotomizer. This new value allows the inclusion of wellknown decomposition techniques into the ECOC framework, such as One vs. One [52].
At the decoding step a data sample is classified among the possible categories. In order to perform the classification task, each dichotomizer predicts a binary value for whether it belongs to one of the bipartitions defined by the correspondent dichotomy. Once the set of predictions is obtained, it is compared to the rows of using a distance function , known as the decoding function. Usual decoding techniques are based on wellknown distance measures such as the or Euclidean distance. These measures are proved to be effective for . Nevertheless, it is not until the work of [17] that decoding functions took into account the meaning of the value at the decoding step. Generally, the final prediction for is given by the class , where , .
3.2 Good practices in ECOC
Several works have studied the characteristics of a good ECOC coding matrix [16, 36, 3, 57, 4], which are summed up in the following three properties:

Correction capability: let denote a symmetric matrix of hamming distances among all pairs of rows in , the correction capability is expressed as ^{2}^{2}2In the case of ternary codes this correction capability can be easily adapted., considering only offdiagonal values of . In this sense, if , ECOC will be able to recover the correct multiclass prediction even if binary classifier misses its prediction.^{3}^{3}3Note that for to be valid all offdiagonal elements of should be greater or equal than one.

Uncorrelated binary subproblems: the induced binary problems should be as uncorrelated as possible for to recover binary classifier errors.

Use of powerful binary classifiers: since the final class prediction consists of the aggregation of bit predictors, accurate binary classifiers are also required to obtain accurate multiclass predictions.
3.3 From global to pairwise correction capability
In literature, correction capability has been a core objective of problemdependent designs of . In this sense, different authors have always agreed on defining correction capability for an ECOC coding matrix as a global value [16, 2, 36, 57, 23, 25]. Hence, is expected to be large in order for to recover from as many binary classifier errors as possible. However, since expresses the hamming distance between rows of , one can alternatively express the correction capability in a pairwise fashion [5], allowing for a deeper understanding of how correction is distributed among codewords. Figure 3 shows an example of global and pairwise correction capabilities calculation. Recall that the operator between two vectors denotes its concatenation. Thus, the pairwise correction capability is defined as follows:

The pairwise correction capability of codewords and is expressed as: , where we only consider offdiagonal values of . This means that a sample of class is correctly discriminated from class even if binary classifiers miss their predictions.
Note that though in Figure 3 the global correction capability of is , there are pairs of codewords with a higher correction, e.g. and . In this case the global correction capability as defined in literature is overlooking ECOC coding characteristics that can potentially be exploited. This novel way of expressing the correction capability of an ECOC matrix enables a better understanding of how ECOC coding matrices distribute their correction capability, and gives an insight on how to design coding matrices. In this sense, it is straightforward to demand the correction capabilities of the ECOC matrix to be allocated according to those classes that are more prone to error, in order for them to have better recovery behavior (i.e. following a ”no class is left behind” criteria). However, recent works [57, 23, 58] have focused on designing a matrix where binary problems are easily separable. This assumption leads to a matrix where classes that are not easily separable show a small hamming distance on their respective codewords (i.e. following a ”hard classes are left behind” scheme).
In addition to the proposal of a general method for ECOC coding by means of the definition of a design matrix, we explore the effect of focusing the learning effort of our method in those classes that have complex boundaries (i.e. those which show a small interclass margin). It is important to take into account that though it is natural to estimate the design matrix from training data, it is not a limitation of ECF. In this sense, the design matrix can also code information of experts or any other distance measure directly set by the user. Formally, let be a coding matrix, let be a symmetric matrix of pairwise distances between rows of and let be a design matrix (e.g. pairwise distance measure between class codewords). It is natural to see that the ordinal properties of the distance should hold in and . Thus, if distance between codewords and () is required to be larger than the distance between codewords and (), this order should be maintained in . Then we want to find a configuration of such that .
Note that the distances in can be seen as a function of the dot product of the codewords , where . Therefore, instead of directly requiring to match , we can equivalently require the product to match [54]. This implies that we can cast the problem of finding into a Matrix Factorization problem, where we find an so that the matrix of inner products is closest to under a given norm.
3.4 ErrorCorrecting Factorization
This section describes the objective function and the optimization strategy for the ECF algorithm.
3.4.1 Objective
Our goal is to find an ECOC coding matrix that encodes the properties denoted by the design matrix . In this sense, ECF seeks a factorization of the design matrix into a discrete ECOC matrix . This factorization is formulated as the quadratic form that reconstructs with minimal Frobenius distance under several constraints, as shown in Equation (1) ^{4}^{4}4Recall that the distance is a function of the dot product ..
(1)  
subject to  (5)  
The component that solves this optimization problem generates the inner product of discrete vectors that is closest to under the Frobenius norm. In order for to be a valid matrix under the ECOC framework we constraint in Equations (5)(5). Equation (5) ensures that each binary problem classes will belong to one of the two possible metaclasses. In addition, to avoid the case of having two or more equivalent rows in , the constraints in 5 ensure that the correlation between rows of less or equal than a certain userdefined matrix (recall that denotes a matrix or vector of all 1s of the appropriate size when used), where encodes the minimum distance between any pair of codewords. is a symmetric matrix with . Thus, by setting the off diagonal values in we can control the minimum interclass correction capability. Hence, if we want the correction capability of rows and to be , we set .
Finally, constraints in Equations (5) and (5) ensure the induced binary problems are not equivalent. Similar constraints have been studied thoroughly in literature [16, 36, 25] defining methods that rely on diversity measures for binary problems to obtain a coding matrix . Equations (5) and (5) can be considered as softconstraints since its violation does not imply violating the ECOC properties in terms of row distance. This is easy to show since a coding matrix that induces some equivalent binary problems but ensures that will define a matrix whose rows are unequivocally defined. In this sense, a coding matrix can be easily projected on the set defined by constraints (5) and (5) by eliminating repeated columns, . Thus, constraints in 5 and 5 ensure that uncorrelated binary subproblems will be defined in our coding matrix . The discrete constraint in Equation 5 on the variable elevates the optimization problem to the NPHard class. To overcome this issue and following [13, 58, 8] we relax the discrete constraint in 5 an replace it by in Equation 8.
3.4.2 Optimization
In this section, we detail the process for optimizing . The minimization problem posed in Equation (1) with the relaxation of the boolean constraint in Equation (5) is nonconvex, thus, is not guaranteed to be a global minimum. In this sense, although gradient descent techniques have been successfully applied in the literature to obtain local minimums [49, 35, 1] these techniques do not enjoy the efficiency and scalability properties present in other optimization methods applied to Matrix Factorization problems, such as Coordinate Descent [37, 15]. Coordinate Descent techniques have been widely applied in Nonnegative Matrix Factorization obtaining satisfying results in terms of efficiency [34, 31]. In addition, it has been proved that if each of the coordinate subproblems can be solved exactly, Coordinate Descent converges to a stationary point [29, 53]. Using this result, we decouple the problem in Equation (1) into a set of linear leastsquares problems (one for each coordinate). Therefore, if the problem in Equation (1) is going to be minimized along the th coordinate of , we fix all rows of except of and we substitute with in Equations (1) and (5), where denotes matrix after removing the th row. In addition, we substitute with , where denotes the matrix after removing the th row and column. Equivalently, we substitute , obtaining the following block decomposition:
(6)  
subject to  (8)  
Analyzing the block decomposition in Equation (6) we can see that the only terms involving free variables are , and . Thus, since and are symmetric by definition, the minimizer of Equation (6) is the solution to the linear leastsquares problem shown in Equation (9):
(9)  
subject to  (11)  
where constraint (11) is the relaxation of the discrete constraint (5). In addition, constraint (11) ensures the correlation of with the rest of the rows of is below a certain value . Algorithm 1 shows the complete optimization process.
To solve the minimization problem in Algorithm 1 we use the Active Set method described in [26]
, which finds an initial feasible solution by first solving a linear programming problem. Once ECF converges to a solution
with objective value we obtain a discretized suboptimal solution with objective value by sampling 1000 points that split the interval and choosing the point that minimizes . Finally, we discard repeated columns if any appear ^{5}^{5}5In all our runs of ECF this situation happened with a chance of less than ..3.5 Connections to Singular Value Decomposition, Nearest Correlation Matrix and Discrete Basis problems
Similar objective functions to the one defined in the ECF problem in Equation (1
) are found in other contexts, for example, in the Singular Value Decomposition problem (SVD). The SVD uses the same objective function as ECF subjected to the constraint
. However, the solution of SVD yields an orthogonal basis, disagreeing with the objective defined in Equation (1) which ensures different correlations between the ’s. In addition, we can also find a common ground with the Nearest Correlation Matrix (NMC) Problem [32, 9, 39]. However, the NMC solution does not yield a discrete factor , instead it seeks directly for the Gramian where is not discrete, as in Equation (12).(12)  
subject to  (14)  
In addition, the ECF has similarities with the Discrete Basis Problem (DBP) [42], since the factors are discrete valued. Nevertheless, DBP factorizes instead of , as show in Equation (15).
(15)  
subject to  (16) 
4 Discussion
In this section we discuss how to ensure that the design matrix is valid, as well as how to automatically estimate the code length for each problem given . Furthermore, we analyze the convergence of ECF in relation to the order of updating the coordinates. Finally we show that under certain conditions of ECF converges to a solution with almost negligible objective value.
4.1 Ensuring a representable design matrix
An alternative interpretation for ECF is that it seeks for a discrete matrix whose Gramian is closest to under the Frobenius norm. However, since can be directly set by the user we need to guarantee that is a correlation matrix that is realizable in the space, that is, has to be symmetric and positive semidefinite. In particular, we would like to find the correlation matrix that is closest to under the Frobenius norm. This problem has been treated in several works [32, 9, 11, 27], resulting in various algorithms that often use an alternating projections approach. However, for this particular case in addition to be in the Positive Semidefinite (PSD) Cone and symmetric we also require to be scaled in the range, with . In this sense, to find we follow an alternating projections algorithm, similar as [32], which is shown in Algorithm 2. We first project into the PSD cone by computing its eigenvectors and recovering , where
are the nonnegative eigenvalues of
. Then, we scale in the range and set .4.2 Defining a code length with representation guarantees
The definition of a problemdependent ECOC code length , that is, choosing the number of binary partitions for a given multiclass task is a problem that has been overlooked in literature. For example, predefined coding designs like One vs. All or One vs. One have fixed code length. On the other hand, coding designs like Dense or Sparse Random codings (which are very often used in experimental comparisons [57, 58, 4, 18]) are suggested [2] to have a code length of and respectively. These values are arbitrary and unjustified. Additionally, to build a Dense or Sparse Random ECOC matrix one has to generate a set of 1000 matrices and chose the one that maximizes . Consider the Dense Random Coding design, of length , the ECOC matrix will have in the best case a correction capability of , independently of the distribution of the multiclass data. In addition, the effect of maximizing leads to an equidistribution of the correction capability over the classes. Other approaches, like Spectral ECOC [57] search for the code length by looking at the best performance on a validation set. Nevertheless, recent works have shown that the code length can be reduced to of with very small loss in performance if the ECOC coding design is carefully chosen [38] and classifiers are strong. In this paper, instead of fixing the code length or optimizing it on a validation subset, we derive the optimal length according to matrix rank properties. Consider the rank of a factorization of into , there are three different possibilities:

If , we obtain rank factorization algorithm that should be able to factorize with minimal error.

In the case when we obtain a lowrank factorization method that cannot guarantee to represent with 0 error, but reconstructs the components of with higher information.

If , the system is overdetermined and many possible solutions exist.
In general we would like to reconstruct with minimal error, and since and (the number of classes) is fixed, we only have to set the number of columns of to control the rank. Hence, by setting , ECF will be able to factorize with minimal error. Figure 4 shows visual results for the ECF method applied on the Traffic and ARFace datasets. Note how, for the Traffic (36 classes) and ARFaces (50 classes) datasets the required code length for ECF to full rank factorization is and , respectively as shown in Figures 4(e)(f).
(a)  (b) 
(c)  (d) 
(e)  (f) 
4.3 Order of Coordinate Updates
Coordinate Descent has been applied in a wide span of problems obtaining satisfying results. However, the problem of choosing the coordinate to minimize at each iteration still remains active [47, 21, 53, 33]. In particular, [44] derives a convergence rate which is faster when coordinates are chosen uniformly at random rather than on a cyclic fashion. Hence, choosing coordinates at random its a suitable choice when the problem shows some of the following characteristics [47]:

Not all data is available at all times.

A randomized strategy is able to avoid worstcase order of coordinates, and hence might be preferable.

Recent efforts suggest that randomization can improve the convergence rate [44].
However, the structure of ECF is different and calls for a different analysis. In particular, we remark the following points. (i) At each coordinate update of ECF, information about the rest of coordinates is available. (ii) Since our coordinate updates are solved uniquely, a repetition on a coordinate update does not change the objective function. (iii) The descent on the objective value when updating a coordinate is maximal when all other coordinates have been updated. These reasons leads us to choose a cyclic update scheme for ECF. In addition in Figure 5 we show a couple of examples in which the cyclic order of coordinates converges faster than the random order for two problems: Vowel and ARFace (refer to Section 5
for further information on the datasets). This behavior is common for all datasets. In particular, note how the cyclic order of coordinates reduces the standard deviation on the objective function, which is denoted by the narrower blue shaded area in Figure
5.(a)  (b) 
4.4 Approximation Errors and Convergence results when is an inner product of binary data
The optimization problem posed by ECF in Equation (1) is nonconvex due to the quadratic term , even if the discrete constraint is relaxed. This implies that we cannot guarantee that the algorithm converges to the global optima. Recall that ECF seeks for the term that is closest to under the Frobenius norm. Hence, the error in the approximation can be measured by , where is the local optimal point to which ECF converges. In this sense, we introduce which is the matrix of inner products of discrete vectors that is closest to under the Frobenious norm. Thus, we expand the norm as in the following equation:
(17)  
(18)  
(19) 

The optimization error : measured as the distance between the local optimum where ECF converges and denoted by , which is expressed as the first term in Equation (18).

The discretization error : computed as, , that is, the distance between and the closest inner product of discrete vectors , expressed as the second term in Equation (18).
In order to better understand how ECF works we analyze both components separately. Then, to analyze if ECF converges to a good solution in terms of Frobenius norm we set by generating a matrix which is the inner product matrix of random discrete vectors, and thus, all the terms except of are zero. By doing that, we can empirically observe the magnitude of the optimization error . In order to do that we run ECF times on different matrices of different sizes and calculate the average . Figure 6 shows examples for different matrices of size , , and . In Figure 6 we can see how ECF converges to a solution with almost negligible optimization error after 15 iterations. In fact, the average objective value for all runs of ECF on different ’s after 15 update cycles (coordinate updates for all ’s) is . This implies, that ECF converges in average to a point with almost negligible objective value, and when applied to ’s which are not computed from binary components the main source of the approximation error is the discretization error . Since ECF seeks to find a discrete decomposition of this discretization error is unavoidable, and as we have seen empirically, ECF converges in average to a solution with almost negligible objective value.
(a)  (b) 
(c)  (d) 
5 Experiments
In this section we present the experimental results of the proposed ErrorCorrecting Factorization method. In order to do so, we first present the data, methods and settings.
5.1 Data
The proposed ErrorCorrecting Factorization method was applied to a total of datasets. In order to provide a deep analysis and understanding of the method, we synthetically generated a Toy problem consisting of classes, where each class contained
two dimensional points sampled from a Gaussian distribution with same standard deviation but different means. Figure
6(d) shows the synthetic multiclass generated data, where each color corresponds to a different category. We selected wellknown UCI datasets: Glass, Segmentation, Ecoli, Yeast and Vowel that range in complexity and number of classes. Finally, we apply the classification methodology in two challenging computer vision categorization problems. First, we test the methods in a real traffic sign categorization problem consisting of 36 traffic sign classes. Second, 50 classes from the ARFaces [41] dataset are classified using the present methodology. These datasets are public upon request to the authors. Table I shows the characteristics of the different datasets.Traffic sign categorization: We test ECF on a real traffic sign categorization problem, of 36 classes [10]. The dataset contains a total of 3481 samples of size 3232, filtered using the Weickert anisotropic filter, masked to exclude the background pixels, and equalized to prevent the effects of illumination changes. These feature vectors are then projected into a 100 feature vector by means of PCA. A visual sample is show in Figure 7(a).
ARFaces classification: The ARFace database [41] is composed of 26 face images from 126 different subjects (from which 50 are selected), portraying different expressions and complements. An example is shown in Figure 7(b).
Glass  Segment.  Ecoli  Yeast  Vowel  Toy  Traffic  ARFace  

#s  214  2310  336  1484  990  400  3481  1300 
#f  9  19  8  8  10  2  100  120 
#c  7  7  8  10  11  14  36  50 
(a)  (b) 
5.2 Methods and settings
We compared the proposed ErrorCorrecting Factorization method, with the standard predefined One vs. All (OVA) and One vs. One (OVO) approaches [48, 52]. In addition, we introduce two random designs for ECOC matrices. In the first one, we generated random ECOC coding matrices fixing the general correction capability to a certain value (RAND). In the second, we generate a Dense Random coding matrix [3] (DENSE). These comparisons enable us to analyze the effect of reorganizing the interclass correcting capabilities of an ECOC matrix. Finally, in order to compare our proposal with stateoftheart methods, we also used the Spectral ECOC (SECOC) method [57] and the Relaxed Hierarchy [23] (RH) . Finally we propose two different flavors of ECF, ECFH and ECFE. In ECFH we compute the design matrix in order to allocate the correction capabilities on those classes that are hard to discriminate. On the other hand, for ECFE we compute allocating correction to those classes that are easy to discriminate. is computed as the Mahalanobis distance between each pair of classes. Although, there exist a number of approaches to define from data [23, 58, 57], i.e. the margin between each pair of classes (after training a One vs. One SVM classifier), we experimentally observed that the Mahalanobis distance provides good generalization and leverages the computational cost of training a One vs. One SVM classifier. All the reported classification accuracies are the mean of a stratified fold crossvalidation on the aforementioned datasets. For all methods we used an SVM classifier with RBF kernel. The parameters and were tunned by crossvalidation on a validation subset of the data using an inner fold crossvalidation. The parameter was tunned on a gridsearch on a log sampling in the range , and the parameter was equivalently tuned on a equidistant linear sampling in the range , we used the libsvm implementation available at [12]. For both ECFH and ECFE we run the factorization forcing different minimum distance between classes by setting . For the Relaxed Hierarchy method [23] we used values for . In all the compared methods that use a decoding function (e.g all tested methods but the one in [23]) we used both the Hamming Decoding (HD) and the LossWeighted decoding (LWD) [46].
5.3 Experimental Results
In Figure 8 we show the multiclass classification accuracy as a function of the relative computational complexity for all datasets using both Hamming decoding (HD) and LossWeighted Decoding (LWD). We used nonlinear SVM classifiers and we define the relative computational complexity as the number of unique Support Vectors (SVs) yielded for each method, as in [23]. For visualization purposes we use an exponential scale and normalize the number of SVs by the maximum number of SVs obtained by a method in that particular dataset. In addition, although the code length cannot be considered as an accurate measure of complexity when using nonlinear classifiers in the feature space, it is the only measure of complexity that is available prior to learning the binary problems and designing the coding matrix. In this sense, we show in Figure 9 the classification results for all datasets as a function of the code length , using both Hamming decoding (HD) and LossWeighted Decoding (LWD). Figures 8 and 9 and show how the proposed ECFH obtains in most of the cases better performance than stateoftheart approaches even with reduced computational complexity. In addition, in most datasets the ECFH is able to boost the boundaries of those classes prone to error, the effect of this is that it attains higher classification accuracies than the rest of methods paying the prize of an small increase on the relative computational complexity. Specifically, we can see how on Glass dataset, Vowel, Yeast, Segmentation and Traffic datasets (Figs. 8(e)(f) and 9(e)(f), respectively), the proposed method outperforms the rest of the approaches while yielding a comparable or even lower computational complexity, independently of the decoding function used. We also can see that the RAND and ECFE methods present erratic behaviours. This is expected for the random coding design, since incrementing the number of SVs or dichotomies does not imply an increase in performance if the dichotomies are not carefully selected. On the other hand, the reason why ECFE is not stable is not completely straightforward. ECFE focus its design in dichotomies that are very easy to learn, allocating correction to those classes that are separable. We hypothesize that when these dichotomies become harder (there exists a limited number of easy separable partitions) to learn the addition of a difficult dichotomy harms the performance by adding confusion to previously learned dichotomies until proper errorcorrection is allocated. On the other hand, we can see how ECFH usually shows a more stable behaviour since it focuses on categories that are prone to be confused. In this sense, we expect that the addition of dichotomies will increase the correction. Finally, it is worth noting that the Spectral ECOC method yields a code length of , corresponding to the full eigendecomposition. Our proposal defines coding matrices which ensure to follow the design denoted by , fulfilling ECOC properties.
(a) Toy dataset HD  (b) Toy dataset LWD  (c) Ecoli dataset HD  (d) Ecoli dataset LWD 
(e) Glass dataset HD  (f) Glass dataset LWD  (j) Vowel dataset HD  (h) Vowel dataset LWD 
(i) Yeast dataset HD  (j) Yeast dataset LWD  (k) Segmentation dataset HD  (l) Segmentation dataset LWD 
(e) Traffic dataset HD  (f) Traffic dataset LWD  (j) ARFace dataset HD  (h) ARFace dataset LWD 
(a) Toy dataset HD  (b) Toy dataset LWD  (c) Ecoli dataset HD  (d) Ecoli dataset LWD 
(e) Glass dataset HD  (f) Glass dataset LWD  (j) Vowel dataset HD  (h) Vowel dataset LWD 
(i) Yeast dataset HD  (j) Yeast dataset LWD  (k) Segmentation dataset HD  (l) Segmentation dataset LWD 
(m) Traffic dataset HD  (n) Traffic dataset LWD  (o) ARFace dataset HD  (p) ARFace dataset LWD 
(a)  (b)  (c) 
(d)  (e)  (f) 
As a summary, we show in Figure 10 a comparison in terms of classification accuracy for different methods over all datasets. We compare the classification accuracy of a selected method for both decodings (at different operating complexities if available) versus the best performing method in a range of of the operative complexity. For consistency we show the comparison using both the number of SVs and the number of dichotomies as the computational complexity. If the compared method dominates in most of the datasets it will be found above the diagonal. In Figures 10(a) and 10(d) we compare ECFH with the best performant of the rest of the methods and see that ECFH outperforms the rest of the methods of the times depending on the complexity measure. This implies that ECFH dominates most of the methods in terms of performance by focusing on those classes that are more prone to error regardless of the complexity measure used (number of SVs or number of dichotomies). In addition, when repeating the comparison for ECFE in Figures 10(b) and 10(e) we see that the majority of the datasets are clearly below the diagonal (ECFE is the most suitable choice of times). Finally, Figures 10(c) and 10(f) show the comparison for OVA, which is a standard method often defended by its simplicity [48]. We clearly see how it never outperforms any method and it is not the recommended choice for almost any dataset. In Table II we show the percentage of wins for all methods^{6}^{6}6The RH method [23] is far less complex than the compared methods, however we compare it to the to the closest operating complexity for each of the rest of the methods., in increasing order of complexity averaged over all datasets. Note how, ECFH denoted by H in the table although being the third less complex method outperforms by far the rest of the methods with an improvement of at least in the worst case. In conclusion, the experimental results show that ECFH yields ECOC coding matrices which obtain comparable or even better results than stateoftheart methods with similar relative complexity. Furthermore, by a allowing a small increase in the computational complexity when compared to stateoftheart methods, ECF is able to obtain better classification results by boosting the boundaries of classes that are prone to be confused.
Method  RH  S  H  E  D  R  A  O 
Win SVs  0.0  22.5  62.1  10.3  50.0  5.7  14.2  25.0 
Win nclass.  0.0  48.5  70.0  17.5  25.0  6.9  12.5  16.6 
Avg. Comp.  0.58  0.87  0.88  0.89  0.91  0.92  0.99  0.99 
6 Conclusions
We presented the ErrorCorrecting Factorization method for multiclass learning which is based on the ErrorCorrecting Output Codes framework. The proposed method factorizes a design matrix of desired correction properties into a discrete ErrorCorrecting component consistent with the design matrix. ECF is a general method for building an ECOC multiclass classifier with desired properties, which can be either directly set by the user or obtained from data using a priori interclass distances. We note that the proposed approach is not a replacement for ECOC codings, but a generalized framework to build ECOC matrices that follow a certain errorcorrecting criterion design. The ErrorCorrecting Factorization is formulated as a minimization problem which is optimized using a constrained Coordinate Descent, where the minimizer of each coordinate is the solution to a leastsquares problem with box and linear constraints that can be efficiently solved. By analyzing the approximation error, we empirically show that although ECF is a nonconvex optimization problem, the optimization is very efficient. We performed experiments using ECF to build ECOC matrices following the common trend in stateoftheart works, in which the design matrix priorized the most separable classes. In addition, we hypothesized and showed that a more beneficial situation is to allocate the correction capability of the ECOC to those categories which are more prone to confusion. Experiments show that when ECF is used to allocate the correction capabilities to those classes which are prone to confusion we obtain higher accuracies than state of the art methods with efficient models in terms of the number of Support Vectors and dichotomies.
Finally, there still exists open questions that require a deeper analysis for future work. The results obtained raise a fair doubt regarding the right allocation of error correcting power in several methods found in literature where ECOC designs are based on the premise of boosting the classes which are easily separable. In the light of these results, we may conjecture that a careful allocation of error correction must be made in such a way that balances two aspects: on one hand, simple to classify boundaries must be handled properly. On the other hand, the error correction must be allocated on difficult classes for the ensemble to correct possible mistakes. In addition, it would be interesting to study which are the parameters that affect the suitability of the no class is left behind and the hard classes are left behind one. Finally we could consider ternary matrices and further regularizations.
References
 [1] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David J Kriegman, and Serge Belongie. Generalized nonmetric multidimensional scaling. In ICAIS, pages 11–18, 2007.
 [2] E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. In Journal of Machine Learning Research, volume 1, pages 113–141, 2002.
 [3] Erin L. Allwein, Robert E. Schapire, Yoram Singer, and Pack Kaelbling. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000.

[4]
Miguel Ángel Bautista, Sergio Escalera, Xavier Baró, and Oriol Pujol.
On the design of an ecoccompliant genetic algorithm.
Pattern Recognition, 47(2):865 – 884, 2013.  [5] Miguel Bautista, Oriol Pujol, Xavier Baró, and Sergio Escalera. Introducing the separability matrix for error correcting output codes coding. MCS, pages 227–236, 2011.
 [6] Anna Bosch, Andrew Zisserman, and Xavier Muoz. Image classification using random forests and ferns. In ICCV 2007, pages 1–8. IEEE, 2007.
 [7] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152. ACM, 1992.
 [8] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2009.
 [9] Stephen Boyd and Lin Xiao. Leastsquares covariance matrix adjustment. SIAM Journal on Matrix Analysis and Applications, 27(2):532–546, 2005.
 [10] J. Casacuberta, J. Miranda, M. Pla, S. Sanchez, A.Serra, and J.Talaya. On the accuracy and performance of the GeoMobil system. In International Society for Photogrammetry and Remote Sensing, 2004.
 [11] Lawrence Cayton and Sanjoy Dasgupta. Robust euclidean embedding. In ICML, pages 169–176. ACM, 2006.
 [12] ChihChung Chang and ChihJen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
 [13] Koby Crammer and Yoram Singer. Improved output coding for classification using continuous relaxation. In NIPS, volume 13, page 437. MIT Press, 2001.
 [14] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. The Journal of Machine Learning Research, 2:265–292, 2002.
 [15] Fernando De la Torre. A leastsquares framework for component analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions, 34(6):1041–1055, 2012.

[16]
T. Dietterich and G. Bakiri.
Solving multiclass learning problems via errorcorrecting output
codes.
In
Journal of Artificial Intelligence Research
, volume 2, pages 263–286, 1995.  [17] S. Escalera, O. Pujol, and P.Radeva. On the decoding process in ternary errorcorrecting output codes. Transactions in Pattern Analysis and Machine Intelligence, 99(1), 2009.
 [18] Sergio Escalera, David MJ Tax, Oriol Pujol, Petia Radeva, and Robert PW Duin. Subclass problemdependent design for errorcorrecting output codes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(6):1041–1054, 2008.
 [19] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In CVPR, pages 1778–1785. IEEE, 2009.
 [20] Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. In COLT, pages 23–37. Springer, 1995.
 [21] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
 [22] Keinosuke Fukunaga and Thomas E Flick. An optimal global nearest neighbor metric. Pattern Anaylsis and Machine Intelligence, Transactions on, (3):314–318, 1984.
 [23] Tianshi Gao and Daphne Koller. Discriminative learning of relaxed hierarchy for largescale visual recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2072–2079. IEEE, 2011.
 [24] Tianshi Gao and Daphne Koller. Multiclass boosting with hinge loss based on output coding. In ICML, pages 569–576, 2011.
 [25] N. GarciaPedrajas and C. Fyfe. Evolving output codes for multiclass problems. Evolutionary Computation, IEEE Transactions on, 12(1):93 –106, 2008.
 [26] Phil Gill. Numerical linear algebra and optimization. 2007.
 [27] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of cooccurrence data. Journal of Machine Learning Research, 8:2265–2295, 2007.
 [28] Gregory Griffin and Pietro Perona. Learning and using taxonomies for fast visual categorization. In CVPR, pages 1–8. IEEE, 2008.
 [29] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss–seidel method under convex constraints. Operations Research Letters, 26(3):127 – 136, 2000.
 [30] YX Gu, Qing Ren Wang, and Ching Y Suen. Application of a multilayer decision tree in computer recognition of chinese characters. Pattern Anaylsis and Machine Intelligence, Transactions on, (1):83–89, 1983.
 [31] Naiyang Guan, Dacheng Tao, Zhigang Luo, and Bo Yuan. Nenmf: an optimal gradient method for nonnegative matrix factorization. Signal Processing, IEEE Transactions on, 60(6):2882–2898, 2012.
 [32] Nicholas J Higham. Computing the nearest correlation matrix—a problem from finance. IMA journal of Numerical Analysis, 22(3):329–343, 2002.
 [33] ChoJui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for nonnegative matrix factorization. In ACM SIGKDD, pages 1064–1072. ACM, 2011.
 [34] Hyunsoo Kim and Haesun Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713–730, 2008.
 [35] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115–129, 1964.
 [36] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 51(2):181–207, 2003.
 [37] ChihJen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
 [38] Ana C. Lorena and André C. P. L. F. Carvalho. Evolutionary design of multiclass support vector machines. Journal of Intelligent Fuzzy Systems, 18:445–454, October 2007.
 [39] Jérôme Malick. A dual approach to semidefinite leastsquares problems. SIAM Journal on Matrix Analysis and Applications, 26(1):272–284, 2004.
 [40] Marcin Marszalek and Cordelia Schmid. Constructing category hierarchies for visual recognition. In ECCV, pages 479–491. Springer, 2008.
 [41] A. Martinez and R. Benavente. The AR face database. In Computer Vision Center Technical Report #24, 1998.
 [42] Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. Knowledge and Data Engineering, IEEE Transactions on, 20(10):1348–1362, 2008.
 [43] Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. The Journal of Machine Learning Research, 14(1):437–497, 2013.
 [44] Yu Nesterov. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
 [45] Devi Parikh and Kristen Grauman. Relative attributes. In ICCV, pages 503–510. IEEE, 2011.

[46]
O. Pujol, P. Radeva, and J. Vitrià.
Discriminant ECOC: A heuristic method for application dependent design of error correcting output codes.
In Pattern Analysis and Machine Intelligence, IEEE Transactions on, volume 28, pages 1001–1007, 2006.  [47] Peter Richtárik and Martin Takávc. Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Mathematical Programming, 144(12):1–38, 2014.
 [48] Ryan Rifkin and Aldebaro Klautau. In defense of onevsall classification. Journal of Machine Learning Research, 5:101–141, 2004.
 [49] Douglas LT Rohde. Methods for binary multidimensional scaling. Neural Computation, 14(5):1195–1232, 2002.
 [50] Mohammad J Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms. In NIPS, pages 2124–2132, 2011.
 [51] Robert E Schapire. Using output codes to boost multiclass learning problems. In ICML, volume 97, pages 313–321, 1997.
 [52] T.Hastie and R.Tibshirani. Classification by pairwise grouping. NIPS, 26:451–471, 1998.
 [53] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
 [54] Yair Weiss, Rob Fergus, and Antonio Torralba. Multidimensional spectral hashing. In ECCV 2012, pages 340–353. Springer, 2012.
 [55] Jason Weston, Chris Watkins, et al. Support vector machines for multiclass pattern recognition. In ESANN, volume 99, pages 219–224, 1999.
 [56] Felix X Yu, Liangliang Cao, Rogerio S Feris, John R Smith, and ShihFu Chang. Designing categorylevel attributes for discriminative visual recognition. In CVPR, pages 771–778. IEEE, 2013.
 [57] Xiao Zhang, Lin Liang, and HeungYeung Shum. Spectral error correcting output codes for efficient multiclass recognition. In ICCV, pages 1111–1118, Sept 2009.
 [58] Bin Zhao and Eric P Xing. Sparse output coding for largescale visual recognition. In CVPR, pages 3350–3357. IEEE, 2013.
 [59] Guoqiang Zhong and Mohamed Cheriet. Adaptive errorcorrecting output codes. In IJCAI, pages 1932–1938. AAAI Press, 2013.
 [60] Guoqiang Zhong, Kaizhu Huang, and ChengLin Liu. Joint learning of errorcorrecting output codes and dichotomizers from data. Neural Computing and Applications, 21(4):715–724, 2012.

[61]
Jin Deng Zhou, Xiao Dan Wang, Hong Jian Zhou, Jie Ming Zhang, and Ning Jia.
Decoding design based on posterior probabilities in ternary errorcorrecting output codes.
Pattern Recognition, 45(4):1802 – 1818, 2012.  [62] Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. Multiclass adaboost. Statistics and its Interface, 2(3):349–360, 2009.
Comments
There are no comments yet.