1 Introduction
Convolutional Neural Networks lie at the core of the latest breakthroughs in largescale image recognition [35, 30], at present even surpassing human performance [20], applied to the classification of objects [15], faces [31], or scenes [51]. Due to its effectiveness and simplicity, onehot encoding is still the most prevalent procedure for addressing such multiclass classification tasks: in essence, a function
is modeled, that maps image samples to a probability distribution over a discrete set of the
labels of target categories.Unfortunately, when the output space grows, class labels do not properly span the full label space, mainly due to existing label crosscorrelations. Consequently, onehot encoding might result inadequate for finegrained classification tasks, since the projection of the outputs into a higher dimensional (orthogonal) space dramatically increases the parameter space of computed models. In addition, for datasets with a large number of labels, the ratio of samples per label is typically reduced. This constitutes an additional challenge for training CNN models in large output spaces, and the reason of slow convergence rates [40].
In order to address the aforementioned limitations, output embeddings have been proposed as an alternative to the onehot encoding for training in large output spaces [7]: depending on the specific classification task at hand, using different output embeddings captures different aspects of the structure of the output space. Indeed, since embeddings use weight sharing during training for finding simpler (and more natural) partitions of classes, the latent relationships between categories are included in the modeling process.
According to Akata et al. [2], output embeddings can be categorized as:

Embeddings based on a priori information, like attributes [1], or hierarchies [39]
: unfortunately, learning from attributes requires expert knowledge or extra labeling effort and hierarchies require a prior understanding of a taxonomy of classes, and in addition, approaches that use textual data as prior do not guarantee visual similarity
[16]; and 
Learned embeddings, for capturing the semantic structure of word sequences (i.e. annotations) and images jointly [43]. The main drawbacks of learning output embeddings are the need of a high amount of data, and a slow training performance.
Thus, in cases where there exist high quality attributes, methods with prior information are preferred, while in cases of a known equidistant label space, dataindependent embeddings are a more suitable alternative. Unfortunately, the architectural design of a model is bound to the particular choice among the abovementioned embeddings. Thus, once a model is chosen and trained using an specific output embedding, it is hard to reuse it for another tasks requiring a different type of embedding.
In this paper, ErrorCorrecting Output Codes (ECOC) are proven to be a better alternative to onehot encoding for image recognition, since ECOCs are a generalization of the three embedding categories [14], so a change in the ECOC matrix will not constitute a change in the chosen architecture. In addition, ECOCs naturally enable errorcorrection, low dimensional embedding spaces [6]
, and bias and variance error reduction
[25].Inspired by the latest advances on ECOCs, we circumvent onehot encoding by integrating the ErrorCorrecting Output Codes into CNNs, as a generalization of output embedding. As a result, a bestofbothworlds approach is indeed proposed: compact outputs, databased hierarchies, and error correction. Using our approach, training models in lowdimensional spaces drastically improves convergence speed in comparison to onehot encoding. Figure 1 shows an overview of the proposed model.
The rest of the paper is organized as follows: Section 2 reviews the existing work most closely related to this paper. Section 3 presents the contribution of the proposed embedding technique, which is two fold: (i) we show that random projections of the label space are suitable for finding useful lower dimensional embeddings, while boosting dramatically convergence rates at zero computational cost; and (ii) In order to generate partitions of the label space that are more discriminative than the random encoding (which generates random partitions of the label space), we also propose a normalized eigenrepresentation of the class manifold to encode the targets with minimal information loss, thus improving the accuracy of random projections encoding while enjoying the same convergence rates. Subsequently, the experimental results on CIFAR100 [26], CUB2002011 [41], MIT Places [51], and ImageNet [35] presented in Section 4 show that our approach drastically improves convergence speed while maintaining a competitive accuracy. Lastly, Section 5
concludes the paper discussing how, when gradient sparsity on the output neurons is highly reduced, more robust gradient estimates and better representations can be found.
2 Related work
This section reviews those works on output embeddings most related to ours, in particular those using ECOC.
Output Embeddings
Most of the related literature addresses the challenge of zeroshot learning, i.e. training a classifier in the absence of labels. Often, the proposed approaches take into account the attributes of objects
[49, 34, 24, 2] related to the different classes through wellknown, shared object features.Due to their computing efficiency based on a divideandconquer strategy, output embeddings have been also proven useful for those multiclass classification problems in which testing all possible class labels and hierarchical structures is not feasible [4, 41, 43, 7]. Given a large output space, most labels are usually considered instances of a superior category e.g., sunflower and violet are flower plants. In this sense, the inherent hierarchical structure of the data makes divideandconquer hierarchical output spaces a suitable alternative to the traditionally flat 1ofN classifiers. Likewise in the context of language processing, Mikolov et al. combine Huffman binary codes and hierarchical softmax in order to map the most frequent codes to shorter paths in a tree [32].
Because output embeddings enforce weight sharing, they have been also used when the number of classes is rather large, with no clear interclass boundaries, and a decaying ratio of the number of examples per class. In this context, in order to reduce the output space, Weston et al. proposed WSABIE, an online learningtorank algorithm to find an embedding for the labels based on images [44].
In the field of largescale recognition, hierarchical approaches such as using treebased priors [38], label relational graphs [11], CNN hierarchies [46], and HDCNNs [47] have been proposed. For example in [29] binary hash codes are used for fast image retrieval. However, such hierarchical approaches need to be learned, and cannot be easily interchanged with other embeddings. In addition, for approaches learning codes as latent variables, to find the optimal ones in terms of class separability or error correction is not guaranteed [29]. Due to all this, ECOC constitute a better alternative for seamless integration with CNNs, as detailed next.
ErrorCorrecting Output Codes^{1}^{1}1We use the standard notation in ECOCs: bold capital letters denote matrices (e.g.
) and bold lowercase letters represent vectors (e.g.,
). All nonbold letters denote scalar variables. ECOC have been applied in multiple fields such as medical imaging [5], face and facialfeature recognition [45, 37], and segmentation of human limbs citesanchez2015hupba8k+. ECOCs are a generic divideandconquer framework that combines binary partitions to achieve multiclass recognition [12]. Their core property is the capability to correct errors of binary classifiers using redundancy, while reducing the bias and variance of the ensemble [25]. Advanced approaches propose to use them as intermediate representations [23].ECOC consist of two main steps: coding and decoding. The coding step consists in assigning a codeword of arbitrary length to each of the classes. Codewords are organized in a ”code matrix” , where each column is a binary partition on the label space in metaclasses. Since there are many possible bipartitions, the design of the code is central for obtaining discriminative ones. Indeed there are several approaches for generating ECOCs: Exhaustive codes [12], BCH codes [8], random codes [3], and circular ECOC [17] are few examples of methods that generate codes independently from the inherent structure of the data.
Although ECOCs can be dataindependent and even randomly generated, they can also be learnt from data: Pujol et al. propose a discriminant ECOC approach based on hierarchical partitions of the output space [33]. Subsequently, Escalera et al. [13] proposed to split complex problems into easier subclasses, embedded as binary dichotomizers in the ECOC framework, easier to optimize. In [9], it is also shown Optimal continuous ECOCs can be found by gradient descent. Griffin & Perona [18] use trees to efficiently handle multiclass problems, which posteriorly Zhang et al. improved by finding optimal partitions with spectral ECOCs [50].
In the decoding step, a sample can be decoded as the output of binary classifiers . Given the predicted code, the class label corresponds to the closest row in . The most common decoding methods are the Hamming and Euclidean distances but there are more sophisticated approaches such as probabilisticbased decoding, especially with ternary codes [14].
Inspired from latest ECOC advances, we propose to integrate output codes in largescale deep learning problems. In this context, few approaches in the literature have been presented: in [10, 11], CNNs are also used to directly predict the code bits for Optical Character recognition (OCR). We go a step further by: (i) showing that the convergence speed in large scale settings with millions of images can be dramatically improved; (ii) instead of directly predicting the code bits, we integrate the euclidean decoding with the crossentropy loss, so that the network does not only optimize individual bits independently but also intercode distances, which results in errorcorrection.
Our approach enhances the convergence of CNNs using random codes, i.e. when the interclass relationships are not considered. We achieve even lower error rates with datadependent codes, due to using more efficient data partitions. Similarly, Yang et al. also used CNNs to integrate dataindependent Hadamard Codes with the Euclidean loss [48]. But due to the efficiency of datadependent codes, our encoding proposal is shown more efficient than [48], by halving the required CNN output size, and eliminating the need of training multiple CNNs to predict code chunks.
3 Low dimensional target embedding
Figure 1 depicts our proposed model inspired by the ECOC framework [12]
and applied for deep supervised learning. Given a set of
classes, an ECOC consists of a set of binary partitions of the label space (groups of classes) representing each of the classes in the dataset. The codes are usually arranged in a design matrix .Let’s define the output of the last layer of a neural network as , with the depth of the network. For the sake of clarity the identity nonlinearity is used so that . Thus, given the weights of the previous layer , and the corresponding bias , can be computed as .
In our case, we reduce the output dimensionality of a CNN, i.e. the dimensionality of , from (the number of classes) to , an arbitrary number of partitions. Then, given a design matrix , where each row encodes a class label, the predicted class is obtained by finding the distance of the output with each row of the design matrix , with a column vector constituted by ones, and obtaining the label with . Then, we seamlessly integrate our proposal in the traditional loglikelihood and softmax loss layer.
3.1 Embedding output codes in CNNs
Given a training set
, of imagelabel pairs, CNNs constitute the stateoftheart at finding good local minima by empirical risk minimization (ERM) using the crossentropy as the loss function
by means of backpropagation
[28]:where is the predicted label for the example and the ground truth label. Since crossentropy requires probability distributions, the output of the network is fed to a softmax layer that assigns a probability score to each of the possible classes:
The derivative of the loss function for gradient descent through backpropagation is known to be:
The decoder is introduced between the output of the network and the softmax function . Concretely, the negative normalized Euclidean distance between and the rows in is used, so that the output of the softmax represents the probability of the output of the CNN to be decoded as the output word.
We reformulate the softmax function as , with the variable change of by (with
the normalized vector). The derivative of the loss can be computed using the chain rule:
We now calculate:
(1) 
(2) 
Given eq. 1 and 2, it is possible to compute the derivative of the crossentropy with the new decoding loss :
(3) 
Provided the amount of computation that can be shared from the forward pass to the backward pass, this process does not slowdown the training phase. In fact, the cost is compensated by (i) the shrinkage of , which also results in a reduction of the number of network parameters, and (ii) the increase of convergence speed.
The convergence speed increases because reducing the output layer results in parameter sharing, which produces more robust gradient estimates. The explanation is that the softmax function distributes the probabilities among a high number of neurons. Thus, the the gradient is zero for most outputs because only once in the ground truth vector, and . Given that the network is certain about the output , the expected output for the rest of the outputs is even smaller .
In other words, output layers with huge number of outputs and smaller minibatch size can only update the weights of few output units per iteration, since activation expected value is virtually zero. Thus, the gradients for these outputs are either zero or based on too few examples. This leads to noisy estimates to the real loss surface. As a result, reducing the output space with our method increases the ratio of activations per minibatch, helping to obtain more robust gradient estimates and increasing convergence speed, reduces the minibatch size, and thus the memory requirements.
3.2 Connections with Normalized Cuts
CNNs trained with our approach are robust and fast even when drawing codes from a normal distribution. The reason is the fact that random gaussian matrices tend to follow the coding properties described in the literature
[12, 19], such as row and column orthogonality. For most large datasets the label space follows a hierarchical structure and defining random partitions of the label space is rather unnatural. In order to find the most simple partitions we use an eigenrepresentation of the class manifold based on the class similarities found in the dataset. Concretely, solving the normalized cut (Ncut) problem on the class similarity graph is a way of obtaining uncorrelated lowcost partitions, with the number of classes [36]. The NCut can be approximated by solving the eigendecomposition of the normalized Laplacian of the class similarity matrix :where is the class similarity matrix, is the degree matrix,
are the eigenvalues in ascending order and
, the corresponding eigenvectors
. Given that , the eigenvectors constitute the partitions ordered by the Ncut cost. As explained in [51], this kind of codes have desirable properties such as balancing, orthogonality, lower error bounds due to the separability maximization, and similarity preserving, i.e. similar classes have similar codes. We show that training CNNs to predict the embedded target, together with this databased codes, exhibit lower error rates than using random codes. Contrary to [50], we do not threshold the eigenvectors so as to obtain a binary code but we interpret the values as likelihoods.In the following section, we provide empirical evidence confirming that CNNs trained with our proposed methodology on CIFAR100, CUB200, MIT Places, and Imagenet have faster convergence rates (with comparable or better recognition rates), even with smaller minibatch size, than their onehot counterparts.
4 Experiments
To validate our approach, we perform a thorough analysis of the advantages of embedding output codes in CNN models over different stateoftheart datasets. First, we describe the considered datasets, methods and evaluation.
4.1 Datasets
We first experiment the ImageNet 2012 LargeScale Visual Recognition Challenge (ILSVRC2012) [35] and the MIT Places205 [51] datasets. ImageNet consists of 1.2M images, and 50K validation images with 10K object classes. MIT Places is constituted by 2.5M images from 205 scene categories for training, and 100 images for category for testing.
Subsequently we experiment on the CIFAR100 [26] and the CaltechUCSD Birds2002011 [42]. CIFAR100 consists of 50K images for training, and 10K images for testing belonging to 10 coarse categories and 100 finegrained categories. CUB200 contains 11,788 images (5,994 images for training and 5,794 for test) of 200 bird species, each image annotated with 15 part locations, 312 binary attributes, and 1 Bounding Box.
4.2 Methods and evaluation
We use standard stateoftheart models to evaluate the contribution of the proposed target embedding procedure instead of comparing with stateoftheart results on the considered datasets. Note that any model, including more recent and powerful stateoftheart architectures, can benefit from our target embedding methodology.
As a proof of concept, we first validate dataindependent codes on the Imagenet and MIT Places datasets. Concretely, we retrain with our approach the fc7 and fc8 layers of an Alexnet model [27] pretrained on the respective datasets. Concretely, we randomly reinitialize their weights and train them using SGD with a global learning rate (lr) of , and the specific lr of the reinitialized layers is multiplied by .
Then, we demonstrate the advantages of datadependent codes on the finegrained CIFAR100 and CUB200 2011. For CIFAR100, we use the cifar_quick
models found in the Caffe framework
[22]. The network is initialized with noise sampled from a gaussian distribution, and the model is trained for 100 epochs. Finetuning on CUB200 is performed with the same pretrained model of the Imagenet experiments for
epochs, and the lr is divided by after epochs.Experiments with the standard Alexnet CNN [27] (caffe version [22]) on Imagenet, and MIT Places, prove that CNNs trained with random codes and our approach show faster convergence rates than using onehot encoding, especially for small minibatch sizes, while matching onehot in performance for bigger minibatch sizes. Thus, the proposed datadependent encoding approach performs better than using random codes for finegrained datasets, with fuzzy interclass boundaries, essentially because random codes alone do not take into account the correlation of attributes.
4.3 Random codes for faster convergence
Output encodings allow to embed sparse output spaces into compact representations. For instance, codes generated with the dense random strategy only need bits [3] to encode
classes. An inherent property of onehot encoding is the output activation sparsity for huge output spaces. Given a randomly initialized CNN with onehot encoding, provided that the output neurons follow a uniform distribution, the probability assigned to each class will be
, which tends to for . In the final stages of training, the situation will persist since just an extremely small ratio of the neurons activate, i.e. a small subset of the neurons show high probability for the predicted class while the residual probability mass is spread over a much larger number of neurons.Thus, it can be coarsely estimated that the update probability of the parameters associated to an output neuron during an SGD step is related to the ratio , with minibatch size , being for Alexnet trained on Imagenet, provided that . In other words, given a label, sampling more images increases the probability of that label being in the set of samples, and drawing less samples than the number of labels ensures that at least labels will not be seen during the update.
Figure 2 shows the resulting validation accuracy when training Alexnet on the ILSVRC2012 and MIT Places for different minibatches and a random code sampled from . As it can be seen, models trained with our approach converge faster than those trained with onehot encoding.
4.4 Using databased encodings
In order to adapt to finegrained settings, i.e. with high interclass correlations, and few examples per class, we propose to generate the output codes using the eigenvectors of the normalized Laplacian of the class similarity matrix. Since this eigendecomposition generates the most discriminating, hierarchical partitions based on the data, models trained with this datadependent codes result in higher accuracy bounds than the random counterparts.
To confirm the aforementioned advantages of using datadependent codes we choose to experiment on the wellestablished CIFAR100 and CUB200 2011 finegrained datasets. see Fig. 3. We use CIFAR100 for fast experimentation, and then we apply the best setting to CUB200.
CIFAR100. First, we evaluate different procedures for generating the codes:

Onehot. A vector of zeros and a one at the target position (with the number of classes).

Dense random [3]. Sampling the matrix with the most uncorrelated rows and columns from .

Gaussian. Sampling matrices from a normal distribution.

Databased. Constructing the code matrix from the eigenvalues of the class similarity Laplacian.
Note that Gaussian and Databased codes are composed of real numbers and a thresholding function should be applied for obtaining binary partitions. We test thresholding at zero and the median of the rows of the code matrix. Additionally, we test the raw values, interpreting them as the likelihood of the metaclass to be present in the class.
As it can be seen in table 1, output encodings are more robust, losing a smaller percentage of the accuracy when the number of codebits are halved, while onehot scales linearly with the number of bits, see 2(a)
for a detailed analysis. In addition, databased codes find the more discriminative partitions, resulting in better accuracy than the rest of the encodings. Moreover, keeping the raw values of the eigenvectors provides additional information about the likelihood of a metaclass to be present in a certain class, resulting in more robust predictions. Since output codes are based on binary partitions, they constrain the learning so that features are encoded to fall into hyperplanes.
In figure 4 we show the 2D projection of those hyperplanes using tsne. Note the higher overlapping of samples from different classes displayed on the target embedding space of 1hot in comparison to dense and datadependent alternatives. In particular, the proposed eigendecomposition of the output space shows a more discriminative splitting of the data samples according to their labels.
CUB200. Figure 5 shows that using small minibatch sizes with databased encodings largely outperforms the onehot baseline for different code lengths when training a CNN on CUB200 with datadependent codes based on the raw eigenvalues of the class similarity matrix (best setting on CIFAR100). Moreover, in figure 2(b), it can be seen that the databased code matches the onehot encoding with just the of the bits. As expected, the first bits correspond to the most discriminative partitions ordered by cut cost.
The class similarity matrix was built with the fc7 outputs of a pretrained network, but any other would also work if it reflects the interclass relationships.
Figure 6 contains the confusion matrices for ten of the CUB200 classes. Note that datadependent encodings find low cost partitions, discriminating classes prone to be confused in the first stages of the hierarchy (the first encoding bits), and keeping those harder classification problems to the leafs. A comparison of onehot, random and datadependent encodings for the classification of ”Fish crow” and ”Grackle” is shown in figure 8.
We lastly verify the correspondence of the metaclasses found with datadependent encodings by computing the Pearson Correlation Coefficient (CCP) between the columns of the codematrix and the attributes associated to each of the CUB200 classes, see table 2.
As expected, the datadependent code finds a highlevel partition that already discriminates both classes. Onehot, instead acts directly at the class level, without being explicitly based on shared attributes. On the other hand, random codes, although also based on metaclasses (attributes), do not guarantee that those metaclasses are the most discriminative ones.


5 Conclusion
In this work, output codes are integrated with the training of deep CNNs on largescale datasets. We found that CNNs trained on CIFAR100, CUB200, Imagenet, and MIT Places using our approach show less sparsity at the output neurons. As a result, models trained with our approach showed more robust gradient estimates and faster convergence rates than those trained with the prevalent onehot encoding at a small cost, especially for huge label spaces. As a side effect, CNNs trained with our approach can use smaller minibatch sizes, lowering the memory consumption. Moreover, we showed that training with datadependent codes based on eigenrepresentations of the class space allows for more efficient, hierarchical representations, achieving lower error rates than those trained with dataindependent output codes.
Acknowledgements
Authors acknowledge the support of the Spanish project TIN201565464R (MINECO FEDER), the 2016FI_B 01163 grant (Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya), and the COST Action IC1307 iV&L Net (European Network on Integrating Vision and Language) supported by COST (European Cooperation in Science and Technology). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU and a GTX TITAN GPU, used for this research.
References

Akata et al. [2013]
Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid.
Labelembedding for attributebased classification.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 819–826, 2013.  Akata et al. [2016] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Labelembedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2016.

Allwein et al. [2000]
E. L. Allwein, R. E. Schapire, and Y. Singer.
Reducing multiclass to binary: A unifying approach for margin
classifiers.
Journal of machine learning research
, 1(Dec):113–141, 2000.  Amit et al. [2007] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In Proceedings of the 24th international conference on Machine learning, pages 17–24. ACM, 2007.
 Bai et al. [2016] X. Bai, S. I. Niwas, W. Lin, B.F. Ju, C. K. Kwoh, L. Wang, C. C. Sng, M. C. Aquino, and P. T. Chew. Learning ecoc code matrix for multiclass classification with application to glaucoma diagnosis. Journal of medical systems, 40(4):1–10, 2016.
 Bautista et al. [2012] M. Á. Bautista, S. Escalera, X. Baró, P. Radeva, J. Vitriá, and O. Pujol. Minimal design of errorcorrecting output codes. Pattern Recognition Letters, 33(6):693–702, 2012.
 Bengio et al. [2010] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multiclass tasks. In Advances in Neural Information Processing Systems, pages 163–171, 2010.
 Bose and RayChaudhuri [1960] R. C. Bose and D. K. RayChaudhuri. On a class of error correcting binary group codes. Information and control, 3(1):68–79, 1960.
 Crammer and Singer [2002] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine learning, 47(23):201–233, 2002.
 Deng et al. [2010] H. Deng, G. Stathopoulos, and C. Y. Suen. Applying errorcorrecting output coding to enhance convolutional neural network for target detection and pattern recognition. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 4291–4294. IEEE, 2010.
 Deng et al. [2014] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Largescale object classification using label relation graphs. In European Conference on Computer Vision, pages 48–64. Springer, 2014.

Dietterich and Bakiri [1995]
T. G. Dietterich and G. Bakiri.
Solving multiclass learning problems via errorcorrecting output
codes.
Journal of artificial intelligence research
, 2:263–286, 1995.  Escalera et al. [2008] S. Escalera, D. M. Tax, O. Pujol, P. Radeva, and R. P. Duin. Subclass problemdependent design for errorcorrecting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6):1041–1054, 2008.
 Escalera et al. [2010] S. Escalera, O. Pujol, and P. Radeva. On the decoding process in ternary errorcorrecting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):120–134, 2010.
 Everingham et al. [2015] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
 Frome et al. [2013] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
 Ghaderi and Windeau [2000] R. Ghaderi and T. Windeau. Circular ecoc. a theoretical and experimental analysis. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 2, pages 203–206. IEEE, 2000.
 Griffin and Perona [2008] G. Griffin and P. Perona. Learning and using taxonomies for fast visual categorization. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 Hastie et al. [1998] T. Hastie, R. Tibshirani, et al. Classification by pairwise coupling. The annals of statistics, 26(2):451–471, 1998.
 He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
 Hsu et al. [2009] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multilabel prediction via compressed sensing. In NIPS, volume 22, pages 772–780, 2009.
 Jia et al. [2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Jiang et al. [2016] Z. Jiang, Y. Wang, L. Davis, W. Andrews, and V. Rozgic. Learning discriminative features via label consistent neural network. arXiv preprint arXiv:1602.01168, 2016.
 Kankuekul et al. [2012] P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa. Online incremental attributebased zeroshot learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3657–3664. IEEE, 2012.
 Kong and Dietterich [1995] E. B. Kong and T. G. Dietterich. Errorcorrecting output coding corrects bias and variance. In ICML, pages 313–321, 1995.
 Krizhevsky and Hinton [2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 LeCun and Bengio [1995] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 Lin et al. [2015] K. Lin, H.F. Yang, J.H. Hsiao, and C.S. Chen. Deep learning of binary hash codes for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 27–35, 2015.
 Lin et al. [2014] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
 Liu et al. [2015] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 Mikolov et al. [2013] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Pujol et al. [2006]
O. Pujol, P. Radeva, and J. Vitria.
Discriminant ecoc: A heuristic method for application dependent design of error correcting output codes.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):1007–1012, 2006.  Rohrbach et al. [2011] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zeroshot learning in a largescale setting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1641–1648. IEEE, 2011.
 Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 Shi and Malik [2000] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
 Smith and Windeatt [2015] R. S. Smith and T. Windeatt. Facial action unit recognition using multiclass classification. Neurocomputing, 150:440–448, 2015.

Srivastava and Salakhutdinov [2013]
N. Srivastava and R. R. Salakhutdinov.
Discriminative transfer learning with treebased priors.
In Advances in Neural Information Processing Systems, pages 2094–2102, 2013.  Tsochantaridis et al. [2005] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(Sep):1453–1484, 2005.
 Vijayanarasimhan et al. [2014] S. Vijayanarasimhan, J. Shlens, R. Monga, and J. Yagnik. Deep networks with large output spaces. arXiv preprint arXiv:1412.7479, 2014.
 Weinberger and Chapelle [2009] K. Q. Weinberger and O. Chapelle. Large margin taxonomy embedding for document categorization. In Advances in Neural Information Processing Systems, pages 1737–1744, 2009.
 Welinder et al. [2010] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltechucsd birds 200. 2010.
 Weston et al. [2010] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with joint wordimage embeddings. Machine learning, 81(1):21–35, 2010.
 Weston et al. [2011] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. 2011.

Windeatt and Ardeshir [2003]
T. Windeatt and G. Ardeshir.
Boosted ecoc ensembles for face recognition.
In IEE conference publication, pages 165–168. Institution of Electrical Engineers, 2003.  Xiao et al. [2014] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang. Errordriven incremental learning in deep convolutional neural network for largescale image classification. In Proceedings of the 22nd ACM international conference on Multimedia, pages 177–186. ACM, 2014.
 Yan et al. [2015] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, and Y. Yu. Hdcnn: Hierarchical deep convolutional neural network for large scale visual recognition. In ICCV’15: Proc. IEEE 15th International Conf. on Computer Vision, 2015.
 Yang et al. [2015] S. Yang, P. Luo, C. C. Loy, K. W. Shum, and X. Tang. Deep representation learning with target coding. In AAAI, pages 3848–3854, 2015.
 Yu and Aloimonos [2010] X. Yu and Y. Aloimonos. Attributebased transfer learning for object categorization with zero/one training example. In European conference on computer vision, pages 127–140. Springer, 2010.
 Zhang et al. [2009] X. Zhang, L. Liang, and H.Y. Shum. Spectral error correcting output codes for efficient multiclass recognition. sign (M (r, i), 1:1, 2009.

Zhou et al. [2014]
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning deep features for scene recognition using places database.
In Advances in neural information processing systems, pages 487–495, 2014.
Comments
There are no comments yet.