1 Introduction
Multiclass classification is one of the most common problems in machine learning. It aims at predicting one label out of multiple, mutually exclusive labels based on the known assignments in a training data.
Such an approach does not take into account complex dependencies among output variables, potentially leading to two problems. First, it assumes mutually independent labels. The assumption holds on some computer vision tasks such as object recognition on ILSVRC
(Russakovsky et al., 2015), in which classes are mutually exclusive leaf nodes of WordNet (Miller, 1995) (e.g., an object is not supposed to be both a dog and of a cat), but does not apply to many other tasks. Second, the quality of topk predictions is not wellassessed. A naive multiclass classification framework evaluates topk accuracy, which only measures the model’s ability to exactly match the true label and ignores the relevance of other top predictions. This is critical especially in a dataset with highly correlated classes. For example, an image labeled with ‘husky’ is classified as ‘dog’ or ‘mammal’. Though neither matches the ground truth exactly, ‘dog’ is clearly a better prediction than ‘mammal’. Topk predictions of ‘dog’ & ‘husky’ should be considered better than that of ‘mammal’ & ‘husky’.A known label relation can be exploited as a guide for a model to produce a cluster of predictions that are close to the ground truth in a structured label space. As a result, both classification accuracy and the relevance of top predictions can be improved. Graphs have been shown to encode a complex geometry and can be used with strong mathematical tools such as spectral graph theory (Chung, 1997).
There have been work on incorporating the label structure in multiclass classification. They however come with two major shortcomings. First, classification with label relations is often confined to a certain type of graph (Deng et al., 2014). However, underlying label relations of a certain task may exist in various ways. Second, most of the recent work approximates the pairwise relation with graphical models such as conditional random fields (CRF) and Markov random fields (MRFs) (Schwing and Urtasun, 2015). These may not be rich enough to capture complex dependencies.
In this paper, we explore a novel way to perform multiclass classification combining deep neural networks (DNN) with graph convolutional networks (GCN) (Bruna et al., 2013) that encodes the label structure and improves top predictions relevancies. The proposed model stacks graph convolution layers on the concatenation of input and class latent variables to extract labelwise features, which are then decoded by a final classifier. The entire network is trained as a deterministic deep neural network, bypassing any need for sophisticated inference steps. We also propose several graphtheoretic metrics to evaluate the relevancy of top predictions.
2 Problem Description
Given an input variable , and output variables , the classification task amounts to assigning with an output variable
that maximize the probability. When correlations exist among output variables, the probabilities of a certain
depends on not only , but also the output variables that has correlations with. One way to represent such known structure underlying the classes is to use graph structure. Let be a graph such that is indexed by the vertices of , and is an edge between andthat represents a known relation between the output random variables, the problem amounts to modelling the probability
.3 Class Structure Aware Classification
3.1 General Setup
The goal of supervised learning is to map an input
to one of the classes . This process often consists of three submodules. The first module extracts input representation, and the second module class representations. The final module, called a score function, compares the input representation against each of the class representation to compute the score of each class given the input. Given the scores, the prediction is made by where denotes a set of parameters of the classifier.With the score function above, we define a conditional distribution over the classes given an input. This is often done by socalled softmax:
With this conditional distribution, we can now maximize a loglikelihood given a set of training examples with respect to the parameters:
Example: Multilayer Perceptron (MLP)
When it is assumed that each class is conditionally independent from each other and that there is no known structure underlying the classes, we can use a plain multilayer perceptron (MLP). First, the input representation
is extracted by a deep neural network. The representation of each classis simply a trainable vector
and does not depend on the other classes nor on the input. The score function is a dot product between the input representation and the class vector, i.e., , where .3.2 Structured Class Space
In this paper, we are interested in the case where there exists a graph structure underlying the classes in . This graph indicates the similarity or relatedness between each pair of classes. The degree of similarity between the classes and is given by the weight , and these weights collectively define an adjacency matrix . In this paper we focus on undirected graph, therefore is symmetric.
Example: MLP + CRF
Instead of defining the score function as a dot product between the input and class representations, we consider a conditional random field defined over the classes given an observation with a unary potential function modelled by MLP and a pairwise potential function . The score associated with the th class consists of both unary and pairwise potentials:
For a general graph, the problem of exact inference in CRFs is intractable. Instead, mean field (MF) inference can be used to obtain an approximate solution. Initializing the score function to be , the score function of at iteration is:
(1) 
where resembles a “message” sent from node to . Assuming each node is a binary, the marginal probability of can be obtained at convergence using softmax.
Conditional Ising Model
Inspired by Ding et al. (2015), we add an Ising model on top of MLP and use this as one of the baselines. An Ising model has score function that takes into account local potentials as well as pairwise potential :
When an MLP maps an input feature vector
to a label bias vector
, the conditional probability is defined as:(2) 
where , and is the edgespecific potential function.
The pairwise energy function is define as: is the interaction parameter defined to be set to if and otherwise, . Setting , we can further rewrite Eq. 2 as:
(3) 
3.3 Graphbased Output Structure
3.3.1 Graph Convolutional Network
A graph convolutional network (GCN) is defined according to a graph , where is a set of nodes, and is a set of edges. The edge weights for form an adjacency matrix . The node representation for node is usually obtained by a neural network, and node representations for all nodes collectively form a feature matrix . The network takes as input and , and generates a nodewise output feature matrix . Each layer of propagation can be written as a nonlinear form: where and . is the number of layers.
We use the iteration rule by Kipf and Welling (2016):
(4) 
where is a trainable parameter, and is a nonlinear function such as . Here, , in which is the adjacency matrix with a selfconnection, and is the diagonal node degree matrix of .
is normalized such that all rows sum to one . As the entire network is designed to be differentiable endtoend, all the parameters are estimated with gradientbased optimization.
3.3.2 Proposed Approach
The neural message passing procedure with GCNs in Eq. 4 is similar to mean field iterations in Eq. 1, as each node is associated with a certain quantity that is computed based on the associated quantities of the neighboring nodes. The major difference between the two methods is that in GCN the value of each class node or label is a vector representation instead of a scalar. To compute the score of each label, the vector needs to be transformed into a scalar. We hereby introduce our methods of label representation and decoder. The input representation and GCN propagation remain the same as previous sections.
ContextDependent Label Representation
Following the MLP example, the input representation is extracted, and each label is embedded in a vector which is jointly learned during training. Contextdependent node vector for label is initialized by concatenating the latent input representation and label vector . A graph feature matrix is constructed so that the th row of is , i.e., .
TiedWeight Decoder
After iterations of graph convolution, the graph feature is extracted. The output feature vector of label is decoded by tying output weight between the input latent representation and node representation (Inan et al., 2016). The label score is obtained as , where is the th row of . We constrain and to have the same dimension.
Unlike the MLP+CRF, the proposed method with neural message passing encodes labels based on vector representations, and therefore potentially encoding richer information.
4 Related Work
Structured Prediction with Label Relations
Structured prediction has been used for classification with label relations (Taskar et al., 2004), (Tsochantaridis et al., 2005), (Lampert, 2011), (Bi and Kwok, 2012), (Bi and Kwok, 2011), (Zhang et al., 2017).
The goal of our work is clearly distinguished from the aforementioned works. Structured prediction can be viewed as a variant of multilabel classification, it takes input data with multiple assignments during training and jointly predicts a set of class labels for new observations during testing, while in our work the proposed model is trained on singlelabeled data.
Classification with Label Relations
Deng et al. (2014) incorporated WordNet (Miller, 1995) in object recognition and demonstrated that exploiting label relations not only improves multiclass classification accuracy but also multilabel classification performance, by setting hard constraints on the exclusive and inclusive relation between labels. This model was further extended for soft label relations using the Ising model in (Ding et al., 2015).
There are three major differences between this approach and the proposed approach. First, we do not impose any constraints on the graph structure other than requiring the availability of pairwise relations among nodes. Deng et al. (2014) on the other hand proposed to used a special kind of representation (the HEX graph) to express and enforce exclusion, inclusion, and overlap relations. Second, the proposed model is trained strictly with a singlelabeled data, while Deng et al. (2014) and Ding et al. (2015) add multiple labels during training by using hard constraints. Third, we train and use the entire model as a deterministic network, while Deng et al. (2014) and Ding et al. (2015) require a separate inference procedure to model a conditional probability in the test time, leading to mismatch between training and test.
Graph Convolutional Networks
Previous works focused on exploiting different GCN structures. For instance, Defferrard et al. (2016) approximated smooth filters in the spectral domain using Chebyshev polynomials with free parameters that are learned in a neural networklike model. Kipf and Welling (2016) introduced simplifications that significantly improves both training times and predictive accuracy.
The main difference between the proposed method and the previously mentioned works is in the input data structure. The method proposed applies GCN as a layer in DNN to model data with structured output instead of structured input. In particular, the proposed model projects structured labels into a highdimensional space and forward label hidden states conditioned on input data to GCN layers for feature extraction and classification.
Classification with External Knowledge
Recent works have begun to investigate new ways to integrate richer knowledge in classification tasks. For example, Grauman et al. (2011), Hwang et al. (2012) and Deng et al. (2014) took the WordNet category taxonomy to improve image object recognition. McAuley and Leskovec (2012) and Johnson et al. (2015) used metadata from a social network to improve image classification. Ordonez et al. (2013) leveraged associated image captions to estimate entrylevel labels of visual objects. Hu et al. (2016) used label relation graphs and concept layers for layered predictions.
The proposed method is a novel approach to incorporating external knowledge about label relations and is not taskspecific. The label structure can be extracted in an arbitrary way.
5 Experiments Settings
5.1 Datasets
The proposed model is assessed in two experiments: a visual object recognition task on a canine image dataset, and a document classification task on an inhouse dataset. Dataset statistics are summarized in Table 1.
Dataset  Nodes (Labels)  Edges  Data Size 

Canine Images  170  170  23,800 
InHouse Documents  251  15,498  28,916 
5.1.1 Canine Image Dataset
The canine image dataset is composed of opensource images with labels on a subgraph of WordNet, which is a hierarchical structure of objects. With this dataset we want to evaluate the performance of the proposed model on a special case of graph structure: a tree structure, where edges are directed and each node has only one edge (apart from the parent node).
Image data collection
We collect a new dataset consisting of images using an approach inspired by Evtimova et al. (2017). We crawl the nodes in the subtree of the ‘canine’ synset in WordNet, and query the label of each node in Flickr to retrieve 140 images. Images in each node are partitioned into 100/20/20 images for training/validation/test sets respectively.
Label graph construction
The adjacency matrix is extracted from the WordNet canine subgraph.
Input Representation
5.1.2 Document Classification Dataset
We use an inhouse dataset composed of various types of web page content for document classification. The underlying output structure is generated using semantic similarity of labels. We use this dataset to evaluate the performance of proposed model on general undirected graph structure.
Data collection and preprocessing
In this dataset, each document has one humanannotated label that summarizes the primary information of the content. The labels are topics covering company names, business, finance, accounting, marketing, human resource, technology, lifestyle, and more. The dataset is split by 60%/20%/20% into training/validation/test sets respectively. The documents are lowercased and tokenized. The vocabulary contains the most frequent 100,000 unigrams and bigrams.
Label graph construction
The label graph is built by measuring pairwise label similarities based on the label definitions. The label definition is retrieved from Wikipedia. If the label name does not have an exact match, the topthree topics suggested by Wikipedia are selected, and their Wikipedia definitions are concatenated as an alternative. The definitions are further tokenized into words and vectorized. We experiment with definition vectors created by TFIDF weighted average of pretrained word vectors from Joulin et al. (2016).
The adjacency matrix is created by computing the pairwise cosine similarities of definition vectors. The
entry of an adjacency matrix is , where and are the th and th label’s definition vector respectively. A discrete adjacency matrix is built by setting a threshold on the continuous adjacency matrix. In the experiments, we set as the 75% percentile of all entries in . The discrete adjacency matrix has on average 29.74 edges per node.Input Representation
After preprocessing, the document vectors are embedded in a CBoW manner (Mikolov et al., 2013). Let be a trainable embedding matrix, where is the vocabulary size and the embedding dimension, and be the of th row of . The embedding function for document with tokens indices is defined as .
5.2 Model and Learning Configuration
Proposed Model: GCNTD
We used a model architecture with two layers of GCN propagation and tiedweight decoder. We also experiment with 4 and 8 layers, but did not observe significant difference on the model performance.
Baselines
We consider the following baselines.
 MLPn

The baseline model uses layers of multilayer perceptron (MLP) instead of the proposed GCN layers to map the contextual hidden state to the classes. We denote this model as MLPn.
 MLPCRF

This baseline is described in section 3.2. We apply meanfield inference (MF) in our experiment. We finetune MLP parameters and on validation set using top1 accuracy.
 GCNTDfc/id

Another baseline is to set the adjacency matrix in GCNTD as fullyconnected (fc
) or identity matrix (
id).
All models are trained by minimizing negative loglikelihood (NLL) with backpropagation using Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001. The learning rate is annealed each time the validation error does not improve. Each training is earlystopped based on the top1 accuracy on the validation set. We random search embedding size and learning rate on validation set Bergstra and Bengio (2012). Metrics are reported on a test set using the best model according to validation set. We observed similar training time for all the models.
5.3 Evaluation
Apart from the top1 and top10 accuracies, we propose several graphtheoretic metrics to understand the performance of the proposed GCNTD model. We refer to ‘predictions’ as the 10 labels that are predicted with the highest scores, if not otherwise specified. The following are our evaluation metrics:
 Top1/top10 accuracy

The percentage of test cases when the true label is predicted in top1/top10 predictions.
 Onehop precision@k

The fraction of topk predictions that overlaps with the true label and its onehop neighbors. By default .
 Onehop recall@k

The fraction of the true label and its onehop neighbors that overlaps with topk predictions. By default .
 Top1/top10 distance

Distance refers to the shortest path between a certain prediction and the true label on the graph, if they are connected at all. Top1 distance is the distance between top1 prediction and true label, and top10 distance is the average distance between top10 predictions and true label.
 Diameter

The diameter of a graph is the maximum eccentricity of any vertex in the graph. In other words, it is the greatest distance between any pair of vertices. Here the diameter refers to the diameter of the subgraph that the topk predictions form.
The onehop precision and recall are similar to those in multilabel classification framework. For simplicity, we refer to them as precision and recall. In these metrics, it is assumed that the true label’s onehop neighbors on the graph are also potential true labels. Let true label and its onehop neighbors be
and predictions be , the precision and recall are: , In addition to the ability of model to exactly match the ground truth, these two metrics also measure the ability for the model to find a small cluster around ground truth. Higher values on top1/top10 accuracy, precision, and recall indicate stronger predictive power.Top1/top10 distances and diameter, on the other hand, measure the coherence of predictions from a graphical perspective. Since the graph captures label relations, the labels that are closer to each other on the graph are more related. In the case of the definitionbased label graph in document classification task, the graph captures semantic similarities. These metrics hence measure how centralized the predictions are with respect to the true label or themselves from a semantic perspective. Lower values indicate semantically more related labels.


Precision  Recall 

Diameter  

GCNTD 

.42 (.74)  .15  .60  2.07 (3.46)  3.37  
MLP1  .44 (.75)  .14  .57  2.02 (3.52)  3.86  
MLPCRF  .42 (.75)  .16  .67  1.97 (3.29)  4.33  
GCNTDid  .40 (.70)  .13  .52  2.17 (3.64)  5.00  
GCNTDfc  .42 (.74)  .14  .56  2.11 (3.57)  2.73  
GCNTD 

.47 (.76)  .13  .64  2.02 (3.57)  3.26  
MLP1  .49 (.77)  .11  .62  1.95 (3.64)  3.70  
MLPCRF  .43 (.74)  .14  .70  1.95 (3.41)  4.25  
GCNTDid  .45 (.71)  .11  .56  2.10 (3.77)  6.00  
GCNTDfc  .48 (.75)  .12  .60  2.02 (3.69)  2.42 
6 Results
6.1 Object Recognition: Canine Image Dataset
The results are shown in Table 2, where the models were trained on all the nodes and evaluated on either all the nodes (all) or the leaf nodes only (leaf). In both cases, the models that consider the underlying label structure (i.e. GCNTD and MLPCRF) achieve higher performances on graphtheoretic metrics while the MLP outperforms the GCNTD and MLPCRF on accuracy. The MLPCRF achieves the highest precision and recall on both evaluation scenarios. Such result indicates that the explicitly defined energy functions in a graphical model are often beneficial for predicting labels closer to the ground truth than the MLP and the GCNTD on WordNet hierarchy.

Precision  Recall 

Diameter  

GCNTD  .83 (.95)  .61  .18  .24 (1.35)  2.40  
GCNTDid  .81 (.94)  .50  .15  .28 (1.48)  2.81  
GCNTDfc  .84 (.96)  .50  .15  .22 (1.48)  2.82  
MLPCRF  .81 (.95)  .60  .16  .25 (1.34)  2.42  
MLP1  .82 (.95)  .53  .16  .26 (1.44)  2.67  
MLP2  .81 (.94)  .50  .14  .27 (1.49)  2.71  
MLP4  .75 (.91)  .46  .13  .37 (1.54)  2.94 
6.2 Document Classification
Results are shown in Table 3. In general, the GCNTD outperforms the MLP on all the metrics, and outperforms MLPCRF on accuracies, precision, and recall. GCNTDfc achieves the highest accuracies, indicating the benefit of message passing under the extreme case where labels are fullyconnected.
Figure 2 shows the comparison between GCNTD and MLP in topk precision and recall . This demonstrates that the GCNTD tends to find a smaller cluster of predictions that are closer to ground truth. Since the label graph, in this case, is constructed by measuring definition similarities, the GCNTD can be thought of as making predictions that are semantically closer and more related to the ground truth: a smaller diameter indicates semantically closer predictions, and a smaller distance indicates that the predictions are semantically closer to the ground truth.
7 Discussion and Conclusion
We have proposed a graph convolutional network (GCN) augmented neural network classifier to exploit an underlying graph structure of labels. The proposed approach resembles an approximate inference procedure in probabilistic graphical models, but replaces iterative inference with graph convolution layers. In the experiments on object recognition and document classification, the proposed model achieved better performance on graphtheoretic metrics than a baseline model that ignores label structures. The proposed approach can be applied to any classification task with a label graph to improve accuracy and top predictions relevancy. It can be also used to incorporate external knowledge about labels by encoding such knowledge in label graph.
Acknowledgments
Meihao Chen and Zhuoru Lin thank Nicholaus Halecky, Jeffery Payne, Oleg Khavronin, Patrick Kelley, and Lindsay Reynolds for discussion and support on this work. Kyunghyun Cho thanks support by Tencent, eBay, Facebook, Google and NVIDIA, and was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI). This work is completed during Zhuoru Lin’s internship with Bombora Inc.
References
 Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyperparameter optimization. Journal of Machine Learning Research 13(Feb):281–305.
 Bi and Kwok (2011) Wei Bi and James T Kwok. 2011. Multilabel classification on treeand dagstructured hierarchies. In Proceedings of the 28th International Conference on Machine Learning (ICML11). pages 17–24.
 Bi and Kwok (2012) Wei Bi and James T. Kwok. 2012. Mandatory leaf node prediction in hierarchical multilabel classification. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, Curran Associates, Inc., pages 153–161.
 Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. CoRR abs/1312.6203. http://arxiv.org/abs/1312.6203.
 Chung (1997) Fan RK Chung. 1997. Spectral graph theory. 92. American Mathematical Soc.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. CoRR abs/1606.09375. http://arxiv.org/abs/1606.09375.
 Deng et al. (2014) Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. 2014. Largescale object classification using label relation graphs. In European Conference on Computer Vision. Springer, pages 48–64.
 Ding et al. (2015) Nan Ding, Jia Deng, Kevin Murphy, and Hartmut Neven. 2015. Probabilistic label relation graphs with ising models. CoRR abs/1503.01428. http://arxiv.org/abs/1503.01428.
 Evtimova et al. (2017) Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. 2017. Emergent language in a multimodal, multistep referential game. arXiv preprint arXiv:1705.10369 .
 Grauman et al. (2011) Kristen Grauman, Fei Sha, and Sung Ju Hwang. 2011. Learning a tree of metrics with disjoint visual features. In Advances in neural information processing systems. pages 621–629.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770–778.
 Hu et al. (2016) Hexiang Hu, GuangTong Zhou, Zhiwei Deng, Zicheng Liao, and Greg Mori. 2016. Learning structured inference neural networks with label relations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 Hwang et al. (2012) Sung Ju Hwang, Kristen Grauman, and Fei Sha. 2012. Semantic kernel forests from multiple taxonomies. In Advances in neural information processing systems. pages 1718–1726.
 Inan et al. (2016) Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. CoRR abs/1611.01462. http://arxiv.org/abs/1611.01462.
 Johnson et al. (2015) Justin Johnson, Lamberto Ballan, and Li FeiFei. 2015. Love thy neighbors: Image annotation by exploiting image metadata. In Proceedings of the IEEE international conference on computer vision. pages 4624–4632.
 Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. CoRR abs/1607.01759. http://arxiv.org/abs/1607.01759.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 .
 Lampert (2011) Christoph H Lampert. 2011. Maximum margin multilabel structured prediction. In Advances in Neural Information Processing Systems. pages 289–297.
 McAuley and Leskovec (2012) Julian McAuley and Jure Leskovec. 2012. Image labeling on a network: using socialnetwork metadata for image classification. Computer Vision–ECCV 2012 pages 828–841.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
 Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
 Ordonez et al. (2013) Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. From large scale image categorization to entrylevel categories. In Proceedings of the IEEE International Conference on Computer Vision. pages 2768–2775.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252.
 Schwing and Urtasun (2015) Alexander G. Schwing and Raquel Urtasun. 2015. Fully connected deep structured networks. CoRR abs/1503.02351. http://arxiv.org/abs/1503.02351.
 Taskar et al. (2004) Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Maxmargin markov networks. In Advances in neural information processing systems. pages 25–32.
 Tsochantaridis et al. (2005) Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of machine learning research 6(Sep):1453–1484.
 Zhang et al. (2017) L Zhang, SK Shah, and IA Kakadiaris. 2017. Hierarchical multilabel classification using fully associative ensemble learning. Pattern Recognition 70:89–103.
Comments
There are no comments yet.