Multi-class classification is one of the most common problems in machine learning. It aims at predicting one label out of multiple, mutually exclusive labels based on the known assignments in a training data.
Such an approach does not take into account complex dependencies among output variables, potentially leading to two problems. First, it assumes mutually independent labels. The assumption holds on some computer vision tasks such as object recognition on ILSVRC(Russakovsky et al., 2015), in which classes are mutually exclusive leaf nodes of WordNet (Miller, 1995) (e.g., an object is not supposed to be both a dog and of a cat), but does not apply to many other tasks. Second, the quality of top-k predictions is not well-assessed. A naive multi-class classification framework evaluates top-k accuracy, which only measures the model’s ability to exactly match the true label and ignores the relevance of other top predictions. This is critical especially in a dataset with highly correlated classes. For example, an image labeled with ‘husky’ is classified as ‘dog’ or ‘mammal’. Though neither matches the ground truth exactly, ‘dog’ is clearly a better prediction than ‘mammal’. Top-k predictions of ‘dog’ & ‘husky’ should be considered better than that of ‘mammal’ & ‘husky’.
A known label relation can be exploited as a guide for a model to produce a cluster of predictions that are close to the ground truth in a structured label space. As a result, both classification accuracy and the relevance of top predictions can be improved. Graphs have been shown to encode a complex geometry and can be used with strong mathematical tools such as spectral graph theory (Chung, 1997).
There have been work on incorporating the label structure in multi-class classification. They however come with two major shortcomings. First, classification with label relations is often confined to a certain type of graph (Deng et al., 2014). However, underlying label relations of a certain task may exist in various ways. Second, most of the recent work approximates the pairwise relation with graphical models such as conditional random fields (CRF) and Markov random fields (MRFs) (Schwing and Urtasun, 2015). These may not be rich enough to capture complex dependencies.
In this paper, we explore a novel way to perform multi-class classification combining deep neural networks (DNN) with graph convolutional networks (GCN) (Bruna et al., 2013) that encodes the label structure and improves top predictions relevancies. The proposed model stacks graph convolution layers on the concatenation of input and class latent variables to extract label-wise features, which are then decoded by a final classifier. The entire network is trained as a deterministic deep neural network, bypassing any need for sophisticated inference steps. We also propose several graph-theoretic metrics to evaluate the relevancy of top predictions.
2 Problem Description
Given an input variable , and output variables , the classification task amounts to assigning with an output variable
that maximize the probability. When correlations exist among output variables, the probabilities of a certaindepends on not only , but also the output variables that has correlations with. One way to represent such known structure underlying the classes is to use graph structure. Let be a graph such that is indexed by the vertices of , and is an edge between and
that represents a known relation between the output random variables, the problem amounts to modelling the probability.
3 Class Structure Aware Classification
3.1 General Setup
The goal of supervised learning is to map an inputto one of the classes . This process often consists of three sub-modules. The first module extracts input representation, and the second module class representations. The final module, called a score function, compares the input representation against each of the class representation to compute the score of each class given the input. Given the scores, the prediction is made by where denotes a set of parameters of the classifier.
With the score function above, we define a conditional distribution over the classes given an input. This is often done by so-called softmax:
With this conditional distribution, we can now maximize a log-likelihood given a set of training examples with respect to the parameters:
Example: Multilayer Perceptron (MLP)
When it is assumed that each class is conditionally independent from each other and that there is no known structure underlying the classes, we can use a plain multilayer perceptron (MLP). First, the input representationis extracted by a deep neural network. The representation of each class
is simply a trainable vectorand does not depend on the other classes nor on the input. The score function is a dot product between the input representation and the class vector, i.e., , where .
3.2 Structured Class Space
In this paper, we are interested in the case where there exists a graph structure underlying the classes in . This graph indicates the similarity or relatedness between each pair of classes. The degree of similarity between the classes and is given by the weight , and these weights collectively define an adjacency matrix . In this paper we focus on undirected graph, therefore is symmetric.
Example: MLP + CRF
Instead of defining the score function as a dot product between the input and class representations, we consider a conditional random field defined over the classes given an observation with a unary potential function modelled by MLP and a pairwise potential function . The score associated with the -th class consists of both unary and pairwise potentials:
For a general graph, the problem of exact inference in CRFs is intractable. Instead, mean field (MF) inference can be used to obtain an approximate solution. Initializing the score function to be , the score function of at iteration is:
where resembles a “message” sent from node to . Assuming each node is a binary, the marginal probability of can be obtained at convergence using softmax.
Conditional Ising Model
Inspired by Ding et al. (2015), we add an Ising model on top of MLP and use this as one of the baselines. An Ising model has score function that takes into account local potentials as well as pairwise potential :
When an MLP maps an input feature vector
to a label bias vector, the conditional probability is defined as:
where , and is the edge-specific potential function.
The pairwise energy function is define as: is the interaction parameter defined to be set to if and otherwise, . Setting , we can further rewrite Eq. 2 as:
3.3 Graph-based Output Structure
3.3.1 Graph Convolutional Network
A graph convolutional network (GCN) is defined according to a graph , where is a set of nodes, and is a set of edges. The edge weights for form an adjacency matrix . The node representation for node is usually obtained by a neural network, and node representations for all nodes collectively form a feature matrix . The network takes as input and , and generates a node-wise output feature matrix . Each layer of propagation can be written as a non-linear form: where and . is the number of layers.
We use the iteration rule by Kipf and Welling (2016):
where is a trainable parameter, and is a non-linear function such as . Here, , in which is the adjacency matrix with a self-connection, and is the diagonal node degree matrix of .
is normalized such that all rows sum to one . As the entire network is designed to be differentiable end-to-end, all the parameters are estimated with gradient-based optimization.
3.3.2 Proposed Approach
The neural message passing procedure with GCNs in Eq. 4 is similar to mean field iterations in Eq. 1, as each node is associated with a certain quantity that is computed based on the associated quantities of the neighboring nodes. The major difference between the two methods is that in GCN the value of each class node or label is a vector representation instead of a scalar. To compute the score of each label, the vector needs to be transformed into a scalar. We hereby introduce our methods of label representation and decoder. The input representation and GCN propagation remain the same as previous sections.
Context-Dependent Label Representation
Following the MLP example, the input representation is extracted, and each label is embedded in a vector which is jointly learned during training. Context-dependent node vector for label is initialized by concatenating the latent input representation and label vector . A graph feature matrix is constructed so that the -th row of is , i.e., .
After iterations of graph convolution, the graph feature is extracted. The output feature vector of label is decoded by tying output weight between the input latent representation and node representation (Inan et al., 2016). The label score is obtained as , where is the -th row of . We constrain and to have the same dimension.
Unlike the MLP+CRF, the proposed method with neural message passing encodes labels based on vector representations, and therefore potentially encoding richer information.
4 Related Work
Structured Prediction with Label Relations
Structured prediction has been used for classification with label relations (Taskar et al., 2004), (Tsochantaridis et al., 2005), (Lampert, 2011), (Bi and Kwok, 2012), (Bi and Kwok, 2011), (Zhang et al., 2017).
The goal of our work is clearly distinguished from the aforementioned works. Structured prediction can be viewed as a variant of multi-label classification, it takes input data with multiple assignments during training and jointly predicts a set of class labels for new observations during testing, while in our work the proposed model is trained on single-labeled data.
Classification with Label Relations
Deng et al. (2014) incorporated WordNet (Miller, 1995) in object recognition and demonstrated that exploiting label relations not only improves multi-class classification accuracy but also multi-label classification performance, by setting hard constraints on the exclusive and inclusive relation between labels. This model was further extended for soft label relations using the Ising model in (Ding et al., 2015).
There are three major differences between this approach and the proposed approach. First, we do not impose any constraints on the graph structure other than requiring the availability of pairwise relations among nodes. Deng et al. (2014) on the other hand proposed to used a special kind of representation (the HEX graph) to express and enforce exclusion, inclusion, and overlap relations. Second, the proposed model is trained strictly with a single-labeled data, while Deng et al. (2014) and Ding et al. (2015) add multiple labels during training by using hard constraints. Third, we train and use the entire model as a deterministic network, while Deng et al. (2014) and Ding et al. (2015) require a separate inference procedure to model a conditional probability in the test time, leading to mismatch between training and test.
Graph Convolutional Networks
Previous works focused on exploiting different GCN structures. For instance, Defferrard et al. (2016) approximated smooth filters in the spectral domain using Chebyshev polynomials with free parameters that are learned in a neural network-like model. Kipf and Welling (2016) introduced simplifications that significantly improves both training times and predictive accuracy.
The main difference between the proposed method and the previously mentioned works is in the input data structure. The method proposed applies GCN as a layer in DNN to model data with structured output instead of structured input. In particular, the proposed model projects structured labels into a high-dimensional space and forward label hidden states conditioned on input data to GCN layers for feature extraction and classification.
Classification with External Knowledge
Recent works have begun to investigate new ways to integrate richer knowledge in classification tasks. For example, Grauman et al. (2011), Hwang et al. (2012) and Deng et al. (2014) took the WordNet category taxonomy to improve image object recognition. McAuley and Leskovec (2012) and Johnson et al. (2015) used metadata from a social network to improve image classification. Ordonez et al. (2013) leveraged associated image captions to estimate entry-level labels of visual objects. Hu et al. (2016) used label relation graphs and concept layers for layered predictions.
The proposed method is a novel approach to incorporating external knowledge about label relations and is not task-specific. The label structure can be extracted in an arbitrary way.
5 Experiments Settings
The proposed model is assessed in two experiments: a visual object recognition task on a canine image dataset, and a document classification task on an in-house dataset. Dataset statistics are summarized in Table 1.
|Dataset||Nodes (Labels)||Edges||Data Size|
5.1.1 Canine Image Dataset
The canine image dataset is composed of open-source images with labels on a subgraph of WordNet, which is a hierarchical structure of objects. With this dataset we want to evaluate the performance of the proposed model on a special case of graph structure: a tree structure, where edges are directed and each node has only one edge (apart from the parent node).
Image data collection
We collect a new dataset consisting of images using an approach inspired by Evtimova et al. (2017). We crawl the nodes in the subtree of the ‘canine’ synset in WordNet, and query the label of each node in Flickr to retrieve 140 images. Images in each node are partitioned into 100/20/20 images for training/validation/test sets respectively.
Label graph construction
The adjacency matrix is extracted from the WordNet canine subgraph.
5.1.2 Document Classification Dataset
We use an in-house dataset composed of various types of web page content for document classification. The underlying output structure is generated using semantic similarity of labels. We use this dataset to evaluate the performance of proposed model on general undirected graph structure.
Data collection and preprocessing
In this dataset, each document has one human-annotated label that summarizes the primary information of the content. The labels are topics covering company names, business, finance, accounting, marketing, human resource, technology, lifestyle, and more. The dataset is split by 60%/20%/20% into training/validation/test sets respectively. The documents are lowercased and tokenized. The vocabulary contains the most frequent 100,000 unigrams and bigrams.
Label graph construction
The label graph is built by measuring pairwise label similarities based on the label definitions. The label definition is retrieved from Wikipedia. If the label name does not have an exact match, the top-three topics suggested by Wikipedia are selected, and their Wikipedia definitions are concatenated as an alternative. The definitions are further tokenized into words and vectorized. We experiment with definition vectors created by TF-IDF weighted average of pre-trained word vectors from Joulin et al. (2016).
The adjacency matrix is created by computing the pairwise cosine similarities of definition vectors. Theentry of an adjacency matrix is , where and are the -th and -th label’s definition vector respectively. A discrete adjacency matrix is built by setting a threshold on the continuous adjacency matrix. In the experiments, we set as the 75% percentile of all entries in . The discrete adjacency matrix has on average 29.74 edges per node.
After preprocessing, the document vectors are embedded in a CBoW manner (Mikolov et al., 2013). Let be a trainable embedding matrix, where is the vocabulary size and the embedding dimension, and be the of -th row of . The embedding function for document with tokens indices is defined as .
5.2 Model and Learning Configuration
Proposed Model: GCNTD
We used a model architecture with two layers of GCN propagation and tied-weight decoder. We also experiment with 4 and 8 layers, but did not observe significant difference on the model performance.
We consider the following baselines.
The baseline model uses layers of multi-layer perceptron (MLP) instead of the proposed GCN layers to map the contextual hidden state to the classes. We denote this model as MLPn.
This baseline is described in section 3.2. We apply mean-field inference (MF) in our experiment. We fine-tune MLP parameters and on validation set using top-1 accuracy.
Another baseline is to set the adjacency matrix in GCNTD as fully-connected (fc
) or identity matrix (id).
All models are trained by minimizing negative log-likelihood (NLL) with back-propagation using Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001. The learning rate is annealed each time the validation error does not improve. Each training is early-stopped based on the top-1 accuracy on the validation set. We random search embedding size and learning rate on validation set Bergstra and Bengio (2012). Metrics are reported on a test set using the best model according to validation set. We observed similar training time for all the models.
Apart from the top-1 and top-10 accuracies, we propose several graph-theoretic metrics to understand the performance of the proposed GCNTD model. We refer to ‘predictions’ as the 10 labels that are predicted with the highest scores, if not otherwise specified. The following are our evaluation metrics:
- Top-1/top-10 accuracy
The percentage of test cases when the true label is predicted in top-1/top-10 predictions.
- One-hop precision@k
The fraction of top-k predictions that overlaps with the true label and its one-hop neighbors. By default .
- One-hop recall@k
The fraction of the true label and its one-hop neighbors that overlaps with top-k predictions. By default .
- Top-1/top-10 distance
Distance refers to the shortest path between a certain prediction and the true label on the graph, if they are connected at all. Top-1 distance is the distance between top-1 prediction and true label, and top-10 distance is the average distance between top-10 predictions and true label.
The diameter of a graph is the maximum eccentricity of any vertex in the graph. In other words, it is the greatest distance between any pair of vertices. Here the diameter refers to the diameter of the subgraph that the top-k predictions form.
The one-hop precision and recall are similar to those in multi-label classification framework. For simplicity, we refer to them as precision and recall. In these metrics, it is assumed that the true label’s one-hop neighbors on the graph are also potential true labels. Let true label and its one-hop neighbors beand predictions be , the precision and recall are: , In addition to the ability of model to exactly match the ground truth, these two metrics also measure the ability for the model to find a small cluster around ground truth. Higher values on top-1/top-10 accuracy, precision, and recall indicate stronger predictive power.
Top-1/top-10 distances and diameter, on the other hand, measure the coherence of predictions from a graphical perspective. Since the graph captures label relations, the labels that are closer to each other on the graph are more related. In the case of the definition-based label graph in document classification task, the graph captures semantic similarities. These metrics hence measure how centralized the predictions are with respect to the true label or themselves from a semantic perspective. Lower values indicate semantically more related labels.
|.42 (.74)||.15||.60||2.07 (3.46)||3.37|
|MLP1||.44 (.75)||.14||.57||2.02 (3.52)||3.86|
|MLP-CRF||.42 (.75)||.16||.67||1.97 (3.29)||4.33|
|GCNTD-id||.40 (.70)||.13||.52||2.17 (3.64)||5.00|
|GCNTD-fc||.42 (.74)||.14||.56||2.11 (3.57)||2.73|
|.47 (.76)||.13||.64||2.02 (3.57)||3.26|
|MLP1||.49 (.77)||.11||.62||1.95 (3.64)||3.70|
|MLP-CRF||.43 (.74)||.14||.70||1.95 (3.41)||4.25|
|GCNTD-id||.45 (.71)||.11||.56||2.10 (3.77)||6.00|
|GCNTD-fc||.48 (.75)||.12||.60||2.02 (3.69)||2.42|
6.1 Object Recognition: Canine Image Dataset
The results are shown in Table 2, where the models were trained on all the nodes and evaluated on either all the nodes (all) or the leaf nodes only (leaf). In both cases, the models that consider the underlying label structure (i.e. GCNTD and MLP-CRF) achieve higher performances on graph-theoretic metrics while the MLP outperforms the GCNTD and MLP-CRF on accuracy. The MLP-CRF achieves the highest precision and recall on both evaluation scenarios. Such result indicates that the explicitly defined energy functions in a graphical model are often beneficial for predicting labels closer to the ground truth than the MLP and the GCNTD on WordNet hierarchy.
|GCNTD||.83 (.95)||.61||.18||.24 (1.35)||2.40|
|GCNTD-id||.81 (.94)||.50||.15||.28 (1.48)||2.81|
|GCNTD-fc||.84 (.96)||.50||.15||.22 (1.48)||2.82|
|MLP-CRF||.81 (.95)||.60||.16||.25 (1.34)||2.42|
|MLP1||.82 (.95)||.53||.16||.26 (1.44)||2.67|
|MLP2||.81 (.94)||.50||.14||.27 (1.49)||2.71|
|MLP4||.75 (.91)||.46||.13||.37 (1.54)||2.94|
6.2 Document Classification
Results are shown in Table 3. In general, the GCNTD outperforms the MLP on all the metrics, and outperforms MLP-CRF on accuracies, precision, and recall. GCNTD-fc achieves the highest accuracies, indicating the benefit of message passing under the extreme case where labels are fully-connected.
Figure 2 shows the comparison between GCNTD and MLP in top-k precision and recall . This demonstrates that the GCNTD tends to find a smaller cluster of predictions that are closer to ground truth. Since the label graph, in this case, is constructed by measuring definition similarities, the GCNTD can be thought of as making predictions that are semantically closer and more related to the ground truth: a smaller diameter indicates semantically closer predictions, and a smaller distance indicates that the predictions are semantically closer to the ground truth.
7 Discussion and Conclusion
We have proposed a graph convolutional network (GCN) augmented neural network classifier to exploit an underlying graph structure of labels. The proposed approach resembles an approximate inference procedure in probabilistic graphical models, but replaces iterative inference with graph convolution layers. In the experiments on object recognition and document classification, the proposed model achieved better performance on graph-theoretic metrics than a baseline model that ignores label structures. The proposed approach can be applied to any classification task with a label graph to improve accuracy and top predictions relevancy. It can be also used to incorporate external knowledge about labels by encoding such knowledge in label graph.
Meihao Chen and Zhuoru Lin thank Nicholaus Halecky, Jeffery Payne, Oleg Khavronin, Patrick Kelley, and Lindsay Reynolds for discussion and support on this work. Kyunghyun Cho thanks support by Tencent, eBay, Facebook, Google and NVIDIA, and was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI). This work is completed during Zhuoru Lin’s internship with Bombora Inc.
- Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb):281–305.
- Bi and Kwok (2011) Wei Bi and James T Kwok. 2011. Multi-label classification on tree-and dag-structured hierarchies. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). pages 17–24.
- Bi and Kwok (2012) Wei Bi and James T. Kwok. 2012. Mandatory leaf node prediction in hierarchical multilabel classification. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, Curran Associates, Inc., pages 153–161.
- Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. CoRR abs/1312.6203. http://arxiv.org/abs/1312.6203.
- Chung (1997) Fan RK Chung. 1997. Spectral graph theory. 92. American Mathematical Soc.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. CoRR abs/1606.09375. http://arxiv.org/abs/1606.09375.
- Deng et al. (2014) Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. 2014. Large-scale object classification using label relation graphs. In European Conference on Computer Vision. Springer, pages 48–64.
- Ding et al. (2015) Nan Ding, Jia Deng, Kevin Murphy, and Hartmut Neven. 2015. Probabilistic label relation graphs with ising models. CoRR abs/1503.01428. http://arxiv.org/abs/1503.01428.
- Evtimova et al. (2017) Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. 2017. Emergent language in a multi-modal, multi-step referential game. arXiv preprint arXiv:1705.10369 .
- Grauman et al. (2011) Kristen Grauman, Fei Sha, and Sung Ju Hwang. 2011. Learning a tree of metrics with disjoint visual features. In Advances in neural information processing systems. pages 621–629.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770–778.
- Hu et al. (2016) Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng Liao, and Greg Mori. 2016. Learning structured inference neural networks with label relations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Hwang et al. (2012) Sung Ju Hwang, Kristen Grauman, and Fei Sha. 2012. Semantic kernel forests from multiple taxonomies. In Advances in neural information processing systems. pages 1718–1726.
- Inan et al. (2016) Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. CoRR abs/1611.01462. http://arxiv.org/abs/1611.01462.
- Johnson et al. (2015) Justin Johnson, Lamberto Ballan, and Li Fei-Fei. 2015. Love thy neighbors: Image annotation by exploiting image metadata. In Proceedings of the IEEE international conference on computer vision. pages 4624–4632.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. CoRR abs/1607.01759. http://arxiv.org/abs/1607.01759.
- Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 .
- Lampert (2011) Christoph H Lampert. 2011. Maximum margin multi-label structured prediction. In Advances in Neural Information Processing Systems. pages 289–297.
- McAuley and Leskovec (2012) Julian McAuley and Jure Leskovec. 2012. Image labeling on a network: using social-network metadata for image classification. Computer Vision–ECCV 2012 pages 828–841.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
- Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
- Ordonez et al. (2013) Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. From large scale image categorization to entry-level categories. In Proceedings of the IEEE International Conference on Computer Vision. pages 2768–2775.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252.
- Schwing and Urtasun (2015) Alexander G. Schwing and Raquel Urtasun. 2015. Fully connected deep structured networks. CoRR abs/1503.02351. http://arxiv.org/abs/1503.02351.
- Taskar et al. (2004) Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin markov networks. In Advances in neural information processing systems. pages 25–32.
- Tsochantaridis et al. (2005) Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of machine learning research 6(Sep):1453–1484.
- Zhang et al. (2017) L Zhang, SK Shah, and IA Kakadiaris. 2017. Hierarchical multi-label classification using fully associative ensemble learning. Pattern Recognition 70:89–103.