Multi-Label Text Classification using Attention-based Graph Neural Network

03/22/2020 ∙ by Ankit Pal, et al. ∙ Saama Technologies, Inc. 0

In Multi-Label Text Classification (MLTC), one sample can belong to more than one class. It is observed that most MLTC tasks, there are dependencies or correlations among labels. Existing methods tend to ignore the relationship among labels. In this paper, a graph attention network-based model is proposed to capture the attentive dependency structure among the labels. The graph attention network uses a feature matrix and a correlation matrix to capture and explore the crucial dependencies between the labels and generate classifiers for the task. The generated classifiers are applied to sentence feature vectors obtained from the text feature extraction network (BiLSTM) to enable end-to-end training. Attention allows the system to assign different weights to neighbor nodes per label, thus allowing it to learn the dependencies among labels implicitly. The results of the proposed model are validated on five real-world MLTC datasets. The proposed model achieves similar or better performance compared to the previous state-of-the-art models.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-Label Text Classification (MLTC) is the task of assigning one or more labels to each input sample in the corpus. This makes it both a challenging and essential task in Natural Language Processing(NLP). We have a set of labelled training data where are the input features with dimension for each data instances and are the targets. The vector has one in the th coordinate if the th data point belongs to th class. We need to learn a mapping (prediction rule) between the features and the labels, such that we can predict the class label vector of a new data point correctly.

MLTC has many real-world applications, such as text categorization [38], tag recommendation [20], information retrieval [15]

, and so on. Before deep learning, the solution to the MLTC task used to focus on traditional machine learning algorithms.

In Proceedings of the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020)

DOI: 10.5220/0008940304940505
ISBN: 978-989-758-395-7
Copyright © 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Different techniques have been proposed in the literature for treating multi-label classification problems. In some of them, multiple single-label classifiers are combined to emulate MLTC problems. Other techniques involve modifying single-label classifiers by changing their algorithms to allow their use in multi-label problems.

The most popular traditional method for solving MLTC is Binary Relevance (BR) [51]. BR emulates the MLTC task into multiple independent binary classification problems. However, it ignores the correlation or the dependencies among labels [28]. Binary Relevance has stimulated research for finding approaches to capture and explore the label correlations in various ways. Some methods, including Deep Neural Network (DNN) based and probabilistic based models, have been introduced to model dependencies among labels, such as Hierarchical Text Classification. [41], [47], [14] and [32]. Recently Graph-based Neural Networks [45] e.g. Graph Convolution Network [23], Graph Attention Networks [43] and Graph Embeddings [4] have received considerable research attention. This is due to the fact that many real-world problems in complex systems, such as recommendation systems [50], social networks and biological networks [12]

etc, can be modelled as machine learning tasks over large networks. Graph Convolutional Network (GCN) was proposed to deal with graph structures. The GCN benefits from the advantage of the Convolutional Neural Network(CNN) architecture: it performs predictions with high accuracy, but a relatively low computational cost by utilizing fewer parameters compared to a fully connected multi-layer perceptron (MLP) model. It can also capture essential sentence features that determine node properties by analyzing relations between neighboring nodes. Despite the advantages as mentioned above, we suspect that the GCN is still missing an essential structural feature to capture better correlation or dependencies between nodes.

One possible approach to improve the GCN performance is to add adaptive attention weights depending on the feature matrix to graph convolutions.

To capture the correlation between the labels better, we propose a novel deep learning architecture based on graph attention networks. The proposed model with graph attention allows us to capture the dependency structure among labels for MLTC tasks. As a result, the correlation between labels can be automatically learned based on the feature matrix. We propose to learn inter-dependent sentence classifiers from prior label representations (e.g. word embeddings) via an attention-based function. We name the proposed method Multi-label Text classification using Attention based Graph Neural NETwork (MAGNET). It uses a multi-head attention mechanism to extract the correlation between labels for the MLTC task. Specifically, these are the following contributions:

  • The drawbacks of current models for the MLTC task are analyzed.

  • A novel end-to-end trainable deep network is proposed for MLTC. The model employs Graph Attention Network (GAT) to find the correlation between labels.

  • It shows that the proposed method achieves similar or better performance compared to previous State-of-the-art(SoTA) models across two MLTC metrics and five MLTC datasets.

2 Related Work

The MLTC task can be modeled as finding an optimal label sequence

that maximizes the conditional probability

, which is calculated as follows:


There are mainly three types of methods to solve the MLTC task:

  • Problem transformation methods

  • Algorithm adaptation methods

  • Neural network models

2.1 Problem transformation methods

Problem transformation methods transform the multi-label classification problem either into one or more single-label classification or regression problems. Most popular problem transformation method is Binary relevance (BR) (Boutell et al., 2004), BR learns a separate classifier for each label and combines the result of all classifiers into a multi-label prediction by ignoring the correlations between labels. Label Powers(LP) treats a multi-label problem as a multi-class problem by training a multi-class classifier on all unique combinations of labels in the dataset. Classifier Chains (CC) transform the multi-label text classification problem into a Bayesian conditioned chain of the binary text classification problem. However, the problem transformation method takes a lot of time and space if the dataset and labels are too large.

2.2 Algorithm adaptation methods

Algorithm adaptation, on the other hand, adapts the algorithms to handle multi-label data directly, instead of transforming the data. Clare and King (2001) construct a decision tree by modifying the c4.5 algorithm


and develop resampling strategies. (Elisseeff and Weston 2002) propose the Rank-SVM by amending a Support Vector Machine (SVM). (Zhang and Zhou 2007) propose a multi-label lazy learning approach (ML-KNN), ML-KNN uses correlations of different labels by adopting the traditional K-nearest neighbor (KNN) algorithm. However, the algorithm adaptation method is limited to utilizing only the first or second order of label correlation.

2.3 Neural network models

In recent years, various Neural network-based models are used for MLTC task. For example, [48] propose hierarchical attention networks (HAN), uses the GRU gating mechanism with hierarchical attention for document classification. Zhang and Zhou (2006) propose a framework called Back-propagation for multilabel learning (BP-MLL) that learns ranking errors in neural networks via back-propagation. However, these types of neural networks don’t perform well on high dimensional and large-scale data.

Many CNN based model, RCNN [24], Ensemble method of CNN and RNN by Chen et al. (2017), XML-CNN [27], CNN [21] and TEXTCNN [21] have been proposed to solve the MLTC task. However, they neglect the correlations between labels.

To utilise the relation between the labels some Hierarchical text classification models have been proposed, Transfer learning idea proposed by

[46] uses hierarchical Support Vector Machine (SVM), [14] and [16] uses hierarchical and graphical dependencies between class-labels, [33] utilize the graph operation on the graph of words. However, these methods are limited as they consider only pair-wise relation due to computational constraints.

Recently, the BERT language model achieves state-of-the-art performance in many NLP tasks. [11]

3 Magnet Architecture

3.1 Graph representation of labels

A graph consists of a feature description and the corresponding adjacency matrix where denotes the number of labels and denotes the number of dimensions.

GAT network takes the node features and adjacency matrix that represents the graph data as inputs. The adjacency matrix is constructed based on the samples. In our case, we do not have a graph dataset. Instead, we learn the adjacency matrix, hoping that the model will determine the graph, thereby learning the correlation of the labels.

Our intuition is that by modeling the correlation among labels as a weighted graph, we force the GAT network to learn such that the adjacency matrix and the attention weights together represent the correlation. We use three methods to initialize the weights of the adjacency matrix. Section 3.5 explains the initialization methods in detail.

In the context of our model, the embedding vectors of the labels act as the node features, and the adjacency matrix is a learn-able parameter.

Figure 1: Illustration of overall structure of MAGNET model with a single Graph Attention layer for multi label text classification. is input for BiLSTM to generate the feature vectors. are encoded using BERT embeddings. Input for Graph attention network is the Adjacency matrix and label vectors . GAT output is attended label features which is applied on the feature vectors obtained from the BiLSTM

3.2 Node updating Mechanism in Graph convolution

In Graph Convolution Network Nodes can be updated by different types of node updating mechanisms. The basic version of GCN updates each node of the -th layer, , as follows.



denote an activation function,

is an adjacency matrix and is the convolution weights of the -th layer. We represent each node of the graph structure as a label; at each layer, the label’s features are aggregated by neighbors to form the label features of the next layer. In this way, features become increasingly more abstract at each consecutive layer. e.g., label 2 has three adjacent labels 1, 3 and 4. In this case, another way to write equation (2) is


So, in this case, the graph convolution network sums up all labels features with the same convolution weights, and then the result is passed through one activation function to produce the updated node feature output.

3.3 Graph Attention Networks for multi-label classification

In GCNs, the neighborhoods of nodes combine with equal or pre-defined weights. However, the influence of neighbors can vary greatly, and the attention mechanism can identify label importance in correlation graph by considering the importance of their neighbor labels.

The node updating mechanism, equation (3), can be written as a linear combination of neighboring labels with attention coefficients.


where is an attention coefficient which measures the importance of the th node in updating the th node of the -th hidden layer. The basic expression of the attention coefficient can be written as


The attention coefficient can be obtained typically by i) a similarity base, ii) concatenating features, and iii) coupling all features. We evaluate the attention coefficient by concatenating features.


In our experiment, we are using multi-head attention [42] that utilizes different heads to describe labels relationships. The operations of the layer are independently replicated times (each replication is done with different parameters), and the outputs are aggregated feature wise (typically by concatenating or adding).


Where is the attention coefficient of label to . represents the neighborhood of label in the graph. We use a cascade of GAT layers. For first layer the input is label embedding matrix .


The output from the previous GAT layer is fed into the successive GAT layer similar to RNN but the GAT layer weights are not shared


The output from the last layer is the attended label features where denotes the number of labels and denotes the dimension of the attended label features. which is applied to the textual features from the BiLSTM.

3.4 Feature vector generation

We are using bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to obtain the feature vectors. We use BERT for embedding the words and then feed it to BiLSTM for fine-tuning for the domain-specific task, BiLSTM reads the text sequence from both directions and computes the hidden states for each word,


We obtain the final hidden representation of the

-th word by concatenating the hidden states from both directions,


Where is the sentence, is Rnn parameters, is BERT’s parameter, is hidden size of BiLSTM. Later we multiply feature vectors with attended label features to get the final prediction score as,


Where and is feature vectors obtained from BiLSTM model. Figure 1 shows the overall structure of the proposed model.

3.5 Adjacency matrix generation

In this section, we explain how to initialize the adjacency matrix for the GAT network. We use three different methods to initialize the weights..

  • Identity Matrix

    We use the Identity matrix as the adjacency matrix. Ones on the main diagonal and zeros elsewhere, i.e., starting with zero correlation as a starting point.

  • Xavier initialization We use Xavier initialization [13] to initialize the weight of adjacency matrix.


    where is the number of incoming network connections.

  • Correlation matrix

    The correlation matrix is constructed by counting the pairwise co-occurrence of labels. The frequency vector is vector where is the number of labels and is the frequency of label in the overall training set. The co-occurrence matrix is then normalized by the frequency vector.

    A = M / F (15)

    where is the co-occurrence matrix and is the frequency vector of individual labels. This is similar to how the correlation matrix built-in [7]

    , except we do not employ binarization.

3.6 Loss function

We use Cross-entropy as the loss function. If the ground truth label of a data point is

, where


Where is sigmoid activation function

4 Experiment

In this section, we introduce the datasets, experiment details, and baseline results. Subsequently, the authors make a comparison of the proposed methods with baselines

4.1 Datasets

In this section, we provide detail and use the source of the datasets in the experiment. Table 2 shows the Statistics of all datasets.
Reuters-21578 is a collection of documents collected from Reuters News Wire in 1987. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) [8]. It contains 10,788 documents, which has 8,630 documents for training and 2,158 for testing with a total of 90 categories.
RCV1-V2 provided by Lewis et al. (2004) [26], consists of categorized newswire stories made available by Reuters Ltd. Each newswire story can have multiple topics assigned to it, with 103 topics in total. RCV1-V2 contains 8,04,414 documents which are divided into 6,43,531 documents for training and 1,60,883 for testing.
Arxiv Academic Paper Dataset (AAPD) is provided by Yang et al. (2018). The dataset consists of the abstract and its corresponding subjects of 55,840 academic papers, and each paper can have multiple subjects. The target is to predict subjects of an academic paper according to the content of the abstract. The AAPD dataset then divides into 44,672 documents for training and 11,168 for testing with a total of 54 classes.
Slashdot dataset was collected from the Slashdot website and consists of article blurbs labeled with the subject categories. Slashdot contains 19,258 samples for training and 4,814 samples for testing with a total of 291 classes.
Toxic comment dataset, We are using toxic comment dataset from Kaggle. This dataset has large number of comments from Wikipedia talk page edits. Human raters have labeled them for toxic behavior.

4.2 Experiment details

We implement our experiments in Tensorflow on an NVIDIA 1080Ti GPU. Our model consists of two GAT layers with multi-head attention. Table 1 shows the hyper-parameters of the model on five datasets. For label representations, we adopt 768 dim BERT trained on Wikipedia and BookCorpus. For the categories whose names contain multiple words, we obtain the label representation as to the average of embeddings for all words. For all datasets, the batch size is set to 250, and out of vocabulary(OOV) words are replaced with

unk. We use BERT embedding to encode the sentences. We use Adam optimizer to minimize the final objective function. The learning rate is initialized to 0.001 and we make use of the dropout 0.5 (Srivastava et al. 2014) to avoid overfitting and clip the gradients (Pascanu, Mikolov, and Bengio 2013) to the maximum norm of 10.

Dataset Vocab Size Embed size Hidden size Attention heads
Reuters-21578 20,000 768 250 4
RCV1-V2 50,000 768 250 8
AAPD 30,000 768 250 8
Slashdot 30,000 768 300 4
Toxic Comment 50,000 768 200 8
Table 1: Main experimental hyper-parameters

4.3 Performance Evaluation


In the micro-average method, the individual true positives, false positives, and false negatives of the system are summed up for different sets and applied to get Micro-average F-Score.


Hamming loss (HL): Hamming-Loss is the fraction of labels that are incorrectly predicted. [9]. Therefore, hamming loss takes into account the prediction of both an incorrect label and a missing label normalized over the total number of classes and the total number of examples.

Domain #Train #Test Labels
Reuters-21578 Text 8,630 2,158 90
RCV1-V2 Text 6,43,531 1,60,883 103
AAPD Text 44,672 11,168 54
Slashdot Text 19,258 4,814 291
Toxic Comment Text 126,856 31,714 7
Table 2: Statistics of the datasets

where is the target and is the prediction. Ideally, we would expect Hamming loss, , which would imply no error; practically the smaller the value of hamming loss, the better the performance of the learning algorithm.

4.4 Comparison of methods

We compare the performance of 27 algorithms, including state-of-the-art models. Furthermore, we compare the latest state-of-the-art models on the rcv1-v2 dataset. Compared algorithms can be categorized into three groups, as described below:

  • Flat baselines Flat Baseline models transform the documents and extract the features using TF-IDF [36]

    , later use those features as input to Logistic regression

    (LR) [1] , Support Vector Machine (SVM)[18] , Hierarchical Support Vector Machine (HSVM) [44] , Binary Relevance (BR)[3], Classifier Chains(CC)[37]. Flat Baseline methods ignore the relation between words and dependency between labels.

  • Sequence, Graph and N-gram based models

    These types of models first transform the text dataset into sequences of words, the graph of words or N-grams features, later apply different types of deep learning models on those features including CNN [22], CNN-RNN [5], RCNN [24], DCNN [39], XML-CNN [27], HR-DGCNN [33], Hierarchical LSTM (HLSTM) [6], multi-label classification approach based on a conditional cyclic directed graphical model (CDN-SVM) [17], Hierarchical Attention Network (HAN) [48] and Bi-directional Block Self-Attention Network (Bi-BloSAN) [40] etc. for the multi-label classification task For example, Hierarchical Attention Networks for Document Classification (HAN) uses a GRU grating mechanism to encode the sequences and apply word and sentence level attention on those sequences for document classification. Bi-directional Block Self-Attention Network (BI-BloSAN) uses intra-block and inter-block self-attentions to capture both local and long-range context dependencies by splitting the sequences into several blocks.

  • Recent state-of-the-art models

    We compare our model with different state-of-the-art models for multi-label classification task including BP-MLL [31], Input Encoding with Feature Message Passing (FMP) [25], TEXT-CNN[21], Hierarchical taxonomy-aware and attentional graph capsule recurrent CNNs framework (HE-AGCRCNN)[32], BOW-CNN[19], Capsule-B networks[52], Hierarchical Text Classification with Reinforced Label Assignment (HiLAP)[29], Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN (HR-DGCNN)[33], Hierarchical Transfer Learning-based Strategy (HTrans)[2], BERT (Bidirectional Encoder Representations from Transformers)[11], BERT-SGM[49], For example FMP + LaMP is a variant of LaMP model which uses Input Encoding with Feature Message Passing (FMP). It achieves state-of-the-art accuracy across five metrics and seven datasets. HE-AGCRCNN uses a hierarchical taxonomy embedding method to learn the hierarchical relations among the another recent state-of-the-art model, which has shown outstanding performance in large-scale multi-label text classification. It uses a hierarchical taxonomy embedding method to learn the hierarchical relations among the labels. BERT (Bidirectional Encoder Representations from Transformers) is a recent pre-trained language model that has shown groundbreaking results in many NLP tasks. BERT uses attention mechanism (Transformer) to learns contextual relations between words in a text.

5 Performance Analysis

In this section, we will compare our proposed method with baselines on the test sets. Table 4 shows the detailed Comparisons of Micro F1-score for various state-of-the-art models.

5.1 Comparisons with State-of-the-art

First, we compare the result of Traditional Machine learning algorithms. Among LR, SVM, and HSVM, HSVM performs better than the other two. HSVM uses SVM at each node of the Decision tree. Later we compare the result of Hierarchical, CNN based models and graph-based deep learning models. Among Hierarchical Models HLSTM, HAN, HE-AGCRCNN, HR-DGCNN, HiLAP, and HTrans, HE-AGCRCNN performs better compared to other Hierarchical models. HAN and HLSTM methods are based on recurrent neural networks. While analyzing the performance of the recurrent model with baseline Flat models, recurrent neural networks perform worse than HSVM even though there was an ignorance of label dependency in baseline models. RNN faces the problem of vanishing gradients and exploding gradients when the sequences are too long. Graph model, HR-DGCNN, performs better than recurrent and baseline models. Comparing the CNN-based model RCNN, XML-CNN, DCNN, TEXTCNN, CNN, and CNN-RNN, TEXTCNN performs better among all of them while RCNN performs worse among them.

The sequence generator model treats the multi-label classification task as a sequence generation. When comparing the sequence generator models SGM-GE and seq2seq, SGM performs better than the seq2seq network. SGM utilizes the correlation between labels by using sequence generator model with a novel decoder structure.

Comparing the proposed MAGNET against the state-of-the-art models, MAGNET significantly improved previous state-of-the-art results, we see ~20 improvement in miF1 comparison to HSVM model. While comparing with the best Hierarchical text classification models, we observe ~11, ~19, ~5 and ~8 accuracy improvement compared to HE-AGCRCNN, HAN, HiLAP, HTrans respectively. The proposed model produced a ~16 improvement in miF1 over the popular bi-directional block self-attention network (Bi-BloSAN).

Comparing with CNN group models, proposed model improves the performance by ~12 and ~6 accuracy compared with TEXTCNN and BOW-CNN method respectively. MAGNET achieves ~2 improvement over state-of-the-art BERT model.

5.2 Evaluation on other datasets

We also evaluate our proposed model on four different datasets rather than RCV1 to observe the performance of the model on those datasets, which vary in the number of samples and the number of labels. Table 3 shows the miF1 scores for different datasets, and we also report the Hamming loss in Table 5. Evaluation results show that proposed methods achieve the best performance in the primary evaluation metrics. We observe  3

and 4 miF1 improvement in AAPD and Slashdot dataset, respectively, as compared to the CNN-RNN method.

Methods Reuters-21578 AAPD Slashdot Toxic
BR 0.878 0.648 0.486 0.853

0.872 0.682 0.516 0.874
CC 0.879 0.654 0.480 0.893
CNN 0.863 0.664 0.512 0.775
CNN-RNN 0.855 0.669 0.530 0.904
MAGNET 0.899 0.696 0.568 0.930

Table 3: Comparisons of Micro F1-score for various models on four benchmark datasets.

5.3 Analysis and Discussion

Here we discuss a further analysis of the model and experimental results. We report the evaluation results in terms of hamming loss and macro-F1 score. We are using a moving average with a window size of 3 to draw the plots to make the scenarios more comfortable to read.

5.3.1 Impact of initialization of the adjacency matrix

We initialized the adjacency matrix in three different ways random, identity, and co-occurrence matrix. We hypothesized that the co-occurrence matrix would perform the best since the model is fed with richer prior information than the identity matrix, where the correlation is zero and random matrix. To our surprise, random initialization performed the best at 0.887, and identity matrix performed the worst at 0.865, whereas the co-occurrence matrix achieved the micro-F1 score of 0.878. Even though Xavier initializer performed the best, all the other random initializers performed better than co-occurrence and identity matrices. This shows that the textual information from samples contain richer information than that in the label co-occurrence matrix that we initialize the adjacency with, and both co-occurrence and identity matrix, traps the model in a local minima.

Method Accuracy

SVM 0.691
HSVM 0.693
HLSTM 0.673
RCNN 0.686
XML-CNN 0.695
HAN 0.696
Bi-BloSAN 0.72
DCNN 0.732
SGM+GE 0.719
CDN-SVM 0.738
HR-DGCNN 0.761
BP-MLL 0.780
HTrans 0.805
BOW-CNN 0.827
HilAP 0.833
BERT 0.864
BERT + SGM 0.846
FMP + LaMP 0.877
MAGNET 0.885
Table 4: Comparisons of Micro F1-score for various state-of-the-art models on Rcv1-v2 dataset.
Methods Rcv1-v2 AAPD Reuters-21578 Slashdot Toxic
BR 0.0093 0.0316 0.0032 0.052 0.034
CC 0.0089 0.0306 0.0031 0.057 0.030
CNN 0.0084 0.0287 0.0033 0.049 0.039
CNN-RNN 0.0086 0.0282 0.0037 0.046 0.025
MAGNET 0.0079 0.0252 0.0029 0.039 0.022

Table 5: Comparisons of hamming loss for various models on four benchmark datasets.
the smaller the value, the better.

5.3.2 Results on different types of word embeddings

In this section, we investigate the impact of the four different word embeddings on our proposed architecture, namely the Word2Vec embeddings[30], Glove embeddings [34], Random embeddings, BERT embeddings [10]. Figure (2) and Figure (3) shows the f1 score of all four different word embeddings on the (unseen) test dataset of Reuters-21578.

Figure 2: Different types of word embeddings performance on MAGNET x axis refer to the different types of word embeddings and y axis refer to the Accuracy ( F1-score)

Accordingly, we make the following observations:

  • Glove and word2vec embeddings produce similaer results.

  • Random embeddings perform worse than other embeddings. Pre-trained word embeddings have proven to be highly useful in our proposed architecture compared to the random embeddings.

  • BERT embeddings outperform other embeddings in this experiment. Therefore, using BERT feature embeddings increase the accuracy and performance of our architecture.

Our proposed model uses BERT embeddings for encoding the sentences.

Figure 3:

Performance of proposed model on different types of word embeddings. x-axis is the number of epoch and the y-axis refers to the micro-F1 score.

5.3.3 Comparison between Two different Graph neural networks

In this section, we compare the performance of GAT and GCN networks. The critical difference between GAT and GCN is how the information aggregates from the neighborhood. GAT computes the hidden states of each node by attending over its neighbors, following a self-attention strategy where GCN produces the normalized sum of the node features of neighbors.

GAT improved the average miF1 score by 4 over the GCN model. It shows that the GAT model captures better label correlation compare to GCN. The attention mechanism can identify label importance in correlation graph by considering the significance of their neighbor labels.

Figure 4: Performance of GAT vs GCN. x-axis is number of epochs and y-axis is micro-F1 score.

Figure (4) shows the accuracy of both neural network on Reuters-21578 dataset.

5.4 Conclusion

The proposed approach can improve the accuracy and efficiency of models and can work across a wide range of data types and applications. To model and capture the correlation between labels, we proposed a GAT based model for multi-label text classification.

We evaluated the proposed model on various datasets and presented the results. The combination of GAT with bi-directional LSTM shows that it has achieved consistently higher accuracy than those obtained by conventional approaches.

Even though our proposed model performs very well, there are still some limitations. When the dataset contains a large number of labels correlation matrix will be very large, and training the model can be difficult. Our work alleviates this problem to some extent, but we still think the exploration of more effective solutions is vital in the future.


  • [1] P. Allison (1999) Logistic regression using sas®: theory and application. First edition, SAS Publishing. External Links: ISBN 9781580253529 Cited by: 1st item.
  • [2] S. Banerjee, C. Akkaya, F. Perez-Sorrosal, and K. Tsioutsiouliklis (2019) Hierarchical transfer learning for multi-label text classification. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 6295–6300. External Links: Link Cited by: 3rd item.
  • [3] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown (2004)

    Learning multi-label scene classification

    Pattern Recognition 37 (9), pp. 1757 – 1771. External Links: ISSN 0031-3203, Document, Link Cited by: 1st item.
  • [4] H. Cai, V. W. Zheng, and K. C. Chang (2017) A comprehensive survey of graph embedding: problems, techniques and applications. CoRR abs/1709.07604. External Links: Link, 1709.07604 Cited by: §1.
  • [5] G. Chen, D. Ye, Z. Xing, J. Chen, and E. Cambria (2017-05) Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In 2017 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 2377–2383. External Links: Document, ISSN Cited by: 2nd item.
  • [6] H. Chen, M. Sun, C. Tu, Y. Lin, and Z. Liu (2016) Neural sentiment classification with user and product attention. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1650–1659. External Links: Link Cited by: 2nd item.
  • [7] Z. Chen, X. Wei, P. Wang, and Y. Guo (2019) Multi-Label Image Recognition with Graph Convolutional Networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: 3rd item.
  • [8] F. Debole and F. Sebastiani (2005-04) An analysis of the relative hardness of reuters-21578 subsets: research articles. J. Am. Soc. Inf. Sci. Technol. 56 (6), pp. 584–596. External Links: ISSN 1532-2882, Link, Document Cited by: §4.1.
  • [9] S. Destercke (2014) Multilabel prediction with probability sets: the hamming loss case. In Information Processing and Management of Uncertainty in Knowledge-Based Systems - 15th International Conference, IPMU 2014, Montpellier, France, July 15-19, 2014, Proceedings, Part II, pp. 496–505. External Links: Link, Document Cited by: §4.3.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link Cited by: §5.3.2.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.3, 3rd item.
  • [12] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 6533–6542. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §1.
  • [13] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pp. 249–256. External Links: Link Cited by: 2nd item.
  • [14] S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil (2012) Bayesian models for large-scale hierarchical classification. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pp. 2420–2428. External Links: Link Cited by: §1, §2.3.
  • [15] S. Gopal and Y. Yang (2010) Multilabel classification with meta-level features. pp. 315–322. External Links: Link, Document Cited by: §1.
  • [16] S. Gopal and Y. Yang (2015)

    Hierarchical bayesian inference and recursive regularization for large-scale classification

    TKDD 9 (3), pp. 18:1–18:23. External Links: Link, Document Cited by: §2.3.
  • [17] Y. Guo and S. Gu (2011) Multi-label classification using conditional dependency networks. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, pp. 1300–1305. External Links: Link, Document Cited by: 2nd item.
  • [18] M. A. Hearst (1998-07) Support vector machines. IEEE Intelligent Systems 13 (4), pp. 18–28. External Links: ISSN 1541-1672, Link, Document Cited by: 1st item.
  • [19] R. Johnson and T. Zhang (2014) Effective use of word order for text categorization with convolutional neural networks. CoRR abs/1412.1058. External Links: Link, 1412.1058 Cited by: 3rd item.
  • [20] I. Katakis, G. Tsoumakas, and I. Vlahavas (2008) Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD 2008 Discovery Challenge, Cited by: §1.
  • [21] Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751. External Links: Link Cited by: §2.3, 3rd item.
  • [22] Y. Kim (2014) Convolutional neural networks for sentence classification. CoRR abs/1408.5882. External Links: Link, 1408.5882 Cited by: 2nd item.
  • [23] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907. External Links: Link, 1609.02907 Cited by: §1.
  • [24] S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pp. 2267–2273. External Links: Link Cited by: §2.3, 2nd item.
  • [25] J. Lanchantin, A. Sekhon, and Y. Qi (2019) Neural message passing for multi-label classification. CoRR abs/1904.08049. External Links: Link, 1904.08049 Cited by: 3rd item.
  • [26] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li (2004-12) RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, pp. 361–397. External Links: ISSN 1532-4435, Link Cited by: §4.1.
  • [27] J. Liu, W. Chang, Y. Wu, and Y. Yang (2017) Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pp. 115–124. External Links: Link, Document Cited by: §2.3, 2nd item.
  • [28] O. Luaces, J. Díez, J. Barranquero, J. J. del Coz, and A. Bahamonde (2012-12-01) Binary relevance efficacy for multilabel classification. Progress in Artificial Intelligence 1 (4), pp. 303–313. External Links: ISSN 2192-6360, Document, Link Cited by: §1.
  • [29] Y. Mao, J. Tian, J. Han, and X. Ren (2019) Hierarchical text classification with reinforced label assignment. CoRR abs/1908.10419. External Links: Link, 1908.10419 Cited by: 3rd item.
  • [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pp. 3111–3119. External Links: Link Cited by: §5.3.2.
  • [31] J. Nam, J. Kim, E. Loza Menc’ia, I. Gurevych, and J. Fürnkranz (2014) Large-scale multi-label text classification - revisiting neural networks. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part II, pp. 437–452. External Links: Link, Document Cited by: 3rd item.
  • [32] H. Peng, J. Li, Q. Gong, S. Wang, L. He, B. Li, L. Wang, and P. S. Yu (2019) Hierarchical taxonomy-aware and attentional graph capsule rcnns for large-scale multi-label text classification. CoRR abs/1906.04898. External Links: Link, 1906.04898 Cited by: §1, 3rd item.
  • [33] H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 1063–1072. External Links: Link, Document Cited by: §2.3, 2nd item, 3rd item.
  • [34] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. External Links: Link Cited by: §5.3.2.
  • [35] J. R. Quinlan (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 1-55860-238-0 Cited by: §2.2.
  • [36] J. Ramos () Using tf-idf to determine word relevance in document queries. Cited by: 1st item.
  • [37] J. Read, B. Pfahringer, G. Holmes, and E. Frank (2011) Classifier chains for multi-label classification. Machine Learning 85 (3), pp. 333–359. External Links: Link, Document Cited by: 1st item.
  • [38] R. E. Schapire and Y. Singer (2000) BoosTexter: A boosting-based system for text categorization. Machine Learning 39 (2/3), pp. 135–168. External Links: Link, Document Cited by: §1.
  • [39] H. Schwenk, L. Barrault, A. Conneau, and Y. LeCun (2017) Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pp. 1107–1116. External Links: Link Cited by: 2nd item.
  • [40] T. Shen, T. Zhou, G. Long, J. Jiang, and C. Zhang (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. CoRR abs/1804.00857. External Links: Link, 1804.00857 Cited by: 2nd item.
  • [41] A. Sun and E. Lim (2001) Hierarchical text classification and evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA, pp. 521–528. External Links: Link, Document Cited by: §1.
  • [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §3.3.
  • [43] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [44] V. Vural and J. G. Dy (2004) A hierarchical method for multi-class support vector machines. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, New York, NY, USA, pp. 105–. External Links: ISBN 1-58113-838-5, Link, Document Cited by: 1st item.
  • [45] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link, 1901.00596 Cited by: §1.
  • [46] L. Xiao, D. Zhou, and M. Wu (2011) Hierarchical classification via orthogonal transfer. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 801–808. External Links: Link Cited by: §2.3.
  • [47] G. Xue, D. Xing, Q. Yang, and Y. Yu (2008) Deep classification in large-scale text hierarchies. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, pp. 619–626. External Links: Link, Document Cited by: §1.
  • [48] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy (2016) Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 1480–1489. External Links: Link Cited by: §2.3, 2nd item.
  • [49] R. Yarullin and P. Serdyukov (2019) BERT for sequence-to-sequence multi-label text classification. External Links: Link Cited by: 3rd item.
  • [50] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. CoRR abs/1806.01973. External Links: Link, 1806.01973 Cited by: §1.
  • [51] M. Zhang, Y. Li, X. Liu, and X. Geng (2018-04-01) Binary relevance for multi-label learning: an overview. Frontiers of Computer Science 12 (2), pp. 191–202. External Links: ISSN 2095-2236, Document, Link Cited by: §1.
  • [52] W. Zhao, J. Ye, M. Yang, Z. Lei, S. Zhang, and Z. Zhao (2018) Investigating capsule networks with dynamic routing for text classification. CoRR abs/1804.00538. External Links: Link, 1804.00538 Cited by: 3rd item.