Sequence Generation Model for Multi-label Classification
Multi-label classification is an important yet challenging task in natural language processing. It is more complex than single-label classification in that the labels tend to be correlated. Existing methods tend to ignore the correlations between labels. Besides, different parts of the text can contribute differently for predicting different labels, which is not considered by existing models. In this paper, we propose to view the multi-label classification task as a sequence generation problem, and apply a sequence generation model with a novel decoder structure to solve it. Extensive experimental results show that our proposed methods outperform previous work by a substantial margin. Further analysis of experimental results demonstrates that the proposed methods not only capture the correlations between labels, but also select the most informative words automatically when predicting different labels.READ FULL TEXT VIEW PDF
Multi-label text classification (MLTC) aims to assign multiple labels to...
One of the fundamental tasks in understanding genomics is the problem of...
Many modern applications deal with multi-label data, such as functional
Emotion detection in text is an important task in NLP and is essential i...
Consider a general machine learning setting where the output is a set of...
In Multi-Label Text Classification (MLTC), one sample can belong to more...
This paper explores a new natural language processing task, review-drive...
Sequence Generation Model for Multi-label Classification
This work is licenced under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/
Multi-label classification (MLC) is an important task in the field of natural language processing (NLP), which can be applied in many real-world scenarios, such as text categorization [Schapire and Singer2000], tag recommendation [Katakis et al.2008], information retrieval [Gopal and Yang2010], and so on. The target of the MLC task is to assign multiple labels to each instance in the dataset.
Binary relevance (BR) [Boutell et al.2004]
is one of the earliest attempts to solve the MLC task by transforming the MLC task into multiple single-label classification problems. However, it neglects the correlations between labels. Classifier chains (CC) proposed by ml_3 converts the MLC task into a chain of binary classification problems to model the correlations between labels. However, it is computationally expensive for large datasets. Other methods such as ML-DT[Clare and King2001], Rank-SVM [Elisseeff and Weston2002]
, and ML-KNN[Zhang and Zhou2007] can only be used to capture the first or second order label correlations or are computationally intractable when high-order label correlations are considered.
In recent years, neural networks have achieved great success in the field of NLP. Some neural network models have also been applied in the MLC task and achieved important progress. For instance, fully connected neural network with pairwise ranking loss function is utilized in r5. r4 propose to perform classification using the convolutional neural network (CNN). r6 use CNN and recurrent neural network (RNN) to capture the semantic information of texts. However, they either neglect the correlations between labels or do not consider differences in the contributions of textual content when predicting labels.
In this paper, inspired by the tremendous success of the sequence-to-sequence (Seq2Seq) model in machine translation [Bahdanau et al.2014, Luong et al.2015, Sun et al.2017], abstractive summarization [Rush et al.2015, Lin et al.2018], style transfer [Shen et al.2017, Xu et al.2018] and other domains, we propose a sequence generation model with a novel decoder structure to solve the MLC task. The proposed sequence generation model consists of an encoder and a decoder with the attention mechanism. The decoder uses an LSTM to generate labels sequentially, and predicts the next label based on its previously predicted labels. Therefore, the proposed model can consider the correlations between labels by processing label sequence dependencies through the LSTM structure. Furthermore, the attention mechanism considers the contributions of different parts of text when the model predicts different labels. In addition, a novel decoder structure with global embedding is proposed to further improve the performance of the model by incorporating overall informative signals.
The contributions of this paper are listed as follows:
We propose to view the MLC task as a sequence generation problem to take the correlations between labels into account.
We propose a sequence generation model with a novel decoder structure, which not only captures the correlations between labels, but also selects the most informative words automatically when predicting different labels.
Extensive experimental results show that our proposed methods outperform the baselines by a large margin. Further analysis demonstrates the effectiveness of the proposed methods on correlation representation.
We introduce our proposed methods in detail in this section. First, we give an overview of the model in Section 2.1. Second, we explain the details of the proposed sequence generation model in Section 2.2. Finally, Section 2.3 presents our novel decoder structure.
First of all, we define some notations and describe the MLC task. Given the label space with labels , a text sequence containing words, the task is to assign a subset containing labels in the label space to . Unlike traditional single-label classification where only one label is assigned to each sample, each sample in the MLC task can have multiple labels. From the perspective of sequence generation, the MLC task can be modeled as finding an optimal label sequence
that maximizes the conditional probability, which is calculated as follows:
An overview of our proposed model is shown in Figure 1. First, we sort the label sequence of each sample according to the frequency of the labels in the training set. High-frequency labels are placed in the front. In addition, the and symbols are added to the head and tail of the label sequence, respectively.
The text sequence
is encoded to the the hidden states, which are aggregated to a context vectorby the attention mechanism at time-step . The decoder takes the context vector , the last hidden state of the decoder and the embedding vector as the inputs to produce the hidden state at time-step . Here
is the predicted probability distribution over the label spaceat time-step . The function takes
as input and produces the embedding vector which is then passed to the decoder. Finally, the masked softmax layer is used to output the probability distribution.
In this subsection, we introduce the details of our proposed model. The whole sequence generation model consists of an encoder and a decoder with the attention mechanism.
Encoder: Let be a sentence with words and is the one-hot representation of the -th word. We first embed to a dense embedding vector by an embedding matrix . Here is the size of the vocabulary, and is the dimension of the embedding vector.
We use a bidirectional LSTM [Hochreiter and Schmidhuber1997] to read the text sequence from both directions and compute the hidden states for each word,
We obtain the final hidden representation of the-th word by concatenating the hidden states from both directions, , which embodies the information of the sequence centered around the -th word.
Attention: When the model predicts different labels, not all text words make the same contribution. The attention mechanism produces a context vector by focusing on different portions of the text sequence and aggregating the hidden representations of those informative words. Specially, the attention mechanism assigns the weight to the -th word at time-step as follows:
where , , are weight parameters and is the current hidden state of the decoder at time-step . For simplicity, all bias terms are omitted in this paper. The final context vector which is passed to the decoder at time-step is calculated as follows:
Decoder: The hidden state of the decoder at time-step is computed as follows:
where means the concatenation of the vectors and . is the embedding of the label which has the highest probability under the distribution . is the probability distribution over the label space at time-step and is computed as follows:
where , , and are weight parameters, is the mask vector that is used to prevent the decoder from predicting repeated labels, and
is a nonlinear activation function.
At the training stage, the loss function is the cross-entropy loss function. We employ the beam search algorithm [Wiseman and Rush2016] to find the top-ranked prediction path at inference time. The prediction paths ending with the are added to the candidate path set.
In the sequence generation model mentioned above, the embedding vector in Equation (7) is the embedding of the label that has the highest probability under the distribution . However, this calculation only takes advantage of the maximum value of greedily. The proposed sequence generation model generates labels sequentially and predicts the next label conditioned on its previously predicted labels. Therefore, it is likely that we would get a succession of wrong label predictions in the following time steps if the prediction is wrong at time-step , which is also called exposure bias. To a certain extent, the beam search algorithm alleviates this problem. However, it can not fundamentally solve the problem because the exposure bias phenomenon is likely to occur for all candidate paths. represents the predicted probability distribution at time-step , so it is obvious that all information in is helpful when we predict the current label at time-step . The exposure bias problem ought to be relieved by considering all informative signals contained in .
Based on this motivation, we propose a new decoder structure, where the embedding vector at time-step is capable of representing the overall information at -th time step. Inspired by the idea of the adaptive gate in highway network [Srivastava et al.2015], here we introduce our global embedding. Let denotes the embedding of the label which has the highest probability under the distribution . is the weighted average embedding at time , which is calculated as follows:
where is the -th element of and is the embedding vector of the -th label. Then the proposed global embedding passed to the decoder at time-step is as follows:
where is the transform gate controlling the proportion of the weighted average embedding:
where are weight matrices. The global embedding is the optimized combination of the original embedding and the weighted average embedding by using transform gate , which can automatically determine the combination factor in each dimension. contains the information of all possible labels. By considering the probability of every label, the model is capable of reducing damage caused by mispredictions made in the previous time steps. This enables the model to predict label sequences more accurately.
In this section, we evaluate our proposed methods on two datasets. We first introduce the datasets, evaluation metrics, experimental details, and all baselines. Then, we compare our methods with the baselines. Finally, we provide the analysis and discussions of experimental results.
Reuters Corpus Volume I (RCV1-V2)222http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm: This dataset is provided by rcv1. It consists of over manually categorized newswire stories made available by Reuters Ltd for research purposes. Multiple topics can be assigned to each newswire story and there are 103 topics in total.
Arxiv Academic Paper Dataset (AAPD)333https://github.com/lancopku/SGM: We build a new large dataset for the multi-label text classification. We collect the abstract and the corresponding subjects of papers in the computer science field from the website444https://arxiv.org/. An academic paper may have multiple subjects and there are 54 subjects in total. The target is to predict corresponding subjects of an academic paper according to the content of the abstract.
We divide each dataset into training, validation and test sets. The statistics of the two datasets are shown in Table 1.
|Dataset||Total Samples||Label Sets||Words/Sample||Labels/Sample|
Following the previous work [Zhang and Zhou2007, Chen et al.2017], we adopt hamming loss and micro- score as our main evaluation metrics. Micro-precision and micro-recall are also reported to assist the analysis.
Hamming-loss [Schapire and Singer1999] evaluates the fraction of misclassified instance-label pairs, where a relevant label is missed or an irrelevant is predicted.
We extract the vocabularies from the training sets. For the RCV1-V2 dataset, the size of the vocabulary is and out-of-vocabulary (OOV) words are replaced with . Each document is truncated at the length of 500 and the beam size is 5 at the inference stage. Besides, we set the word embedding size to 512. The hidden sizes of the encoder and the decoder are 256 and 512, respectively. The number of LSTM layers of encoder and decoder is 2.
For the AAPD dataset, the size of word embedding is 256. There are two LSTM layers in the encoder and its size is 256. For the decoder, there is one LSTM layer of size 512. The size of the vocabulary is and OOV words are also replaced with . Each document is truncated at the length of 500. The beam size is 9 at the inference stage.
We use the Adam [Kingma and Ba2014] optimization method to minimize the cross-entropy loss over the training data. For the hyper-parameters of the Adam optimizer, we set the learning rate , two momentum parameters and respectively, and . Additionally, we make use of the dropout regularization [Srivastava et al.2014] to avoid overfitting and clip the gradients [Pascanu et al.2013]
to the maximum norm of 10.0. During training, we train the model for a fixed number of epochs and monitor its performance on the validation set. Once the training is finished, we select the model with the best micro-score on the validation set as our final model and evaluate its performance on the test set.
We compare our proposed methods with the following baselines:
Binary Relevance (BR) [Boutell et al.2004] transforms the MLC task into multiple single-label classification problems by ignoring the correlations between labels.
Classifier Chains (CC) [Read et al.2011] transforms the MLC task into a chain of binary classification problems and takes high-order label correlations into consideration.
Label Powerset (LP) [Tsoumakas and Katakis2006] transforms a multi-label problem to a multi-class problem with one multi-class classifier trained on all unique label combinations.
CNN-RNN [Chen et al.2017] utilizes CNN and RNN to capture both the global and local textual semantics and model the label correlations.
Following the previous work [Chen et al.2017], we adopt the linear SVM as the base classifier in BR, CC and LP. We implement BR, CC and LP by means of Scikit-Multilearn [Szymański2017], an open-source library for the MLC task. We tune hyper-parameters of all baseline algorithms on the validation set based on the micro- score. In addition, training strategies mentioned in cnn_trick are used to tune hyper-parameters for the baselines CNN and CNN-RNN.
For the purpose of simplicity, we denote the proposed sequence generation model as SGM. We report the evaluation results of our methods and all baselines on the test sets.
The experimental results of our methods and the baselines on dataset RCV1-V2 are shown in Table 1(a). Results show that our proposed methods give the best performance in the main evaluation metrics. Our proposed SGM model using global embedding achieves a reduction of 12.79% hamming-loss and an improvement of 2.33% micro- score over the most commonly used baseline BR. Besides, our methods outperform other traditional deep-learning models by a large margin. For instance, the proposed SGM model with global embedding achieves a reduction of 15.73% hamming-loss and an improvement of 2.69% micro- score over the traditional CNN model. Even without the global embedding, our proposed SGM model is still able to outperform all baselines.
In addition, the SGM model is significantly improved by using global embedding. The SGM model with global embedding achieves a reduction of 7.41% hamming loss and an improvement of 1.04% micro- score on the test set compared with the model without global embedding.
Table 1(b) presents the results of the proposed methods and the baselines on the AAPD test set. Similar to the experimental results on the RCV1-V2 test set, our proposed methods still outperform all baselines by a large margin in main evaluation metrics. This further confirms that our methods have significant advantages over previous work on large datasets. Besides, the proposed SGM achieves a reduction of 2.39% hamming loss and an improvement of 1.57% micro- score on the test set by using global embedding. This further testifies that the global embedding is capable of helping the model to predict label sequences more accurately.
Here we perform further analysis on the model and experimental results. We report the evaluation results in terms of hamming loss and micro- score.
As is shown in Table 2, global embedding can significantly improve the performance of the model. The global embedding at time-step takes advantage of all information of possible labels contained in , so it is able to enrich the source information when the model predicts the current label, which leads to the performance of the model significantly improved. The global embedding is the combination of original embedding and the weighted average embedding by using the transform gate . Here we conduct experiments on the RCV1-V2 dataset to explore how the performance of our model is affected by the proportion between two kinds of embeddings. In the exploratory experiment, the final embedding vector at time-step is calculated as follows:
The proportion between two kinds of embeddings is controlled by coefficient . denotes the proposed SGM model without global embedding. The proportion of weighted average embedding increases when we increase . The experimental results using different values in the decoder are shown in Figure 3.
As is shown in Figure 3, the performance of the model varies when different is used. Overall, the model using the adaptive gate performs the best, which achieves the best results in both hamming loss and micro-. The models with outperform the model with , which shows that the weighted average embedding contains richer information, leading to the improvement in the performance of the model. Without using the adaptive gate, the performance of the model improves at first and then deteriorates as increases. It reveals the reason why the model with the adaptive gate performs the best: the adaptive gate can automatically determine the most appropriate value according to the actual condition.
Our proposed methods are developed based on traditional Seq2Seq models. However, the mask module is added to the proposed methods, which is used to prevent the models from predicting repeated labels. In addition, we sort the label sequence of each sample according to the frequency of appearance of labels in the training set. In order to explore the impact of the mask module and sorting, we conduct ablation experiments on the RCV1-V2 dataset. The experimental results are shown in Table 3. “w/o mask” means that we do not perform mask operation and “w/o sorting” means that we randomly shuffle the label sequence in order to perturb its original order.
As is shown in Table 3
, the performance decline of the SGM model with global embedding is more significant compared with that of the SGM model without global embedding. In addition, the decline in the performance of the two models is more significant when we randomly shuffle the label sequence of the sample compared with removing mask module. The label cardinality of the RCV1-V2 dataset is small, so our proposed methods are less prone to predicting repeated labels. This explains the reason why experimental results indicate that the mask module has little impact on the models’ performance. In addition, the proposed models are trained using the maximum likelihood estimation method and the cross-entropy loss function, which requires humans to predefine the order of the output labels. Therefore, the sorting of labels is very important for the models’ performance. Besides, the performance of both models declines when we do not use the mask module. This shows that the performance of the model can be improved by using the mask operation.
In the experiment, we find that the performance of all methods deteriorates when the length of the label sequence increases (for simplicity, we denote the length of the label sequence as LLS). In order to explore the influence of the value of the LLS, we divide the test set into different subsets based on different LLS. Figure 3 shows the performance of the SGM model and the most commonly used baseline BR on different subsets of the RCV1-V2 test set. As is shown in Figure 3, generally, the performance of both models deteriorates as the LLS increases. This shows that when the label sequence of the sample is particularly long, it is difficult to accurately predict all labels. Because more information is needed when the model predicts more labels. It is easy to ignore some true labels whose feature information is insufficient.
However, as is shown in Figure 3, the proposed SGM model outperforms BR with any value of LLS, and the advantages of our model are more significant when LLS is large. The traditional BR method predicts all labels at once only based on the sample input. Therefore, it tends to ignore some true labels whose feature information contained in the sample is insufficient. The SGM model generates labels sequentially, and predicts the next label based on its previously predicted labels. Therefore, even if the sample contains less information of some true labels, the SGM model is capable of generating these true labels by considering relevant labels that have been predicted.
”. They denote computer vision and computational language, respectively.
When the model predicts different labels, there exist differences in the contributions of different words. The SGM model is able to select the most informative words by utilizing the attention mechanism. The visualization of the attention layer is shown in Table 4. According to Table 4, when the SGM model predicts the label “CV”, it can automatically assign larger weights to more informative words, like image, visual, captioning, and so on. For the label “CL”, the selected informative words are sentence, memory, recurrent, etc. This shows that our proposed models are able to consider the differences in the contributions of textual content when predicting different labels and select the most informative words automatically.
We give several examples of the generated label sequences on the RCV1-V2 dataset in Table 5, where we compare the proposed methods with the most commonly used baseline BR. The red bold labels in each example indicate that they are highly correlated. For instance, the correlation coefficient between E51 and E512 is 0.7664. Therefore, these highly correlated labels are likely to appear together in the predicted label sequence. The BR algorithm fails to capture this label correlation, leaving many true labels unpredicted. However, our proposed methods accurately predict almost all highly correlated true labels. The proposed SGM captures the correlations between labels by utilizing LSTM to generate labels sequentially. Therefore, for some true labels whose feature information is insufficient, the proposed SGM is still able to generate them by considering relevant labels that have been predicted. In addition, label sequences that are more accurate are predicted by using global embedding. The SGM model with global embedding predicts more true labels compared with the SGM model without global embedding. The reason is that the source information is further enriched by incorporating overall informative signals in the probability distribution when the model predicts the label at time-step . Enriched information makes global embedding more smooth, which enables the model to reduce damage caused by mispredictions made in the previous time steps.
|Reference||BR||SGM||SGM + GE|
|CCAT, C15, C152, C41, C411||CCAT, C15, C13||CCAT, C15, C152||CCAT, C15, C152, C41, C411|
|CCAT, GCAT, ECAT, C31, GDIP, C13, C21, E51, E512||CCAT, GCAT, GDIP, E51||CCAT, ECAT, GDIP, E51, E512||CCAT, GCAT, ECAT, C31, GDIP, E51, E512, C312|
|GCAT, ECAT, G15, G154, G151, G155||GCAT, ECAT, GENV, G15||GCAT, ECAT, E21, G15, G154, G156||GCAT, ECAT, E21, G15, G154, G155|
The MLC task studies the problem where multiple labels are assigned to each sample. There are four main types of methods for the MLC task: problem transformation methods, algorithm adaptation methods, ensemble methods, and neural network models.
Problem transformation methods map the MLC task into multiple single-label learning tasks. Binary relevance (BR) [Boutell et al.2004] decomposes the MLC task into independent binary classification problems by ignoring the correlations between labels. In order to model label correlations, label powerset (LP) [Tsoumakas and Katakis2006] transforms a multi-label problem to a multi-class problem with a classifier trained on all unique label combinations. Classifier chains (CC) [Read et al.2011] transforms the MLC task into a chain of binary classification problems, where subsequent binary classifiers in the chain are built upon the predictions of preceding ones. However, the computational efficiency and performance of these methods are challenged by applications with a large number of labels and samples.
Algorithm adaptation methods extend specific learning algorithms to handle multi-label data directly. ml_7 construct decision tree based on multi-label entropy to perform classification. ml_8 optimize the empirical ranking loss by using maximum margin strategy and kernel tricks. Collective multi-label classifier (CML)[Ghamrawi and McCallum2005] adopts maximum entropy principle to deal with multi-label data by encoding label correlations as constraint conditions. ml_6 adopt -nearest neighbor techniques to deal with multi-label data. ml_5 make ranking among labels by utilizing pairwise comparison. li2015multi propose a novel joint learning algorithm that allows the feedbacks to be propagated from the classifiers for latter labels to the classifier for the current label. Most methods, however, can only be used to capture the first or second order label correlations or are computationally intractable in considering high-order label correlations.
Among ensemble methods, rakel break the initial set of labels into a number of small random subsets and employ the LP algorithm to train a corresponding classifier. lsps propose to construct a label co-occurrence graph and perform community detection to partition the label set.
In recent years, some neural network models have also been used for the MLC task. r5 propose the BP-MLL that utilizes a fully-connected neural network and a pairwise ranking loss function. r3 propose a neural network using cross-entropy loss instead of ranking loss. r8 increase classification speed by adding an extra ART layer for clustering. r4 utilize word embeddings based on CNN to capture label correlations. r6 propose to represent semantic information of text and model high-order label correlations by combining CNN with RNN. r7 initialize the final hidden layer with rows that map to co-occurrence of labels based on the CNN architecture to improve the performance of the model. ma2018bag propose to use the multi-label classification algorithm for machine translation to handle the situation where a sentence can be translated into more than one correct sentences.
In this paper, we propose to view the multi-label classification task as a sequence generation problem to model the correlations between labels. A sequence generation model with a novel decoder structure is proposed to improve the performance of classification. Extensive experimental results show that the proposed methods outperform the baselines by a substantial margin. Further analysis of experimental results demonstrates that our proposed methods not only capture the correlations between labels, but also select the most informative words automatically when predicting different labels.
As analyzed in Section 3.6.3, when a large number of labels are assigned to a sample, how to predict all these true labels accurately is an intractable problem. Our proposed methods alleviate this problem to some extent, but more effective solutions need to be further explored in the future.
This work is supported in part by National Natural Science Foundation of China (No. 61673028, No. 61333018) and the National Thousand Young Talents Program. Xu Sun is the corresponding author of this paper.
Learning multi-label scene classification.Pattern Recognition, 37(9):1757–1771.
Bag-of-words as target for neural machine translation.In ACL 2018.
A neural attention model for abstractive sentence summarization.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 379–389.
Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach.ACL 2018.