1 Introduction
Multilabel text classification is an important machine learning task wherein one must predict a set of labels to associate with a given document; for example, a news article might be tagged with labels
sport, football, 2018 world cup, and Russia. Formally, we are given a set of label candidates, and we aim to build a classifier which maps a document
to a set of labels . The label setis typically written as a binary vector
, with each bit indicating the presence or absence of a label.Naively, one could predict each label independently without considering label dependencies. This approach is called Binary Relevance Boutell et al. (2004); Tsoumakas and Katakis (2007), and is widely used due to its simplicity, but it often does not deliver good performance. Intuitively, knowing some labels—such as sport and football—should make it easier to predict 2018 world cup and then Russia
. There are several methods that try to capture label dependencies by building a joint probability estimation over all labels
Ghamrawi and McCallum (2005); Read et al. (2009); Dembczynski et al. (2010); Li et al. (2016). The most popular approach, Probabilistic Classifier Chain (PCC) Dembczynski et al. (2010) learns labels onebyone in a predefined fixed order: for each label, it uses one classifier to estimate the probability of that label given all previous labels predictions, . PCC’s well known drawback is that errors in early probability estimations tend to affect subsequent predictions, and can become massive when the total number of label candidates is large.Recurrent neural network (RNN) is originally designed to output a sequential structure, such as a sentence Cho et al. (2014). Recently, RNNs have also been applied to multilabel classification by mapping the label set to a sequence Wang et al. (2016); Zhang et al. (2016); Jin and Nakayama (2016); Wang et al. (2017b, a); Chen et al. (2018); Yang et al. (2018). In contrast to PCC where a binary decision is made for each label sequentially, RNN only predicts the positive labels explicitly and therefore its decision chain length is equal to the number of positive labels, not the number of all labels. This makes RNN suffer less from early estimation errors than PCC.
Both PCC and RNN rely heavily on label orders in training and prediction. In multilabel data, the labels are given as sets, not necessarily with natural orders. RNN defines a sequence probability, while PCC defines set probability. Various ways of arranging sets as sequences have been explored: ordering alphabetically, by frequency, based on a label hierarchy, or according to some label ranking algorithm Liu and Tsang (2015). Previous experimental results show that which order to choose can have a significant impact on learning and prediction Vinyals et al. (2016); Nam et al. (2017); Chen et al. (2018). In the above example, starting label predictions sequence with Russia, while correct, would make the other predictions very difficult.
Previous work has shown that it is possible to train an RNN on multilabel data without specifying the label order in advance. With special training objectives, RNN can explore different label orders and converge to some order automatically Vinyals et al. (2016). In this paper we follow the same line of study: We consider how to adapt RNN sequence model to multilabel set prediction without specifying the label order. Specifically, we make the following contributions:

We analyze existing RNN models proposed for multilabel prediction, and show that existing training and prediction objectives are not well justified mathematically and have undesired consequences in practice.

We develop efficient approximate training and prediction methods. We propose new training and prediction objectives based on a principled notion of set probability. Our new formulation avoids the drawbacks of existing ones and gives the RNN model freedom to discover the best label order.

We crawl two new datasets for multilabel prediction task, and apply our method to them. We also test our method on two existing multilabel datasets. The experimental results show that our method outperforms stateoftheart methods on all datasets. We release the datasets at http://www.ccis.neu.edu/home/kechenqin.
2 Mapping Sequences to Sets
In this section, we describe how existing approaches map sequences to sets, by writing down their objective functions using consistent notations. To review RNN designed for sequences, let be an input sequence of outcomes, in a particular order, where
; the order is often critical to the datapoint. An RNN model defines a probability distribution over all possible output sequences given the input in the form
. To train the RNN model, one maximizes the likelihood of the ground truth sequence.At prediction time, one seeks to find the sequence with the highest probability , and this is usually implemented approximately with a beam search procedure Lowerre (1976) (we modified into Algorithm 1). The sequence history is encoded with an internal memory vector which is updated over time. RNN is also often equipped with the attention mechanism Bahdanau et al. (2014), which in each timestep puts different weights on different words (features) and thus effectively attends on a list of important words. The context vector is computed as the weighted average over the dense representation of important words to capture information from the document. The context , the RNN memory at timestep , and the encoding of previous label are all concatenated and used to model the label probability distribution at time as , where is a nonlinear function, and softmax is the normalized exponential function.
Methods  Training objectives  Prediction objectives 

seq2seqRNN  ,  
VinyalsRNNmax  ,  
VinyalsRNNuniform  ,  
VinyalsRNNsample  ,  
setRNN (ours) 
To apply RNN to multilabel problems, one approach is to map the given set of labels to a sequence , on training documents. This is usually obtained by writing the label set in a globally fixed order (e.g. by label frequency), as in PCC. Once the mapping is done, RNN is trained with the standard maximum likelihood objective Nam et al. (2017):
(1) 
where is the th document and is the total number of documents in the corpus.
vinyals2015order proposes to dynamically choose during training the sequence order deemed as most probable by the current RNN model:
(2) 
where the stands for all permutations of the label set . This eliminates the need to manually specify the label order. However, as noticed by the authors, this objective cannot be used in the early training stages: the early order choice (often random) is reinforced by this objective and can be stuck upon permanently. To address this issue, vinyals2015order also proposes two smoother alternative objectives to initialize the model training:
The authors suggest that one first consider many random orders for each label set in order to explore the space:
(3) 
After that, one can sample sequences following the model predictive distribution instead of uniform distribution:
(4) 
In training, one needs to schedule the transition among these objectives, a rather tricky endeavor. At prediction time, one needs to find the most probable set. This is done by (approximately) finding the most probable sequence and treating it as a set . With a large number of sequences, it is quite possible that the argmax has actually a low probability, which can lead to neglecting important information when we ignore sequences other than the top one.
3 Adapting RNN Sequence Prediction Model to Multilabel Set Prediction
We propose a new way of adapting RNN to multilabel set prediction, which we call setRNN. We appreciate the RNN model structure Rumelhart et al. (1988) (defines a probability distribution over all possible sequences directly) and introduce training and prediction objectives tailored for sets that take advantage of it, while making a clear distinction between the sequence probability and the set probability . We define the set probability as the sum of sequences probabilities for all sequence permutations of the set, namely . Based on this formulation, an RNN also defines a probability distribution over all possible sets indirectly since . (For this equation to hold, in theory, we should also consider permutations with repeated labels, such as . But in practice, we find it very rare for RNN to actually generate sequences with repeated labels in our setup, and whether allowing repetition or not does not make much difference.)
In standard maximum likelihood training, one wishes to maximize the likelihood of given label sets, namely, , or equivalently,
(5) 
3.1 How is our new formulation different?
This training objective (5) looks similar to the objective (3) considered in previous work Vinyals et al. (2016), but in fact they correspond to very different transformations. Under the maximum likelihood framework, our objective (5) corresponds to the transformation , while objective (3) corresponds to the transformation . The latter transformation does not define a valid probability distribution over (i.e., ), and it has an undesired consequence in practical model training: because of the multiplication operation, the RNN model has to assign equally high probabilities to all sequence permutations of the given label set in order to maximize the set probability. If only some sequence permutations receive high probabilities while others receive low probabilities, the set probability computed as the product of sequence probabilities will still be low. In other words, if for each document, RNN finds one good way of ordering relevant labels (such as hierarchically) and allocates most of the probability mass to the sequence in that order, the model still assigns low probabilities to the ground truth label sets and will be penalized heavily. As a consequence the model has little freedom in discovering and concentrating on some natural label order. In contrast, with our proposed training objective, in which the multiplication operation is replaced by the summation operation, it suffices to find only one reasonable permutation of the labels for each document. It is worth noting that different documents can have different label orders; thus our proposed training objective gives the RNN model far more freedom on label order. The other two objectives (2) and (4) proposed in Vinyals et al. (2016) are less restrictive than (3), but they have to work in conjunction with (3) because of the self reinforcement issue. Our proposed training objective has a natural probabilistic interpretation, and does not suffer from self reinforcement issue. Thus it can serve as a stand alone training objective. Also, using Jensen’s inequality, one can show that objective (3) is maximizing a lower bound on the loglikelihood, while objective (5) is maximizing it directly.
3.2 Training by Maximizing Set Probability
Training an RNN model with the proposed objective (5) requires summing up sequence (permutation) probabilities for a set , where is the cardinality of the set. Thus evaluating this objective exactly can be intractable. We can approximate this sum by only considering the top highest probability sequences produced by the RNN model. We introduce a variant of beam search for sets with width and with the search candidates in each step restricted to only labels in the set (see Algorithm 1 with ). This approximate inference procedure is carried out repeatedly before each batch training step, in order to find highest probability sequences for all training instances occurring in that batch. The overall training procedure is summarized in Algorithm 2.
3.3 Predicting the Most Probable Set
The transformation also naturally leads to a prediction procedure, which is different from the previous standard of directly using most probable sequence as a set. We instead aim to find the most likely set , which involves summing up probabilities for all of its permutations. To make it tractable, we propose a twolevel beam search procedure. First we run standard RNN beam search (Algorithm 1 with ) to generate a list of highest probability sequences. We then consider the label set associated with each label sequence. For each set, we evaluate its probability using the same approximate summation procedure as the one used during model training (Algorithm 1 with ): we run our modified beam search to find the top few highest probability sequences associated with the set and sum up their probabilities. Among these sets that we have evaluated, we choose the one with the highest probability as the prediction. The overall prediction procedure is summarized in Algorithm 3. As we shall show in case study, the most probable set may not correspond to the most probable sequence; these are certainly cases where our method has an advantage.
Both our method and the competitor stateoftheart (VinyalsRNNs) are at most times slower than a vanillaRNN, due to the time spent on dealing with
permutations per datapoint. Our proposed method is about as fast as the VinyalsRNN methods, except for the VinyalsRNNuniform which is a bit faster (by a factor of 1.5) because its epochs do not run the additional forward pass.
4 Results and Analysis
4.1 Experimental Setup
We test our proposed setRNN method on 4 realworld datasets, RCV1v2, Slashdot, TheGuardian, and Arxiv Academic Paper Dataset (AAPD) Yang et al. (2018). We take the public RCV1v2 release^{1}^{1}1http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm and randomly sample 50,000 documents. We crawl Slashdot and TheGuardian documents from their websites^{2}^{2}2Slashdot: https://slashdot.org/ Note that there is another public Slashdot multilabel dataset Read et al. (2009) but we do not use that one because it is quite small. TheGuardian: https://www.theguardian.com and treat the official editor tags as ground truth. We also gather a list of user tags^{3}^{3}3www.zubiaga.org/datasets/socialbm0311/ for each document and treat them as additional features. For AAPD dataset, we follow the same train/test split as in Yang et al. (2018). Table 2 contains statistics of these four datasets. Links to document, official editor tags, and user tags are avaliable at http://www.ccis.neu.edu/home/kechenqin.
Data  #Train  #Test  Cardinality  #Labels  Doc length 

Slashdot  19,258  4,814  4.15  291  64 
RCV1v2  40,000  10,000  3.17  101  121 
TheGuardian  37,638  9,409  7.41  1,527  505 
AAPD  53,840  1,000  2.41  54  163 
Methods  Slashdot  RCV1v2  TheGuardian  AAPD  

labelF1  instanceF1  labelF1  instanceF1  labelF1  instanceF1  labelF1  instanceF1  hammingloss  microF1  
BR 
.271  .484  .486  .802  .292  .572  .529  .654  .0230  .685 
BRsupport  .247  .516  .486  .805  .296  .594  .545  .689  .0228  .696 
PCC  .279  .480  .595  .818      .541  .688  .0255  .682 
seq2seqRNN  .270  .528  .561  .824  .331  .603  .510  .708  .0254  .701 
VinyalsRNNuniform  .279  .527  .578  .826  .313  .567  .532  .721  .0241  .711 
VinyalsRNNsample  .300  .531  .590  .828  .339  .597  .527  .706  .0259  .697 
VinyalsRNNmax  .293  .530  .588  .829  .343  .599  .535  .709  .0256  .700 
VinyalsRNNmaxdirect  .226  .518  .539  .808  .313  .583  .490  .702  .0257  .694 
SGM                  .0245  .710 
setRNN  .310  .538  .607  .838  .361  .607  .548  .731  .0241  .720 

Methods  Slashdot  RCV1v2  TheGuardian  AAPD  

labelF1  instanceF1  labelF1  instanceF1  labelF1  instanceF1  labelF1  instanceF1  
seq2seqRNN 
.270.269  .528.528  .561.561  .824.824  .331.336  .603.603  .510.511  .708.709 
VinyalsRNNuniform  .279.288  .527.537  .578.587  .826.833  .313.336  .567.585  .532.542  .721.724 
VinyalsRNNsample  .300.303  .531.537  .590.597  .828.833  .339.351  .597.602  .527.530  .706.708 
VinyalsRNNmax  .293.301  .530.535  .588.585  .829.830  .343.352  .599.604  .535.537  .709.712 
VinyalsRNNmaxdirect  .226.228  .518.519  .539.538  .808.808  .313.316  .583.584  .490.490  .702.701 
setRNN 
.297.310  .528.538  .593.607  .831.838  .349.361  .595.607  .548.548  .728.731 

To process documents, we filter out stopwords and punctuations. Each document is truncated to have maximum 500 words for TheGuardian and AAPD, and 120 for Slashdot and RCV1v2. Zero padding is used if the document contains less words than the maximum number. Numbers and outofvocabulary words are replaced with special tokens. Words, user tags and labels are all encoded as 300dimensional vectors using
word2vec Mikolov et al. (2013).We implement RNNs with attention using tensorflow1.4.0 Abadi et al. (2016)
. The dynamic function for RNNs is chosen to be Gated recurrent units (GRU) with 2 layers and at most 50 units in decoder. The size of the GRU unit is 300. We set dropout rate to 0.3, and train the model with Adam optimizer
Kingma and Ba (2014) with learning rate . Beam size is set to be 12 at both training and inference stages. We adopt labelF1 (average F1 over labels) and instanceF1(average F1 over instances) as our main evaluation metrics, as defined below:
where for each instance , if label is a given label in ground truth; if label is a predicted label.
We compare our method with the following methods:

Binary Relevance (BR) Tsoumakas and Katakis (2007) with both independent training and prediction;

Binary Relevance with support inference (BRsupport) Wang et al. (2018) which trains binary classifiers independently but imposes label constraints at prediction time by only considering label sets observed during training, namely ;

Probabilistic Classifier Chain (PCC) Dembczynski et al. (2010) which transforms the multilabel classification task into a chain of binary classification problems. Predictions are made with Beam Search.

VinyalsRNNuniform, VinyalsRNNsample, and VinyalsRNNmax are three variants of RNNs proposed by Vinyals et al. (2016). They are trained with different objectives that correspond to different transformations between sets and sequences. See Table 1 for a summary of their training objectives. Following the approach taken by Vinyals et al. (2016), VinyalsRNNsample and VinyalsRNNmax are initialized by VinyalsRNNuniform. We have also tested training VinyalsRNNmax directly without having VinyalsRNNuniform as an initialization, and we name it as VinyalsRNNmaxdirect.

Sequence Generation Model (SGM) Yang et al. (2018) which trains the RNN model similar to seq2seqRNN but uses a new decoder structure that computes a weighted global embedding based on all labels as opposed to just the top one at each timestep.
In BR and PCC, logistic regressions with L1 and L2 regularizations are used as the underlying binary classifiers. seq2seqRNN, PCC, and SGM rely on a particular label order. We adopt the decreasing label frequency order, which is the most popular choice.
4.2 Experimental Results
Table 3 shows the performance of different methods in terms of labelF1 and instanceF1. The SGM results are taken directly from Yang et al. (2018), and are originally reported only on AAPD dataset in terms of hammingloss and microF1. Definitions of these two metrics can be found in Koyejo et al. (2015).
Our method performs the best in all metrics on all datasets (except hamming loss on AAPD, see table 3). In general, RNN based methods perform better than traditional methods BR, BRsupport and PCC. Among the VinyalsRNN variants, VinyalsRNNmax and Vinyalssample work the best and have similar performance. However, they have to be initialized by VinyalsRNNuniform. Otherwise, the training gets stuck in early stage and the performance degrades significantly. One can see the clear degradation by comparing the VinyalsRNNmax row (with initialization) with the VinyalsRNNmaxdirect row (without initialization). By contrast, our training objective in setRNN does not suffer from this issue and can serve as a stable stand alone training objective.
On TheGuardian dataset, setRNN performs slightly better than seq2seqRNN in terms of instanceF1, but much better in terms of labelF1. It is known that instanceF1 is basically determined by the popular labels’ performance while labelF1 is also sensitive to the performance on rare labels. Figure 1 shows that setRNN predicts rare labels better than seq2seqRNN.
Next we analyze how much benefit our new set prediction strategy brings in. For each RNNbased method, we test two prediction strategies: 1) finding the sequence with the highest probability and outputting the corresponding set (this is the default prediction strategy for all models except setRNN); 2) outputting the set with the highest probability (this is the default prediction strategy for setRNN). Table 4 shows how each method performs with these two prediction strategies. One can see that VinyalsRNNuniform and setRNN benefit most from predicting the top set, VinyalsRNNsample, VinyalsRNNmax and VinyalsRNNmaxdirect benefit less, and seq2seq RNN does not benefit at all. Intuitively, for the topset prediction to be different from the topsequence prediction, the model has to spread probability mass across different sequence permutations of the same set.
4.3 Analysis: Sequence Probability Dsitribution
Results in Table 4 motivates us to check how sharply (or uniformly) distributed the probabilities are over different sequence permutations of the predicted set. We first normalize these sequence probabilities related to the predicted set and then compute the entropy. To make predictions with different set sizes (and hence different number of sequence permutations) comparable, we further divide the entropy by the logarithm of number of sequences. Smaller entropy values indicate a sharper distributions. The results are shown in Figure 2.
seq2seqRNN trained with fixed label order and standard RNN objective (1) generates very sharp sequence distributions. It basically only assigns probability to one sequence in the given order. The entropy is close to 0. In this case, predicting the set is no different than predicting the top sequence (see Table 4). On the other extreme is VinyalsRNNuniform, trained with objective (3), which spreads probabilities across many sequences, and leads to the highest entropy among all models tested (the uniform distribution has the max entropy of 1). From Table 4, we see that by summing up sequence probabilities and predicting the most probable set, VinyalsRNNuniform’s performance improves. But as discussed earlier, training with the objective (3) makes it impossible for the model to discover and concentrate on a particular natural label order (represented by a sequence). Overall VinyalsRNNuniform is not competitive even with the setprediction enhancement. Between the above two extremes are VinyalsRNNmax and setRNN (we have omitted VinyalsRNNsample and VinyalsRNNmaxdirect here as they are similar to VinyalsRNNmax). Both models are allowed to assign probability mass to a subset of sequences. VinyalsRNNmax produces sharper sequence distributions than setRNN, because VinyalsRNNmax has the incentive to allocate most of the probability mass to the most probable sequence due to the max operator in its training objective (2). From Table 4, one can see that setRNN clearly benefits from summing up sequence probabilities and predicting the most probable set while VinyalsRNNmax does not benefit much. Therefore, the sequence probability summation is best used in both training and prediction, as in our proposed method.
Comparing 4 datasets in Table 4, we also see that Slashdot and TheGuardian, which have larger label cardinalities (therefore more permutations for one set potentially), benefit more from predicting the most probable set than RCV1 and AAPD, which have smaller label cardinalities.
5 Case Analysis
We further demonstrate how setRNN works with two examples. In the first example from the RCV1v2 dataset, the most probable set predicted by setRNN (which is also the correct set in this example) does not come from the most probable sequence. Top sequences in decreasing probability order are listed in Table 5. The correct label set {forex, markets, equity, money markets, metals trading, commodity} has the maximum total probability of 0.161, but does not match the top sequence.
PROB  SEQUENCE 

0.0236  equity, markets, money markets, forex 
0.0196  forex, markets, equity, money markets, metals trading, commodity 
0.0194  equity, markets, forex, money markets, metals trading, commodity 
0.0159  markets, equity, forex, money markets, metals trading, commodity 
0.0157  forex, money markets, equity, metals trading, markets, commodity 
0.0153  forex, money markets, markets, equity, metals trading, commodity 
0.0148  markets, equity, money markets, forex 
0.0143  money markets, equity, metals trading, commodity, forex, markets 
0.0123  markets, money markets, equity, metals trading, commodity, forex 
0.0110  markets, equity, forex, money markets, commodity, metals trading 
0.0107  forex, markets, equity, money markets, commodity, metals trading 
0.0094  forex, money markets, equity, markets, metals trading, commodity 
Next we demonstrate the issue with prescribing the sequence order in seq2seqRNN with a TheGuardian example^{4}^{4}4This document can be viewed at http://www.guardian.co.uk/artanddesign/jonathanjonesblog/2009/apr/08/altermodernismnicolasbourriaud. Figure 3 shows the predictions made by seq2seqRNN and our method. In this particular example the top sequence agrees with the top set in our method’s prediction so we can just analyze the top sequence. seq2seqRNN predicts Tate Modern (incorrect but more popular label) while we predict Tate Britain (correct but less popular label). The seq2seq predicted sequence is in the decreasing label frequency order while our predicted sequence is not. In the training data, Exhibition is more frequent than Tate Britain and Tate Modern. If we arrange labels by decreasing frequency, Exhibition is immediately followed by Tate Modern 19 times, and by Tate Britain only 3 times. So it is far more likely to have Tate Modern than Tate Britain after Exhibition. However, at the set level, Exhibition and Tate Modern cooccurs 22 times while Exhibition and Tate Britain cooccurs 12 times, so the difference is not so dramatic. In this case, imposing the sequence order biases the probability estimation and leads to incorrect predictions.
6 Conclusion
In this work, we present an adaptation of RNN sequence models to the problem of multilabel classification for text. RNN only directly defines probabilities for sequences, but not for sets. Different from previous approaches, which either transform a set to a sequence in some prespecified order, or relate the sequence probability to the set probability in some ad hoc way, our formulation is derived from a principled notion of set probability. We define the set probability as the sum of all corresponding sequence permutation probabilities. We derive a new training objective that maximizes the set probability and a new prediction objective that finds the most probable set. These new objectives are theoretically more appealing than existing ones, because they give the RNN model more freedom to automatically discover and utilize the best label orders.
Acknowledgements
We thank reviewers and Krzysztof Dembczyński for their helpful comments, Xiaofeng Yang for her help on writing, and Bingyu Wang for his help on proofreading. This work has been generously supported through a grant from the Massachusetts General Physicians Organization.
References
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 24, 2016., pages 265–283.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.

Boutell et al. (2004)
Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. 2004.
Learning multilabel scene classification.
Pattern Recognition, 37(9):1757–1771. 
Chen et al. (2018)
ShangFu Chen, YiChen Chen, ChihKuan Yeh, and YuChiang Frank Wang.
2018.
Orderfree RNN with visual attention for multilabel
classification.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018
. 
Cho et al. (2014)
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN encoderdecoder for
statistical machine translation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL
, pages 1724–1734.  Dembczynski et al. (2010) Krzysztof Dembczynski, Weiwei Cheng, and Eyke Hüllermeier. 2010. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pages 279–286.
 Ghamrawi and McCallum (2005) Nadia Ghamrawi and Andrew McCallum. 2005. Collective multilabel classification. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 195–200. ACM.
 Jin and Nakayama (2016) Jiren Jin and Hideki Nakayama. 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 48, 2016, pages 2452–2457.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Koyejo et al. (2015) Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. 2015. Consistent multilabel classification. In Advances in Neural Information Processing Systems, pages 3321–3329.
 Li et al. (2016) Cheng Li, Bingyu Wang, Virgil Pavlu, and Javed Aslam. 2016. Conditional bernoulli mixtures for multilabel classification. In International Conference on Machine Learning, pages 2482–2491.
 Liu and Tsang (2015) Weiwei Liu and Ivor Tsang. 2015. On the optimality of classifier chain for multilabel classification. In Advances in Neural Information Processing Systems, pages 712–720.
 Lowerre (1976) Bruce T Lowerre. 1976. The harpy speech recognition system. Technical report, CARNEGIEMELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
 Nam et al. (2017) Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J. Kim, and Johannes Fürnkranz. 2017. Maximizing subset accuracy with recurrent neural networks in multilabel classification. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 5419–5429.
 Read et al. (2009) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2009. Classifier chains for multilabel classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 254–269. Springer.
 Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1.
 Tsoumakas and Katakis (2007) Grigorios Tsoumakas and Ioannis Katakis. 2007. Multilabel classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3):1–13.
 Vinyals et al. (2016) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Sequence to sequence for sets. CoRR, abs/1511.06391.
 Wang et al. (2018) Bingyu Wang, Cheng Li, Virgil Pavlu, and Jay Aslam. 2018. A pipeline for optimizing f1measure in multilabel text classification. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 913–918. IEEE.

Wang et al. (2016)
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016.
CNNRNN: A unified framework for multilabel image
classification.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pages 2285–2294.  Wang et al. (2017a) Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. 2017a. Attribute recognition by joint recurrent learning of context and correlation. CoRR, abs/1709.08553.
 Wang et al. (2017b) Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017b. Multilabel image recognition by recurrently discovering attentional regions. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pages 464–472.
 Yang et al. (2018) Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. SGM: sequence generation model for multilabel classification. CoRR, abs/1806.04822.
 Zhang et al. (2016) Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang, and Jianfeng Lu. 2016. Multilabel image classification with regional latent semantic dependencies. CoRR, abs/1612.01082.
Comments
There are no comments yet.