Deep Learning with a Rethinking Structure for Multi-label Classification

02/05/2018 ∙ by Yao-Yuan Yang, et al. ∙ National Taiwan University 0

Multi-label classification (MLC) is an important learning problem that expects the learning algorithm to take the hidden correlation of the labels into account. Extracting the hidden correlation is generally a challenging task. In this work, we propose a novel deep learning framework to better extract the hidden correlation with the help of the memory structure within recurrent neural networks. The memory stores the temporary guesses on the labels and effectively allows the framework to rethink about the goodness and correlation of the guesses before making the final prediction. Furthermore, the rethinking process makes it easy to adapt to different evaluation criterion to match real-world application needs. Experimental results across many real-world data sets justify that the rethinking process indeed improves MLC performance across different evaluation criteria and leads to superior performance over state-of-the-art MLC algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human beings master our skills for a given problem by working on and thinking through the same problem over and over again. When a difficult problem is given to us, multiple attempts would have gone through our mind to simulate different possibilities. During this period, our understanding to the problem gets deeper, which in term allows us to propose a better solution in the end. The deeper understanding comes from a piece of consolidated knowledge within our memory, which records how we build up the problem context with processing and predicting during the “rethinking” attempts. The human-rethinking model above inspires us to design a novel deep learning model for machine-rethinking, which is equipped with a memory structure to better solve the multi-label classification (MLC) problem.

The MLC problem aims to attach multiple relevant labels to an input instance simultaneously, and matches various application scenarios, such as tagging songs with a subset of emotions (Trohidis et al., 2008) or labeling images with objects (Wang et al., 2016). Those MLC applications typically come with an important property called label correlation (Cheng et al., 2010; Huang and Zhou, 2012)

. For instance, when tagging songs with emotions, “angry” is negatively correlated with “happy”; when labeling images, the existence of a desktop computer probably indicates the co-existence of a keyboard and a mouse. Many existing MLC works implicitly or explicitly takes label correlation into account to better solve MLC problems 

(Cheng et al., 2010).

Label correlation is also known to be important for human when solving MLC problems (Bar, 2004). For instance, when solving an image labeling task upon entering a new room, we might notice some more obvious objects like sofa, dining table and wooden floor at the first glance. Such a combination of objects hints us of a living room, which helps us better recognize the “geese” on the sofa to be stuffed animals instead of real ones. The recognition route from the sofa to the living room to stuffed animals require rethinking about the correlation of the predictions step by step. Our proposed machine-rethinking model mimics this human-rethinking process to digest label correlation and solve MLC problems more accurately.

Next, we introduce some representative MLC algorithms before connecting them to our proposed machine-rethinking model. Binary relevance (BR) (Tsoumakas et al., 2009)

is a baseline MLC algorithm that does not consider label correlation. For each label, BR learns a binary classifier to predict the label’s relevance independently. Classifier chain (CC) 

(Read et al., 2009) extends BR by taking some label correlation into account. CC links the binary classifiers as a chain and feeds the predictions of the earlier classifiers as features to the latter classifiers. The latter classifiers can thus utilize (the correlation to) the earlier predictions to form better predictions.

The design of CC can be viewed as a memory mechanism that stores the label predictions of the earlier classifiers. CNN-RNN (Wang et al., 2016) and Order-Free RNN with Visual Attention (Att-RNN) (Chen et al., 2017) algorithms extend CC by replacing the mechanism with a more sophisticated memory-based model—recurrent neural network (RNN). By adopting different variations of RNN (Hochreiter and Schmidhuber, 1997; Cho et al., 2014)

, the memory can store more sophisticated concepts beyond earlier predictions. In addition, adopting RNN allows the algorithms to solve tasks like image labeling more effectively via end-to-end training with other deep learning architectures (e.g., convolutional neural network in CNN-RNN).

The CC-family algorithms above for utilizing label correlation are reported to achieve better performance than BR (Read et al., 2009; Wang et al., 2016). Nevertheless, given that the predictions happen sequentially within a chain, those algorithms generally suffer from the issue of label ordering. In particular, classifiers in different positions of the chain receive different levels of information. The last classifier predicts with all information from other classifiers while the first classifier label predicts with no other information. Att-RNN addresses this issue with beam search to approximate the optimal ordering of the labels, and dynamic programming based classifier chain (CC-DP) (Liu and Tsang, 2015) searches for the optimal ordering with dynamic programming. Both Att-RNN and CC-DP can be time-consuming when searching for the optimal ordering, and even after identifying a good ordering, the label correlation information is still not shared equally during the prediction process.

Our proposed deep learning model, called RethinkNet, tackles the label ordering issue by viewing CC differently. By considering CC-family algorithms as a rethinking model based on the partial predictions from earlier classifiers, we propose to fully memorize the temporary predictions from all classifiers during the rethinking process. That is, instead of forming a chain of binary classifiers, we form a chain of multi-label classifiers as a sequence of rethinking. RethinkNet learns to form preliminary guesses in earlier classifiers of the chain, store those guesses in the memory and then correct those guesses in latter classifiers with label correlation. Similar to CNN-RNN and Att-RNN, RethinkNet adopts RNN for making memory-based sequential prediction. We design a global memory for RethinkNet to store the information about label correlation, and the global memory allows all classifiers to share the same information without suffering from the label ordering issue.

Another advantage of RethinkNet is to tackle an important real-world need of Cost-Sensitive Multi-Label Classification (CSMLC) (Li and Lin, 2014). In particular, different MLC applications often require different evaluation criteria. To be seamlessly useful for a broad spectrum of applications, it is thus important to design CSMLC algorithms, which takes the criteria (cost) into account during learning. State-of-the-art CSMLC algorithms include condensed filter tree (CFT) (Li and Lin, 2014) and probabilistic classifier chain (PCC) (Read et al., 2009). PCC extends CC to CSMLC by making Bayes optimal predictions according to the criterion. CFT also extends from CC, but achieves cost-sensitivity by converting the criterion to importance weights when training each binary classifier within CC. The conversion step in CFT generally requires knowing the predictions of all classifiers, which has readily been stored within the memory or RethinkNet. Thus, RethinkNet can be seamlessly combined with the importance-weighting idea within CFT to achieve cost-sensitivity. Extensive experiments across real-world data sets validate that RethinkNet indeed improves MLC performance across different evaluation criteria and is superior to state-of-the-art MLC and CSMLC algorithms. Furthermore, for image labeling, experimental results demonstrate that RethinkNet outperforms both CNN-RNN and Att-RNN. The results justify the usefulness of RethinkNet.

2 Preliminary

In the MLC problem, the goal is to map the feature vector

to a label vector , where if and only if the -th bit is relevant. During training, MLC algorithms use the training data set to learn a classifier . During testing, test example is drawn from the same distribution that generated . The prediction is produced as . The goal of an MLC algorithm is to make prediction close to .

The existence of diverse criteria for evaluating the closeness of and calls for a more general setup called cost-sensitive multi-label classification (CSMLC) (Li and Lin, 2014). In this paper, we consider the instance-wise evaluation criteria. These criteria can be generalized by a cost function . represents the penalty of predicting as . For CSMLC problem, the criterion used for evaluation is assumed to be known before training. That is, CSMLC algorithms learn a classifier from both the training data set and the cost function , and should be able to adapt to different easily. CSMLC algorithms aim at minimizing the expected cost .

2.1 Recurrent Neural Network (RNN)

RNN is a class of neural network model designed to solve sequence prediction problem. RNN uses memory to pass information from one element in the sequence to the next element. RNN learns two transformations. The memory transformation takes in the output of previous element and outputs to the next element. The feature transformation takes in the feature vector and projects it to the output space. For , where is the length of the sequence, we use to represent the feature vector of the -th element in the sequence, and use to represent its output vector. The RNN model can be written as and for , where

is the activation function.

RNN comes with different forms. The basic form of RNN is called simple RNN (SRN) (Elman, 1990; Jordan, 1997). SRN assumes and

to be linear transformation. SRN is able to link information from one element to the latter elements, but it can be hard to train due to the decay of gradient

(Hochreiter et al., 2001)

. Several other forms of RNN are designed to solve such problem, including long short term memory (LSTM) 

(Hochreiter and Schmidhuber, 1997)

, gated recurrent unit (GRU) 

(Cho et al., 2014) and iterative RNN (IRNN) (Le et al., 2015).

3 Proposed Model

The idea of improving prediction result by iteratively polishing it is the “rethinking” process. This process can be taken as a sequence prediction problem and RethinkNet adopts recurrent neural network (RNN) to model this process.

Figure 1 illustrates how RethinkNet is designed. RethinkNet is composed of an RNN layer and a dense (fully connected) layer. The dense layer learns a label embedding to transform the output of RNN layer to label vector. The RNN layer is used to model the “rethinking” process. All steps in RNN share the same feature vector since they are solving the same MLC problem. The output of the RNN layer represents the embedding of the label vector . Each is passed down to -th element in the RNN layer.

In the first step, RethinkNet makes a prediction base on the feature vector alone, which targets at labels that are easier to identify. The first prediction is similar to BR, which predicts each label independently without the information of other labels. From the second step, RethinkNet begins to use the result from the previous step to make better predictions . is taken as the final prediction . As RethinkNet polishes the prediction, difficult labels would eventually be labeled more correctly.

Feature vector

RNN layer

Dense layer
Figure 1: The architecture of the proposed RethinkNet model.

3.1 Modeling Label Correlation

RethinkNet models label correlation in the memory of the RNN layer. To simplify the illustration, we assume that the activation function

is sigmoid function and the dense layer is identity transformation. Also, SRN is used in the RNN layer. Other forms of RNN share similar property since they are originated from SRN. In SRN, the memory and feature transformations are represented as matrices

and respectively. The RNN layer output will be a label vector with length .

Under the setting, the predicted label vector is . This equation can be separated into two parts, the feature term , which makes the prediction like BR, and the memory term , which transforms the previous prediction to the current label vector space. This memory transformation serves as the model for label correlation. represents -th row and -th column of and it represents the correlation between -th and -th label. The prediction of -th label is the combination of and . If we predict as relevant at step and is high, it indicates that the -th label is more likely to be relevant. If is negative, this indicates that the -th label and -th label may be negatively correlated.

Figure 2 plots the learned memory transformation matrix and the correlation coefficient of the labels. We can clearly see that RethinkNet is able to capture the label correlation information, although we also found that such result in some data set can be noisy. The finding suggests that may carry not only label correlation but also other data set factors. For example, the RNN model may learn that the prediction of a certain label does not come with a high accuracy. Therefore, even if another label is highly correlated with this one, the model would not give it a high weight.

(a) memory transform
(b) correlation coefficient
Figure 2: The trained memory transformation matrix with SRN and the correlation coefficient of the yeast data set. Each cell represents the correlation between two labels. Each row of the memory weight is normalized for the diagonal element to be 1 so it can be compared with correlation coefficient.

3.2 Cost-Sensitive Reweighted Loss Function

Cost information is another important piece of information that should be considered when solving an MLC problem. Different cost function values each label differently, so we should set the importance of each label differently. One way to encode such property is to weight each label in the loss function according to its importance. The problem would become how to estimate the label importance.

The difference between the label predicted correctly and incorrectly under the cost function can be used to estimate the importance of the label. To evaluate the importance of a single label, filling out other labels is required for most costs. We leverage the sequential nature of RethinkNet where temporary predictions are made between each of the iterations. Using the temporary prediction to fill out all other labels, we will be able to estimate the importance of each label.

The weight of each label is designed as equation (1). For , where no prior prediction exists, the labels are set with equal importance. For , we use and to represent the label vector when the -th label is set to and respectively. The weight of each label is therefore the cost difference between and . This weighting approach can be used to estimate the effect of each label under current prediction with the given cost function. Such method echos the design of CFT (Li and Lin, 2014).

(1)

To accept the weight in loss function, we formulated the weighted binary cross-entropy as Equation (2). For , the weight for all labels are set to 1 since there is no prediction to reference. For , the weights are updated using the previous prediction. Note that when the given cost function is Hamming loss, the labels in each iteration are weighted the same and the weighting is reduced to the same as in BR.

(2)
algorithm memory content cost-sensitivity feature extraction
BR - - -
CC former prediction - -
CC-DP optimal ordered prediction - -
PCC former prediction v -
CFT former prediction v -
CNN-RNN former prediction in RNN - CNN
Att-RNN former prediction in RNN - CNN + attention
RethinkNet full prediction in RNN v general NN
Table 1: Comparison between MLC algorithms.

Table 1 shows a comparison between MLC algorithms. RethinkNet is able to consider both the label correlation and the cost information. Its structure allows it to be extended easily with other neural network for advance feature extraction, so it is easy to be adopted to image labeling problems. In Section 4, we demonstrate these advantages be turned into better results.

4 Experiments

The experiments were evaluated on 11 real-world data sets (Tsoumakas et al., 2011)

. The data set is split with 75% training and 25% testing randomly. All experiments are repeated 10 times with the mean and standard error (ste) of the testing loss/score recorded. The results are evaluated with Hamming loss, Rank loss, F1 score, Accuracy score

(Li and Lin, 2014). We use to indicate the lower value for the criterion is better and to indicate the higher value is better.

RethinkNet is implemented using keras (Chollet, 2015) with tensorflow (Abadi et al., 2015). The RNN layer can be interchanged with different variations of RNN including SRN, LSTM, GRU and IRNN. A 25% dropout on the memory matrix of RNN is added. A single fully-connected layer is used for the dense layer and Nesterov Adam (Nadam) (Dozat, 2016) is used to optimize the model. The model is trained until converges or reach epochs and the batch size is fixed to . We added an L2 Regularizer to training parameters and the regularization strength is search within with three-fold cross-validation.

4.1 Rethinking

In Section 3, we claim that RethinkNet is able to improve through iterations of rethinking. We justify our claim with this experiment. We use the simplest form of RNN, SRN, in the RNN layer of RethinkNet and the dimensionality of the RNN layer is fixed to . We set and plot the training and testing loss/score on Figure 3.

From the figure, we can observe that cost functions like Rank loss, F1 score, Accuracy score which relies more on utilizing label correlation shown significant improvement over the number of rethink iteration. Hamming loss is a criterion that evaluates each label independently and algorithms that does not consider label correlation like BR perform well on such criterion (Read et al., 2009). The first step of RethinkNet is essentially BR, thus more iterations may not serve that much benefit. The result demonstrates that the performance generally converges at around the third iteration. For efficiency, the rest of experiments will be fixed with .

(a) scene
(b) CAL500
Figure 3: The mean performance versus number of rethink iteration.
Rank loss F1 score Accuracy score
data set none reweighted none reweighted none reweighted
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
Table 2: The mean performance on different criteria none reweighted and reweighted RethinkNet (best ones are in bold)

4.2 Effect of Reweighting

To verify the cost-sensitive reweighting can take the cost information and reach a better performance, we conducted this experiment. The performance of RethinkNet with and without reweighting under Rank loss, F1 score and Accuracy score is compared. Table 2 lists the experimental results and it shows that on almost all data sets, reweighting the loss function for RethinkNet yields better result.

Rank loss
data set SRN GRU LSTM IRNN
training testing training testing training testing training testing
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
F1 score
data set SRN GRU LSTM IRNN
training testing training testing training testing training testing
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
Table 3: Experimental results with different RNN for RethinkNet. Evaluated in Rank loss and F1 score (best testing results are in bold)

4.3 Compare with Other MLC Algorithms

We compare RethinkNet with other state-of-the-art MLC and CSMLC algorithms in this experiment. The competing algorithms includes the binary relevance (BR), probabilistic classifier chain (PCC), classifier chain (CC), dynamic programming based classifier chain (CC-DP), condensed filter tree (CFT). To compare with the RNN structure used in convolutional neural network recurrent neural network (CNN-RNN), we implemented a classifier chains using RNN (CC-RNN) as competitor. CC-RNN essentially is CNN-RNN without the CNN layer since we are dealing with general data sets. BR is implemented with feed-forward neural network with a hidden layer having

neurons. We coupled both CC-RNN and RethinkNet with a neurons LSTM. CC-RNN and BR are trained and tuned using same approach as RethinkNet and these models are optimized using Nadam with default parameter. Training

independent feed-forward neural network is too computationally heavy, so we coupled CFT, PCC, CC with L2-regularized logistic regression. CC-DP is coupled with linear support vector machine (SVM) since it is derived on such model. The regularization strength for both these models are searched within

with three-fold cross-validation. PCC does not have inference rule derived for Accuracy score and we use the F1 score inference rule as an alternative in view of the similarity in the formula.

The experimental results are shown on Table 6

and the t-test results are on Table

4. Note that we cannot get the result of CC-DP in two weeks on the data sets Corel5k, CAL500 and bibtex so they are not listed. In terms of average ranking and t-test results, RethinkNet yields a superior performance. On Hamming loss, all algorithms are generally competitive. For Rank loss, F1 score and Accuracy score, CSMLC algorithms (RethinkNet, PCC, CFT) begin to take the lead. Even the parameters of cost-insensitive algorithms are tuned on the target evaluation criteria, they are not able to compete with cost-sensitive algorithms. This demonstrates the importance of developing cost-sensitive algorithms.

All three CSMLC algorithms has similar performance on Rank loss and RethinkNet performs slightly better on F1 score. PCC is not able to directly utilize the cost information of Accuracy score, this makes PCC performs slightly poorly.

When comparing with deep structures (RethinkNet, CC-RNN, BR), only BR is competitive under Hamming loss with RethinkNet. On all other settings, RethinkNet is able to outperform the other two competitors. CC-RNN learns an RNN with sequence length being the number of labels (). When gets large, the depth of CC-RNN can go very deep making it hard to train with fixed learning rate in our setting and failed to perform well on these data sets. This demonstrates that RethinkNet is a better designed deep structure to solve CSMLC problems.

PCC CFT CC-DP CC CC-RNN BR
hamming 6/1/4 3/4/4 5/2/1 6/1/4 8/3/0 3/6/2
rank loss 5/1/5 5/2/4 7/1/0 10/1/0 10/1/0 10/1/0
f1 6/2/3 5/4/2 5/2/1 8/3/0 10/1/0 9/2/0
acc 7/1/3 5/4/2 5/1/2 7/4/0 9/2/0 9/2/0
total 24/5/15 18/14/12 22/6/4 31/9/4 37/7/0 31/11/2
Table 4: RethinkNet versus the other algorithms based on t-test at 95% confidence level (#win/#tie/#loss)
baseline CNN-RNN Att-RNN RethinkNet
hamming
rank loss
f1
acc
Table 5: Experimental results on MSCOCO data set.
Hamming loss
data set RethinkNet PCC CFT CC-DP CC CC-RNN BR
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
avg. rank
Rank loss
data set RethinkNet PCC CFT CC-DP CC CC-RNN BR
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
avg. rank
F1 score
data set RethinkNet PCC CFT CC-DP CC CC-RNN BR
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
avg. rank
Accuracy score
data set RethinkNet PCC CFT CC-DP CC CC-RNN BR
emotions
scene
yeast
birds
tmc2007-500
Arts1
medical
enron
Corel5k
CAL500
bibtex
avg. rank
Table 6: Experimental results (mean ste) on different evaluation criteria (best results are in bold)

4.4 Comparison on Image Data Set

The CNN-RNN and Att-RNN algorithms are designed to process image labeling problems. The purpose of this experiment is to understand how RethinkNet performs on such task compare with CNN-RNN and Att-RNN. We use the data set MSCOCO (Lin et al., 2014) and the training testing split provided by them. Pre-trained Resnet-50 (He et al., 2015) is adopted for feature extraction. The competing models include logistic regression as baseline, CNN-RNN, Att-RNN, and RethinkNet. We use the implementation of Att-RNN from the author and other models are implemented using keras. The models are fine tuned with the pre-trained Resnet-50. The results on testing data set are shown on Table 5. The result justifies that RethinkNet is able to outperform state-of-the-art deep learning models that are designed for image labeling.

4.5 Effect of Using Different RNN

In this experiment, we compare the performance of RethinkNet using different forms of RNN on the RNN layer in RethinkNet. The competitors includes SRN, LSTM, GRU, and IRNN. We tune the label embedding dimensionality so that the total number of trainable parameters are around for each form of RNN. The results are evaluated on two more commonly seen cost functions, Rank loss and F1 score, and shown on Table 3.

Different variations of RNN differs in the way they manipulate the memory. In terms of testing result, we can see that SRN and LSTM are two better choices. GRU and IRNN tends to be overfitting too much causing their testing performance to drop. Among SRN and LSTM, SRN tends to have a slightly larger discrepancy between training and testing performance. We can also observed that many data sets performs better with the same variation of RNN across cost functions. This indicates that different data set may require different form of memory manipulation.

5 Conclusion

Classic multi-label classification (MLC) algorithms predict labels as a sequence to model the label correlation. However, these approaches face the problem of ordering the labels in the sequence. In this paper, we reformulate the sequence prediction problem to avoid the issue. By mimicking the human rethinking process, we propose a novel cost-sensitive multi-label classification (CSMLC) algorithm called RethinkNet. RethinkNet takes the process of gradually polishing its prediction as the sequence to predict. We adopt the recurrent neural network (RNN) to predict the sequence, and the memory in the RNN can then be used to store the label correlation information. In addition, we also modified the loss function to take in the cost information, and thus make RethinkNet cost-sensitive. Extensive experiments demonstrate that RethinkNet is able to outperform other MLC and CSMLC algorithms on general data sets. On image data set, RethinkNet is also able to exceed state-of-the-art image labeling algorithms in performance. The results suggest that RethinkNet is a promising algorithm for solving CSMLC using neural network.

References

  • Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from tensorflow.org.
  • Bar [2004] Moshe Bar. Visual objects in context. Nature reviews. Neuroscience, 5(8):617, 2004.
  • Chen et al. [2017] Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. Order-free RNN with visual attention for multi-label classification. arXiv preprint arXiv:1707.05495, 2017.
  • Cheng et al. [2010] Weiwei Cheng, Eyke Hüllermeier, and Krzysztof J Dembczynski. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, 2010.
  • Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • Chollet [2015] François Chollet. Keras. https://github.com/fchollet/keras, 2015.
  • Dozat [2016] Timothy Dozat.

    Incorporating nesterov momentum into adam.

    2016.
  • Elman [1990] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hochreiter et al. [2001] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. 2001.
  • Huang and Zhou [2012] Sheng-Jun Huang and Zhi-Hua Zhou. Multi-label learning by exploiting label correlations locally. In AAAI, 2012.
  • Jordan [1997] Michael I Jordan. Serial order: A parallel distributed processing approach. Advances in psychology, 121:471–495, 1997.
  • Le et al. [2015] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
  • Li and Lin [2014] Chun-Liang Li and Hsuan-Tien Lin. Condensed filter tree for cost-sensitive multi-label classification. In ICML, 2014.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • Liu and Tsang [2015] Weiwei Liu and Ivor Tsang. On the optimality of classifier chain for multi-label classification. In NIPS, 2015.
  • Read et al. [2009] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning and Knowledge Discovery in Databases, pages 254–269, 2009.
  • Trohidis et al. [2008] Konstantinos Trohidis, Grigorios Tsoumakas, George Kalliris, and Ioannis P. Vlahavas. Multi-label classification of music into emotions. In ISMIR, 2008.
  • Tsoumakas et al. [2009] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. 2009.
  • Tsoumakas et al. [2011] Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, and Ioannis Vlahavas. Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411–2414, 2011.
  • Wang et al. [2016] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A unified framework for multi-label image classification. In CVPR, 2016.