Single Model Ensemble using Pseudo-Tags and Distinct Vectors

05/02/2020 ∙ by Ryosuke Kuwabara, et al. ∙ Tohoku University The University of Tokyo 0

Model ensemble techniques often increase task performance in neural networks; however, they require increased time, memory, and management effort. In this study, we propose a novel method that replicates the effects of a model ensemble with a single model. Our approach creates K-virtual models within a single parameter space using K-distinct pseudo-tags and K-distinct vectors. Experiments on text classification and sequence labeling tasks on several datasets demonstrate that our method emulates or outperforms a traditional model ensemble with 1/K-times fewer parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A model ensemble is a promising technique for increasing the performance of neural network models Lars. and Peter. (1990); Anders and Jesper (1994)

. This method combines the outputs of multiple models that are individually trained using the same training data. Recent submissions to natural language processing(NLP) competitions are primarily composed of neural network ensembles

Bojar et al. (2018); Barrault and others (2019). Despite its effectiveness, a model ensemble is costly. Because it handles multiple models, it requires increased time for training and inference, increased memory, and greater management effort. Therefore, the model ensemble technique cannot always be applied to real systems, as many systems, such as edge devices, must work with limited computational resources.

Figure 1: Overview of our proposed method. A single model processes the same input with distinct pseudo-tags. Each pseudo-tag defines the -th virtual model, and the corresponding vector is added to the embedding. Thus, the model function of a singe model generates different outputs.

In this study, we propose a novel method that replicates the effects of the ensemble technique with a single model. Following the principle that aggregating multiple models improves performance, we create multiple virtual models in a shared space. Our method virtually inflates the training data times with -distinct pseudo-tags appended to all input data. It also incorporates -distinct vectors, which correspond to pseudo-tags. Each pseudo-tag is attached to the beginning of the input sentence, and the -th vector is added to the embedding vectors for all tokens in the input sentence. Fig. 1 presents a brief overview of our proposed method. Intuitively, this operation allows the model to shift the embedding of the same data to the -th designated subspace and can be interpreted as explicitly creating virtual models in a shared space. We thus expect to obtain the same (or similar) effects as the ensemble technique composed of models with our virtual models generated from a single model.

Experiments in text classification and sequence labeling tasks reveal that our method outperforms single models in all settings with the same parameter size. Moreover, our technique emulates or surpasses the normal ensemble with -times fewer parameters on several datasets.

2 Related Work

The neural network ensemble is a widely studied method Lars. and Peter. (1990); Anders and Jesper (1994); Hashem (1994); Opitz and Shavlik (1996); however studies have focused mainly on improving performance while ignoring cost, such as computational cost, memory space, and management cost.

Several methods have overcome the shortcomings of traditional ensemble techniques. For training Snapshot Ensembles, Huang et al. (2017) used a single model to construct multiple models by converging into multiple local minima along the optimization path. For inference distillation, Hinton et al. (2015) transferred the knowledge of the ensemble model into a single model. These methods use multiple models either during training or inference, which partially solves the negative effects of the traditional ensemble.

The incorporation of pseudo-tags is a standard technique widely used in the NLP community,  Rico et al. (2016); Melvin et al. (2017). However, to the best of our knowledge, our approach is the first attempt to incorporate pseudo-tags as an identification marker of virtual models within a single model.

The most similar approach to ours is dropout Srivastava et al. (2014), which stochastically omits each hidden unit during each mini-batch, and in which all units are utilized for inference. Huang et al. (2017) interpreted this technique as implicitly using an exponential number of virtual models within the same network. As opposed to dropout, our method explicitly utilizes virtual models with a shared parameter, which is as discussed in Section 5, complementary to dropout.

3 Base Encoder Model

The target tasks of this study are text classification and sequence labeling. The input is a sequence of tokens (i.e., a sentence). Here, denotes the one-hot vector of the -th token in the input. Let be the embedding matrices where is the dimension of the embedding vectors and is the vocabulary of the input.

We obtain the embedding vector at position by . Here, we introduce the notation to represent the list of vectors that correspond to the input sentence, where is the number of tokens in the input. Given , the feature (or hidden) vectors for all are computed as an encoder neural network , where denotes the dimensions of the feature vector. Namely,

(1)

Finally, the output given input

is estimated as

where represents the task dependent function (e.g., a softmax function for text classification and a conditional random field layer for sequence labeling). It should be noted that the form of the output differs depending on the target task.

4 Single Model Ensemble using Pseudo-Tags and Distinct Vectors

In this section, we introduce the proposed method, which we refer to as SingleEns. Fig. 1 presents an overview of the method. The main principle of this approach is to create different virtual models within a single model.

We incorporate pseudo-tags and predefined distinct vectors. For the pseudo-tags, we add special tokens to the input vocabulary, where hyper-parameter represents the number of virtual models. For the predefined distinct vectors, we leverage mutually orthogonal vectors , where the orthogonality condition requires satisfying for all when .

Finally, we assume that all input sentences start from one of the pseudo-tags. We then add the corresponding orthogonal vector of the attached pseudo-tag to the embedding vectors at all positions. The new embedding vector is written in the following form:

(2)

We substitute in Eq. 1 by in the proposed method.

An intuitive explanation of the role of pseudo-tags is to allow a single model to explicitly recognize differences in homogeneous input, while the purpose of orthogonal vectors is to linearly shift the embedding to the virtual model’s designated direction. Therefore, by combining these elements, we believe that we can define virtual models within a single model and effectively use the local space for each virtual model. Aggregating these virtual models can then result in imitation of ensemble.

Dataset Model Method # params Accuracy
Single 12 M   87.03
3-5[3pt/1pt] Tfm: 1/K Ens 14 M   81.93 ()
GloVe SingleEns 12 M   87.30 ()
 IMDB NormalEns 108 M   87.67 ()
Single 400 M   91.99
3-5[3pt/1pt] Tfm: 1/K Ens 1000 M   90.63 ()
BERT SingleEns 400 M   92.91 ()
NormalEns 3600 M   92.75 ()
Single 400 M   81.75
3-5[3pt/1pt] Tfm: 1/K Ens 1000 M   82.67 ()
Rotten BERT SingleEns 400 M   85.01 ()
NormalEns 3600 M   82.57 ()
Single 400 M   87.18
3-5[3pt/1pt] Tfm: 1/K Ens 1000 M   80.27 ()
RCV1 BERT SingleEns 400 M   89.16 ()
NormalEns 3600 M   90.01 ()
Table 1: Test accuracy and parameter size for text classification tasks. Our method, SingleEns, outperformed Single and 1/K Ens on all datasets. Most notably, SingleEns surpassed NormalEns on IMDB and Rotten with 1/9 fewer parameters.

5 Experiments

To evaluate the effectiveness of our method, we conducted experiments on two tasks: text classification and sequence labeling. We used the IMDB Andrew et al. (2011), Rotten Bo and Lillian (2005), and RCV1 Yiming et al. (2004) datasets for text classification, and the CoNLL-2003 Sang and Meulder (2003) and CoNLL-2000 datasets Sang and Sabine (2000) for sequence labeling.

We used the Transformer model Vaswani et al. (2017) as the base model for all experiments, and its token vector representations were then empowered by pretrained vectors of GloVe, Jeffrey et al. (2014), BERT Devlin et al. (2018), or ELMo Matthew et al. (2018). The models are referred to as Tfm:GloVe, Tfm:BERT, and Tfm:ELMo, respectively.111See Appendix A for detailed experimental settings. For Tfm:BERT, we incorporated the feature (or hidden) vectors of the final layer in the BERT model as the embedding vectors while adopting drop-net technique Zhu et al. (2020). All the models have dropout layers to assess the complementarity of our method and dropout.

We compared our method (SingleEns) to a single model (Single), a normal ensemble (NormalEns), and a normal ensemble in which each component has approximately parameters222Because BERT requires a fixed number of parameters, we did not reduce the parameters accurately for 1/K Tfm:BERT. (1/K Ens).333See Appendix A for detailed experimental settings. Although other ensemble-like methods discussed in Section 2 could have been compared (e.g., snapshot ensemble, knowledge distillation, or dropout during testing to generate predictions and aggregate them), they are imitations of a normal ensemble, and we assumed that the results of a normal ensemble were upper-bound. We used for reporting the primary results of NormalEns, 1/K Ens, and SingleEns. We thus prepared nine pseudo-tags in the same training (trainable) and initialization manner as other embeddings. We created untrainable distinct vectors using the implementation by  Saxe et al. (2013)

that was prepared in PyTorch’s default function,

torch.nn.init.orthogonal. We empirically determined the correct scaling for the distinct vectors as 1 out of 1, 3, 5, 10, 30, 50, 100, and the scale that was closest to the model’s embedding vectors. We obtained the final predictions of ensemble models by averaging and voting the outputs of individual models for text classification and sequence labeling, respectively. The results were obtained by the averaging five distinct runs with different random seeds.

Dataset Model Method # params F1 Score
Single 100 M   91.93
3-5[3pt/1pt] CoNLL Tfm: 1/K ENS 150 M   91.65 ()
2003 ELMo SingleEns 100 M   92.37 ()
NormalEns 900 M   92.86 ()
Single 100 M   96.42
3-5[3pt/1pt] CoNLL Tfm: 1/K ENS 150 M   95.67 ()
2000 ELMo SingleEns 100 M   96.56 ()
NormalEns 900 M   96.67 ()
Table 2: Test F1 score and parameter size for sequence labeling tasks. Similarly to NormalEns, SingleEns improved the score even at high performance levels.

5.1 Evaluation of text classification

Data

We followed the settings used in the implementation by Kiyono et al. (2018) for data partition.444See Appendix B for data statistics. Our method, SingleEns inflates the training data by times. During the inflation, the -th subset is sampled by bootstrapping Efron and Tibshirani (1993) with the corresponding -th pseudo-tag. For NormalEns and 1/K Ens, we attempted both bootstrapping and normal sampling, and a higher score was reported.

Results

Table 1 presents the overall results evaluated in terms of accuracy. For both Tfm:GloVe and Tfm:BERT, SingleEns outperformed Single with the same parameter size. In our experiments, SingleEns achieved the best scores on IMDB and Rotten with Tfm:BERT; it recorded and , which was higher than NormalEns by and , respectively with 89

fewer parameters. The standard deviation of the results for the IMDB dataset was, 0.69 and 0.14 for

Single and SingleEns, respectively, for Tfm:GloVe, and 0.34 and 0.11, respectively, for Tfm:BERT. These results support the claim that explicit operations for defining virtual models have a significant effect for a single model and are complementary to normal dropout. Through the series of experiments, we observed that the number of iterations of SingleEns was 1.0 ~1.5 times greater than that of Single.

Figure 2: Accuracy depending on the number of models for each ensemble method on the Rotten dataset.
Figure 3: F1 score depending on the number of models for each ensemble method on CoNLL-2003.

5.2 Evaluation of sequence labeling

Data

We followed the instructions of the task settings used in CoNLL-2000 and CoNLL-2003.555The statistics of the datasets are presented in Appendix B. We inflated the training data by nine times for SingleEns, and normal sampling was used for NormalEns and 1/K Ens. Because bootstrapping was not effective for the task, the results were omitted.

Results

As displayed in Table 2, SingleEns surpassed Single by 0.44 and 0.14 on CoNLL-2003 and CoNLL-2000, respectively, for TFM:ELMo with the same parameter size. However, NormalEns produced the best results in this setting. The standard deviations of the single model and our methods were 0.08 and 0.05, respectively, on CoNLL-2000. Through the series of experiments, we observed that the number of iterations of SingleEns was 1.0 ~1.5 times greater than that of Single.

IMDB CoNLL-2003
Setting Accuracy F1 Score
Single 91.99 91.93
1) Only pseudo-tags 89.84 92.20
2) Random distinct vectors 92.06 92.21
3) Random noise 92.38 92.32
SingleEns 92.91 92.37
Table 3: Comparison of proposed method (pseudo-tags + corresponding distinct vectors) with other settings. Pseudo-tags and distinct vectors appear to complement each other.
IMDB CoNLL-2003
Setting Accuracy F1 Score
Single 91.99 91.93
1) Emb (SingleEns) 92.91 92.37
2) Hidden 90.68 92.45
1) + 2) 92.64 92.19
Table 4: Test metrics on IMDB and CoNLL-2003 with the pattern of three vector addition operations. Adding distinct vectors to only embeddings is the best or second best approach.

6 Analysis

In this section, we investigate the properties of our proposed method. Unless otherwise specified, we use Tfm:BERT and Tfm:ELMo on IMDB and CoNLL-2003 for the analysis.

Significance of pseudo-tags and distinct vectors

To assess the significance of using both pseudo-tags and distinct vectors, we conducted an ablation study of our method, SingleEns. We compared our method with the following three settings: 1) Only pseudo-tags, 2) Random distinct vectors, and 3) Random noise. In detail, the first setting (Only pseudo-tags) attached the pseudo-tags to the input without adding the corresponding distinct vectors. The second setting (Random distinct vectors) randomly shuffles the correspondence between the distinct vectors and pseudo-tags in every iteration during the training. Additionally, the third setting (Random noise) adds random vectors as the replacement of the distinct vectors to clarify whether the effect of incorporating distinct vectors is essentially identical to the random noise injection techniques or explicit definition of virtual models in a single model.

Table 3 shows the results of the ablation study. This table indicates that using both pseudo-tags and distinct vectors, which matches the setting of SingleEns, leads to the best performance, while the effect is limited or negative if we use pseudo-tags alone or distinct vectors and pseudo-tags without correspondence. Thus, this observation explains that the increase in performance can be attributed to the combinatorial use of pseudo-tags and distinct vectors, and not merely data augmentation.

We can also observe from Table 3 that the performance of SingleEns was higher than that of 3) Random noise. Note that the additional vectors by SingleEns are fixed in a small number while those by Random noise are a large number of different vectors. Therefore, this observation supports our claim that the explicit definition of virtual models by distinct vectors has substantial positive effects that are mostly irrelevant to the effect of the random noise. This observation also supports the assumption that SingleEns is complementary to dropout. Dropout randomly uses sub-networks by stochastically omitting each hidden unit, which can be interpreted as a variant of Random noise. Moreover, it has no specific operations to define an explicitly prepared number of virtual models as SingleEns has. We conjecture that this difference yields the complementarity that our proposed method and dropout can co-exist.

Vector addition

We investigated the patterns with which distinct vectors should be added: 1) Emb, 2) Hidden, and 3) Emb + Hidden. Emb adds distinct vectors only to the embedding, while Hidden adds distinct vectors only to the final feature vectors. Emb + Hidden adds distinct vectors to both the embedding and final feature vectors. As illustrated in Table 4

, adding vectors to the embedding is sufficient for improving performance, while adding vectors to hidden vectors has as adverse effect. This observation can be explained by the architecture of Transformer. The distinct vectors in the embedding are recursively propagated through the entire network without being absorbed as non-essential information since the Transformer employs residual connections 

He et al. (2015).

Comparison with normal ensembles

To evaluate the behavior of our method, we examined the relationship between the performance and the number of models used for training. Our experiments revealed that having more than nine models did not result in significant performance improvement; thus, we only assessed the results up to nine models. Figs 3 and 3 present the metrics on Rotten and CoNLL-2003, respectively. The performance of our method increased with the number of models, which is a general feature of normal ensemble. Notably, on Rotten, the accuracy of our method rose while that of other methods did not. Investigation of this behavior is left for future work.

7 Conclusion

In this paper, we propose a single model ensemble technique called SingleEns. The principle of SingleEns is to explicitly create multiple virtual models in a single model. Our experiments demonstrated that the proposed method outperformed single models in both text classification and sequence labeling tasks. Moreover, our method with Tfm:BERT surpassed the normal ensemble on the IMDB and Rotten datasets, while its parameter size was -times smaller. The results thus indicate that explicitly creating virtual models within a single model improves performance. The proposed method is not limited to the two aforementioned tasks, but can be applied to any NLP as well as other tasks such as machine translation and image recognition. Further theoretical analysis can also be performed to elucidate the mechanisms of the proposed method.

Acknowledgment

The research results were achieved by ”Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation”, the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN. The work was partly supported by JSPS KAKENHI Grant Number 19H04162. We would like to thank Motoki Sato of Preferred Networks and Shun Kiyono of RIKEN for cooperating in preparing the experimental data. We would also like to thank the three anonymous reviewers for their insightful comments.

References

  • K. Anders and V. Jesper (1994)

    Neural network ensembles, cross validation and active learning

    .
    In Proceedings of the 7th International Conference on Neural Information Processing Systems, (NeurIPS), pp. 231–238. External Links: Link Cited by: §1, §2.
  • M. L. Andrew, D. E. Raymond, P. T. Peter, H. Dan, N. Y. Andrew, and P. Christopher (2011)

    Learning word vectors for sentiment analysis

    .
    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (ACL), pp. 142–150. External Links: Link Cited by: §5.
  • L. Barrault et al. (2019) Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation: Shared Task Papers, (WMT), pp. 1–61. Cited by: §1.
  • P. Bo and L. Lillian (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. CoRR, pp. 115–124. External Links: Link, Document Cited by: §5.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, P. Koehn, and C. Monz (2018) Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, (WMT), pp. 272–303. External Links: Link, Document Cited by: §1.
  • J. Devlin, M. W. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §5.
  • B. Efron and R. J. Tibshirani (1993) An introduction to the bootstrap.

    Monographs on Statistics and Applied Probability

    , Springer.
    Cited by: §5.1.
  • S. Hashem (1994) Optimal linear combinations of neural networks. NEURAL NETWORKS 10 (4), pp. 599–614. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §6.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger (2017) Snapshot ensembles: train 1, get M for free. CoRR abs/1704.00109. External Links: Link, 1704.00109 Cited by: §2, §2.
  • P. Jeffrey, S. Richard, and M. Christopher (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP), pp. 1532–1543. External Links: Link, Document Cited by: §5.
  • S. Kiyono, J. Suzuki, and K. Inui (2018)

    Mixture of expert/imitator networks: scalable semi-supervised learning framework

    .
    CoRR abs/1810.05788. External Links: Link, 1810.05788 Cited by: §5.1.
  • H. K. Lars. and S. Peter. (1990) Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell., pp. 993–1001. External Links: Link Cited by: §1, §2.
  • P. Matthew, N. Mark, I. Mohit, G. Matt, C. Christopher, L. Kenton, and Z. Luke (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL). External Links: Link Cited by: Table 5, §5.
  • J. Melvin, S. Mike, L. V. Quoc, K. Maxim, W. Yonghui, C. Zhifeng, T. Nikhil, V. Fernanda, W. Martin, C. Greg, H. Macduff, and D. Jeffrey (2017)

    Google’s multilingual neural machine translation system: enabling zero-shot translation

    .
    Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Link, Document Cited by: §2.
  • D. W. Opitz and J. W. Shavlik (1996) Actively searching for an effective neural network ensemble. Connect. Sci. 8, pp. 337–354. Cited by: §2.
  • S. Rico, H. Barry, and B. Alexandra (2016) Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL), San Diego, California, pp. 35–40. External Links: Link, Document Cited by: §2.
  • E. TK. Sang and F. D. Meulder (2003)

    Introduction to the conll-2003 shared task: language-independent named entity recognition

    .
    In Proceedings of the Seventh Conference on Natural Language Learning, (NAACL), pp. 142–147. External Links: Link Cited by: §5.
  • E. TK. Sang and B. Sabine (2000) Introduction to the CoNLL-2000 shared task chunking. In Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop, External Links: Link Cited by: §5.
  • A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. External Links: 1312.6120 Cited by: §5.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15, pp. 1929–1958.
    External Links: Link Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §5.
  • L. D. D. Y. Yiming, R. G. Tony, and L. Fan (2004) RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, pp. 361–397. External Links: Link Cited by: §5.
  • J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T. Liu (2020) Incorporating bert into neural machine translation. In International Conference on Learning Representations, (ICLR). External Links: Link Cited by: Table 5, §5.

Appendix A Hyper-parameters and Ensemble Strategy

Text Classification Sequence Labeling
Tfm:GloVe Tfm:BERT Tfm:ELMo
Embedding dimension 200 768 256
Hidden dimension 200 768 256
Number of layers 6 6 6
Number of attention heads 8 8 8
Frozen vectors GloVe 200 BERT-Large ELMo 1024
- - 0.5(Emb)
0.2 (Residual) 0.5 (Residual) 0.2 (Residual)
Dropout 0.1 (Attention) - 0.1 (Attention)
- - 0.1 (FF)
Label smoothing 0.1 0.1 -
Optimizer Adam Adam Adam
Initial learning rate 0.0001 0.0001 0.0001
Batch size 64 128 32
Gradient Clipping 1.0 1.0 5.0
Aggrgation Strategy Averaging Averaging Voting
Sampling strategy Normal Bootstrapping Normal Bootstrapping Normal
Table 5: Hyper-parameters and ensemble strategies for Single, NormalEns and SingleEns. For Tfm:BERT, we followed the model architecture of Zhu et al. (2020). For Tfm:ELMo on sequence labeling, we referenced the architecture of Matthew et al. (2018) with replacing the encoder with Transformer. It should be noted that for Tfm:ELMo, we add Linear Relu LayerNorm between embedding and self-attention.
Text Classification Sequence Labeling
Tfm:GloVe Tfm:BERT Tfm:ELMo
Embedding dimension 50 64 370
Hidden dimension 50 64 128
Frozen vectors GloVe 50 BERT-Base ELMo 256
Number of layers 3 3 4
Number of attention heads 10 8 8
Feed forward dimension 128 128 128
Aggregation Strategy Averaging Averaging Voting
Sampling strategy Normal Normal Normal
Table 6: Hyper-parameters and ensemble strategies for 1/K Ens. The other values are same as Table 5. It should be noted that we ensemble models of each sub model for final prediction.

Appendix B Data Statistics

Task Dataset Train Valid Test
IMDB 21,246 3,754 25,000
Text Classification Rotten 8,636 960 1,066
RCV 1 14,007 1,557 49,838
Sequence Labeling CoNLL-2003 14,987 3,466 3,684
CoNLL-2000 8,926 2,012 2,012
Table 7: Summary of the datasets. The values are the number of sentences contained in each dataset.