. This method combines the outputs of multiple models that are individually trained using the same training data. Recent submissions to natural language processing(NLP) competitions are primarily composed of neural network ensemblesBojar et al. (2018); Barrault and others (2019). Despite its effectiveness, a model ensemble is costly. Because it handles multiple models, it requires increased time for training and inference, increased memory, and greater management effort. Therefore, the model ensemble technique cannot always be applied to real systems, as many systems, such as edge devices, must work with limited computational resources.
In this study, we propose a novel method that replicates the effects of the ensemble technique with a single model. Following the principle that aggregating multiple models improves performance, we create multiple virtual models in a shared space. Our method virtually inflates the training data times with -distinct pseudo-tags appended to all input data. It also incorporates -distinct vectors, which correspond to pseudo-tags. Each pseudo-tag is attached to the beginning of the input sentence, and the -th vector is added to the embedding vectors for all tokens in the input sentence. Fig. 1 presents a brief overview of our proposed method. Intuitively, this operation allows the model to shift the embedding of the same data to the -th designated subspace and can be interpreted as explicitly creating virtual models in a shared space. We thus expect to obtain the same (or similar) effects as the ensemble technique composed of models with our virtual models generated from a single model.
Experiments in text classification and sequence labeling tasks reveal that our method outperforms single models in all settings with the same parameter size. Moreover, our technique emulates or surpasses the normal ensemble with -times fewer parameters on several datasets.
2 Related Work
The neural network ensemble is a widely studied method Lars. and Peter. (1990); Anders and Jesper (1994); Hashem (1994); Opitz and Shavlik (1996); however studies have focused mainly on improving performance while ignoring cost, such as computational cost, memory space, and management cost.
Several methods have overcome the shortcomings of traditional ensemble techniques. For training Snapshot Ensembles, Huang et al. (2017) used a single model to construct multiple models by converging into multiple local minima along the optimization path. For inference distillation, Hinton et al. (2015) transferred the knowledge of the ensemble model into a single model. These methods use multiple models either during training or inference, which partially solves the negative effects of the traditional ensemble.
The incorporation of pseudo-tags is a standard technique widely used in the NLP community, Rico et al. (2016); Melvin et al. (2017). However, to the best of our knowledge, our approach is the first attempt to incorporate pseudo-tags as an identification marker of virtual models within a single model.
The most similar approach to ours is dropout Srivastava et al. (2014), which stochastically omits each hidden unit during each mini-batch, and in which all units are utilized for inference. Huang et al. (2017) interpreted this technique as implicitly using an exponential number of virtual models within the same network. As opposed to dropout, our method explicitly utilizes virtual models with a shared parameter, which is as discussed in Section 5, complementary to dropout.
3 Base Encoder Model
The target tasks of this study are text classification and sequence labeling. The input is a sequence of tokens (i.e., a sentence). Here, denotes the one-hot vector of the -th token in the input. Let be the embedding matrices where is the dimension of the embedding vectors and is the vocabulary of the input.
We obtain the embedding vector at position by . Here, we introduce the notation to represent the list of vectors that correspond to the input sentence, where is the number of tokens in the input. Given , the feature (or hidden) vectors for all are computed as an encoder neural network , where denotes the dimensions of the feature vector. Namely,
Finally, the output given input
is estimated aswhere represents the task dependent function (e.g., a softmax function for text classification and a conditional random field layer for sequence labeling). It should be noted that the form of the output differs depending on the target task.
4 Single Model Ensemble using Pseudo-Tags and Distinct Vectors
In this section, we introduce the proposed method, which we refer to as SingleEns. Fig. 1 presents an overview of the method. The main principle of this approach is to create different virtual models within a single model.
We incorporate pseudo-tags and predefined distinct vectors. For the pseudo-tags, we add special tokens to the input vocabulary, where hyper-parameter represents the number of virtual models. For the predefined distinct vectors, we leverage mutually orthogonal vectors , where the orthogonality condition requires satisfying for all when .
Finally, we assume that all input sentences start from one of the pseudo-tags. We then add the corresponding orthogonal vector of the attached pseudo-tag to the embedding vectors at all positions. The new embedding vector is written in the following form:
We substitute in Eq. 1 by in the proposed method.
An intuitive explanation of the role of pseudo-tags is to allow a single model to explicitly recognize differences in homogeneous input, while the purpose of orthogonal vectors is to linearly shift the embedding to the virtual model’s designated direction. Therefore, by combining these elements, we believe that we can define virtual models within a single model and effectively use the local space for each virtual model. Aggregating these virtual models can then result in imitation of ensemble.
|3-5[3pt/1pt]||Tfm:||1/K Ens||14 M||81.93 ()|
|GloVe||SingleEns||12 M||87.30 ()|
|IMDB||NormalEns||108 M||87.67 ()|
|3-5[3pt/1pt]||Tfm:||1/K Ens||1000 M||90.63 ()|
|BERT||SingleEns||400 M||92.91 ()|
|NormalEns||3600 M||92.75 ()|
|3-5[3pt/1pt]||Tfm:||1/K Ens||1000 M||82.67 ()|
|Rotten||BERT||SingleEns||400 M||85.01 ()|
|NormalEns||3600 M||82.57 ()|
|3-5[3pt/1pt]||Tfm:||1/K Ens||1000 M||80.27 ()|
|RCV1||BERT||SingleEns||400 M||89.16 ()|
|NormalEns||3600 M||90.01 ()|
To evaluate the effectiveness of our method, we conducted experiments on two tasks: text classification and sequence labeling. We used the IMDB Andrew et al. (2011), Rotten Bo and Lillian (2005), and RCV1 Yiming et al. (2004) datasets for text classification, and the CoNLL-2003 Sang and Meulder (2003) and CoNLL-2000 datasets Sang and Sabine (2000) for sequence labeling.
We used the Transformer model Vaswani et al. (2017) as the base model for all experiments, and its token vector representations were then empowered by pretrained vectors of GloVe, Jeffrey et al. (2014), BERT Devlin et al. (2018), or ELMo Matthew et al. (2018). The models are referred to as Tfm:GloVe, Tfm:BERT, and Tfm:ELMo, respectively.111See Appendix A for detailed experimental settings. For Tfm:BERT, we incorporated the feature (or hidden) vectors of the final layer in the BERT model as the embedding vectors while adopting drop-net technique Zhu et al. (2020). All the models have dropout layers to assess the complementarity of our method and dropout.
We compared our method (SingleEns) to a single model (Single), a normal ensemble (NormalEns), and a normal ensemble in which each component has approximately parameters222Because BERT requires a fixed number of parameters, we did not reduce the parameters accurately for 1/K Tfm:BERT. (1/K Ens).333See Appendix A for detailed experimental settings. Although other ensemble-like methods discussed in Section 2 could have been compared (e.g., snapshot ensemble, knowledge distillation, or dropout during testing to generate predictions and aggregate them), they are imitations of a normal ensemble, and we assumed that the results of a normal ensemble were upper-bound. We used for reporting the primary results of NormalEns, 1/K Ens, and SingleEns. We thus prepared nine pseudo-tags in the same training (trainable) and initialization manner as other embeddings. We created untrainable distinct vectors using the implementation by Saxe et al. (2013)
that was prepared in PyTorch’s default function,torch.nn.init.orthogonal. We empirically determined the correct scaling for the distinct vectors as 1 out of 1, 3, 5, 10, 30, 50, 100, and the scale that was closest to the model’s embedding vectors. We obtained the final predictions of ensemble models by averaging and voting the outputs of individual models for text classification and sequence labeling, respectively. The results were obtained by the averaging five distinct runs with different random seeds.
|Dataset||Model||Method||# params||F1 Score|
|3-5[3pt/1pt] CoNLL||Tfm:||1/K ENS||150 M||91.65 ()|
|2003||ELMo||SingleEns||100 M||92.37 ()|
|NormalEns||900 M||92.86 ()|
|3-5[3pt/1pt] CoNLL||Tfm:||1/K ENS||150 M||95.67 ()|
|2000||ELMo||SingleEns||100 M||96.56 ()|
|NormalEns||900 M||96.67 ()|
5.1 Evaluation of text classification
We followed the settings used in the implementation by Kiyono et al. (2018) for data partition.444See Appendix B for data statistics. Our method, SingleEns inflates the training data by times. During the inflation, the -th subset is sampled by bootstrapping Efron and Tibshirani (1993) with the corresponding -th pseudo-tag. For NormalEns and 1/K Ens, we attempted both bootstrapping and normal sampling, and a higher score was reported.
Table 1 presents the overall results evaluated in terms of accuracy. For both Tfm:GloVe and Tfm:BERT, SingleEns outperformed Single with the same parameter size. In our experiments, SingleEns achieved the best scores on IMDB and Rotten with Tfm:BERT; it recorded and , which was higher than NormalEns by and , respectively with 89
fewer parameters. The standard deviation of the results for the IMDB dataset was, 0.69 and 0.14 forSingle and SingleEns, respectively, for Tfm:GloVe, and 0.34 and 0.11, respectively, for Tfm:BERT. These results support the claim that explicit operations for defining virtual models have a significant effect for a single model and are complementary to normal dropout. Through the series of experiments, we observed that the number of iterations of SingleEns was 1.0 ~1.5 times greater than that of Single.
5.2 Evaluation of sequence labeling
We followed the instructions of the task settings used in CoNLL-2000 and CoNLL-2003.555The statistics of the datasets are presented in Appendix B. We inflated the training data by nine times for SingleEns, and normal sampling was used for NormalEns and 1/K Ens. Because bootstrapping was not effective for the task, the results were omitted.
As displayed in Table 2, SingleEns surpassed Single by 0.44 and 0.14 on CoNLL-2003 and CoNLL-2000, respectively, for TFM:ELMo with the same parameter size. However, NormalEns produced the best results in this setting. The standard deviations of the single model and our methods were 0.08 and 0.05, respectively, on CoNLL-2000. Through the series of experiments, we observed that the number of iterations of SingleEns was 1.0 ~1.5 times greater than that of Single.
|1) Only pseudo-tags||89.84||92.20|
|2) Random distinct vectors||92.06||92.21|
|3) Random noise||92.38||92.32|
|1) Emb (SingleEns)||92.91||92.37|
|1) + 2)||92.64||92.19|
In this section, we investigate the properties of our proposed method. Unless otherwise specified, we use Tfm:BERT and Tfm:ELMo on IMDB and CoNLL-2003 for the analysis.
Significance of pseudo-tags and distinct vectors
To assess the significance of using both pseudo-tags and distinct vectors, we conducted an ablation study of our method, SingleEns. We compared our method with the following three settings: 1) Only pseudo-tags, 2) Random distinct vectors, and 3) Random noise. In detail, the first setting (Only pseudo-tags) attached the pseudo-tags to the input without adding the corresponding distinct vectors. The second setting (Random distinct vectors) randomly shuffles the correspondence between the distinct vectors and pseudo-tags in every iteration during the training. Additionally, the third setting (Random noise) adds random vectors as the replacement of the distinct vectors to clarify whether the effect of incorporating distinct vectors is essentially identical to the random noise injection techniques or explicit definition of virtual models in a single model.
Table 3 shows the results of the ablation study. This table indicates that using both pseudo-tags and distinct vectors, which matches the setting of SingleEns, leads to the best performance, while the effect is limited or negative if we use pseudo-tags alone or distinct vectors and pseudo-tags without correspondence. Thus, this observation explains that the increase in performance can be attributed to the combinatorial use of pseudo-tags and distinct vectors, and not merely data augmentation.
We can also observe from Table 3 that the performance of SingleEns was higher than that of 3) Random noise. Note that the additional vectors by SingleEns are fixed in a small number while those by Random noise are a large number of different vectors. Therefore, this observation supports our claim that the explicit definition of virtual models by distinct vectors has substantial positive effects that are mostly irrelevant to the effect of the random noise. This observation also supports the assumption that SingleEns is complementary to dropout. Dropout randomly uses sub-networks by stochastically omitting each hidden unit, which can be interpreted as a variant of Random noise. Moreover, it has no specific operations to define an explicitly prepared number of virtual models as SingleEns has. We conjecture that this difference yields the complementarity that our proposed method and dropout can co-exist.
We investigated the patterns with which distinct vectors should be added: 1) Emb, 2) Hidden, and 3) Emb + Hidden. Emb adds distinct vectors only to the embedding, while Hidden adds distinct vectors only to the final feature vectors. Emb + Hidden adds distinct vectors to both the embedding and final feature vectors. As illustrated in Table 4
, adding vectors to the embedding is sufficient for improving performance, while adding vectors to hidden vectors has as adverse effect. This observation can be explained by the architecture of Transformer. The distinct vectors in the embedding are recursively propagated through the entire network without being absorbed as non-essential information since the Transformer employs residual connectionsHe et al. (2015).
Comparison with normal ensembles
To evaluate the behavior of our method, we examined the relationship between the performance and the number of models used for training. Our experiments revealed that having more than nine models did not result in significant performance improvement; thus, we only assessed the results up to nine models. Figs 3 and 3 present the metrics on Rotten and CoNLL-2003, respectively. The performance of our method increased with the number of models, which is a general feature of normal ensemble. Notably, on Rotten, the accuracy of our method rose while that of other methods did not. Investigation of this behavior is left for future work.
In this paper, we propose a single model ensemble technique called SingleEns. The principle of SingleEns is to explicitly create multiple virtual models in a single model. Our experiments demonstrated that the proposed method outperformed single models in both text classification and sequence labeling tasks. Moreover, our method with Tfm:BERT surpassed the normal ensemble on the IMDB and Rotten datasets, while its parameter size was -times smaller. The results thus indicate that explicitly creating virtual models within a single model improves performance. The proposed method is not limited to the two aforementioned tasks, but can be applied to any NLP as well as other tasks such as machine translation and image recognition. Further theoretical analysis can also be performed to elucidate the mechanisms of the proposed method.
The research results were achieved by ”Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation”, the Commissioned Research of National Institute of Information and Communications Technology (NICT), JAPAN. The work was partly supported by JSPS KAKENHI Grant Number 19H04162. We would like to thank Motoki Sato of Preferred Networks and Shun Kiyono of RIKEN for cooperating in preparing the experimental data. We would also like to thank the three anonymous reviewers for their insightful comments.
Neural network ensembles, cross validation and active learning. In Proceedings of the 7th International Conference on Neural Information Processing Systems, (NeurIPS), pp. 231–238. External Links: Cited by: §1, §2.
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (ACL), pp. 142–150. External Links: Cited by: §5.
- Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation: Shared Task Papers, (WMT), pp. 1–61. Cited by: §1.
- Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. CoRR, pp. 115–124. External Links: Cited by: §5.
- Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, (WMT), pp. 272–303. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §5.
An introduction to the bootstrap.
Monographs on Statistics and Applied Probability, Springer. Cited by: §5.1.
- Optimal linear combinations of neural networks. NEURAL NETWORKS 10 (4), pp. 599–614. Cited by: §2.
- Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §6.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
- Snapshot ensembles: train 1, get M for free. CoRR abs/1704.00109. External Links: Cited by: §2, §2.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP), pp. 1532–1543. External Links: Cited by: §5.
Mixture of expert/imitator networks: scalable semi-supervised learning framework. CoRR abs/1810.05788. External Links: Cited by: §5.1.
- Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell., pp. 993–1001. External Links: Cited by: §1, §2.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL). External Links: Cited by: Table 5, §5.
Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Cited by: §2.
- Actively searching for an effective neural network ensemble. Connect. Sci. 8, pp. 337–354. Cited by: §2.
- Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL), San Diego, California, pp. 35–40. External Links: Cited by: §2.
Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning, (NAACL), pp. 142–147. External Links: Cited by: §5.
- Introduction to the CoNLL-2000 shared task chunking. In Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop, External Links: Cited by: §5.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. External Links: Cited by: §5.
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research15, pp. 1929–1958. External Links: Cited by: §2.
- Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §5.
- RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, pp. 361–397. External Links: Cited by: §5.
- Incorporating bert into neural machine translation. In International Conference on Learning Representations, (ICLR). External Links: Cited by: Table 5, §5.
Appendix A Hyper-parameters and Ensemble Strategy
|Text Classification||Sequence Labeling|
|Number of layers||6||6||6|
|Number of attention heads||8||8||8|
|Frozen vectors||GloVe 200||BERT-Large||ELMo 1024|
|0.2 (Residual)||0.5 (Residual)||0.2 (Residual)|
|Dropout||0.1 (Attention)||-||0.1 (Attention)|
|Initial learning rate||0.0001||0.0001||0.0001|
|Sampling strategy||Normal Bootstrapping||Normal Bootstrapping||Normal|
|Text Classification||Sequence Labeling|
|Frozen vectors||GloVe 50||BERT-Base||ELMo 256|
|Number of layers||3||3||4|
|Number of attention heads||10||8||8|
|Feed forward dimension||128||128||128|
Appendix B Data Statistics