Contextual word embeddings Devlin et al. (2019); Peters et al. (2018); Radford et al. (2019) have been successfully applied to various NLP tasks, including named entity recognition, document classification, and textual entailment. The multilingual version of BERT (which is trained on Wikipedia articles from 100 languages and equipped with a 110,000 shared wordpiece vocabulary) has also demonstrated the ability to perform ‘zero-resource’ cross-lingual classification on the XNLI dataset Conneau et al. (2018)
. Specifically, when multilingual BERT is finetuned for XNLI with English data alone, the model also gains the ability to handle the same task in other languages. We believe that this zero-resource transfer learning can be extended to other multilingual datasets.
In this work, we explore BERT’s111‘BERT’ hereafter refers to multilingual BERT zero-resource performance on the multilingual MLDoc classification and CoNLL 2002/2003 NER tasks. We find that the baseline zero-resource performance of BERT exceeds the results reported in other work, even though cross-lingual resources (e.g. parallel text, dictionaries, etc.) are not used during BERT pretraining or finetuning. We apply adversarial learning to further improve upon this baseline, achieving state-of-the-art zero-resource results.
There are many recent approaches to zero-resource cross-lingual classification and NER, including adversarial learning Chen et al. (2019); Kim et al. (2017); Xie et al. (2018); Joty et al. (2017), using a model pretrained on parallel text Artetxe and Schwenk (2018); Lu et al. (2018); Lample and Conneau (2019) and self-training Hajmohammadi et al. (2015)
. Due to the newness of the subject matter, the definition of ‘zero-resource’ varies somewhat from author to author. For the experiments that follow, ‘zero-resource’ means that, during model training, we do not use labels from non-English data, nor do we use human or machine-generated parallel text. Only labeled English text and unlabeled non-English text are used during training, and hyperparameters are selected using English evaluation sets.
Our contributions are the following:
We demonstrate that the addition of a language-adversarial task during finetuning for multilingual BERT can significantly improve the zero-resource cross-lingual transfer performance.
For both MLDoc classification and CoNLL NER, we find that, even without adversarial training, the baseline multilingual BERT performance can exceed previously published results on zero-resource performance.
We show that adversarial techniques encourage BERT to align the representations of English documents and their translations. We speculate that this alignment causes the observed improvement in zero-resource performance.
2 Related Work
2.1 Adversarial Learning
Language-adversarial training Zhang et al. (2017) was proposed for generating bilingual dictionaries without parallel data. This idea was extended to zero-resource cross-lingual tasks in NER Kim et al. (2017); Xie et al. (2018) and text classification Chen et al. (2019), where we would expect that language-adversarial techniques induce features that are language-independent.
2.2 Self-training Techniques
Self-training, where an initial model is used to generate labels on an unlabeled corpus for the purpose of domain or cross-lingual adaptation, was studied in the context of text classification Hajmohammadi et al. (2015) and parsing McClosky et al. (2006); Zeman and Resnik (2008)
. A similar idea based on expectation-maximization, where the unobserved label is treated as a latent variable, has also been applied to cross-lingual text classification inRigutini et al. (2005).
2.3 Translation as Pretraining
Artetxe and Schwenk (2018) and Lu et al. (2018) use the encoders from machine translation models as a starting point for task-specific finetuning, which permits various degrees of multilingual transfer. Lample and Conneau (2019) add an additional masked translation task to the BERT pretraining process, and the authors observed an improvement in the cross-lingual setting over using the monolingual masked text task alone.
3.1 Model Training
We present an overview of the adversarial training process in Figure 1. We used the pretrained cased multilingual BERT model222https://github.com/google-research/bert/blob/master/multilingual.md as the initialization for all of our experiments. Note that the BERT model has 768 units.
We always use the labeled English data of each corpus. We use the non-English text portion (without the labels) for the adversarial training.
We formulate the adversarial task as a binary classification problem (i.e. English versus non-English.) We add a language discriminator
module which uses the BERT embeddings to classify whether the input sentence was written in English or the non-English language. We also add agenerator loss which encourages BERT to produce embeddings that are difficult for the discriminator to classify correctly. In this way, the BERT model learns to generate embeddings that do not contain language-specific information.
The pseudocode for our procedure can be found in Algorithm 1. In the description that follows, we use a batch size of 1 for clarity.
For language-adversarial training for the classification task, we have 3 loss functions: the task-specific loss, the generator loss , and the discriminator loss :
where is the number of classes for the task, (dim: ) is the task-specific prediction,
(dim: scalar) is the probability that the input is in English,(dim: ) is the mean-pooled BERT output embedding for the input word-pieces , is the BERT parameters, (dim: , , , scalar) are the output projections for the task-specific loss and discriminator respectively, (dim:
) is the one-hot vector representation for the task label and(dim: scalar) is the binary label for the adversarial task (i.e. 1 or 0 for English or non-English).
In the case of NER, the task-specific loss has an additional summation over the length of the sequence:
where (dim: ) is the prediction for the word, is the number of words in the sentence, (dim: ) is the matrix of one-hot entity labels, and (dim: ) refers to the BERT embedding of the word.
The generator and discriminator losses remain the same for NER, and we continue to use the mean-pooled BERT embedding during adversarial training.
We then take the gradients with respect to the 3 losses and the relevant parameter subsets. The parameter subsets are , , and . We apply the gradient updates sequentially at a 1:1:1 ratio.
During BERT finetuning, the learning rates for the task loss, generator loss and discriminator loss were kept constant; we do not apply a learning rate decay.
All hyperparameters were tuned on the English dev sets only, and we use the Adam optimizer in all experiments. We report results based on the average of 4 training runs.
|Schwenk and Li (2018)||92.2||81.2||72.5||72.4||69.4||67.6||60.8||74.7|
|Artetxe and Schwenk (2018)||89.9||84.8||77.3||77.9||69.4||60.3||67.8||71.9|
|BERT En-labels + Adv.||-||88.1||80.8||85.7||72.3||76.8||77.4||84.7|
3.2 MLDoc Classification Results
We finetuned BERT on the English portion of the MLDoc corpus Schwenk and Li (2018). The MLDoc task is a 4-class classification problem, where the data is a class-balanced subset of the Reuters News RCV1 and RCV2 datasets. We used the
english.train.1000 dataset for the classification loss, which contains 1000 labeled documents. For language-adversarial training, we used the text portion of
french.train.10000, etc. without the labels.
We used a learning rate of for the task loss, for the generator loss and for the discriminator loss.
In Table 1, we report the classification accuracy for all of the languages in MLDoc. Generally, adversarial training improves the accuracy across all languages, and the improvement is sometimes dramatic versus the BERT non-adversarial baseline.
In Figure 2, we plot the zero-resource German and Japanese test set accuracy as a function of the number of steps taken, with and without adversarial training. The plot shows that the variation in the test accuracy is reduced with adversarial training, which suggests that the cross-lingual performance is more consistent when adversarial training is applied. (We note that the batch size and learning rates are the same for all the languages in MLDoc, so the variation seen in Figure 2 are not affected by those factors.)
3.3 CoNLL NER Results
|Devlin et al. (2019)||92.4||-||-||-|
|Mayhew et al. (2017)||-||57.5||66.0||64.5|
|Ni et al. (2017)||-||58.5||65.1||65.4|
|Chen et al. (2019)||-||56.0||73.5||72.4|
|Xie et al. (2018)||-||57.8||72.4||71.3|
|BERT En-labels + Adv.||-||78.3||80.8||82.1|
We used a learning rate of for the task loss, for the generator loss and for the discriminator loss.
In Table 2, we report the F1 scores for all of the CoNLL NER languages. When combined with adversarial learning, the BERT cross-lingual F1 scores increased for German over the non-adversarial baseline, and the scores remained largely the same for Spanish and Dutch. Regardless, the BERT zero-resource performance far exceeds the results published in previous work.
3.4 Alignment of Embeddings for Parallel Documents
|Source||Target||Without Adv.||With Adv.|
Median cosine similarity between the mean-pooled BERT embeddings of MLDoc English documents and their translations, with and without language-adversarial training. The median cosine similarity increased with adversarial training for every language pair, which suggests that the adversarial loss encourages BERT to learn language-independent representations.
If language-adversarial training encourages language-independent features, then the English documents and their translations should be close in the embedding space. To examine this hypothesis, we take the English documents from the MLDoc training corpus and translate them into German, Spanish, French, etc. using Amazon Translate.
We construct the embeddings for each document using BERT models finetuned on MLDoc. We mean-pool each document embedding to create a single vector per document. We then calculate the cosine similarity between the embeddings for the English document and its translation. In Table 3, we observe that the median cosine similarity increases dramatically with adversarial training, which suggests that the embeddings became more language-independent.
For many of the languages examined, we were able to improve on BERT’s zero-resource cross-lingual performance on the MLDoc classification and CoNLL NER tasks. Language-adversarial training was generally effective, though the size of the effect appears to depend on the task. We observed that adversarial training moves the embeddings of English text and their non-English translations closer together, which may explain why it improves cross-lingual performance.
Future directions include adding the language-adversarial task during BERT pre-training on the multilingual Wikipedia corpus, which may further improve zero-resource performance, and finding better stopping criteria for zero-resource cross-lingual tasks besides using the English dev set.
We would like to thank Julian Salazar and Faisal Ladhak for the helpful comments and discussions.
- Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464. Cited by: §1, §2.3, Table 1.
- Multi-source cross-lingual model transfer: learning what to share. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Cited by: §1, §2.1, Table 2.
XNLI: evaluating cross-lingual sentence representations.
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, Cited by: §1, §3.3, Table 2.
Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples. Information sciences 317, pp. 67–77. Cited by: §1, §2.2.
Cross-language learning with adversarial neural networks: application to community question answering. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL), Cited by: §1.
- Cross-lingual transfer learning for pos tagging without cross-lingual resources. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §1, §2.1.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1, §2.3.
- A neural interlingua for multilingual machine translation. In Proceedings of the Conference on Machine Translation (WMT), Cited by: §1, §2.3.
- Cheap translation for cross-lingual named entity recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §3.3, Table 2.
- Effective self-training for parsing. In Proceedings of NAACL-HLT, Cited by: §2.2.
- Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Cited by: §3.3, Table 2.
- Deep contextualized word representations. In Proceedings of NAACL-HLT, Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI Blog 1, pp. 8. Cited by: §1.
- An em based training algorithm for cross-language text categorization. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 529–535. Cited by: §2.2.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL), Cited by: §3.3.
- A corpus for multilingual document classification in eight languages. In Proceedings of the Language Resources and Evaluation Conference (LREC), Cited by: §3.2, Table 1.
- Neural cross-lingual named entity recognition with minimal resources. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §1, §2.1, Table 2.
- Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP Workshop on NLP for Less Privileged Languages, Cited by: §2.2.
Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Cited by: §2.1.