An Empirical Study of Factors Affecting Language-Independent Models

12/30/2019 ∙ by Xiaotong Liu, et al. ∙ ibm 0

Scaling existing applications and solutions to multiple human languages has traditionally proven to be difficult, mainly due to the language-dependent nature of preprocessing and feature engineering techniques employed in traditional approaches. In this work, we empirically investigate the factors affecting language-independent models built with multilingual representations, including task type, language set and data resource. On two most representative NLP tasks – sentence classification and sequence labeling, we show that language-independent models can be comparable to or even outperforms the models trained using monolingual data, and they are generally more effective on sentence classification. We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages. We also explore the effects of different data sizes when training and testing language-independent models, and demonstrate that they are not only suitable for high-resource languages, but also very effective in low-resource languages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In today’s globalized world, companies need to be able to understand and analyze what is being said out there, about them, their products, services, or their competitors, regardless of the human language used. Many organizations have spent tremendous resources to develop cognitive applications and services for dealing with customers in different countries. For example, cognitive systems may use machine learning techniques to process input messages or statements to determine their meaning and to provide associated confidence scores based on knowledge acquired by the cognitive system. Typically, the use of such cognitive systems requires training individual natural language understanding models in a specific human language. For example, a tone analyzer model can be built to predict tones from English conversations 

Liu et al. (2018), but such model would not work effectively with other languages. While translation techniques can be applied to translate data from an existing language to another language, human translation is labor-intensive and time-consuming, and machine translation can be costly and unreliable. As a result, attempts to scale existing applications to multiple human languages has traditionally proven to be difficult, mainly due to the language-dependent nature of preprocessing and feature engineering techniques employed in traditional approaches Akkiraju et al. (2018).

In this work, we empirically investigate the feasibility of multilingual representations to build language-independent models, which can be trained with data from multiple source languages and then serve multiple target languages (target languages can be different from source languages). We explore this question using a unified language model Multilingual BERT Devlin et al. (2019), which is pre-trained on the combination of monolingual Wikipedia corpora from 104 languages. Through a series of experiments on multiple task types, language sets and data resources, we contribute empirical findings of how factors affect language-independent models:

  • Task Type. We analyze and compare language-independent models on two most representative NLP tasks: sentence classification and sequence labeling. On both tasks, we show that language-independent models can be comparable to or even outperform the models trained using monolingual data. Language-independent models are generally more effective on sentence classification.

  • Language Set. Theoretically language-independent models can be trained using any language set, and be used to make predictions in any language. Through training and testing language-independent models with many different languages, we show that they are more suitable for typologically similar languages.

  • Data Resource. We explore the effects of different data sizes when training language-independent models. We demonstrate that language-independent models are not only suitable for high-resource languages, but also very effective in low-resource languages.

We derive insights from our experiments to facilitate the development and customization of natural language understanding models and solutions in new languages. First of all, it can be used to solve the cold-start problem, where no initial model is available for a new target language, when building such models from scratch is costly. Secondly, it largely saves the cost and time for acquiring annotated data of a new target language by reusing data already annotated in previously supported languages. Thirdly, it simplifies the deployment process of a new model and save the efforts for simultaneously maintaining multiple monolingual models in a production setting. Our annotated data for low-resource languages will be made publicly available.

2 Related Works

Multilingual representation learning has been an active area of research, starting from word embeddings alignment that uses small dictionaries to align word representations from different languages Mikolov et al. (2013). Research by Faruqui and Dyer (2014)

has demonstrated that multilingual representations can be leveraged to improve the quality of monolingual representations. An unsupervised learning method has been proposed by 

Conneau et al. (2017)

to align multilingual word embeddings without parallel data. In addition to word embedding alignment, aligning sentence representations from multiple languages has also been studied in machine translation, on both supervised learning 

Johnson et al. (2017); Artetxe and Schwenk (2018) and unsupervised learning Lample et al. (2017); Artetxe et al. (2017). However, most of these approaches focus on pairwise multilingual representation learning. In this work, we empirically investigate the impact of multilingual representations learned from a large number of languages on tasks that involves more languages than a certain language pair.

Our work builds on top of recent advances in pre-trained language modeling. ELMo Peters et al. (2018) extracts context-sensitive features from a bidirectional LSTM language model and provides additional features for a task-specific architecture. ULMFiT Howard and Ruder (2018) advocates discriminative fine-tuning and slanted triangular learning rates to stabilize the fine-tuning process with respect to end tasks. OpenAI GPT Radford et al. (2018) builds on multi-layer transformer Vaswani et al. (2017) decoders instead of LSTM to achieve effective transfer while requiring minimal changes to the model architecture. Recently, BERT Devlin et al. (2019) uses bidirectional transformer encoders to pre-train a large corpus, and fine-tunes the pre-trained model that requires almost no specific architecture for each end task. In this work, we leverage the multilingual representations learned from multilingual BERT Devlin et al. (2019) to build models that can scale to many languages.

3 Language-Independent Model

In this section, we describe the motivation of language-independent models, and how to create such models via multilingual representation learning and fine-tuning.

3.1 One Model, Many Languages

To scale our efforts to support the diversity of people in the world, it is important to build and customize machine learning models for many different languages in various NLP tasks. For each target language, however, this often requires going through the whole lifecycle of data collection, data cleansing, data annotation, data storage, feature creation and selection, machine learning model training, model validation, benchmarking and deployment of these models as services in production Akkiraju et al. (2018). It easily becomes overwhelming as the number of target languages increases. To address this problem, we advocate to build one model for all target languages together, which we called a Language-Independent Model (LIM), as the target languages to serve in production do not necessarily depend on which source languages were used in training. Figure 1 shows a conceptual example: an LIM can be trained using annotated data from the source languages such as English (EN) and French (FR), and then serve in the target languages including Spanish (ES), Italian (IT), Japanese(JA), which are different from the source languages. This not only accelerates the enablement of a new language by reusing data already annotated in previously supported languages, but also simplifies the deployment process and save efforts for maintaining multiple monolingual models in production.

Figure 1: A conceptual example of a Language-Independent Model (LIM). The target languages to serve in production do not necessarily depend on which source languages were used in training. For instance, an LIM can be trained using annotated data from the source languages such as English (EN) and French (FR), and then serve in the target languages including Spanish (ES), Italian (IT), Japanese(JA) and so on.

3.2 Multilingual Representation Learning with BERT

The basis for building LIMs lies in learning a representation that can feature multiple languages. Among the recent significant advances in deep contextualized representation learning for natural language understanding, BERT Devlin et al. (2019) stands out as its pre-training process naturally supports multilingual representation learning. Specifically, multilingual BERT was pre-trained on the Wikipedia pages (excluding user and talk pages) of 104 languages with a 110K shared WordPiece Wu et al. (2016) vocabulary. It is a 12-layer, 768-hidden, 12-head transformer model Vaswani et al. (2017) with 110M parameters. To alleviate the bias towards high-resource languages such as English, data from high-resource languages were under-sampled and those from low-resource languages were over-sampled. The pre-training of multilingual BERT does not use any marker denoting the input language, and does not rely on parallel corpus to explicitly encourage translation-equivalent pairs to have similar representations.

3.3 Fine-Tuning Multilingual BERT for End Tasks

The multilingual representations learned with BERT can be generalized for many natural language understanding tasks such as Sentiment Analysis, Named Entity Recognition, Categorization, and so on (as illustrated in Figure 

2

). The input representation of multilingual BERT is a sequence of tokens in any language, which may be a single sentence or two sentences packed together. The input representation of each token is constructed as the sum of the corresponding token, segment, and position embeddings. For sentence classification tasks, the first token of each sequence is a special classification embedding ([CLS]) and its final hidden state will be used as the aggregate representation of the whole sequence. For sequence labeling tasks, the final hidden state of each token will encode its contextualized representation with respect to the whole sequence. To fine-tune multilingual BERT, a classification layer is added on top of the final representation layer, and the probabilities of all label classes are computed with a standard softmax. The parameters of multilingual BERT and the classification layer are fine-tuned jointly to maximize the log-probability of the correct label. The labeled data of end tasks are shuffled across different languages when fine-tuning multilingual BERT.

Figure 2: An illustration of generalized multilingual representation learning for different NLP tasks.

4 Experiments

The effects of LIMs can be affected by at least three factors: task type, language set and data resource. In this section, we empirically investigate the effects of these factors on the performance of LIMs.

4.1 Factor Characterization

Task Type

We explore whether LIMs are equally effective across different end tasks. For the scope of this paper, we consider sentence classification and sequence labeling as two of the most popular NLP tasks. In particular, we select and compare two representative tasks: Sentiment Analysis and Named Entity Recognition (NER). Sentiment Analysis represents a typical sentence classification task, while NER is a popular sequence labeling task.

Language Set

While theoretically an LIM can be trained using any language set, and be used to make predictions in any language, multilingual representations may not be equally effective across different languages Gerz et al. (2018). For instance, it has been shown that a multilingual word embedding alignment between English and Chinese is much more difficult to learn than that between English and Spanish Conneau et al. (2017). We explore many different languages when training and testing LIMs.

Data Resource

For high-resource languages, the annotated data can be of different sizes; for low-resource languages, large amounts of data do not often exist Kasai et al. (2019). We explore the effects of different data sizes when training and testing LIMs.

4.2 Case Study on Sentiment Analysis

We take Sentiment Analysis as a 3-class classification problem: given a sentence in a target language , which consists of a series of words: , predict the sentiment polarity .

For this case study, we consider 7 high-resource languages: English, Spanish, Italian, Brazilian Portuguese, Dutch, Japanese and Chinese, covering both western and eastern languages. The high-resource training set consists of 770K data points — 230K English, and 90K each in other 6 languages; the test set contain both public available test data and high quality in-house test data — 630K English, 10K Spanish, 57K Japanese, 10K Chinese and 15K French. Meanwhile, we collect 5K data points each in 5 languages: Danish, Swedish, Norwegian, Russian, and Turkish, which are considered as low-resource languages in our experiments. We use 4K as training set and 1K as test set for each low-resource language.

We randomly split 1/10 from the training set as the development set for model selection and the rest for model training (i.e., fine-tuning the parameters of Multilingual BERT and the sentence classification layer). Following original BERT fine-tuning Devlin et al. (2019)

, we fine-tune the multilingual BERT with the following parameter choices: (1) batch size: 16, 32; (2) learning rate: 5e-5, 3e-5, 2e-5; (3) number of epochs: 3, 4. The model of 32 batch size, 2e-5 learning rate and 4 epochs was selected as the best model based on its performance on the development set. We denote the LIM for Sentiment Analysis trained with high-resource languages as

LIM-H, and the LIM trained with the mix of high-resource and low-resource languages as LIM-M.

4.2.1 Results on High-Resource Languages

For high-resource languages, we compare LIM-H with the following methods:

  • CNN Kim (2014)

    is a convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We use this method to train monolingual Sentiment Analysis models as a baseline because of its popularity and simple implementation for reproducibility.

  • ULMFiT Howard and Ruder (2018) is a recent generative pretrained language model with task-specific fine-tuning. We follow ULMFiT by adopting discriminative fine-tuning and slanted triangular learning rates to stabilize the fine-tuning process and create monolingual Sentiment Analysis models.

  • Monolingual-BERT. We trained monolingual Sentiment Analysis models by fine-tuning BERT with monolingual datasets for every language, respectively. For example, a Chinese-only BERT model refers to the BERT model fine-tuned using Chinese-only annotated data for Sentiment Analysis.

In Table 1, we report the accuracy results of Sentiment Analysis on English and Spanish across various models. We get a significant boost in performance of 7.4% than CNN, and 3.2% than ULMFiT in English. As for Spanish, we outperform the previous methods by 4.5% and 2.3% respectively.

Language CNN ULMFiT LIM-H
English 72.1 76.3 79.5
Spanish 69.4 71.6 73.9
Table 1: Accuracy results of Sentiment Analysis on English and Spanish across various models.

Furthermore, we show that our method is able to compete with the monolingual BERT models on Sentiment Analysis in Table 2. By leveraging data from non-native languages, our LIM outperforms the English-only BERT model by 1.8% and the Japanese-only BERT model by 0.7%, but falls behind the Chinese-only BERT model by 1.2%. It should be noted that BERT specifically pre-trained the Chinese-only model to account for its unique character tokenization. Therefore, it is still very encouraging to see that our LIM is comparable to a specially customized monolingual BERT model.

Language Monolingual-BERT LIM-H
English 77.7 79.5
Japanese 78.0 78.7
Chinese 74.5 73.3
Table 2: Accuracy results of Sentiment Analysis on English, Japanese and Chinese between monolingual BERT and LIM-H.

In Table 3

, we evaluate the impact of LIM on Sentiment Analysis via zero-shot transfer learning. When we do not include any French annotated data for training, we can still obtain a significant improvement of 5.7% over the monolingual CNN model trained using French annotated data.

Language CNN LIM-H
French 54.0 59.7
Table 3: Accuracy results of Sentiment Analysis on French between CNN and LIM-H. This demonstrates a zero-shot transfer learning case for LIM-H as it does not involve any French annotated data when training the model.

4.2.2 Results on Low-Resource Languages

For low-resource languages, we compare both LIM-H and LIM-M in Table 4. LIM-H demonstrates the effects of zero-shot transfer learning on low-resource languages, with an average of 60% accuracy. Since we do not use any low-resource training data in LIM-H, this shows that LIM can be used to address the cold-start problem, where no initial model is available for a new target low-resource language, when building such models from scratch is costly. Furthermore, LIM-M demonstrates how much improvement a LIM can gain by adding only a small amount of data in low-resource languages. In particular, by adding 4K annotated data in each low-resource language, we obtain an average of 11% improvement. This largely saves the cost and time for acquiring annotated data of a new target low-resource language by transferring the knowledge learned from a larger amount of annotated data available in high-resource languages.

Language LIM-H LIM-M
Danish 62.5 69.2
Swedish 56.8 68.6
Norwegian 62.0 70.3
Russian 62.1 75.8
Turkish 56.8 69.1
Table 4: Accuracy results of Sentiment Analysis on low-resource languages. We compare the performance of zero-shot transfer learning in LIM=H (without any annotated data from the target languages) and low-resource transfer training in LIM-M (only 4K annotated data from the target languages were used in training).

4.3 Case Study on Named Entity Recognition

Given a sentence in a target language , which consists of a series of words: , NER outputs a sequence of labels , with respect to the named entity type  {Person, Location, Organization, Date, Time, JobTitle, Duration, Facility, GeographicFeature, Measure, Ordinal, Money}. This is much more fine-grained and complex than the traditional CoNLL NER task that only considers 4 entity types Tjong Kim Sang (2002); Tjong Kim Sang and De Meulder (2003). We follow the Inside–-outside–-beginning (IOB2) tagging format Ramshaw and Marcus (1999): a -prefix means that the tag is the beginning of a chunk, an -prefix indicates that the tag is inside a chunk, and an tag represents that a token belongs to no chunk.

We build an LIM for NER with annotated data in 3 languages: French, Italian and German. The training set consists of 679K data points (148K in French, 470K in Italian and 61K in German). We randomly split 1/10 from the training set as the development set for model selection and the rest for model training (i.e., fine-tuning the parameters of Multilingual BERT and the sequence labeling layer). We selected the best model of 32 batch size, 2e-5 learning rate and 3 epochs, after fine-tuning with different parameters (described in Section 4.2) on the development set.

4.3.1 Compared Methods

We compare LIM with the following methods:

  • BiLSTM+CRF Lample et al. (2016) is a bidirectional LSTM with a sequential conditional random field above it. We use this method to train monolingual NER models as a baseline because it has been effective and widely used on sequence labeling tasks.

  • FLAIR Akbik et al. (2019) is one of the latest NLP frameworks that achieved state-of-the-art for sequence labeling tasks. It models words as sequence of characters and leverages contextual string embeddings produced from a trained character language model Akbik et al. (2018). We adopt the pre-trained multilingual FLAIR embedding to build multilingual NER models using the FLAIR framework.

4.3.2 Results

We evaluate the models on high quality in-house benchmark datasets for NER in various languages including French (3870 entities), Italian (3776 entities), and German (5023 entities)111We refer to the number of entities instead of data points as one data point can contain multiple entities..

First of all, we report the F-measure results of NER on French, Italian and German. Regarding French, we reach a significant improvement in performance of 9.9% than BiLSTM+CRF, and 7.1% than FLAIR. Similarly, on German, we outperform the previous methods by 6.1% and 2.4% respectively. Our LIM approach is comparable to BiLSTM+CRF and outperforms FLAIR by 3.5% on Italian.

Language BiLSTM+CRF FLAIR LIM
French 68.0 70.8 77.9
Italian 71.5 68.0 71.5
German 64.5 68.2 70.6
Table 5: F-measure results of NER on French, Italian and German. The BiLSTM-CRF models were trained using monolingual data in each language respectively. The FLAIR and LIM models were trained using the concatenation of French, Italian and German annotated data.

Secondly, we evaluate the effects of our LIM approach for zero-shot transfer learning on NER. We trained another FLAIR and LIM using only the concatenation of French and Italian annotated data while excluding German annotated data. Table 6 shows that our LIM method is able to retain the performance of 58.6% while FLAIR drops to 20.3%. demonstrates shows the power of our LIM method in accelerating the development of models for a new language where no annotated data is available.

Language FLAIR LIM
German 20.3 58.6
Table 6: F-measure results of NER on German (zero-shot transfer learning). The FLAIR and LIM models were trained using the concatenation of French and Italian annotated data, while German annotated data was excluded.

4.4 Discussion

Task Type

While the results demonstrate the effectiveness of LIMs on two most representative NLP tasks, we found that LIMs are generally more effective on a sentence classification task than a sequence labeling task, particularly for zero-shot transfer learning. For example, LIM outperforms the corresponding baseline on Sentiment Analysis (Table 3), but falls behind the corresponding baseline on NER (Table 5 and 6), when no annotated data from the target language was used in model training.

Language Set

Powered by the multilingual representations learned in pre-trained BERT, LIMs seem more suitable for typologically similar languages. For instance, the LIM-H is not as good as the model trained using Chinese-only BERT on Sentiment Analysis, though the difference is relatively small (Table 2). This is consistent with the findings from multilingual representation learning using word embeddings Conneau et al. (2017).

Data Resource

Language-independent models are not only suitable for high-resource languages, but also very effective in low-resource languages. In particular, adding a relatively small amount of low-resource training data can result in a significant improvement of performance (Table 4).

Implications

These insights bring unique values to the development and customization of natural language understanding models and solutions in new languages. First of all, it can be used to solve the cold-start problem, where no initial model is available for a new target language, when building such models from scratch is costly. Secondly, it largely saves the cost and time for acquiring annotated data of a new target language by reusing data already annotated in previously supported languages. Thirdly, it simplifies the deployment process of a new model and save the efforts for simultaneously maintaining multiple monolingual models in a production setting.

5 Conclusion and Future Work

As the use of machine learning becomes more pervasive all over the world, people speaking different languages will come to expect seamless and customized experience of their own. Building a language independent model can accelerate the enablement of machine learning and cognitive solutions in new languages at a large scale. We demonstrate the power of this language-independent modeling approach through a series of experiments on multiple task types, language sets and data resources. Our annotated data for low-resource languages will be made publicly available. We hope that the insights gained from these experiments will help researchers and practitioners develop solutions and tools that enable better scalability, integration and operations in many other languages. In future, we will continue to explore the effects of different combinations of languages with respect to various end tasks. Besides, we plan to extend the studies to more NLP tasks, and investigate the feasibility of multi-task learning for building a task and language independent framework.

References

  • A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf (2019) FLAIR: an easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 54–59. Cited by: 2nd item.
  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Cited by: 2nd item.
  • R. Akkiraju, V. Sinha, A. Xu, J. Mahmud, P. Gundecha, Z. Liu, X. Liu, and J. Schumacher (2018) Characterizing machine learning process: a maturity framework. arXiv preprint arXiv:1811.04871. Cited by: §1, §3.1.
  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2017)

    Unsupervised neural machine translation

    .
    arXiv preprint arXiv:1710.11041. Cited by: §2.
  • M. Artetxe and H. Schwenk (2018) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464. Cited by: §2.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §2, §4.1, §4.4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2, §3.2, §4.2.
  • M. Faruqui and C. Dyer (2014) Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471. Cited by: §2.
  • D. Gerz, I. Vulić, E. M. Ponti, R. Reichart, and A. Korhonen (2018) On the relation between linguistic typology and (limitations of) multilingual language modeling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 316–327. Cited by: §4.1.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §2, 2nd item.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §2.
  • J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa (2019)

    Low-resource deep entity resolution with transfer and active learning

    .
    arXiv preprint arXiv:1906.08042. Cited by: §4.1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: 1st item.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pp. 260–270. Cited by: 1st item.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043. Cited by: §2.
  • X. Liu, A. Xu, V. Sinha, and R. Akkiraju (2018) Voice of customer: a tone-based analysis system for online user engagement. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, pp. LBW001. Cited by: §1.
  • T. Mikolov, Q. V. Le, and I. Sutskever (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §2.
  • L. A. Ramshaw and M. P. Marcus (1999) Text chunking using transformation-based learning. In Natural language processing using very large corpora, pp. 157–176. Cited by: §4.3.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Link, Document Cited by: §4.3.
  • E. F. Tjong Kim Sang (2002) Introduction to the conll-2002 shared task: language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning - Volume 20, COLING-02, Stroudsburg, PA, USA, pp. 1–4. External Links: Link, Document Cited by: §4.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.2.