DeepAI
Log In Sign Up

Enriching Biomedical Knowledge for Low-resource Language Through Translation

10/11/2022
by   Long Phan, et al.
0

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/18/2020

The Ubiqus English-Inuktitut System for WMT20

This paper describes Ubiqus' submission to the WMT20 English-Inuktitut s...
06/06/2021

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

One of the biggest challenges hindering progress in low-resource and mul...
06/09/2020

HausaMT v1.0: Towards English-Hausa Neural Machine Translation

Neural Machine Translation (NMT) for low-resource languages suffers from...
04/21/2021

Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

Most work in NLP makes the assumption that it is desirable to develop so...
10/08/2020

Query-Key Normalization for Transformers

Low-resource language translation is a challenging but socially valuable...
08/06/2020

A Multilingual Neural Machine Translation Model for Biomedical Data

We release a multilingual neural machine translation model, which can be...
12/16/2020

MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification

We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeN...

1 Introduction

In recent years, pretrained language models (LMs) have played an important and novel role in the development of many Natural Language Processing (NLP) systems. Utilizing large pretrained models like BERT

Devlin et al. (2018), XLNET Yang et al. (2019), ALBERT Lan et al. (2019), RoBERTa Liu et al. (2019), GPT-3 Brown et al. (2020) BART Lewis et al. (2019), and T5 Raffel et al. (2019) has become an effective trend in natural language processing. All these large models follow the Transformer architecture proposed by Vaswani et al. (2017)

with the attention mechanism. The architecture has been proved to be very suitable for finetuning downstream tasks leveraging transfer learning with their large pretrained checkpoints. Before the emergence of large Transformer LMs, traditional wording embedding gives each word a fixed global representation. Large pretrained models can derive word vector representation from a trained large corpus. This will give the pretrained model to have a better knowledge of the generalized representation of a trained language/domain and significantly improve performance on downstream finetune tasks. The success of pretrained models on a generative domain (BERT, RoBERTa, BART, T5, etc.) has created a path in creating more specific-domain language models such as CodeBERT

Feng et al. (2020) and CoTexT Phan et al. (2021b) for coding languages, TaBERT Yin et al. (2020) for tabular data, BioBERT Lee et al. (2019) and PubmedBERT Tinn et al. (2021) for biomedical languages.

Biomedical literature is getting more popular and widely accessible to the scientific community through large databases such as Pubmed111https://pubmed.ncbi.nlm.nih.gov, PMC222https://www.ncbi.nlm.nih.gov/pmc, and MIMIC-IV Johnson et al. (2021). This also leads to many studies, corpora, or projects released to further advance the Biomedical Natural Language Processing field Lee et al. (2019); Tinn et al. (2021); Phan et al. (2021a); Yuan et al. (2022). These biomedical domain models leverage transfer learning from pretrained models Devlin et al. (2018); Clark et al. (2020); Raffel et al. (2019); Lewis et al. (2019)

to achieve state-of-the-art results on multiple Biomedical NLP tasks like Named Entity Recognition (NER), Relation Extraction (RE), or document classification.

However, there have been minimum studies on leveraging large pretrained models for biomedical NLP in low-resource languages. The main reason is the lack of large biomedical pretraining data and benchmark datasets. Collecting biomedical data in low-resource languages can be very expensive due to scientific limitations and inaccessibility.

We attempt to overcome the issue of lacking biomedical text data in low-resource languages by using state-of-the-art translation works. We start with the Vietnamese language and keep everything reproducible for other low-resource languages in future work.

We introduce ViPubmedT5, a pretrained encoder-decoder transformer model trained on synthetic Vietnamese biomedical text translated with state-of-the-art English-Vietnamese translation work. Meanwhile, we also introduced ViMedNLI, a medical natural language inference task (NLI), translated from the English MedNLI Romanov and Shivade (2018) with human refining.

We thoroughly benchmark the performance of our ViPubmedT5 model when pretrained with synthetic translated biomedical data with ViMedNLI and other public Vietnamese Biomedical NLP tasks Minh et al. (2022b). The results show that our model outperforms both general domain Nguyen and Nguyen (2020); Phan et al. (2022) and health-specific domain Vietnamese Minh et al. (2022b) pretrained models on biomedical tasks.

In this work, we offer the following contributions:

  • A first pretrained Encoder-Decoder Transformer model ViPubmedT5 that pretrained on synthetic translated biomedical data.

  • A Vietnamese medical natural language inference dataset (ViMedNLI) that translated from MedNLI Romanov and Shivade (2018) and refined with biomedical expertise human.

  • We publicize our model checkpoints, datasets, and source code for future studies on other low-resource languages.

2 Related Works

The development of parallel text corpora for translation and use for training MT systems has been a rapidly growing field of research. In recent years, low-resource languages have gained more attention from the industry and academia Chen et al. (2019); Shen et al. (2021); Gu et al. (2018); Nasir and Mchechesi (2022). Previous works include gathering more training data or training large multilingual models Thu et al. (2016); Fan et al. (2021). Low-Language MT enhances billions of people’s daily life in numerous fields. Nonetheless, there are specific domains crucial yet limited such as biomedical and healthcare, in which MT systems have not been able to contribute adequately.

Previous works using MT systems for biomedical tasks including Neves et al. (2016); Névéol et al. (2018). Additionally, a number of biomedical parallel Deléger et al. (2009) have been utilized just for terminology translation only. Pioneer attempts to train MT systems using a corpus of MEDLINE titles Wu et al. (2011), and the use of publication titles and abstracts for both ES-EN and FR-EN language pairs Jimeno-Yepes et al. (2012). However, none of these works targets low-resource languages. A recent effort to train Vietnamese ML systems for biomedical and healthcare is Minh et al. (2022a)

. These, however, do not utilize the capability of MT systems, instead relying on manual crawling. Therefore, this motivation has led us to employ MT systems to contribute high-quality Vietnamese datasets that emerged from the English language. To the best of our knowledge, this is the first work utilizing state-of-the-art machine translation to translate both self-supervised and supervised learning biomedical data for pretrained models in a low-resource language setting.

3 English-Vietnamese Translation

Due to its limitation of high-quality parallel data available, English-Vietnamese translation is classified as a low-resource translation language

Liu et al. (2020)

. One of the first notable parallel datasets and En-Vi neural machine translation is ISWLT’15

Luong and Manning (2015) with 133K sentence pairs. A few years later, PhoMT Doan et al. (2021) and VLSP2020 Ha et al. (2020) released larger parallel datasets, extracted from publicly available resources for the English-Vietnamese translation.

Recently, VietAI333https://vietai.org curated the largest 4.2M high-quality training pairs from various domains and as well as achieving state-of-the-art on English-Vietnamese translation444https://research.vietai.org/mtet. The work also focuses on En-Vi translation performance across multiple domains including biomedical. The project’s NMT outperforms existing En-Vi translation models Doan et al. (2021); Fan et al. (2020) by more than 5% in BLEU score.

4 Pubmed and English Biomedical NLP Researchs

The Pubmed555https://pubmed.ncbi.nlm.nih.gov provides access to the MEDLINE database666https://www.nlm.nih.gov/bsd/pmresources.html which contains titles, abstracts, and metadata from medical literature since the 1970s. The dataset consists of more than 34 million biomedical abstracts from the literature that have been collected from sources such as life science publications, medical journals, and published online e-books. This dataset is maintained and updated yearly to include more up-to-date biomedical documents.

Pubmed Abstract has been a main dataset for almost any state-of-the-art biomedical domain-specific pretrained models Lee et al. (2019); Yuan et al. (2022); Tinn et al. (2021); Yasunaga et al. (2022); Alrowili and Shanker (2021); Phan et al. (2021a). Many well-known Biomedical NLP/NLU benchmark datasets are created based on the unlabeled Pubmed corpus Doğan et al. (2014); Nye et al. (2018); Herrero-Zazo et al. (2013); Jin et al. (2019). Recently, to help accelerate research in biomedical NLP, Gu et al. (2020) releases BLURB (Biomedical Language Understanding & Reasoning Benchmark), which consists of multiple pretrained biomedical NLP models and benchmark tasks. It is important to note that all of the top 10 models on the BLURB Leaderboard777https://microsoft.github.io/BLURB/leaderboard.html are pretrained on the Pubmed Abstract dataset.

# MedNLI Translated by NMT Our Refined
1
Electrocardiograms
revealed no
QRS changes.
Điện tâm đồ cho thấy
không có thay đổi về
QRS.
Điện tâm đồ cho thấy
không có thay đổi về
phức độ QRS.
2
Patient has
no PMH
Bệnh nhân
không có PMH
Bệnh nhân
không có tiền sử bệnh
2
Patient is
post op
Bệnh nhân đã
hồi phục
Bệnh nhân
hậu phẫu thuật
  • #1: Abbreviation in English can be used in both English and Vietnamese

  • #2: Abbreviation can only be used in English. In Vietnamese, abbreviation is different.

  • #3: Abbreviation in English is wrong when translated to Vietnamese

Table 1: Some Examples of Abbreviations Refining

5 ViPubmed

To ensure that our translated ViPubmed dataset contains up-to-date biomedical research (for example, Covid-19 diseases and Covid-19 vaccines), we use the newest Pubmed22888https://ftp.ncbi.nlm.nih.gov/pubmed/baseline which contains approximately 34 million English biomedical abstracts published. The raw data is compressed in XML format. We then parse these structured XMLs to obtain the abstract text with Pubmed Parser999https://github.com/titipata/pubmed_parserAchakulvisut et al. (2020).

By the time of our experiments, because the state-of-the-art English-Vietnamese translation NMT mentioned in Section 3 is limited to 512 token-length, we filter out abstracts with more than 512 tokens. For a fair size comparison with the unlabeled dataset of other health-related Vietnamese pretrained models (discussed in Section 8.2), we take a subset of 20M biomedical abstracts ( 20GB of text) for translation and leave a larger subset for future releases. We then translate the 20M English biomedical abstract with the state-of-the-art English-Vietnamese NMT released by VietAI101010https://research.vietai.org/mtet using 4 TPUv2-8 and 4 TPUv3-8.

6 ViMedNLI

Along with an unlabeled dataset for pretraining, we also introduce a benchmark dataset generated by translation and refined with human experts. We start with a natural language inference (NLI) task as it less objected to errors in biomedical entity translation compared to named-entity recognition (NER) or relation extraction (RE) tasks.

6.1 MedNLI

MedNLI Romanov and Shivade (2018) is an NLI dataset annotated by doctors and grounded in the patients ’ medical history. Given a premise sentence and a hypothesis sentence, the relation of the two sentences (entailment, contradiction, neutral) is labeled by two board-certified radiologists. The source of premise sentences in MedNLI is from MIMIC-III Johnson et al. (2016)

, a large open-source clinical database. The dataset has been widely studied and benchmarked by the Biomedical NLP research community

111111https://paperswithcode.com/dataset/mednli Peng et al. (2019); Phan et al. (2021a); El Boukkouri et al. (2020); Alrowili and Shanker (2021); Kanakarajan et al. (2019).

6.2 Dataset Challenges

We follow the same procedures discussed in Section 5 to translate the same training, development, and test sets released in Romanov and Shivade (2018). The time and resources to translate the dataset are negligible as there are a total of 14522 samples.

However, upon translating the dataset with NMT, we find out that the English clinical note domain has a distinct sublanguage with unique challenges (abbreviations, inconsistent punctuation, misspellings, etc.). This observation has also been addressed in Friedman et al. (2002) and Meystre et al. (2008). Such differences in clinical language representation challenge the translation output and our quest to release a high-quality medical dataset.

Corpus Train Dev Test Task Domain
acrDrAid, pairs 4000 523 1130 Acronym disambiguation Medical
FAQSum, documents 10621 1326 1330 Abstractive summarization Healthcare
ViMedNLI, pairs 11232 1395 1422 Inference Clinical
Table 2: Statistics of finetuned datasets

6.3 Human Refining

The unique challenges of clinical data under translation settings (discussed in Section 6.2) require us to work with humans who not only have expertise in biomedical knowledge but are also sufficient in both English and Vietnamese languages to refine the dataset. Therefore, we collaborate with pre-medical Vietnamese students who studied at well-known U.S. Universities to refine the ViMedNLI datasets.

The refining process starts with a comprehensive guidelines document with thorough annotation instructions and examples. As clinical notes contain a significant amount of technical abbreviations that the machine translation system can not translate initially (Section 6.2), we work with the medical annotators to create abbreviations and their expansion form. To make sure the expansion form of these abbreviations generalizes well in real-world settings, we verify the use case of these words through multiple Vietnamese medical websites, blogs, and online dictionaries. Hence, we decided to keep the original English abbreviations, replace them with a Vietnamese expansion form, or replace them with a Vietnamese abbreviation. Some examples of this process are shown in Figure 1.

Aside from the English medical abbreviations, there are several grammatical and spelling mistakes that the machine translation system does not understand, translating either into Vietnamese meanings or even failing to translate. Human refining is therefore required. The phrase “The infant was born at herm", for example, was translated as “Đứa bé được sinh ra ở Herm". The word “herm", which should be spelled as “term", is misspelled and has no medical meaning. The accurate translation should be “Đứa bé được sinh đủ tháng".

Additionally, the machine translation system occasionally produces incorrect Vietnamese meanings when translating words with proper English spelling and grammar. Considering the sentence “The patient had post-term delivery" as an example. Despite having the meaning “Bệnh nhân sinh muộn", it was mistranslated as “Bệnh nhân sinh non” ( “The patient had pre-term delivery"). Another example is “Narrowing of the vessels", which actually means “Thu hẹp các mạch" rather than “Thu hẹp các"(no meaning).

Domain Datasets Metrics PhoBERT ViT5 ViHealthBERT ViPubmedT5
(+news) (+cc100)
(+health
text mining)
(+translated
ViPubmed)
Healthcare FAQSum R1 47.15 66.32 50.45 65.81
R2 28.18 51.62 31.35 51.18
RL 41.16 61.3 43.85 60.6
Medical acrDrAid Mac-F1 82.51 88 86.7 89.04
Clinical ViMedNLI Acc 77.29 77.85 79.04 81.65
  • Notes: The best scores are in bold, and the second best scores are underlined. PhoBERT/ViHealthBERT scores on FAQSum and acrDrAid are from Minh et al. (2022b)

Table 3: Tests results on Vietnamese health and biomedical tasks

7 ViPubmedT5

With an unlabeled synthetic translated ViPubmed Corpus (Section 5) and a benchmark ViMedNLI dataset (Section 6), we pretrain and finetune a Transformer-based language model Vaswani et al. (2017) to verify the effectiveness of our approach in enriching Vietnamese biomedical domain with translation data. We explain our model, and the pretraining settings we applied in this section.

7.1 Model Architecture

We adopt the Transformer encoder-decoder model proposed by Vaswani et al. (2017), the ViT5 Phan et al. (2022) checkpoints, and T5 framework 121212https://github.com/google-research/text-to-text-transfer-transformer implemented by Raffel et al. (2019). ViT5 is the first monolingual Vietnamese Transformer model; the model achieves state-of-the-art results on multiple Vietnamese general tasks including generation and classification. The ViT5 publication releases 2 model sizes - base and large. We train ViPubmedT5 using the base setting (220 million parameters) and leave larger models for future work.

7.2 Pretraining

We pretrain our ViPubmedT5 on 20GB of translated biomedical data ViPubmed (Section 5). We leverage the Vietnamese checkpoints in the original ViT5 work Phan et al. (2022) and continuously pretrain the model on the synthetic biomedical-specific data for another 500k steps. Previous works Lee et al. (2019); Tinn et al. (2021) have shown that this approach will allow pretrained language models to learn a better representation of biomedical language context while maintaining the core Vietnamese language representation.

We train ViPubmedT5 using the same spans-masking learning objective as Raffel et al. (2019). During self-supervised training, spans of biomedical text sequences are randomly masked (with sentinel tokens) and the target sequence is formed as the concatenation of the same sentinel tokens and the real masked spans/tokens.

8 Experiments

8.1 Benchmark dataset

We finetune and benchmark our pretrained ViPubmedT5 model on two public Vietnamese biomedical-domain datasets acrDrAid Minh et al. (2022b) and our released ViMedNLI (Section 6). Detailed statistics of the three datasets are shown in Table 2.

  • acrDrAid Minh et al. (2022b) is a Vietnamese Acronym Disambiguation (AD) dataset that contains radiology reports from Vinmec hospital131313https://vinmec.com/, Vietnam. The task is correctly identifying the expansion of an acronym in a given radiology report context. The dataset is annotated by three expert radiologists. The acrDrAid has 135 acronyms and 424 expansion texts in total.

  • FAQ Summarization Minh et al. (2022b) is a Vietnamese summarization dataset collected from FAQ sections of multiple healthcare trustworthy sites. For each FAQ section, the question text is the input sequence and the title is a target summary.

  • ViMedNLI is our released dataset discussed in Section 6.

8.2 Baseline

In order to verify the effectiveness of our proposed methods, we compare our ViPubmedT5 model with other state-of-the-art Vietnamese pretrained models:

  • PhoBERT Nguyen and Nguyen (2020) is the first public large-scale monolingual language model pretrained for Vietnamese language. The model follows the original RoBERTa Liu et al. (2019) architecture. PhoBERT is trained on a 20GB word-level Vietnamese news corpus.

  • ViT5 Phan et al. (2022) is the most recent state-of-the-art Vietnamese pretrained model for both generation and classification tasks. The model is trained on a general domain CC100-vi corpus.

  • ViHealthBERT Minh et al. (2022b) is the first domain-specific pretrained language model for Vietnamese healthcare. After initializing weights from PhoBERT, the model is trained on 25M health sentences mined from different sources.

9 Results

The main finetuned results are shown in Table 3. The main takeaway is that training on synthetic translated biomedical data allows ViPubmedT5 to learn a better biomedical context representation. ViPubmedT5 achieves state-of-the-art in Medical and Clinical contexts while performing slightly worse than ViT5 in healthcare topics.

The performance of ViPubmedT5 on the healthcare domain (FAQSum) dataset is unsurprising as the translated data from ViPubmed is biomedical scientific and has academic language representation. The FAQSum dataset, on the other hand, is mostly healthcare communication between patients and doctors on less scientific health websites.

For both medical and clinical datasets, ViPubmedT5 significantly outperforms other existing models. There are also strong improvements from the general domain ViT5 to ViPubmedT5 (88->89.04 in acrDrAid and 77.85->81.65 in ViMedNLI). This indicates that the translated ViPubmed corpus contains biomedical knowledge that low-resource Vietnamese pretrained models can leverage.

Meanwhile, our new translated ViMedNLI can serve as a strong baseline dataset for Vietnamese BioNLP research. Both health and biomedical domain models (ViHealthBERT & ViPubmedT5) perform better than general domain models (PhoBERT & ViT5) on the ViMedNLI dataset. This shows that our translated and refined ViMedNLI dataset is high-quality and has robust biomedical contexts.

10 Scaling to Other Languages

Our novel way of utilizing a state-of-the-art NMT system to generate synthetic translated medical data for pretrained models is not limited to the Vietnamese language and is scalable to many other low-resource languages. There are various recent works focusing on improving the quality of multiple low-resource NMT systems NLLB Team et al. (2022); Fan et al. (2020); Bañón et al. (2020). These new state-of-the-art NMTs make the approach we discussed in this paper more practical to produce synthetic translated biomedical data, enriching the Biomedical NLP research knowledge in multiple low-resource languages.

11 Conclusion

We utilize the state-of-the-art translation model MTet to scale up the very low-resourced yet highly valuable biomedical data in Vietnamese. Namely, ViPubMedT5, a T5-style Encoder-Decoder Transformer pretrained on a large-scale translated corpus of the biomedical domain that demonstrated state-of-the-art results on both inference and acronym disambiguation in the biomedical domain. We also introduced ViMedNLI, a machine-translated and human expert refined benchmark in natural language inference to further grow the Vietnamese suite of benchmarks and data in biomedical data.

12 Limitations

Although our pretrained model trained on synthetic translated biomedical data produces state-of-the-art results on downstream tasks for the Vietnamese language, the approach is hugely dependent on the quality of the NMTs for other low-resource languages. Thanks to recent studies and contributions from the Vietnamese research community (Section 3), the English-Vietnamese translation system has proven to be strong enough for us to conduct the experiments discussed in this work. However, the NMT’s actual performance needed before making the translated biomedical data useful for pretrained models is still a question that required further studies.

References

  • T. Achakulvisut, D. Acuna, and K. Kording (2020) Pubmed parser: a python parser for pubmed open-access xml subset and medline xml dataset xml dataset. Journal of Open Source Software 5 (46), pp. 1979. External Links: Document, Link Cited by: §5.
  • S. Alrowili and V. Shanker (2021) BioM-transformers: building large biomedical language models with BERT, ALBERT and ELECTRA. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, pp. 221–227. External Links: Link, Document Cited by: §4, §6.1.
  • M. Bañón, P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Esplà-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn, S. Ortiz Rojas, L. Pla Sempere, G. Ramírez-Sánchez, E. Sarrías, M. Strelec, B. Thompson, W. Waites, D. Wiggins, and J. Zaragoza (2020) ParaCrawl: web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4555–4567. External Links: Link, Document Cited by: §10.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. abs/2005.14165. External Links: Link, 2005.14165 Cited by: §1.
  • P. Chen, J. Shen, M. Le, V. Chaudhary, A. El-Kishky, G. Wenzek, M. Ott, and M. Ranzato (2019) Facebook AI’s WAT19 Myanmar-English translation task submission. In Proceedings of the 6th Workshop on Asian Translation, Hong Kong, China, pp. 112–122. External Links: Link, Document Cited by: §2.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. abs/2003.10555. External Links: Link, 2003.10555 Cited by: §1.
  • L. Deléger, M. Merkel, and P. Zweigenbaum (2009) Translating medical terminologies through word alignment in parallel text corpora. 42 4, pp. 692–701. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1, §1.
  • L. Doan, L. T. Nguyen, N. L. Tran, T. Hoang, and D. Q. Nguyen (2021) PhoMT: A high-quality and large-scale benchmark dataset for vietnamese-english machine translation. abs/2110.12199. External Links: Link, 2110.12199 Cited by: §3, §3.
  • R. I. Doğan, R. Leaman, and Z. Lu (2014) NCBI disease corpus: a resource for disease name recognition and concept normalization. 47, pp. 1–10. External Links: ISSN 1532-0464, Document, Link Cited by: §4.
  • H. El Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigenbaum, and J. Tsujii (2020) CharacterBERT: reconciling ELMo and BERT for word-level open-vocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6903–6915. External Links: Link, Document Cited by: §6.1.
  • A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin (2020) Beyond english-centric multilingual machine translation. abs/2010.11125. External Links: Link, 2010.11125 Cited by: §10, §3.
  • A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Çelebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin (2021) Beyond english-centric multilingual machine translation. 22, pp. 107:1–107:48. Cited by: §2.
  • Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou (2020) CodeBERT: A pre-trained model for programming and natural languages. abs/2002.08155. External Links: Link, 2002.08155 Cited by: §1.
  • C. Friedman, P. Kra, and A. Rzhetsky (2002) Two biomedical sublanguages: a description based on the theories of zellig harris. Journal of Biomedical Informatics 35 (4), pp. 222–235. Note: Sublanguage - Zellig Harris Memorial External Links: ISSN 1532-0464, Document, Link Cited by: §6.2.
  • J. Gu, H. Hassan, J. Devlin, and V. O.K. Li (2018) Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 344–354. External Links: Link, Document Cited by: §2.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2020) Domain-specific language model pretraining for biomedical natural language processing. CoRR abs/2007.15779. External Links: Link, 2007.15779 Cited by: §4.
  • T. Ha, V. Tran, and K. Nguyen (2020) Goals, challenges and findings of the VLSP 2020 English-Vietnamese news translation shared task. In Proceedings of the 7th International Workshop on Vietnamese Language and Speech Processing, Hanoi, Vietnam, pp. 99–105. External Links: Link Cited by: §3.
  • M. Herrero-Zazo, I. Segura-Bedmar, P. Martínez, and T. Declerck (2013) The ddi corpus: an annotated corpus with pharmacological substances and drug–drug interactions. Journal of Biomedical InformaticsJournal of Biomedical InformaticsCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRR2019 6th NAFOSTED Conference on Information and Computer Science (NICS)CoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRCoRRArXivJ. Mach. Learn. Res.AMIA … Annual Symposium proceedings. AMIA SymposiumBMC BioinformaticsJournal of biomedical informatics 46 (5), pp. 914–920. External Links: ISSN 1532-0464, Document, Link Cited by: §4.
  • A. Jimeno-Yepes, É. Prieur, and A. Névéol (2012) Combining medline and publisher data to create parallel corpora for the automatic translation of biomedical text. 14, pp. 146 – 146. Cited by: §2.
  • Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu (2019) PubMedQA: A dataset for biomedical research question answering. CoRR abs/1909.06146. External Links: Link, 1909.06146 Cited by: §4.
  • A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark (2021) MIMIC-iv. PhysioNet. External Links: Document, Link Cited by: §1.
  • A. Johnson, T. Pollard, L. Shen, L. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Celi, and R. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific Data 3, pp. 160035. External Links: Document Cited by: §6.1.
  • K. r. Kanakarajan, S. Ramamoorthy, V. Archana, S. Chatterjee, and M. Sankarasubbu (2019) Saama research at MEDIQA 2019: pre-trained BioBERT with attention visualisation for medical natural language inference. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, pp. 510–516. External Links: Link, Document Cited by: §6.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) ALBERT: A lite BERT for self-supervised learning of language representations. abs/1909.11942. External Links: Link, 1909.11942 Cited by: §1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. abs/1901.08746. External Links: Link, 1901.08746 Cited by: §1, §1, §4, §7.2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. abs/1910.13461. External Links: Link, 1910.13461 Cited by: §1, §1.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. abs/2001.08210. External Links: Link, 2001.08210 Cited by: §3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, 1st item.
  • M. Luong and C. Manning (2015) Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, Da Nang, Vietnam, pp. 76–79. External Links: Link Cited by: §3.
  • S. M. Meystre, G. K. Savova, K. C. Kipper-Schuler, and J. F. Hurdle (2008) Extracting information from textual documents in the electronic health record: a review of recent research.. Yearbook of medical informatics, pp. 128–44. External Links: ISSN 0943-4747, Link Cited by: §6.2.
  • N. Minh, V. H. Tran, V. Hoang, H. D. Ta, T. H. Bui, and S. Q. H. Truong (2022a) ViHealthBERT: pre-trained language models for vietnamese in health text mining. In Proceedings of the Language Resources and Evaluation Conference, Marseille, France, pp. 328–337. External Links: Link Cited by: §2.
  • N. Minh, V. H. Tran, V. Hoang, H. D. Ta, T. H. Bui, and S. Q. H. Truong (2022b) ViHealthBERT: pre-trained language models for vietnamese in health text mining. In Proceedings of the Language Resources and Evaluation Conference, Marseille, France, pp. 328–337. External Links: Link Cited by: §1, 1st item, 1st item, 2nd item, 3rd item, §8.1.
  • M. U. Nasir and I. Mchechesi (2022)

    Geographical distance is the new hyperparameter: a case study of finding the optimal pre-trained language for English-isiZulu machine translation.

    .
    In Proceedings of the Workshop on Multilingual Information Access (MIA), Seattle, USA, pp. 1–8. External Links: Link, Document Cited by: §2.
  • A. Névéol, A. Jimeno Yepes, M. Neves, and K. Verspoor (2018) Parallel corpora for the biomedical domain. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §2.
  • M. Neves, A. J. Yepes, and A. Névéol (2016) The scielo corpus: a parallel corpus of scientific publications for biomedicine. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 2942–2948. External Links: Link Cited by: §2.
  • D. Q. Nguyen and A. T. Nguyen (2020) PhoBERT: pre-trained language models for vietnamese. abs/2003.00744. External Links: Link, 2003.00744 Cited by: §1, 1st item.
  • NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang (2022) No language left behind: scaling human-centered machine translation. arXiv. External Links: Document, Link Cited by: §10.
  • B. E. Nye, J. J. Li, R. Patel, Y. Yang, I. J. Marshall, A. Nenkova, and B. C. Wallace (2018) A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. CoRR abs/1806.04185. External Links: Link, 1806.04185 Cited by: §4.
  • Y. Peng, S. Yan, and Z. Lu (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and elmo on ten benchmarking datasets. CoRR abs/1906.05474. External Links: Link, 1906.05474 Cited by: §6.1.
  • L. N. Phan, J. T. Anibal, H. Tran, S. Chanana, E. Bahadroglu, A. Peltekian, and G. Altan-Bonnet (2021a) SciFive: a text-to-text transformer model for biomedical literature. abs/2106.03598. External Links: Link, 2106.03598 Cited by: §1, §4, §6.1.
  • L. N. Phan, H. Tran, D. Le, H. Nguyen, J. T. Anibal, A. Peltekian, and Y. Ye (2021b) CoTexT: multi-task learning with code-text transformer. abs/2105.08645. External Links: Link, 2105.08645 Cited by: §1.
  • L. Phan, H. Tran, H. Nguyen, and T. H. Trinh (2022) ViT5: pretrained text-to-text transformer for vietnamese language generation. arXiv. External Links: Document, Link Cited by: §1, §7.1, §7.2, 2nd item.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. abs/1910.10683. External Links: Link, 1910.10683 Cited by: §1, §1, §7.1, §7.2.
  • A. Romanov and C. Shivade (2018) Lessons from natural language inference in the clinical domain. External Links: Link, 1808.06752 Cited by: 2nd item, §1, §6.1, §6.2.
  • J. Shen, P. Chen, M. Le, J. He, J. Gu, M. Ott, M. Auli, and M. Ranzato (2021) The source-target domain mismatch problem in machine translation. abs/1909.13151. Cited by: §2.
  • Y. K. Thu, W. P. Pa, M. Utiyama, A. Finch, and E. Sumita (2016) Introducing the Asian language treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 1574–1578. External Links: Link Cited by: §2.
  • R. Tinn, H. Cheng, Y. Gu, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2021) Fine-tuning large neural language models for biomedical natural language processing. abs/2112.07869. External Links: Link, 2112.07869 Cited by: §1, §1, §4, §7.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1, §7.1, §7.
  • C. Wu, F. Xia, L. Deléger, and I. Solti (2011) Statistical machine translation for biomedical text: are we there yet?. 2011, pp. 1290–9. Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. abs/1906.08237. External Links: Link, 1906.08237 Cited by: §1.
  • M. Yasunaga, J. Leskovec, and P. Liang (2022) LinkBERT: pretraining language models with document links. arXiv. External Links: Document, Link Cited by: §4.
  • P. Yin, G. Neubig, W. Yih, and S. Riedel (2020) TaBERT: pretraining for joint understanding of textual and tabular data. abs/2005.08314. External Links: Link, 2005.08314 Cited by: §1.
  • H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y. Xie, and S. Yu (2022) BioBART: pretraining and evaluation of a biomedical generative language model. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland, pp. 97–109. External Links: Link, Document Cited by: §1, §4.