Log In Sign Up

Sentence Alignment with Parallel Documents Helps Biomedical Machine Translation

by   Shengxuan Luo, et al.

The existing neural machine translation system has achieved near human-level performance in general domain in some languages, but the lack of parallel corpora poses a key problem in specific domains. In biomedical domain, the parallel corpus is less accessible. This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems. We use a simple but effective way to build bilingual word embeddings (BWEs) to evaluate bilingual word similarity and transferred the sentence alignment problem into an extended earth mover's distance (EMD) problem. The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases. Pre-training in general domain, the larger in-domain dataset and n-to-m sentence pairs benefit the NMT model. Fine-tuning in domain corpus helps the translation model learns more terminology and fits the in-domain style of text.


page 1

page 2

page 3

page 4


An Empirical Comparison of Simple Domain Adaptation Methods for Neural Machine Translation

In this paper, we propose a novel domain adaptation method named "mixed ...

NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

Machine translation requires large amounts of parallel text. While such ...

When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?

Word alignment has proven to benefit many-to-many neural machine transla...

Multi-Domain Adaptation in Neural Machine Translation Through Multidimensional Tagging

Many modern Neural Machine Translation (NMT) systems are trained on nonh...

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

We propose a novel model architecture and training algorithm to learn bi...

Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction

Generating accurate terminology is a crucial component for the practical...

Identifying Semantic Divergences in Parallel Text without Annotations

Recognizing that even correct translations are not always semantically e...