Sentence Alignment with Parallel Documents Helps Biomedical Machine Translation

04/17/2021
by   Shengxuan Luo, et al.
0

The existing neural machine translation system has achieved near human-level performance in general domain in some languages, but the lack of parallel corpora poses a key problem in specific domains. In biomedical domain, the parallel corpus is less accessible. This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems. We use a simple but effective way to build bilingual word embeddings (BWEs) to evaluate bilingual word similarity and transferred the sentence alignment problem into an extended earth mover's distance (EMD) problem. The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases. Pre-training in general domain, the larger in-domain dataset and n-to-m sentence pairs benefit the NMT model. Fine-tuning in domain corpus helps the translation model learns more terminology and fits the in-domain style of text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/12/2017

An Empirical Comparison of Simple Domain Adaptation Methods for Neural Machine Translation

In this paper, we propose a novel domain adaptation method named "mixed ...
research
05/18/2020

NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

Machine translation requires large amounts of parallel text. While such ...
research
04/26/2022

When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?

Word alignment has proven to benefit many-to-many neural machine transla...
research
02/19/2021

Multi-Domain Adaptation in Neural Machine Translation Through Multidimensional Tagging

Many modern Neural Machine Translation (NMT) systems are trained on nonh...
research
06/05/2019

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

We propose a novel model architecture and training algorithm to learn bi...
research
05/12/2021

Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction

Generating accurate terminology is a crucial component for the practical...
research
03/29/2018

Identifying Semantic Divergences in Parallel Text without Annotations

Recognizing that even correct translations are not always semantically e...

Please sign up or login with your details

Forgot password? Click here to reset