DeepAI AI Chat
Log In Sign Up

A Bilingual Parallel Corpus with Discourse Annotations

by   Yuchen Eleanor Jiang, et al.

Machine translation (MT) has almost achieved human parity at sentence-level translation. In response, the MT community has, in part, shifted its focus to document-level translation. However, the development of document-level MT systems is hampered by the lack of parallel document corpora. This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set. The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena. Our resource is freely available, and we hope it will serve as a guide and inspiration for more work in document-level machine translation.


Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Several recent papers claim human parity at sentence-level Machine Trans...

Document-Level Machine Translation with Large Language Models

Large language models (LLMs) such as Chat-GPT can produce coherent, cohe...

DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

We present a new English-French test set for the evaluation of Machine T...

Improving the Transformer Translation Model with Document-Level Context

Although the Transformer translation model (Vaswani et al., 2017) has ac...

Document-aligned Japanese-English Conversation Parallel Corpus

Sentence-level (SL) machine translation (MT) has reached acceptable qual...

JESC: Japanese-English Subtitle Corpus

In this paper we describe the Japanese-English Subtitle Corpus (JESC). J...

Document Sub-structure in Neural Machine Translation

Current approaches to machine translation (MT) either translate sentence...