DuTongChuan: Context-aware Translation Model for Simultaneous Interpreting

by   Hao Xiong, et al.
Baidu, Inc.

In this paper, we present DuTongChuan, a novel context-aware translation model for simultaneous interpreting. This model allows to constantly read streaming text from the Automatic Speech Recognition (ASR) model and simultaneously determine the boundaries of Information Units (IUs) one after another. The detected IU is then translated into a fluent translation with two simple yet effective decoding strategies: partial decoding and context-aware decoding. In practice, by controlling the granularity of IUs and the size of the context, we can get a good trade-off between latency and translation quality easily. Elaborate evaluation from human translators reveals that our system achieves promising translation quality (85.71 86.36 discourse coherence. According to an End-to-End (speech-to-speech simultaneous interpreting) evaluation, this model presents impressive performance in reducing latency (to less than 3 seconds at most times). Furthermore, we successfully deploy this model in a variety of Baidu's products which have hundreds of millions of users, and we release it as a service in our AI platform.



There are no comments yet.


page 2

page 3

page 6


Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS

This paper presents a newly developed, simultaneous neural speech-to-spe...

Defining maximum acceptable latency of AI-enhanced CAI tools

Recent years have seen an increasing number of studies around the design...

Low Latency ASR for Simultaneous Speech Translation

User studies have shown that reducing the latency of our simultaneous le...

SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation

Simultaneous text translation and end-to-end speech translation have rec...

Semi-automatic Simultaneous Interpreting Quality Evaluation

Increasing interpreting needs a more objective and automatic measurement...

Lost in Interpreting: Speech Translation from Source or Interpreter?

Interpreters facilitate multi-lingual meetings but the affordable set of...

Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation

Boosted by the simultaneous translation shared task at IWSLT 2020, promi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent progress in Automatic Speech Recognition (ASR) and Neural Machine Translation (NMT), has facilitated the research on automatic speech translation with applications to live and streaming scenarios such as Simultaneous Interpreting (SI). In contrast to non-real time speech translation, simultaneous interpreting involves starting translating source speech, before the speaker finishes speaking (translating the on-going speech while listening to it). Because of this distinguishing feature, simultaneous interpreting is widely used by multilateral organizations (UN/EU), international summits (APEC/G-20), legal proceedings, and press conferences. Despite of recent advance

Ma et al. (2019); Arivazhagan et al. (2019), the research on simultaneous interpreting is notoriously difficult Ma et al. (2019) due to well known challenging requirements: high-quality translation and low latency.

Figure 1: For this sentence, a full-sentence NMT model produces an appropriate translation with, however, a long latency in the context of simultaneous translation, as it needs to wait until the end of the full sentence to start translating. In contrast, a sub-sentence NMT model outputs a translation with less coherence and fluency, although it has a relatively short latency as it starts translating after reading the comma in the source text.
Figure 2: This example shows a special case using sub-sentences as our information units. The blue solid squares indicate the scope of source context we use for translation. The text in red are incrementally generated translations. We discard a preceding generated token to make a coherent translation.

Many studies present methods to improve the translation quality by enhancing the robustness of translation model against ASR errors Tsvetkov et al. (2014); Chen et al. (2017); Sperber et al. (2017); Cheng et al. (2018); Liu et al. (2018); Li et al. (2018). On the other hand, to reduce latency, some researchers propose models that start translating after reading a few source tokens Fujita et al. (2013); Grissom II et al. (2014); Cho and Esipova (2016); Gu et al. (2017); Niehues et al. (2018); Arivazhagan et al. (2019). As one representative work related to this topic, recently, we present a translation model using prefix-to-prefix framework with policy Ma et al. (2019). This model is simple yet effective in practice, achieving impressive performance both on translation quality and latency.

However, existing work pays less attention to the fluency of translation, which is extremely important in the context of simultaneous translation. For example, we have a sub-sentence NMT model that starts to translate after reading a sub-sentence rather than waiting until the end of a sentence like the full-sentence models does. This will definitely reduce the time waiting for the source language speech. However, as shown in the Figure 1, the translation for each sub-sentence is barely adequate, whereas the translation of the entire source sentence lacks coherence and fluency. Moreover, it is clear that the model produces an inappropriate translation “your own” for the source token “自己” due to the absence of the preceding sub-sentence.

To make the simultaneous machine translation more accessible and producible, we borrow SI strategies used by human interpreters to create our model. As shown in Figure 2, this model is able to constantly read streaming text from the ASR model, and simultaneously determine the boundaries of Information Units (IUs) one after another. Each detected IU is then translated into a fluent translation with two simple yet effective decoding strategies: partial decoding and context-aware decoding. Specifically, IUs at the beginning of each sentence are sent to the partial decoding module. Other information units, either appearing in the middle or at the end of a sentence, are translated into target language by the context-aware decoding module. Notice that this module is able to exploit additional context from the history so that the model can generate coherent translation. This method is derived from the “salami technique” Roderick (1998); Gile (2009), or “chunking”, one of the most commonly used strategies by human interpreters to cope with the linearity constraint in simultaneous interpreting. Having severely limited access to source speech structure in SI, interpreters tend to slice up the incoming speech into smaller meaningful pieces that can be directly rendered or locally reformulated without having to wait for the entire sentence to unfold.

In general, there are several remarkable novel advantages that differ our model from the previous work:

  • We propose a practical solution for simultaneous interpreting including an information unit detector and a tailored NMT model.

  • We can trade off latency and translation quality easily by controlling the granularity of IUs and the size of the context.

  • The mechanism of context-aware decoding module make our model generate fluent translation.

Figure 3: In our context-aware translation model, the boundaries of information units in streaming ASR input are determined by a novel IU boundary detector. IUs at different positions are translated using different NMT models, if an IU stands at the beginning of a sentence, then it will be translated by the partial decoding module. Otherwise, context-aware decoding is applied to translate the IU into a coherent translation. Notice that the dashed squares in the first line denote the anchor to determine the IU boundary.

For a comprehensive evaluation of our system, we use two evaluation metrics: translation quality and latency. According to the automatic evaluation metric, our system presents excellent performance both in translation quality and latency. In the speech-to-speech scenario, our model achieves an acceptability of 85.71% for Chinese-English translation, and 86.36% for English-Chinese translation in human evaluation. Moreover, the output speech lags behind the source speech by an average of less than 3 seconds, which presents surprisingly good experience for machine translation users

Lee (2002); Lamberger-Felber (2001); Timarová et al. (2015). We also ask three interpreters with SI experience to simultaneously interpret the test speech in a mock conference setting. However, the target texts transcribed from human SI obtain worse BLEU scores as the reference in the test set are actually from written translating rather than simultaneous interpreting. More importantly, when evaluated by human translators, the performance of NMT model is comparable to the professional human interpreter.

The contributions of this paper can be concluded into the following aspects:

  • We propose a novel context-aware translation model for simultaneous interpreting.

  • We deliver a novel speech translation corpus for evaluating simultaneous machine translation.

  • We conduct elaborate experiments showing our context-aware translation model’s impressive performance in improving translation quality and shortening latency.

  • We propose a novel comparison between the text results from human simultaneous interpreting and machine translation.

2 Context-aware Translation Model

As shown in Figure 3, our model consists of two key modules: an information unit boundary detector and a tailored NMT model. In the process of translation, the IU detector will determine the boundary for each IU while constantly reading the steaming input from the ASR model. Then, different decoding strategies are applied to translate IUs at the different positions.

In this section, we use “IU” to denote one sub-sentence for better description. But in effect, our translation model is a general solution for simultaneous interpreting, and is compatible to IUs at arbitrary granularity, i.e., clause-level, phrase-level, and word-level, etc.

For example, by treating a full-sentence as an IU, the model is reduced to the standard translation model. When the IU is one segment, it is reduced to the segment-to-segment translation model Oda et al. (2014); Niehues et al. (2018). Moreover, if we treat one token as an IU, it is reduced to our previous wait-k model Ma et al. (2019). The key point of our model is to train the IU detector to recognize the IU boundary at the corresponding granularity.

In the remain of this section, we will introduce above two components in details.

2.1 Dynamic Context Based Information Unit Boundary Detector

Figure 4: A running example of our dynamic context based IU boundary detector. In this example, the model learns to determine the classification of the current anchor, “姬” (we insert an additional symbol, SEP

to be consistent with the training format in the work of devlin2019bert). If the probability (0.4 in left side case) of decision for a boundary at the present

anchor is smaller than a threshold, i.e., , then it is necessary to consider more context (additional context: “这个”) to make a reliable decision (0.8 in right side case).

Recent success on pre-training indicates that a pre-trained language representation is beneficial to downstream natural language processing tasks including classification and sequence labeling problems

Devlin et al. (2019); Sun et al. (2019); Yang et al. (2019). We thus formulate the IU boundary detection as a classification problem, and fine-tune the pre-trained model on a small size training corpus. Fine-tuned in several iterations, the model learns to recognize the boundaries of information units correctly.

As shown in Figure 4, the model tries to predict the potential class for the current position. Once the position is assigned to a definitely positive class, its preceding sequence is labeled as one information unit. One distinguishing feature of this model is that we allow it to wait for more context so that it can make a reliable prediction. We call this model a dynamic context based information unit boundary detector.

Definition 1.

Assuming the model has already read a sequence with tokens, we denote as the anchor, and the subsequence with tokens as dynamic context.

For example, in Figure 4, the anchor in both cases is “姬”, and the dynamic context in the left side case is “这”, and in the right side case is “这个”.

Definition 2.

If the normalized probability for the prediction of the current anchor is larger than a threshold , then the sequence is a complete sequence, and if is smaller than a threshold (), it is an incomplete sequence, otherwise it is an undetermined sequence.

For a complete sequence , we will send it to the corresponding translation model 222We can develop an additional model to predict the punctuation for the complete sequence. It is also available to extend the detector to predict the punctuation directly, i.g., 0:no punctuation; 1:comma; 2:period; 3:question mark, etc.. Afterwards, the detector will continue to recognize boundaries in the rest of the sequence (). For an incomplete sequence, we will take the as the new anchor for further detection. For an undetermined sequence, which is as shown in Figure 4, the model will wait for a new token , and take () as dynamic context for further prediction.

Figure 5: Source and target representation for training partial decoding model, where we mask the second sub-sentence by summing a negative infinite number when training the partial decoding model. For simplicity, we omit the embeddings for the target side.

In the training stage, for one common sentence including two sub-sequences, and . We collect plus any token in as positive training samples, and the other sub-sequences in as negative training samples. We refer readers to Appendix for more details.

In the decoding stage, we begin with setting the size of the dynamic context to 0, and then determine whether to read more context according to the principle defined in definition 2.

2.2 Partial Decoding

Traditional NMT models are usually trained on bilingual corpora containing only complete sentences. However in our context-aware translation model, information units usually are sub-sentences. Intuitively, the discrepancy between the training and the decoding will lead to a problematic translation, if we use the conventional NMT model to translate such information units. On the other hand, conventional NMT models rarely do anticipation. Whereas in simultaneous interpreting, human interpreters often have to anticipate the up-coming input and render a constituent at the same time or even before it is uttered by the speaker.

In our previous work Ma et al. (2019), training a wait-k policy slightly differs from the traditional method. When predicting the first target token, we mask the source content behind the token, in order to make the model learn to anticipate. The prediction of other tokens can also be obtained by moving the mask-window token-by-token from position to the end of the line. According to our practical experiments, this training strategy do help the model anticipate correctly most of the time.

Following our previous work, we propose the partial decoding model, a tailored NMT model for translating the IUs that appear at the beginning of each sentence. As depicted in Figure 5, in the training stage, we mask the second sub-sentence both in the source and target side. While translating the first sub-sentence, the model learns to anticipate the content after the comma, and produces a temporary translation that can be further completed with more source context. Clearly, this method relies on the associated sub-sentence pairs in the training data (black text in Figure 5). In this paper, we propose an automatic method to acquire such sub-sentence pairs.

Figure 6: Source and target representation for training incremental decoding model. We do not mask the source input, but mask the target sequence aligned to the first sub-sentence.
Definition 3.

Given a source sentence with tokens, a target sentence with tokens, and a word alignment set where each alignment is a tuple indicating a word alignment existed between the source token and target token , a sub-sentence pair holds if satisfying the following conditions:


To acquire the word alignment, we run the open source toolkit fast_align 333https://github.com/clab/fast_align

, and use a variety of standard symmetrization heuristics to generate the alignment matrix. In the training stage, we perform training by firstly tuning the model on a normal bilingual corpus, and then fine-tune the model on a special training corpus containing sub-sentence pairs.

2.3 Context-aware Decoding

For IUs that have one preceding sub-sentence, the context-aware decoding model is applied to translate them based on the pre-generated translations. The requirements of this model are obvious:

  • The model is required to exploit more context to continue the translation.

  • The model is required to generate the coherent translation given partial pre-generated translations.

Intuitively, the above requirements can be easily satisfied using a force decoding strategy. For example, when translating the second sub-sentence in “这点也是以前让我非常地诧异,也是非常纠结的地方”, given the already-produced translation of the first sub-sentence “It also surprised me very much before .”, the model finishes the translation by adding “It’s also a very surprising , tangled place .”. Clearly, translation is not that accurate and fluent with the redundant constituent “surprising”. We ascribe this to the discrepancy between training and decoding. In the training stage, the model learns to predict the translation based on the full source sentence. In the decoding stage, the source contexts for translating the first-subsentence and the second-subsentence are different. Forcing the model to generate identical translation of the first sub-sentence is very likely to cause under-translation or over-translation.

To produce more adequate and coherent translation, we make the following refinements:

  • During training, we force the model to focus on learning how to continue the translation without over-translation and under-translation.

  • During decoding, we discard a few previously generated translations, in order to make more fluent translations.

As shown in Figure 6, during training, we do not mask the source input, instead we mask the target sequence aligned to the first sub-sentence. This strategy will force the model to learn to complete the half-way done translation, rather than to concentrate on generating a translation of the full sentence.

Moreover, in the decoding stage, as shown in Figure 7, we propose to discard the last tokens from the generated partial translation (at most times, discarding the last token brings promising result). Then the context-aware decoding model will complete the rest of the translation. The motivation is that the translation of the tail of a sub-sentence is largely influenced by the content of the succeeding sub-sentence. By discarding a few tokens from previously generated translation, the model is able to generate a more appropriate translation. In the practical experiment, this slight modification is proved to be effective in generating fluent translation.

Figure 7: In the decoding stage, the context-aware decoding model will discard the last tokens (in this example, ) from the generated partial translation to produce a fluent translation.

3 Latency Metric: Equilibrium Efficiency

In the work of DBLP:journals/corr/abs-1810-08398 and arivazhagan2019monotonic, they used the average lagging as the metric for evaluating the latency. However, there are two major flaws of this metric:

1) This metric is unsuitable for evaluating the sub-sentence model. Take the sentence in Figure 2 for example. As the model reads four tokens “她说    错了  那个”, and generates six target tokens “She said I was wrong ,”, the lag of the last target token is one negative value () according to its original definition.

2) This metric is unsuitable for evaluating latency in the scenario of speech-to-speech translation. DBLP:journals/corr/abs-1810-08398 considered that the target token generated after the cut-off point doesn’t cause any lag. However, this assumption is only supported in the speech-to-text scenario. In the speech-to-speech scenario, it is necessary to consider the time for playing the last synthesized speech.

Therefore, we instead propose a novel metric, Equilibrium Efficiency (EE), which measures the efficiency of equilibrium strategy.

Definition 4.

Consider a sentence with subsequences, and let be the length of source subsequence that emits a target subsequence with tokens. Then the equilibrium efficiency is: , where is defined as:


and , is an empirical factor.

In practice, we set to 0.3 for Chinese-English translation (reading about 200 English tokens in one minute). The motivation of EE is that one good model should equilibrate the time for playing the target speech to the time for listening to the speaker. Assuming playing one word takes one second, the EE actually measures the latency from the audience hearing the final target word to the speaker finishing the speech. For example, the EE of the sentence in Figure 3 is equal to , since the time for playing the sequence “She said I was wrong” is equilibrated to the time for speaker speaking the second sub-sentence “那个  叫  什么  什么 呃 妖姬”.

4 Evaluation

We conduct multiple experiments to evaluate the effectiveness of our system in many ways.

4.1 Data Description

4.1.1 NIST Chinese-English

We use a subset of the data available for NIST OpenMT08 task 4441LDC2002E18, LDC2002L27, LDC2002T01, LDC2003E07, LDC2003E14, LDC2004T07, LDC2005E83, LDC2005T06, LDC2005T10, LDC2005T34, LDC2006E24, LDC2006E26, LDC2006E34, LDC2006E86, LDC2006E92, LDC2006E93, LDC2004T08(HK News, HK Hansards ). The parallel training corpus contains approximate 2 million sentence pairs. We choose NIST 2006 (NIST06) dataset as our development set, and the NIST 2002 (NIST02), 2003 (NIST03), 2004 (NIST04) 2005 (NIST05), and 2008 (NIST08) datasets as our test sets.

We will use this dataset to evaluate the performance of our partial decoding and context-aware decoding strategy from the perspective of translation quality and latency.

Dataset Talks Utterances Transcription Translation Audio CER(1-best) CER(lattice)
Train 174 26,553 796,679 2,292,025 50.57 17.32% 15.68%
Dev 16 956 26,059 75,074 1.58 15.21% 13.20%
Test 6 975 25,832 70,503 1.46 10.32% 8.57%
Table 1: The summary of our proposed speech translation data. The volume of transcription is counted by characters, the volume of translation is counted by tokens, and the audio duration is counted by hours.

4.1.2 BSTC Chinese-English

Recently, we release Baidu Speech Translation Corpus (BSTC) for open research 555http://ai.baidu.com/broad/subordinate?dataset=bstc. This dataset covers speeches in a wide range of domains, including IT, economy, culture, biology, arts, etc. We transcribe the talks carefully, and have professional translators to produce the English translations. This procedure is extremely difficult due to the large number of domain-specific terminologies, speech redundancies and speakers’ accents. We expect that this dataset will help the researchers to develop robust NMT models on the speech translation. In summary, there are many features that distinguish this dataset to the previously related resources:

  • Speech irregularities are kept in transcription while omitted in translation (eg. filler words like “嗯, 呃, 啊”, and unconscious repetitions like “这个这个呢”), which can be used to evaluate the robustness of the NMT model dealing with spoken language.

  • Each talk’s transcription is translated into English by a single translator, and then segmented into bilingual sentence pairs according to the sentence boundaries in the English translations. Therefore, every sentence is translated based on the understanding of the entire talk and is translated faithfully and coherently in global sense.

  • We use the streaming multi-layer truncated attention model (SMLTA)

    666http://research.baidu.com/Blog/index-view?id=109 trained on the large-scale speech corpus (more than 10,000 hours) and fine-tuned on a number of talk related corpora (more than 1,000 hours), to generate the 5-best automatic recognized text for each acoustic speech.

  • The test dataset includes interpretations produced by simultaneous interpreters with professional experience. This dataset contributes an essential resource for the comparison between translation and interpretation.

We randomly extract several talks from the dataset, and divide them into the development and test set. In Table 1, we summarize the statistics of our dataset. The average number of utterances per talk is 152.6 in the training set, 59.75 in the dev set, and 162.5 in the test set.

We firstly run the standard Transformer model on the NIST dataset. Then we evaluate the quality of the pre-trained model on our proposed speech translation dataset, and propose effective methods to improve the performance of the baseline. In that the testing data in this dataset contains ASR errors and speech irregularities, it can be used to evaluate the robustness of novel methods.

4.1.3 Large-scale Chinese-English and English-Chinese

In the final deployment, we train our model using a corpus containing approximately 200 million bilingual pairs both in Chinese-English and English-Chinese translation tasks.

4.2 Data Preprocess

To preprocess the Chinese and the English texts, we use an open source Chinese Segmenter 777https://github.com/fxsjy/jieba and Moses Tokenizer 888https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl. After tokenization, we convert all English letters into lower case. And we use the “multi-bleu.pl” 999https://github.com/moses-smt/
script to calculate BLEU scores. Except in the large-scale experiments, we conduct byte-pair encoding (BPE) Sennrich et al. (2016) for both Chinese and English by setting the vocabulary size to 20K and 18K for Chinese and English, respectively. But in the large-scale experiments, we utilize a joint vocabulary for both Chinese-English and English-Chinese translation tasks, with a vocabulary size of 40K.

4.3 Model Settings

We implement our models using PaddlePaddle 101010https://github.com/paddlepaddle/paddle

, an end-to-end open source deep learning platform developed by Baidu. It provides a complete suite of deep learning libraries, tools and service platforms to make the research and development of deep learning simple and reliable. For training our dynamic context sequence boundary detector, we use ERNIE

Sun et al. (2019) as our pre-trained model.

Models NIST02 NIST03 NIST04 NIST05 NIST08 Average
baseline 49.40 49.71 50.03 48.83 44.38 40.39
sub-sentence 45.41 45.62 46.06 43.63 43.11 37.31
wait-1 38.37 36.87 38.17 36.09 35.31 30.80
wait-3 40.75 39.30 40.57 38.18 38.29 32.85
wait-5 42.76 41.43 43.29 40.43 39.62 34.59
wait-7 44.05 42.94 44.17 42.25 40.61 35.67
wait-9 45.71 44.49 45.74 43.14 41.63 36.78
wait-12 46.67 45.63 46.86 44.59 42.83 37.76
wait-15 46.41 46.43 47.38 45.63 43.60 38.24
treat the information unit as sub-sentence (IU=sub-sentence)
+context-aware 47.79 48.11 48.29 46.55 44.57 39.22
+partial decoding 48.46 48.51 48.53 47.05 45.43 39.66
+discard 2 tokens 48.61 48.54 48.68 47.11 45.08 39.67
+discard 3 tokens 48.62 48.52 48.87 47.16 45.30 39.75
+discard 4 tokens 48.71 48.69 49.10 47.32 45.11 39.82
+discard 5 tokens 48.82 48.78 48.98 47.31 44.48 39.73
+discard 6 tokens 48.94 48.70 48.77 47.21 44.33 39.66
treat the information unit as segment (IU=segment)
+discard 1 tokens 46.89 45.40 47.05 45.36 43.06 37.96
+discard 2 tokens 48.09 46.98 48.45 46.50 44.00 39.00
+discard 3 tokens 48.70 47.87 48.85 47.01 44.48 39.49
+discard 4 tokens 48.75 48.09 48.99 46.86 45.07 39.63
+discard 5 tokens 48.84 48.37 48.71 46.95 44.76 39.56
+discard 6 tokens 48.88 48.60 48.85 47.17 44.84 39.72
Table 2: The overall results on NIST Chinese-English translation task.

For fair comparison, we implement the following models:

  • baseline: A standard Transformer based model with big version of hyper parameters.

  • sub-sentence: We split a full sentence into multiple sub-sentences by comma, and translate them using the baseline model. To evaluate the translation quality, we concatenate the translation of each sub-sentence into one sentence.

  • wait-k: This is our previous work Ma et al. (2019).

  • context-aware: This is our proposed model using context-aware decoding strategy, without fine-tuning on partial decoding model.

  • partial decoding: This is our proposed model using partial decoding.

  • discard tokens: The previously generated tokens are removed to complete the rest of the translation by the context-aware decoding model.

4.4 Experiments

4.4.1 NIST Chinese-English

We firstly conduct our experiments on the NIST Chinese-English translation task.

To validate the effectiveness of our translation model, we run two baseline models, baseline and sub-sentence. We also compare the translation quality as well as latency of our models with the wait-k model.

Effectiveness on Translation Quality. As shown in Table 2, there is a great deal of difference between the sub-sentence and the baseline model. On an average the sub-sentence shows weaker performance by a 3.08 drop in BLEU score (40.39 37.31). Similarly, the wait-k model also brings an obvious decrease in translation quality, even with the best wait-15 policy, its performance is still worse than the baseline system, with a 2.15 drop, averagely, in BLEU (40.39 38.24). For a machine translation product, a large degradation in translation quality will largely affect the use experience even if it has low latency.

Unsurprisingly, when treating sub-sentences as IUs, our proposed model significantly improves the translation quality by an average of 2.35 increase in BLEU score (37.31 39.66), and its performance is slightly lower than the baseline system with a 0.73 lower average BLEU score (40.39 39.66). Moreover, as we allow the model to discard a few previously generated tokens, the performance can be further improved to 39.82 ( 0.16), at a small cost of longer latency (see Figure 8). It is consistent with our intuition that our novel partial decoding strategy can bring stable improvement on each testing dataset. It achieves an average improvement of 0.44 BLEU score (39.22 39.66) compared to the context-aware system in which we do not fine-tune the trained model when using partial decoding strategy. An interesting finding is that our translation model performs better than the baseline system on the NIST08 testing set. We analyze the translation results and find that the sentences in NIST08 are extremely long, which affect the standard Transformer to learn better representation Tang et al. (2018). Using context-aware decoding strategy to generate consistent and coherent translation, our model performs better by focusing on generating translation for relatively shorter sub-sentences.

Figure 8: We show the latency for our proposed model (Left), and model (Right). For better understanding, we use the to represent the latency. The represents the baseline system, and the + represents the sub-sentence system.

Investigation on Decoding Based on Segment. Intuitively, treating one segment as an IU will reduce the latency in waiting for more input to come. Therefore, we split the testing data into segments according to the principle in Definition 3 (if in Definition 3 is a comma, then the data is sub-sentence pair, otherwise it is a segment-pair.) 111111Clearly, this generated testing data is unavailable in real product due to the requirement of target translation to extract the segment-pairs. In actual, it is necessary to let the sequence detector making decision upon the segment-level. .

As Table 2 shows, although the translation quality of discard 1 token based on segment is worse than that based on sub-sentence (37.96 vs. 39.66), the performance can be significantly improved by allowing the model discarding more previously generated tokens. Lastly, the discard 6 tokens obtains an impressive result, with an average improvement of 1.76 BLEU score (37.96 39.72).

Effects of Discarding Preceding Generated Tokens. As mentioned and depicted in Figure 7, we discard one token in the previously generated translation in our context-aware NMT model. One may be interested in whether discarding more generated translation leads to better translation quality. However, when decoding on the sub-sentence, even the best discard 4 tokens model brings no significant improvement (39.66 39.82) but a slight cost of latency (see in Figure 8 for visualized latency). While decoding on the segment, even discarding two tokens can bring significant improvement (37.96 39.00). This finding proves that our partial decoding model is able to generate accurate translation by anticipating the future content. It also indicates that the anticipation based on a larger context presents more robust performance than the aggressive anticipation in the wait-k model, as well as in the segment based decoding model.

Models Precision (%) Recall (%) F-score (%) Average Latency Max Latency
5-LM 55.30 72.63 62.79 8.68 46
RNN 67.61 70.35 68.95 9.79 48
Our model 75.09 81.70 78.26 10.49 39
Table 3: The comparison between our sequence detector and previous work. The latency represents the words requiring to make an explicit decision.

Effectiveness on latency. As latency in simultaneous machine translation is essential and is worth to be intensively investigated, we compare the latency of our models with that of the previous work using our Equilibrium Efficiency metric. As shown in Figure 8, we plot the translation quality and on the NIST06 dev set. Clearly, compared to the baseline system, our model significantly reduce the time delay while remains a competitive translation quality. When treating segments as IUs, the latency can be further reduced by approximate 20% (23.13 18.65), with a slight decrease in BLEU score (47.61 47.27). One interesting finding is that the granularity of information units largely affects both the translation quality and latency. It is clear the decoding based on sub-sentence and based on segment present different performance in two metrics. For the former model, the increase of discarded tokens results in an obvious decrease in translation quality, but no definite improvement in latency. The latter model can benefit from the increasing of discarding tokens both in translation quality and latency.

The latency of the wait-k models are competitive, their translation quality, however, is still worse than context-aware model. Improving the translation quality for the wait-k will clearly brings a large cost of latency (36.53 46.14 vs. 10.94 22.63). Even with a best k-20 policy, its performance is still worse than most context-aware models. More importantly, the intermediately generated target token in the wait-k policy is unsuitable for TTS due to the fact that the generated token is often a unit in BPE, typically is an incomplete word. One can certainly wait more target tokens to synthesize the target speech, however, this method will reduce to the baseline model. In general, experienced human interpreters lag approximately 5 seconds (15 25 words) behind the speaker Lee (2002); Lamberger-Felber (2001); Timarová et al. (2015), which indicates that the latency of our model is accessible and practicable ( = 25 indicates lagging 25 words).

4.4.2 Dynamic Context based Information Unit Boundary Detector

In our context-sensitive model, the dynamic context based information unit boundary detector is essential to determine the IU boundaries in the steaming input. To measure the effectiveness of this model, we compare its precision as well as latency against the traditional language model based methods, a 5-gram language model trained by KenLM toolkit 121212https://github.com/kpu/kenlm, and an in-house implemented RNN based model. Both of two contrastive models are trained on approximate 2 million monolingual Chinese sentences. As shown in Table 3, it is clear that our model beats the previous work with an absolute improvement of more than 15 points in term of F-score (62.79 78.26) and no obvious burden in latency (average latency). This observation indicates that with bidirectional context, the model can learn better representation to help the downstream tasks. In the next experiments, we will evaluate models given testing data with IU boundaries detected by our detector.

4.4.3 BSTC Chinese-English

To our knowledge, almost all of the previous related work on simultaneous translation evaluate their models upon the clean testing data without ASR errors and with explicit sentence boundaries annotated by human translators. Certainly, testing data with real ASR errors and without explicit sentence boundaries is beneficial to evaluate the robustness of translation models. To this end, we perform experiments on our proposed BSTC dataset.

Models Clean Input ASR Input ASR + Auto IU
Pre-train Fine-tune Pre-train Fine-tune Pre-train Fine-tune
baseline 15.85 21.98 14.60 19.91 14.41 17.35
sub-sentence 14.39 18.61 13.50 16.99 13.76 16.29

12.23 16.74 11.62 15.59 11.75 14.68
wait-5 12.84 17.70 11.96 16.23 12.25 15.45
wait-7 13.34 19.32 12.67 17.41 12.55 16.08
wait-9 13.92 19.77 13.05 18.29 13.12 16.49
wait-12 14.35 20.15 13.34 19.07 13.48 17.25
wait-15 14.70 21.11 13.56 19.53 13.70 17.21
context-aware 15.25 20.72 14.24 18.42 13.52 16.83
+discard 2 tokens 15.26 21.07 14.35 19.17 13.73 17.02
+discard 3 tokens 15.37 21.09 14.42 19.39 14.00 17.41
+discard 4 tokens 15.40 21.02 14.45 19.41 14.11 17.36
+discard 5 tokens 15.59 21.23 14.72 19.65 14.54 17.37
+discard 6 tokens 15.53 21.21 14.77 19.48 14.58 17.49
Table 4: The overall results on BSTC Chinese-English translation task (Pre-train represents training on the NIST dataset, and fine-tune represents fine-tuning on the BSTC dataset.). Clean input indicates the input is from human annotated transcription, while the ASR input represents the input contains ASR errors. ASR + Auto IU indicates that the sentence boundary as well as sub-sentence is detected by our IU detector. Therefore, this data basically reflects the real environment in practical product.

The testing data in BSTC corpus consists of six talks. We firstly employ our ASR model to recognize the acoustic waves into Chinese text, which will be further segmented into small pieces of sub-sentences by our IU detector. To evaluate the contribution of our proposed BSTC dataset, we firstly train all models on the NIST dataset, and then check whether the performance can be further improved by fine-tuning them on the BSTC dataset.

From the results shown in Table 4, we conclude the following observations:

  • Due to the relatively lower CER in ASR errors (10.32 %), the distinction between the clean input and the noisy input results in a BLEU score difference smaller than 2 points (15.85 vs. 14.60 for pre-train, and 21.98 vs. 19.91 for fine-tune).

  • Despite the small size of the training data in BSTC, fine-tuning on this data is essential to improve the performance of all models.

  • In all settings, the best system in context-aware model beats the wait-15 model.

  • Pre-trained models are not sensitive to errors from Auto IU, while fine-tuned models are.

Models Translation Reference Interpretation Reference (3-references)
BLEU Brevity Penalty BLEU Brevity Penalty
Our Model 20.93 1.000 28.08 1.000
S 16.02 0.845 - -
A 16.38 0.887 - -
B 12.08 0.893 - -
Table 5: Comparison between machine translation and human interpretation. The interpretation reference consists of a collection of interpretations from S, A and B. Our model is trained on the large-scale corpus.
Models Overall Missing Translation
BAD OK GOOD Acceptability
Our Model 26.09% 29.13% 44.78% 73.91% 20%
S 36.96% 30.87% 32.17% 63.04% 53%
A 26.96% 35.65% 37.39% 73.04% 47%
B 52.17% 31.74% 16.09% 47.83% 53%

Table 6: Results of human evaluation for interpreting and machine translation. Missing Translation indicates the proportion of missing translation in all translation errors. Notice that inadequate translations are marked as BAD by the human translator.

4.4.4 Machine Translation vs. Human Interpretation

Another interesting work is to compare machine translation with human interpretation. We request three simultaneous interpreters to interpret the talks in BSTC testing dataset, in a mock conference setting 131313We provide the conferences video of the talks to the interpreters, because in real conferences interpreters have a good view of speakers from the booth.. We have three simultaneous interpreters with years of interpreting experience: S (9 years experience), A (7 years experience), and B (5 years experience).

We concatenate the translation of each talk into one big sentence, and then evaluate it by BLEU score. From Table 5, we find that machine translation beats the human interpreters significantly. Moreover, the length of interpretations are relatively short, and results in a high length penalty provided by the evaluation script. The result is unsurprising, because human interpreters often deliberately skip non-primary information to keep a reasonable ear-voice span, which may bring a loss of adequacy and yet a shorter lag time, whereas the machine translation model translates the content adequately. We also use human interpreting results as references. As Table 5 indicates, our model achieves a higher BLEU score, 28.08.

Figure 9: This is a representative case that indicates our model can generate coherent translation.
Task Overall Error Distribution
BAD OK GOOD Acceptability Translation ASR IU Boundary

24.29% 24.37% 61.34% 85.71% 13% 39% 48%
EC 13.64% 14.54% 71.82% 86.36% 15% 15% 70%
Table 7: Results of DuTongChuan. CE represents Chinese-English translation task, and EC represents the English-Chinese translation task.

Furthermore, we ask human translators to evaluate the quality between interpreting and machine translation. To evaluate the performance of our final system, we select one Chinese talk as well as one English talk 141414https://www.youtube.com/watch?v=RXGNbTx2Wqk consisting of about 110 sentences, and have human translators to assess the translation from multiple aspects: adequacy, fluency and correctness. The detailed measurements are:

  • Bad: Typically, the mark Bad indicates that the translation is incorrect and unacceptable.

  • OK: If a translation is comprehensible and adequate, but with minor errors such as incorrect function words and less fluent phrases, then it will be marked as OK.

  • Good: A translation will be marked as Good if it contains no obvious errors.

As shown in Table 6, the performance of our model is comparable to the interpreting. It is worth mentioning that both automatic and human evaluation criteria are designed for evaluating written translation and have a special emphasis on adequacy and faithfulness. But in simultaneous interpreting, human interpreters routinely omit less-important information to overcome their limitations in working memory. As the last column in Table 6 shows, human interpreters’ oral translations have more omissions than machine’s and receive lower acceptability. The evaluation results do not mean that machines have exceeded human interpreters in simultaneous interpreting. Instead, it means we need machine translation criteria that suit simultaneous interpreting. We also find that the BSTC dataset is extremely difficult as the best human interpreter obtains a lower Acceptability 73.04%. Although the NMT model obtains impressive translation quality, we do not compare the latency of machine translation and human interpreting in this paper, and leave it to the future work.

4.4.5 Ablation Study

To better understand the contribution of our model on generating coherent translation, we select one representative running example for analysis. As the red text in Figure 9 demonstrates that machine translation model generates coherent translation “its own grid” for the sub-sentence “这个网络”, and “corresponds actually to” for the subsequence “…对应的,就是每个…”. Compared to the human interpretation (S), our model presents comparable translation quality. In details, our model treats segments as IUs, and generates translation for each IU consecutively. While the human interpreter splits the entire source text into two sub-sentences, and produces the translation respectively.

4.4.6 Performance of DuTongChuan

In the final deployment, we train DuTongChuan on the large-scale training corpus. We also utilize techniques to enhance the robustness of the translation model, such as normalization of the speech irregularities, dealing with abnormal ASR errors, and content censorship, etc (see Appendix). We successfully deploy DuTongChuan in the Baidu Create 2019 (Baidu AI Developer Conference) 151515https://create.baidu.com/.

As shown in Table 7, it is clear that DuTongChuan achieves promising acceptability on both translation tasks (85.71% for Chinese-English, and 86.36 % for English-Chinese). We also elaborately analyze the error types in the final translations, and we find that apart from errors occurring in translation and ASR, a majority of errors come from IU boundary detection, which account for nearly a half of errors. In the future, we should concentrate on improving the translation quality by enhancing the robustness of our IU boundary detector. We also evaluate the latency of our model in an End-to-End manner (speech-to-speech), and we find that the target speech slightly lags behind the source speech in less than 3 seconds at most times. The overall performance both on translation quality and latency reveals that DuTongChuan is accessible and practicable in an industrial scenario.

5 Related Work

The existing research on speech translation can be divided into two types: the End-to-End model Duong et al. (2016); Bansal et al. (2017); Weiss et al. (2017); Bérard et al. (2018); Liu et al. (2019) and the cascaded model. The former approach directly translates the acoustic speech in one language, into text in another language without generating the intermediate transcription for the source language. Depending on the complexity of the translation task as well as the scarce training data, previous literatures explore effective techniques to boost the performance. For example pre-training Bansal et al. (2018), multi-task learning Duong et al. (2016); Bérard et al. (2018), attention-passing, Sperber et al. (2019), and knowledge distillation Liu et al. (2019) etc.,. However, the cascaded model remains the dominant approach and presents superior performance practically, since the ASR and NMT model can be optimized separately training on the large-scale corpus.

Many studies have proposed to synthesize realistic ASR errors, and augment them with translation training data, to enhance the robustness of the NMT model towards ASR errors Tsvetkov et al. (2014); Chen et al. (2017); Sperber et al. (2017). However, most of these approaches depend on simple heuristic rules and only evaluate on artificially noisy test set, which do not always reflect the real noises distribution on training and inference Cheng et al. (2018); Liu et al. (2018); Li et al. (2018).

Beyond the research on translation models, there are many research on the other relevant problems, such as sentence boundary detection for realtime speech translation Sridhar et al. (2013); Oda et al. (2014); Wang et al. (2016); Bourlon et al. (2016); Zhou et al. (2017), low-latency simultaneous interpreting Fujita et al. (2013); Grissom II et al. (2014); Cho and Esipova (2016); Gu et al. (2017); Niehues et al. (2018); Alinejad et al. (2018); Press and Smith (2018), automatic punctuation annotation for speech transcription Gravano et al. (2009); Cho et al. (2017), and discussion about human and machine in simultaneous interpreting He et al. (2016).

Focus on the simultaneous translation task, there are some work referring to the construction of the simultaneous interpreting corpus Tohyama et al. (2004); Bendazzoli and Sandrelli (2005); Shimizu et al. (2014). Particularly, Shimizu et al. (2014) deliver a collection of a simultaneous translation corpus for comparative analysis on Japanese-English and English-Japanese speech translation. This work analyze the difference between the translation and the interpretations, using the interpretations from human simultaneous interpreters.

For better generation of coherent translations, gong2011cache propose a memory based approach to capture contextual information to make the statistical translation model generate discourse coherent translations. kuang2017cache,tu2018learning,P18-1118 extend similar memory based approach to the NMT framework. wang2017exploiting present a novel document RNN to learn the representation of the entire text, and treated the external context as the auxiliary context which will be retrieved by the hidden state in the decoder. tiedemann2017neural and P18-1117 propose to encode global context through extending the current sentence with one preceding adjacent sentence. Notably, the former is conducted on the recurrent based models while the latter is implemented on the Transformer model. Recently, we also propose a reinforcement learning strategy to deliberate the translation so that the model can generate more coherent translations

Xiong et al. (2019).

6 Conclusion and Future Work

In this paper, we propose DuTongChuan, a novel context-aware translation model for simultaneous interpreting. This model is able to constantly read streaming text from the ASR model, and simultaneously determine the boundaries of information units one after another. The detected IU is then translated into a fluent translation with two simple yet effective decoding strategies: partial decoding and context-aware decoding. We also release a novel speech translation corpus, BSTC, to boost the research on robust speech translation task.

With elaborate comparison, our model obtains superior translation quality against the wait-k model, but also presents competitive performance in latency. Assessment from human translators reveals that our system achieves promising translation quality (85.71% for Chinese-English, and 86.36% for English-Chinese), specially in the sense of surprisingly good discourse coherence. Our system also presents superior performance in latency (delayed in less 3 seconds at most times) in a speech-to-speech simultaneous translation. We also deploy our simultaneous machine translation model in our AI platform, and welcome the other users to enjoy it.

In the future, we will conduct research on novel method to evaluate the interpreting.

7 Acknowledgement

We thank Ying Chen for improving the written of this paper. We thank Yutao Qu for developing partial modules of DuTongChuan. We thank colleagues in Baidu for their efforts on construction of the BSTC. They are Zhi Li, Ying Chen, Xuesi Song, Na Chen, Qingfei Li, Xin Hua, Can Jin, Lin Su, Lin Gao, Yang Luo, Xing Wan, Qiaoqiao She, Jingxuan Zhao, Can Jin, Wei Jin, Xiao Yang, Shuo Liu, Yang Zhang, Jing Ma, Junjin Zhao, Yan Xie, Minyang Zhang, Niandong Du, etc.

We also thank tndao.com161616http://www.tndao.com/about-tndao and zaojiu.com171717https://www.zaojiu.com/ for contributing their speech corpora.


Appendix A Training Samples for Information Unit Detector

For example, for a sentence “她说我错了,那个叫什么什么呃妖姬。”, there are some representative training samples:

Appendix B Techniques for Robust Translation

To develop an industrial simultaneous machine translation system, it is necessary to deal with problems that affect the translation quality in practice such as large number of speech irregularities, ASR errors, and topics that allude to violence, religion, sex and politics.

b.1 Speech Irregularities Normalization

In the real talk, the speaker tends to express his opinion using irregularities rather than regular written language utilized to train prevalent machine translation relevant models. For example, as depicted in Figure 2, the spoken language in the real talk often contains unconscious repetitions (i.e., “什么(shénmē) 什么(shénmē)), and filler words (“呃”, “啊”), which will inevitably affects the downstream models, especially the NMT model. The discrepancy between training and decoding is not only existed in the corpus, but also occurs due to the error propagation from ASR model (e.g. recognize the “饿 (è)” into filler word “呃 (è) ” erroneously), which is related to the field of robust speech NMT research.

In the study of robust speech translation, there are many methods can be applied to alleviate the discrepancy mostly arising from the ASR errors such as disfluency detection, fine-tuning on the noisy training data Tsvetkov et al. (2014); Chen et al. (2017), complex lattice input Sperber et al. (2017), etc. For spoken language normalization, it is mostly related to the work of sentence simplification. However, the traditional methods for sentence simplification rely large-scale training corpus and will enhance the model complexity by incorporating an End-to-End model to transform the original input.

In our system, to resolve problems both on speech irregularities and ASR errors, we propose a simple rule heuristic method to normalize both spoken language and ASR errors, mostly focus on removing noisy inputs, including filler words, unconscious repetitions, and ASR error that is easy to be detected. Although faithfulness and adequacy is essential in the period of the simultaneous interpreting, however, in a conference, users can understand the majority of the content by discarding some unimportant words.

b.1.1 Unconscious Repetitions

To remove unconscious repetitions, the problem can be formulated as the Longest Continuous Substring (LCS) problem, which can be solved by an efficient suffix-array based algorithm in time complexity empirically. Unfortunately, this simple solution is problematic in some cases. For example, “他 必须 分成 很多 个 小格 , 一个 小格 一个 小格 完成”, in this case, the unconscious repetitions “一个 小格 一个 小格” can not be normalized to “一个 小格”. To resolve this drawback, we collect unconscious repetitions appearing more than 5 times in a large-scale corpus consisting of written expressions, resulting in a white list containing more than 7,000 unconscious repetitions. In practice, we will firstly retrieve this white list and prevent the candidates existed in it from being normalized.

b.1.2 Removing ASR Errors

According to our previous study, many ASR errors are caused by disambiguating homophone. In some cases, such error will lead to serious problem. For example, both “食油 (cooking oil)” and “石油 (oil)” have similar Chinese phonetic alphabet (shí yóu), but with distinct semantics. The simplest method to resolve this problem is to enhance the ASR model by utilizing a domain-specific language model to generate the correct sequence. However, this method requires an insatiably difficult requirement, a customized ASR model. To reduce the cost of deploying a customized ASR model, as well as to alleviate the propagation of ASR errors, we propose a language model based identifier to remove the abnormal contents.

Definition 5.

For a given sequence , if the value of is lower than a threshold , then we denote the token as an abnormal content.

In the above definition, the value of and can be efficiently computed by a language model. In our final system, we firstly train a language model on the domain-specific monolingual corpus, and then identify the abnormal content before the context-aware translation model. For the detected abnormal content, we simply discard it rather than finding an alternative, which will lead to additional errors potentially. Actually, human interpreters often routinely omit source content due to the limited memory.

b.2 Constrained Decoding and Content Censorship

For an industrial product, it is extremely important to control the content that will be presented to the audience. Additionally, it is also important to make a consistent translation for the domain-specific entities and terminologies. This two demands lead to two associate problems: content censorship and constrained decoding, where the former aims to avoid producing some translation while the latter has the opposite target, generating pre-specified translation.

Recently, post2018fast proposed a Dynamic Beam Allocation (DBA) strategy, a beam search algorithm that forces the inclusion of pre-specified words and phrases in the output. In the DBA strategy, there are many manually annotated constraints, to force the beam search generating the pre-specified translation. To satisfy the requirement of content censorship, we extend this algorithm to prevent the model from generating the pre-specified forbidden content, a collection that contains words and phrases alluding to violence, religion, sex and politics. Specially, during the beam search, we punish the candidate beam that matches a constraint of pre-specified forbidden content, to prevent it from being selected as the final translation.