Log In Sign Up

Robust Neural Machine Translation for Clean and Noisy Speech Transcripts

by   Mattia Antonino Di Gangi, et al.

Neural machine translation models have shown to achieve high quality when trained and fed with well structured and punctuated input texts. Unfortunately, the latter condition is not met in spoken language translation, where the input is generated by an automatic speech recognition (ASR) system. In this paper, we study how to adapt a strong NMT system to make it robust to typical ASR errors. As in our application scenarios transcripts might be post-edited by human experts, we propose adaptation strategies to train a single system that can translate either clean or noisy input with no supervision on the input type. Our experimental results on a public speech translation data set show that adapting a model on a significant amount of parallel data including ASR transcripts is beneficial with test data of the same type, but produces a small degradation when translating clean text. Adapting on both clean and noisy variants of the same data leads to the best results on both input types.


page 1

page 2

page 3

page 4


Breaking the Data Barrier: Towards Robust Speech Translation via Adversarial Stability Training

In a pipeline speech translation system, automatic speech recognition (A...

Improving the Robustness of Speech Translation

Although neural machine translation (NMT) has achieved impressive progre...

Assessing the Tolerance of Neural Machine Translation Systems Against Speech Recognition Errors

Machine translation systems are conventionally trained on textual resour...

Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back Translation

Neural Machine Translation (NMT) models have been proved strong when tra...

Gradient-guided Loss Masking for Neural Machine Translation

To mitigate the negative effect of low quality training data on the perf...

Secoco: Self-Correcting Encoding for Neural Machine Translation

This paper presents Self-correcting Encoding (Secoco), a framework that ...

Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding

Spoken language understanding (SLU) system usually consists of various p...

1 Introduction

The recent quality improvements [bojar2017findings, bojar2018findings] of neural machine translation (NMT) [sutskever2014sequence, bahdanau2014neural] opened the way to commercial applications that can provide high-quality translations. The assumption is that the sentences to translate will be similar to the training data, usually characterized by properly-formed text in the two translation languages. Poor-quality sentences are usually considered “noise” and removed from the training set to improve the final quality [junczys2018microsoft]. This practice is so common that a shared task has been devoted to it [koehn2018findings], which got attention from major industrial players in MT. Thus, the major weakness of NMT lies in coping with noisy input, which is an important feature of a real-world application such as speech translation.

The degradation of translation quality with noisy input has been widely reported in literature. Belinkov and Bisk [belinkov2017synthetic] showed that the translation quality rapidly drops with both natural and synthetic noise. Karpukhin et al. [ruiz2017assessing] have observed a correlation between recognition and translation quality in the context of speech translation. In both cases the degradation is mainly due to word errors, and following works have shown that inserting synthetic noise in the training data increases the robustness to the same kind of noise [karpukhin2019training, sperber2017toward].

Most of the time, travellers worry about their luggage.
Most of the time travellers worry about their luggage.
Table 1: Example of sentence in which the meaning is changed by a punctuation mark.

In practice, ASR transcripts are not only noisy in the choice of words, but also come without punctuation.111ASR services generally include punctuation as an option. Thus, in order to feed MT systems with ASR output there are two main options: i) use a separate model that inserts punctuation (pre-processing) [peitz2011modeling]; or ii) train the MT system on non-punctuated source data to resemble the test condition (implicit learning). The first option is exposed to error propagation, as punctuation inserted in the wrong position can alter completely the meaning of a sentence (see example in Table 1. The second option has shown to increase systems robustness [sperber2017toward] when the input is provided without punctuation. The two approaches have shown to be equivalent in handling text lacking punctuation [vandeghinstecomparison]

, probably because they both rely only on plain monolingual text to recover the missing punctuation.

We consider here application scenarios (e.g. subtitling) where the same NMT system has to operate under both clean and noisy input conditions, namely post-edited or raw ASR outputs. We start from the hypothesis that word and punctuation errors compound and should be addressed separately for an increased robustness to errors. To verify our hypothesis we use a strong NMT model and fine-tune it on a recently released speech corpus222Available at [mustc19] of TED Talks.

Our findings are the following: implicit learning of punctuation helps to recover part of the errors in the source side, but training on ASR output is more beneficial. Training on clean and noisy data together leads to a system that can translate both clean and noisy data without supervision on the type of sentence and without degradation.

Train Validation Test
Words 5.7M 31.5K 54.5K
Segments 250K 1.3K 2.5K
Audio 457h 2.5h 4.2h
Table 2: The used English-Italian corpus of TED Talks.
Clean Noisy Noisy-np
Gen 32.3 (30.7) 24.5 (24.0) 20.6 (22.4)
Ada 34.9 (32.9) 25.9 (25.6) 21.8 (24.6)
Table 3: BLEU scores of large-data generic and adapted NMT systems with clean and noisy input. Scores in parentheses do not consider punctuation in both hypothesis and reference.

2 Robust NMT

In this paper we are interested in building a single system to translate speech transcripts that can be either raw ASR outputs, or human post-editing of them. We define our problem in terms of domain adaptation, and use the method known as fine-tuning or continued training [luong2015stanford, farajian2016fbk, chu2017empirical]. A parent model is trained on large data from several domains and used to initialize [thompson2018freezing] models for spoken language translation (TED talks) on two input conditions: clean, and noisy. The clean domain is characterized by correct transcriptions of talks with proper punctuation, while the noisy domain can contain machine-generated errors in words and punctuation. In-domain data can be given with or without punctuation (allowing implicit learning). In a multi-domain setting, i.e. translating both clean and noisy data with a single system, models can suffer from catastrophic forgetting [kirkpatrick2017overcoming] by degrading performance on less-recently observed domains. In this work we avoid catastrophic forgetting by fine-tuning the model on both domains simultaneously.

3 Experimental Setting

We use as a parent model for fine-tuning a Transformer Big model [vaswani2017attention] trained on public and proprietary data using label smoothed cross entropy [szegedy2016rethinking], for about 16 million English-Italian sentence pairs. The model has layer size of , hidden size of on feed forward layers, 16 heads in the multi-head attention, and layers in both encoder and decoder. This model is then fine-tuned on the EnIt portion of TED Talks in MuST-C. We keep the same training/validation/test set split as provided with the corpus (see Table 2). In all the experiments, we use Adam [kingma2014adam] with a fixed learning rate of , dropout of , label smoothing with a smoothing factor of . Training is performed on Nvidia V100 GPUs, with batches of tokens per GPU. Gradients are accumulated for batches in each GPU [ott2018scaling]. All texts are tokenized and true-cased with scripts from the Moses toolkit [koehn2007moses], and then words are segmented with BPE [sennrich2015neural] with 32K joint merge rules.

While our main goal is to verify our hypotheses on a large data condition, thus the need to include proprietary data, for the sake of reproducibility we also provide results with systems only trained on TED Talks (small data condition).

We transcribed automatically the entire TED Talks corpus with with a general purpose ASR service. 333Amazon Transcribe: The resulting word error rates on the test set is 11.0%. The used ASR service provides transcribed text with predicted punctuation. In the experiments which assume noisy input without punctuation, we simply remove the predicted punctuation.

All the results are evaluated in terms of BLEU score [papineni2002bleu] using the multeval tool [clark2011better].

Clean Noisy
Clean 34.9 25.9
Clean-np 34.2 26.6
Clean + Clean-np 34.9 26.9
Noisy 34.0 28.3
Noisy-np 34.2 28.4
Noisy + Clean 35.1 28.1
Noisy + Noisy-np 34.0 28.2
Noisy-np + Clean 35.0 28.2
Noisy-np + Clean-np 34.5 27.9
Noisy[-np] + Clean[-np] 34.9 27.7
Table 4: Results of fine-tuning on different training conditions with clean and noisy input (large data).  means statistical significant (p 0.01) wrt to Clean + Clean-np,  means statistical significant (p 0.01) wrt the first system of the block. (With randomization tests with 15K repetitions [riezler-maxwell-2005-pitfalls])

4 Experiments and Results

At first, we evaluate the degradation due to ASR noise for systems trained on clean data. In Table 3 we show the BLEU scores of our baseline system, respectively, trained on large out-of-domain data (Base) and fine-tuned on clean TED Talks data (In-domain), with three types of input: manual transcripts (Clean), ASR transcripts with predicted punctuation (Noisy), and ASR transcripts with no punctuation (Noisy-np). As these models will hardly generate punctuation when it does not appear in the source text, we also report (in parentheses) BLEU scores computed w/o punctuation in both hypothesis and reference.

Our results show that in-domain fine-tuning is beneficial in all scenarios, but more with clean input (+2.6 points) than with noisy input (+1.4 points). Translating noisy input results in a 26% relative drop in BLEU, which is apparently not due to punctuation errors (drop when evaluating w/o punctuation is 22%). Providing noisy input with no punctuation works even worse, probably due to the larger mismatch with the training/tuning conditions.

Clean np Noisy np
Clean 30.3 - 22.3 -
Clean-np - 28.2 - 22.9
Clean + Clean-np 29.7 27.9 22.9 22.9
Noisy 25.8 - 23.9 -
Noisy-np - 26.4 - 24.1
Noisy-np + Clean 30.1 27.9 24.0 24.2
Table 5: Results of fine-tuning on different training conditions with clean and noisy input (small data).: statistical significant difference with Clean.
Clean when I ’m not fighting poverty , I ’m fighting fires as the assistant captain of a volunteer fire company .
Noisy when I ’m not fighting poverty . I ’m fighting fires . is the assistant captain with volunteer fire company .
Base NMT Quando non combatto la povertà . Combatto gli incendi . È l’assistente capitano di una compagnia di pompieri volontaria .
Robust NMT Quando non combatto la povertà , combatto gli incendi come assistente capitano di una compagnia di pompieri volontaria .
Clean that means we all share a common ancestor , an evolutionary grandmother , who lived around six million years ago.
Noisy that means we all share a common ancestor - on evolutionary grandmother - who lived around six million years ago.
Base NMT Ciò significa che tutti condividiamo un antenato comune - sulla nonna evolutiva che ha vissuto circa sei milioni di anni fa.
Robust NMT Ciò significa che tutti condividiamo un antenato comune e una nonna evolutiva che è vissuta circa sei milioni di anni fa.
Clean and we in the West couldn’t understand how anybody would do this , how much this would restrict freedom of speech.
Noisy we in the West . I couldn ’t understand how anybody would do - how much this would restrict freedom of speech.
Base NMT Noi occidentali . Non riuscivo a capire come chiunque avrebbe fatto - quanto questo avrebbe limitato la libertà di parola.
Robust NMT In Occidente non riuscivo a capire come chiunque avrebbe fatto , quanto questo avrebbe limitato la libertà di parola.
Table 6: Examples of punctuation and substitution errors (”as is”, ”of a with”, ”an on”) that are successfully recovered by the Robust NMT system. Notice that not all errors (underlined) are recovered. In the second example, Robust NMT introduces a spurious conjunction ”e” (”and”) in place of the missing comma, while in the third example, Robust NMT is not able to recover the deleted words ”and” and ”this” at the begin and in the middle of the sentence, respectively.

Next, we evaluate systems fine-tuned on data similar to the Noisy test condition. Table 4 lists the results of all fine-tuning experiments, when test input is either clean with punctuation or noisy with punctuation as in training. The first part of Table 4 shows that fine-tuning the Gen model with clean data and no punctuation (Clean-np) improves over the Ada model (Clean) when testing on Noisy-np (+0.7) but degrades when testing on Clean (-0.7). Fine-tuning the same model on Clean data with both punctuation condition (Clean + Clean-np) improvement by 1 point on Noisy input without any loss on Clean input. This result shows that is possible to make a model robust to noisy text, while preserving high quality on proper text.

The second part of Table 4 lists results of fine-tuning on noisy data, with and/or without punctuation. In all cases, BLEU scores on the noisy input improve from 1 to 2 point over the best systems tuned on Clean data only, reaching values above 28. However, scores on clean input degrade by 0.7 and 0.9 points (i.e. 34.2 and 34.0). If we adapt on both clean and noisy data, the score on the two input conditions reach a better balance. In particular, training on both Noisy-np and Clean data scores on the two input conditions 35.0 and 28.2, which results in the best overall working point on both conditions (together with Noisy+Clean). It is worth pointing out that this configuration obtains 33.2 points (not in the table) with the best possible noisy input, i.e. no errors and no punctuation, which is still 1.7 points below the score on Clean input.

Finally, if we expose the system to all types of data (Noisy-np +Noisy+ Clean + Clean-np) we do not see any improvement over our top results, which means that Clean-np data do not provide additional information to Noisy-np.

For the sake of replicability, we also trained our systems from scratch on the TED Talks data only. The results, listed in Table 5, show the same trend as the results discussed so far. We did not evaluate *-np systems on input with punctuation as all the punctuation would represent out of vocabulary words. The main difference resides in the result on Clean input with the Noisy system ( points), which is much worse than the result with the Noisy-np+Clean system ( points), i.e. more than 4 points. This result suggests how training on noisy data can affect the model negatively if it is not balanced with clean data.

ASR w ties ASR w/o ties
Clean 10.5 32.8
ASR-np + Clean 21.7 67.2
Table 7: Manual evaluation in the ASR input condition (large data). Percentage of wins with and without ties.  stands for statistically significant (p 0.01).

5 Manual Evaluation

We carried out a manual evaluation444We used crowd-sourcing via to assess the quality of Noisy-np + Clean against Clean, the reference baseline, under the Noisy input condition. We ran a head-to-head evaluation on the first 10 sentences of each test talk, for a total of 260 sentences, by asking annotators to blindly rank the two system outputs (ties were also permitted). We collected three judgments for each output, from 11 annotators, for a total of 780 scores. Inter-annotator agreement measured with Fleiss’ kappa was 0.39.

Results reported in Table 7 confirm with high confidence the differences observed with BLEU: output of system Noisy-np+Clean is preferred 10% time more often than output of system Clean, while almost 68% of the time the two outputs are considered comparable.

From some manual inspection, we found that translations by the robust system that are unanimously ranked best show that error recovery most likely occurs on punctuation and non-content words like articles, prepositions and conjunctions (see Tables 6). In general, errors on content words that affect the meaning are not recovered.

6 Related works

A recent study [chen2017mitigating] proposed to tackle ASR errors as a domain adaptation problem in the context of dialog systems. Domain adaptation for NMT has been widely studied in recent years [chu2018survey]. In [khayrallah-etal-2018-regularized], fine-tuning was used to adapt NMT to multiple domains simultaneously, while in [britz2017effective] adversarial adaptation is proposed for avoiding degradation in the original domain. Training on multiple domains simultaneously to prevent catastrophic forgetting is inspired by [stojanov2019incremental]. They proposed an incremental learning scheme that trains the network with the data from previous tasks when a new task is learned, which we adapt to our multi-domain scenario.

Punctuation insertion based on monolingual text has attracted research works for a long time [huang2002maximum, matusov2006automatic, lu2010better, ueffing2013improved] and obtained recent improvements with deep networks [cho2012segmentation, cho2015punctuation, tilk2015lstm, tilk2016bidirectional, salloum2017deep]. However, this approach, which is meant to make the output of ASR more readable for humans, cannot solve ambiguity due to missing punctuation.

A more recent research line aims at using pauses and audio features to better predict punctuation [christensen2001punctuation, klejch2017sequence, zelasko2018punctuation, nanchen2019empirical, yi2019self] although it has been shown that the use of pauses is highly dependent on the speaker [igras2016structure]. In [peitz2011modeling], it is shown that implicit learning of punctuation in MT systems is at least as good as inserting punctuation either in the input or output text, but they only studied the effect on correct input that has been deprived of punctuation, not on noisy input. On the other side, [sperber2017toward] studied how to improve the robustness to misrecognized words, but did not study the effect of MT systems that are only robust to punctuation errors. We close the gap by studying the combined effect of misrecognized errors and missing punctuation, besides studying the robustness to noisy data affects the translation quality on clean input.

7 Conclusion

We have studied the robustness to input errors of NMT systems for speech translation with a fine-tuning approach. We have observed that a system trained to learn implicitly the target punctuation can recover part of the quality degradation due to ASR errors up to 1 BLEU point. Fine-tuning on noisy input can instead improve by more than 2 BLEU points. A system tuned on ASR errors does not obtain a further improvement by more data for implicit punctuation learning. Finally, when fine-tuning on clean and noisy data, the system becomes robust to noisy input and keeps high performance on clean input.