1 Introduction
The recent quality improvements [bojar2017findings, bojar2018findings] of neural machine translation (NMT) [sutskever2014sequence, bahdanau2014neural] opened the way to commercial applications that can provide high-quality translations. The assumption is that the sentences to translate will be similar to the training data, usually characterized by properly-formed text in the two translation languages. Poor-quality sentences are usually considered “noise” and removed from the training set to improve the final quality [junczys2018microsoft]. This practice is so common that a shared task has been devoted to it [koehn2018findings], which got attention from major industrial players in MT. Thus, the major weakness of NMT lies in coping with noisy input, which is an important feature of a real-world application such as speech translation.
The degradation of translation quality with noisy input has been widely reported in literature. Belinkov and Bisk [belinkov2017synthetic] showed that the translation quality rapidly drops with both natural and synthetic noise. Karpukhin et al. [ruiz2017assessing] have observed a correlation between recognition and translation quality in the context of speech translation. In both cases the degradation is mainly due to word errors, and following works have shown that inserting synthetic noise in the training data increases the robustness to the same kind of noise [karpukhin2019training, sperber2017toward].
Most of the time, travellers worry about their luggage. |
Most of the time travellers worry about their luggage. |
In practice, ASR transcripts are not only noisy in the choice of words, but also come without punctuation.111ASR services generally include punctuation as an option. Thus, in order to feed MT systems with ASR output there are two main options: i) use a separate model that inserts punctuation (pre-processing) [peitz2011modeling]; or ii) train the MT system on non-punctuated source data to resemble the test condition (implicit learning). The first option is exposed to error propagation, as punctuation inserted in the wrong position can alter completely the meaning of a sentence (see example in Table 1. The second option has shown to increase systems robustness [sperber2017toward] when the input is provided without punctuation. The two approaches have shown to be equivalent in handling text lacking punctuation [vandeghinstecomparison]
, probably because they both rely only on plain monolingual text to recover the missing punctuation.
We consider here application scenarios (e.g. subtitling) where the same NMT system has to operate under both clean and noisy input conditions, namely post-edited or raw ASR outputs. We start from the hypothesis that word and punctuation errors compound and should be addressed separately for an increased robustness to errors. To verify our hypothesis we use a strong NMT model and fine-tune it on a recently released speech corpus222Available at https://ict.fbk.eu/must-c/ [mustc19] of TED Talks.
Our findings are the following: implicit learning of punctuation helps to recover part of the errors in the source side, but training on ASR output is more beneficial. Training on clean and noisy data together leads to a system that can translate both clean and noisy data without supervision on the type of sentence and without degradation.
Train | Validation | Test | |
Words | 5.7M | 31.5K | 54.5K |
Segments | 250K | 1.3K | 2.5K |
Audio | 457h | 2.5h | 4.2h |
Clean | Noisy | Noisy-np | |
Gen | 32.3 (30.7) | 24.5 (24.0) | 20.6 (22.4) |
Ada | 34.9 (32.9) | 25.9 (25.6) | 21.8 (24.6) |
2 Robust NMT
In this paper we are interested in building a single system to translate speech transcripts that can be either raw ASR outputs, or human post-editing of them. We define our problem in terms of domain adaptation, and use the method known as fine-tuning or continued training [luong2015stanford, farajian2016fbk, chu2017empirical]. A parent model is trained on large data from several domains and used to initialize [thompson2018freezing] models for spoken language translation (TED talks) on two input conditions: clean, and noisy. The clean domain is characterized by correct transcriptions of talks with proper punctuation, while the noisy domain can contain machine-generated errors in words and punctuation. In-domain data can be given with or without punctuation (allowing implicit learning). In a multi-domain setting, i.e. translating both clean and noisy data with a single system, models can suffer from catastrophic forgetting [kirkpatrick2017overcoming] by degrading performance on less-recently observed domains. In this work we avoid catastrophic forgetting by fine-tuning the model on both domains simultaneously.
3 Experimental Setting
We use as a parent model for fine-tuning a Transformer Big model [vaswani2017attention] trained on public and proprietary data using label smoothed cross entropy [szegedy2016rethinking], for about 16 million English-Italian sentence pairs. The model has layer size of , hidden size of on feed forward layers, 16 heads in the multi-head attention, and layers in both encoder and decoder. This model is then fine-tuned on the EnIt portion of TED Talks in MuST-C. We keep the same training/validation/test set split as provided with the corpus (see Table 2). In all the experiments, we use Adam [kingma2014adam] with a fixed learning rate of , dropout of , label smoothing with a smoothing factor of . Training is performed on Nvidia V100 GPUs, with batches of tokens per GPU. Gradients are accumulated for batches in each GPU [ott2018scaling]. All texts are tokenized and true-cased with scripts from the Moses toolkit [koehn2007moses], and then words are segmented with BPE [sennrich2015neural] with 32K joint merge rules.
While our main goal is to verify our hypotheses on a large data condition, thus the need to include proprietary data, for the sake of reproducibility we also provide results with systems only trained on TED Talks (small data condition).
We transcribed automatically the entire TED Talks corpus with with a general purpose ASR service. 333Amazon Transcribe: https://aws.amazon.com/transcribe The resulting word error rates on the test set is 11.0%. The used ASR service provides transcribed text with predicted punctuation. In the experiments which assume noisy input without punctuation, we simply remove the predicted punctuation.
All the results are evaluated in terms of BLEU score [papineni2002bleu] using the multeval tool [clark2011better].
Clean | Noisy | |
Clean | 34.9 | 25.9 |
Clean-np | 34.2 | 26.6 |
Clean + Clean-np | 34.9 | 26.9 |
Noisy | 34.0 | 28.3 |
Noisy-np | 34.2 | 28.4 |
Noisy + Clean | 35.1 | 28.1 |
Noisy + Noisy-np | 34.0 | 28.2 |
Noisy-np + Clean | 35.0 | 28.2 |
Noisy-np + Clean-np | 34.5 | 27.9 |
Noisy[-np] + Clean[-np] | 34.9 | 27.7 |
4 Experiments and Results
At first, we evaluate the degradation due to ASR noise for systems trained on clean data. In Table 3 we show the BLEU scores of our baseline system, respectively, trained on large out-of-domain data (Base) and fine-tuned on clean TED Talks data (In-domain), with three types of input: manual transcripts (Clean), ASR transcripts with predicted punctuation (Noisy), and ASR transcripts with no punctuation (Noisy-np). As these models will hardly generate punctuation when it does not appear in the source text, we also report (in parentheses) BLEU scores computed w/o punctuation in both hypothesis and reference.
Our results show that in-domain fine-tuning is beneficial in all scenarios, but more with clean input (+2.6 points) than with noisy input (+1.4 points). Translating noisy input results in a 26% relative drop in BLEU, which is apparently not due to punctuation errors (drop when evaluating w/o punctuation is 22%). Providing noisy input with no punctuation works even worse, probably due to the larger mismatch with the training/tuning conditions.
Clean | np | Noisy | np | |
Clean | 30.3 | - | 22.3 | - |
Clean-np | - | 28.2 | - | 22.9 |
Clean + Clean-np | 29.7 | 27.9 | 22.9 | 22.9 |
Noisy | 25.8 | - | 23.9 | - |
Noisy-np | - | 26.4 | - | 24.1 |
Noisy-np + Clean | 30.1 | 27.9 | 24.0 | 24.2 |
Clean | when I ’m not fighting poverty , I ’m fighting fires as the assistant captain of a volunteer fire company . |
Noisy | when I ’m not fighting poverty . I ’m fighting fires . is the assistant captain with volunteer fire company . |
Base NMT | Quando non combatto la povertà . Combatto gli incendi . È l’assistente capitano di una compagnia di pompieri volontaria . |
Robust NMT | Quando non combatto la povertà , combatto gli incendi come assistente capitano di una compagnia di pompieri volontaria . |
Clean | that means we all share a common ancestor , an evolutionary grandmother , who lived around six million years ago. |
Noisy | that means we all share a common ancestor - on evolutionary grandmother - who lived around six million years ago. |
Base NMT | Ciò significa che tutti condividiamo un antenato comune - sulla nonna evolutiva che ha vissuto circa sei milioni di anni fa. |
Robust NMT | Ciò significa che tutti condividiamo un antenato comune e una nonna evolutiva che è vissuta circa sei milioni di anni fa. |
Clean | and we in the West couldn’t understand how anybody would do this , how much this would restrict freedom of speech. |
Noisy | — we in the West . I couldn ’t understand how anybody would do — - how much this would restrict freedom of speech. |
Base NMT | — Noi occidentali . Non riuscivo a capire come chiunque avrebbe fatto — - quanto questo avrebbe limitato la libertà di parola. |
Robust NMT | — In Occidente non riuscivo a capire come chiunque avrebbe fatto — , quanto questo avrebbe limitato la libertà di parola. |
Next, we evaluate systems fine-tuned on data similar to the Noisy test condition. Table 4 lists the results of all fine-tuning experiments, when test input is either clean with punctuation or noisy with punctuation as in training. The first part of Table 4 shows that fine-tuning the Gen model with clean data and no punctuation (Clean-np) improves over the Ada model (Clean) when testing on Noisy-np (+0.7) but degrades when testing on Clean (-0.7). Fine-tuning the same model on Clean data with both punctuation condition (Clean + Clean-np) improvement by 1 point on Noisy input without any loss on Clean input. This result shows that is possible to make a model robust to noisy text, while preserving high quality on proper text.
The second part of Table 4 lists results of fine-tuning on noisy data, with and/or without punctuation. In all cases, BLEU scores on the noisy input improve from 1 to 2 point over the best systems tuned on Clean data only, reaching values above 28. However, scores on clean input degrade by 0.7 and 0.9 points (i.e. 34.2 and 34.0). If we adapt on both clean and noisy data, the score on the two input conditions reach a better balance. In particular, training on both Noisy-np and Clean data scores on the two input conditions 35.0 and 28.2, which results in the best overall working point on both conditions (together with Noisy+Clean). It is worth pointing out that this configuration obtains 33.2 points (not in the table) with the best possible noisy input, i.e. no errors and no punctuation, which is still 1.7 points below the score on Clean input.
Finally, if we expose the system to all types of data (Noisy-np +Noisy+ Clean + Clean-np) we do not see any improvement over our top results, which means that Clean-np data do not provide additional information to Noisy-np.
For the sake of replicability, we also trained our systems from scratch on the TED Talks data only. The results, listed in Table 5, show the same trend as the results discussed so far. We did not evaluate *-np systems on input with punctuation as all the punctuation would represent out of vocabulary words. The main difference resides in the result on Clean input with the Noisy system ( points), which is much worse than the result with the Noisy-np+Clean system ( points), i.e. more than 4 points. This result suggests how training on noisy data can affect the model negatively if it is not balanced with clean data.
ASR w ties | ASR w/o ties | |
Clean | 10.5 | 32.8 |
ASR-np + Clean | 21.7 | 67.2 |
5 Manual Evaluation
We carried out a manual evaluation444We used crowd-sourcing via figure-eight.com. to assess the quality of Noisy-np + Clean against Clean, the reference baseline, under the Noisy input condition. We ran a head-to-head evaluation on the first 10 sentences of each test talk, for a total of 260 sentences, by asking annotators to blindly rank the two system outputs (ties were also permitted). We collected three judgments for each output, from 11 annotators, for a total of 780 scores. Inter-annotator agreement measured with Fleiss’ kappa was 0.39.
Results reported in Table 7 confirm with high confidence the differences observed with BLEU: output of system Noisy-np+Clean is preferred 10% time more often than output of system Clean, while almost 68% of the time the two outputs are considered comparable.
From some manual inspection, we found that translations by the robust system that are unanimously ranked best show that error recovery most likely occurs on punctuation and non-content words like articles, prepositions and conjunctions (see Tables 6). In general, errors on content words that affect the meaning are not recovered.
6 Related works
A recent study [chen2017mitigating] proposed to tackle ASR errors as a domain adaptation problem in the context of dialog systems. Domain adaptation for NMT has been widely studied in recent years [chu2018survey]. In [khayrallah-etal-2018-regularized], fine-tuning was used to adapt NMT to multiple domains simultaneously, while in [britz2017effective] adversarial adaptation is proposed for avoiding degradation in the original domain. Training on multiple domains simultaneously to prevent catastrophic forgetting is inspired by [stojanov2019incremental]. They proposed an incremental learning scheme that trains the network with the data from previous tasks when a new task is learned, which we adapt to our multi-domain scenario.
Punctuation insertion based on monolingual text has attracted research works for a long time [huang2002maximum, matusov2006automatic, lu2010better, ueffing2013improved] and obtained recent improvements with deep networks [cho2012segmentation, cho2015punctuation, tilk2015lstm, tilk2016bidirectional, salloum2017deep]. However, this approach, which is meant to make the output of ASR more readable for humans, cannot solve ambiguity due to missing punctuation.
A more recent research line aims at using pauses and audio features to better predict punctuation [christensen2001punctuation, klejch2017sequence, zelasko2018punctuation, nanchen2019empirical, yi2019self] although it has been shown that the use of pauses is highly dependent on the speaker [igras2016structure]. In [peitz2011modeling], it is shown that implicit learning of punctuation in MT systems is at least as good as inserting punctuation either in the input or output text, but they only studied the effect on correct input that has been deprived of punctuation, not on noisy input. On the other side, [sperber2017toward] studied how to improve the robustness to misrecognized words, but did not study the effect of MT systems that are only robust to punctuation errors. We close the gap by studying the combined effect of misrecognized errors and missing punctuation, besides studying the robustness to noisy data affects the translation quality on clean input.
7 Conclusion
We have studied the robustness to input errors of NMT systems for speech translation with a fine-tuning approach. We have observed that a system trained to learn implicitly the target punctuation can recover part of the quality degradation due to ASR errors up to 1 BLEU point. Fine-tuning on noisy input can instead improve by more than 2 BLEU points. A system tuned on ASR errors does not obtain a further improvement by more data for implicit punctuation learning. Finally, when fine-tuning on clean and noisy data, the system becomes robust to noisy input and keeps high performance on clean input.