1 Introduction
Although neural machine translation (NMT) has achieved state-of-the-art performance in recent years (cho2014learning; bahdanau2015neural; vaswani2017attention), most NMT models still suffer from the slow decoding speed problem due to their autoregressive property: the generation of a target token depends on all the previously generated target tokens, making the decoding process intrinsically nonparallelizable.
Recently, non-autoregressive neural machine translation (NAT) models (gu2018non; li2019hint; wang2019non; guo2019non; wei2019imitation) have been investigated to mitigate the slow decoding speed problem by generating all target tokens independently in parallel, speeding up the decoding process significantly. Unfortunately, these models suffer from the multi-modality problem (gu2018non), resulting in inferior translation quality compared with autoregressive NMT. To be specific, a source sentence may have multiple feasible translations, and each target token may be generated with respect to different feasible translations since NAT models discard the dependency among target tokens. This generally manifests as repetitive or missing tokens in the translations. Table LABEL:tab:multi-modality shows an example. The German phrase “viele Farmer” can be translated as either “ lots of farmers” or “ a lot of farmers”. In the first translation (Trans. 1), “ lots of” are translated w.r.t. “ lots of farmers” while “ of farmers” are translated w.r.t. “ a lot of farmers” such that two “of” are generated. Similarly, “of” is missing in the second translation (Trans. 2). Intuitively, the multi-modality problem has a significant negative effect on the translation quality of NAT.