Scaling Laws for Neural Machine Translation

by   Behrooz Ghorbani, et al.

We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.


Data Scaling Laws in NMT: The Effect of Noise and Architecture

In this work, we study the effect of varying the architecture and traini...

On the Sub-Layer Functionalities of Transformer Decoder

There have been significant efforts to interpret the encoder of Transfor...

Neural Machine Translation with Word Predictions

In the encoder-decoder architecture for neural machine translation (NMT)...

Bi-Decoder Augmented Network for Neural Machine Translation

Neural Machine Translation (NMT) has become a popular technology in rece...

Scaling Laws for Autoregressive Generative Modeling

We identify empirical scaling laws for the cross-entropy loss in four do...

Cross entropy as objective function for music generative models

The election of the function to optimize when training a machine learnin...

Encouraging Neural Machine Translation to Satisfy Terminology Constraints

We present a new approach to encourage neural machine translation to sat...