Learning Accurate Integer Transformer Machine-Translation Models

01/03/2020 ∙ by Ephrem Wu, et al. ∙ Xilinx Inc. 0

We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers, as opposed to the more costly single-precision floating-point (FP32) hardware. Unlike previous work, which converted only 85 Transformer matrix multiplications to INT8, leaving 48 out of 133 of them in FP32 because of unacceptable accuracy loss, we convert them all to INT8 without compromising accuracy. Tested on the newstest2014 English-to-German translation task, our INT8 Transformer Base and Transformer Big models yield BLEU scores that are 99.3 converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training. To demonstrate the robustness of this approach, we also include results from INT6 Transformer models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We report a method for training accurate yet compact Transformer machine-translation models [Vaswani et al.2017]. Specifically, we aim these models at hardware with 8-bit integer (INT8) matrix multipliers. Compared to single-precision floating-point (FP32) matrix multiplications, INT8 matrix multiplications not only reduce both storage and bandwidth four times, but they also consume 15 times less energy [Horowitz2014]. We therefore have two goals: 1) to convert all matrix multiplications from FP32 to INT8, and 2) to maintain translation accuracy relative to the FP32 model.

The Transformer model has proven to be a powerful attention-based model for machine translation [Bahdanau et al.2015], and has inspired much follow-up research in this area. For example, the term “Transformer” appears 56 times in the findings of the 2018 Conference of Machine Translation (WMT18) [Bojar et al.2018b] and 105 times in WMT19 [Barrault et al.2019]. The Transformer model in [Vaswani et al.2017]

comes in two forms, Transformer Base, which has 61 million parameters, and Transformer Big, which has 210 million parameters. These model sizes are large compared to the convolutional neural network benchmark ResNet-50

[He et al.2016]

, a 25M-parameter model, but still small compared to OpenAI GPT-2 models

[Radford et al.2019], a family of Transformer models with 117M, 345M, 774M, and 1.6B parameters. Transformer model sizes show no signs of reduction. For instance, [Huang et al.2019] reports multi-lingual machine-translation results even for larger Transformer models for up to 6 billion parameters. It is therefore useful to explore techniques for reducing parameter representation costs, for instance, going from FP32 to INT8, rather than holding out for smaller Transformer architectures.

2 Related Work

We draw inspiration from three papers on quantizing machine-translation models to INT8. The first paper, [Wu et al.2016], was published before the Transformer [Vaswani et al.2017]. The authors of this paper quantized parameter (weight) tensors but not non-parameter tensors in a range-preserving fashion. The second paper, [Junczys-Dowmunt et al.2018], reported INT8 Transformer Big English-to-German translation results but did not report any results for INT8 Transformer Base. The third paper, [Bhandare et al.2019], did the opposite: it reported INT8 Transformer Base English-to-German translation results but not for Transformer Big.

Before the Transformer paper was published, [Wu et al.2016]

described a quantization-aware training method for an attention-based LSTM neural machine translator. They treated parameter tensors differently from non-parameter tensors. In particular, they quantized parameter tensors to INT8 in a range-preserving fashion. For non-parameters, they treated logits differently from other non-parameter tensors. Specifically, they clipped logits to

, and they clipped the rest of the non-parameter tensors to and annealed them to by the end of the training. In the LSTM module, matrix multiplication operands were 8-bit integers, and accumulators were 16-bit integers. All other operations in the LSTM module were 16-bit operations. The softmax function and the attention mechanism remained as floating-point operations. These authors reported an English-to-German translation BLEU score of 24.61 (newstest2014). Similarly, we quantized parameters to 8-bit integers in a range-preserving manner. We attempted to clip floating-point non-parameter tensors before uniform quantization, but observed that clipping after rounding was simpler and yielded accurate Transformer models. Furthermore, we did not manually select clipping ranges for non-parameters. Our training method automatically adjusts clipping ranges to make range-precision trade-offs.

On the basis of a literature overview, we believe that Microsoft’s Marian team was the first to publish newstest2014 English-to-German translation BLEU scores using integer Transformer models [Junczys-Dowmunt et al.2018]. With a beam size of 1, the FP32 Transformer Big model achieved a BLEU score of 28.1, and the 8-bit model scored 27.5. The BLEU score of the 8-bit Transformer Base model was absent, but the 16-bit model yielded 27.4, which is the same as that of the FP32 model. These authors did not have to retrain the 16-bit Transformer Base model; FP32 parameters and non-parameter tensors were scaled down by 1024 before 16-bit uniform quantization to prevent overflow. The 8-bit Transformer Big model, however, required retraining. Matrix multiplication input tensors (both parameters and non-parameters) were clipped to the range to maximize BLEU scores before 8-bit uniform quantization. Similarly, we retrained Transformer Big to obtain an INT8 model. Although the INT8 model in [Junczys-Dowmunt et al.2018] exhibits a 2.13% drop in BLEU scores relative to the original FP32 model, our INT8 model shows only 0.334% and 0.685% drops relative to our FP32 model for case-insensitive (uncased) and case-sensitive (cased) results, respectively.

[Bhandare et al.2019] used calibration to obtain an INT8 Transformer Base model from a pre-trained FP32 model. Before summarizing this approach, a distinction should be made between two types of layers that multiply matrices because these authors only quantized a subset of one of these layer types. These two layer types are called dense vs. matmul. Although a dense layer multiplies a trainable parameter matrix by a non-parameter matrix (e.g., a matrix of activations), a matmul layer multiplies two non-parameter matrices (e.g., a query matrix and a key matrix). In both Transformer Base and Transformer Big, there are 97 dense layers and 36 matmul layers. The matmul layers compute attention weights, a key feature of the attention-based Transformer model, and we also convert these layers to INT8. To avoid unacceptable accuracy loss, these authors left 48 out of 133 matrix multiplications as FP32 operations, namely, 12 out of 97 dense layers and all 36 matmul layers. They calibrated the FP32 model for conversion to INT8 using 600 out of 3003 sentences of various lengths from the validation dataset. For each dense layer, they used KL divergence to limit the floating-point range before uniform quantization. Using symmetric 8-bit uniform quantization, this method achieved a BLEU score of 27.30 for the newstest2014 English-to-German translation task, a drop of 0.38 points or 1.4% relative to the FP32 model. There were no accuracy results for Transformer Big.

We obtained accurate 8-bit Transformer Base and Transformer Big models for English-to-German translation on newstest2014. In addition, we separately reported the BLEU scores for both cased and uncased scenarios. We converted all 133 matrix-multiplications, 97 dense layers and 36 additional matmul layers in the attention module, into 8-bit integer operations. We did so without having to manually decide which layers should be converted to INT8 and which should not. We let our models learn optimal range-precision trade-offs for all non-parameter tensors. The resulting models yielded BLEU scores that were 99.3% to 100% relative to to those from the FP32 reference models.

3 Integer Matrix Multiplication

To use an integer matrix multiplier to approximate floating-point matrix multiplication, we approximate a floating-point tensor as the product of a floating-point threshold scalar111The threshold scalar may be limited to an integer power of 2 so that multiplication by an integer matrix becomes an arithmetic shift operation. and an integer tensor :


Sections 3.1 and 3.2 will discuss how a model learns its threshold scalars. In hardware, we surround an integer matrix multiplier with data-type converters whose cost is amortized over the matrix multiplier array (Fig. 1).

To explicityly indicate data types in tensor variables, we use a subscript “int,” “uint,” or “fp” for signed integers, unsigned integers, or floating-point numbers, respectively. This subscript is optionally followed by data-type width in bits. Without a data-type subscript, tensor elements by default are considered to be IEEE single-precision floating-point (FP32) numbers. See Fig. 1 for an example.

Figure 1: Applying floating-point operands to an integer matrix multiplier. The FP32 matrix multiplier on the left computes . The integer matrix multiplier on the right approximates as , where and are the threshold scalars for and , respectively, such that and .

Converting an integer tensor into a floating-point tensor given a threshold scalar is straightforward with Eq. 1, but starting with a floating-point tensor , which is the case after we have trained a floating-point model, both the threshold scalar and the integer tensor are unknown. The question is how the threshold scalar should be set. If the distribution of is such that saturating more of the tails will reduce training loss, then we should pick a sufficiently small threshold scalar to clip more of the tails. The threshold scalar is therefore a knob for choosing between range and precision. Experimentally, we determined that in order to stabilize training, we should preserve the range for parameter (weight and bias) tensors and learn the threshold scalars for non-parameter tensors.

Specifically, given bits for encoding a signed integer, we use the integer range , where :


If all elements in are non-negative, we use the range :


In Eqs. 2 and 3 we denote bankers’ rounding of a floating-point tensor as and define . Note that, unlike it is done in [Wu et al.2016] and [Junczys-Dowmunt et al.2018], these equations do not clip the original FP32 tensor . Clipping occurs only after the rounding of or to integers. Experimentally, we observed that clipping FP32 tensors directly is detrimental to accuarcy, so the goal is to find per-tensor threshold scalars that maintain accuracy. To this end, we separate threshold scalars into range-preserving vs. non-range-preserving ones. We call a threshold scalar that does not clip the distribution range-preserving. Such threshold scalars are dervied from the tensor range, and those that are non-range-preserving are learned.

3.1 Range-Preserving Quantization

Threshold scalars and in Eqs. 2 and 3 represent range-precision trade-offs. They are range-preserving when used for signed elements,


and for unsigned elements,


In Transformer, we compute range-preserving threshold scalars for the weight and bias tensors in all 97 dense layers.

3.2 Learned Range-Precision Trade-offs

We train threshold scalars for non-parameter tensors to achieve optimal range-precision trade-offs [Jain et al.2019]. If precision is more important than range in minimizing loss, training reduces the threshold scalar. Conversely, if range is more important than precision, training increases the threshold scalar.

In Transformer, we train the threshold scalars for the remaining operands in all 97 dense layers, as well as the queries and key-value pairs in the attention modules. In a dense layer that computes , the weight operand uses the range-based scalar as computed by Eq. 4, and the other operand has its threshold scalar learned. During inference, the integer weight matrix and the product of threshold scalars become constant tensors:

Specifically, the integer matrix multiplier receives two floating-point constants during inference: to convert to integers and to convert the integer matrix product back into the floating-point format. We compute the integer weight matrix offline using Eqs. 2 and 4. During inference, the accelerator computes

3.2.1 Custom Gradients

Following [Jain et al.2019], we compute custom gradients to train threshold scalars for range-precision trade-offs. On the basis of Eq. 2, the element-wise operation that produces a floating-pointer tensor with quantization noise from the input floating-point tensor is

where the threshold scalar is constrained to be positive. Let be dependent on some

. The local gradient with respect to the floating-point tensor is a straight-through estimator (STE)

[Bengio et al.2013].


For stability reasons, the threshold scalar is trained according to the log of this scalar, i.e., (see Appendix B in [Jain et al.2019]). Again, using STE, the local gradient with respect to is


In TensorFlow, we implemented these custom gradient functions using the

tf.custom_gradient decorator in Python. Specifically, range-preserving tensors use only Eq. 6 and not Eq. 7 since their threshold scalars in each step are always calculated according to either Eq. 2 or Eq. 3.

3.2.2 Attention Mechanism

The attention mechanism in the Transformer is worthy of note. Each soft look-up in an attention module has the form , where is the width of . None of the matrices , , or are constant during inference, so we use only non-range-preserving scalars for all matrix multiplication operands to compute from the three matrices. Given the floating-point tensors , , and during training, we encode each integer element with bits, and train the threshold scalars , , , and for inference. Specifically,

is the attention-weight matrix, which is the only matrix converted to unsigned integers because is positive.

3.2.3 Number of Trained Threshold Scalars

The Transformer architecture consists of a stack of encoder modules, a stack of

decoder modules, a linear-projection layer, and finally a softmax layer. Although Transformer Base has eight attention heads and Transformer Big has 16 heads, we use the same threshold scalar for each tensor operand across all heads within the same module. As a result, the number of threshold scalars is independent of the number of attention heads. Each encoder module consists of

dense layers and matmul layers. Each decoder module consists of dense layers and matmul layers. Each dense layer needs just one threshold scalar for the sole non-parameter input. Each matmul layer requires two trained threshold scalars because neither of the inputs is a parameter matrix. We thus train threshold scalars among encoder and decoder modules. The non-parameter tensor operand of the final dense layer, whose weights are the embedding table, uses one additional threshold scalar. We therefore train 169 threshold scalars to make range-precision trade-offs in the Transformer.

4 Training Floating-Point Transformer Models

To establish FP32 baselines, we trained both Transformer Base and Transformer Big [Vaswani et al.2017] using the models provided in Tensor2Tensor v1.12 [Vaswani et al.2018]. The key difference in the Transformer models in [Vaswani et al.2018] and the original models in [Vaswani et al.2017] is that [Vaswani et al.2018] applied layer normalization to the sentence representation matrix before computing attention weights. This difference does not change the accuracy of Transformer Base and Transformer Big but is beneficial for deeper Transformer models [Wang et al.2019]. We also used the training recipes in [Vaswani et al.2018], which we outline in the next section.

4.1 Training Data

We used the default dataset in the Tensor2Tensor v1.12 English-to-German translation task (translate_ende_wmt32k_packed). This dataset has 4.6 million sentence pairs drawn from three WMT18 [Bojar et al.2018a] parallel corpora: News Commentary V13, Europarl V7, and Common Crawl. We used t2t-datagen to create a vocabulary table with 33288 subwords, corresponding to about 142 million subwords in the training dataset.

4.2 Hardware and and Hyperparameters

We trained both the Transformer Base and Big models using eight-core Google Cloud TPUs. Specifically, we trained Transformer Base on TPUv2-8 and Transformer Big on TPUv3-8, both with a batch size of 2048 subwords per TPU core. An epoch is therefore about 9000 steps. We trained Transformer Base using the hyperparameter set

transformer_tpu for 300,000 steps, and we trained Transformer Big using transformer_big_tpu for 600,000 steps. Unlike the hyperparameter sets for GPUs, those for TPUs use the Adafactor optimizer [Shazeer and Stern2018], with steps for warm-up and an inverse square root decay schedule such that the learning rate factor is .

4.3 Post-FP32 Training

Our goal is to convert floating-point tensors to integer tensors for all dense and matmul layers. We first tried simultaneously quantizing the dense-layer weight tensors and learning the threshold scalars for non-parameter tensors. However, this did not result in convergence. Therefore, we tried a different approach, which, although simple, achieved high BLEU scores without convergence problems. In this approach, we fine-tuned parameter tensors and non-parameter tensors separately over three to six epochs, the first three of which were mandatory and the remaining three optional.

  1. In the first epoch, we converted weight and bias tensors to integers using Eqs. 2 to 5 while leaving non-parameter tensors in the floating-point format.

  2. In the second epoch, we froze the integer parameter tensors and measured the maximum absolute value of all FP32 non-parameter tensors that are inputs to dense layers.

  3. At the start of the third epoch, we initialized the threshold scalar of each dense-layer non-parameter input, while still using the same integer parameter tensor. The training loss increased abruptly (Fig. 2) but settled quickly. This pattern suggests that preserving the range for these tensors is suboptimal, but clipping these tensors helps in minimizing the loss.

  4. (Optional) The fourth epoch continued refining threshold scalars.

  5. (Optional) In the fifth epoch, we froze threshold scalars and fine-tuned the integer parameters.

  6. (Optional) The fine-tuning of integer parameters continued in this epoch.

Fig. 2 illustrates the training loss for six epochs after FP32-training.

Figure 2: Loss after FP32 training of the Transformer Base model. Dotted vertical lines delineate epochs. The loss spike at the beginning of the third epoch is due to unoptimized initial threshold scalars for non-parameter tensors. In the third and fourth epochs, gradients with respect to parameters stop propagating to allow threshold scalars to reduce loss, in effect learning to make range-precision trade-offs.

5 Experimental Results

Using the validation dataset (newstest2013), we obtained integer weights from the checkpoint with the highest BLEU score in the last two epochs of parameter fine-tuning. We used t2t-bleu [Popel and Bojar2018] to report the BLEU scores of FP32 models and integer models. Scores from t2t-bleu are the same as those from sacrebleu [Post2018]. We used a beam size of 4 in beam search and length penalty 0.6, and recorded both cased and uncased scores. Table 1 shows BLEU scores for both FP32 models and INT8 models. Unlike the BLEU scores in Table 2 in [Vaswani et al.2017], we did not apply checkpoint averaging to obtain Table 1. Still, we obtained the same BLEU score for Transformer Base (27.3) and a higher BLEU score for Transformer Big (29.2 as opposed to 28.4).

Because we used the same tensor2tensor code base, validation dataset, and test dataset as [Bhandare et al.2019] did, our INT8 Transformer Base BLEU scores can be directly compared to theirs, in which the FP32 baseline BLEU score was 27.68 and the symmetric INT8 BLEU score was 27.30 (-1.4%). By constrast, our INT8 Transformer Base BLEU score did not drop at all, either cased or uncased.

Although [Bhandare et al.2019] did not report results for Transformer Big, our Transformer-Big INT8 BLEU scores can be compared to [Junczys-Dowmunt et al.2018], in which the BLEU score dropped from 28.1 to 27.5 (-2.1%). By contrast, our Transformer Big INT8 BLEU score drop was only 0.7% for cased and 0.3% for uncased, despite the higher baseline scores for the FP32 model at 29.2 cased and 29.6 uncased.

To demonstrate the robustness of our approach, we converted the same FP32 Transformer models to even lower precision (INT6). The rightmost column in Table 1 shows 1.0 BLEU point drop for Transformer Base (cased and uncased), 1.3 BLEU point drop for Transformer Big uncased, and 1.4 BLEU point drop for Transformer Big cased.

To go beyond the newstest2014 dataset, we reported FP32 and INT8 BLEU scores in Table 2 using the same models from Table 1. We observed that 10 of the 24 BLEU scores from the INT8 models in Tables 1 and 2 are either the same or higher than those from the FP32 models.

Model FP32 INT8 INT6
Transformer Base Uncased 27.8 27.8 26.8
Transformer Base Cased 27.3 27.3 26.3
Transformer Big Uncased 29.6 29.5 28.3
Transformer Big Cased 29.2 29.0 27.8
Table 1: FP32, INT8, and INT6 Transformer newstest2014 English-to-German translation BLEU scores.
Model 15 16 17 18 19
Transformer Base Uncased 30.2 30.4 35.1 34.9 28.4 28.4 42.3 42.1 38.5 38.1
Transformer Base Cased 29.7 29.9 34.5 34.3 27.8 27.8 41.8 41.7 38.1 37.8
Transformer Big Uncased 32.0 31.9 35.9 35.8 29.5 29.5 43.2 43.1 39.2 39.4
Transformer Big Cased 31.6 31.4 35.3 35.2 29.0 29.0 42.7 42.6 38.8 39.0
Table 2: FP32 and INT8 Transformer newstest2015 to newstest2019 English-to-German translation BLEU scores.

6 Conclusion

We presented a stable integer quantization approach for Transformer machine-translation models that converts an existing FP32 model to an integer model by alternately optimizing parameter tensors and non-parameter tensors in separate epochs. Unlike the case in previous work, we applied this approach to all 133 matrix multiplications in both Transformer Base and Transformer Big. For the English-to-German translation task on the newstest2014 test set, our INT8 models achieved 99.3% to 100% BLEU scores relative to FP32 models. To show the robustness of this approach, we extended it to INT6, although perhaps only FPGAs can take advantage of this non-standard format. In addition, we presented BLEU scores from newstest2015 to newstest2019 for the INT8 models to illustrate the usefulness of these models. It is encouraging that 10 out of the 24 BLEU scores from our INT8 models are at least as high as the scores from the FP32 models. Since our quantization approach starts with an FP32 model, we hypothesize that a principled method in selecting local minima (roughly training checkpoints) in the FP32 optimization landscape may yield even better results [He et al.2019], which is a topic for future research.