has been widely used in natural language processing tasks. By stacking multiple identical encoder/decoder layers with attention modules, it provides a significant performance improvement over previous convolutional or recurrent neural network modelskim2014convolutional.
Nevertheless, it is challenging to deploy Transformers on mobile devices due to the high computation cost. For instance, in order to translate a sentence with only 30 words, a Transformer-Big model needs to execute 13G FLOPs and takes 20 seconds on a Raspberry Pi. Such long latency will hurt the user experience on edge devices. Thus we need hardware-efficient Transformers (Figure 1).
There are two common pitfalls when evaluating the efficiency of a Transformer. (1) FLOPs does not reflect the measured latency. Although FLOPs is used as an metric for efficiency in prior arts Howard:2017mobilenets; Anonymous:2020efficient, it is not a good latency proxy. As in Figure 2 (Right), models with the same FLOPs can result in very different measured latencies; (2) different hardware prefers different Transformer architecture. As in Table 1, the Transformer model optimized on one hardware is sub-optimal for another because latency is influenced by different factors on different hardware platforms. For example, the embedding dimension has significant impact on the Raspberry Pi latency but hardly influences the GPU latency (Figure 2).
|Measured On||GPU||ARM CPU|
|HAT (GPU)||28.10||147 ms||6491 ms|
|HAT (ARM CPU)||28.15||184 ms||6042 ms|
Inspired by the success of Neural Architecture Search (NAS) pmlr-v80-bender18a; Guo:2019single; Pham:2018tl; cai2019once, we propose to search for Hardware-Aware Transformers (HAT) by directly involving the latency feedback into the design loop. In this way, we do not need FLOPs as the latency proxy and can search specialized models for various hardware.
We first construct a large search space with arbitrary encoder-decoder attention and heterogeneous Transformer layers. Traditional Transformer has an information bottleneck between the encoder and decoder. Arbitrary encoder-decoder attention breaks the bottleneck, allowing all decoder layers to attend to multiple and different encoder layers instead of only the last one. Thus low-level information from the encoder can also be used by the decoder. Motivated by Figure 2, we introduce heterogeneous Transformer layers to allow different layers to have different architecture adapting various hardware.
To perform a low-cost search in such a large design space, we first train a Transformer supernet – SuperTransformer, which contains many SubTransformers sharing the weights. We train all SubTransformers simultaneously by optimizing the uniformly sampled SubTransformers from the SuperTransformer. The performance of a SubTransformer with inherited weights from the SuperTransformer can provide a good relative performance approximation for different architectures trained from-scratch. Unlike conventional NAS, we only need to pay the SuperTransformer training cost for once and can evaluate all the models in the design space with it. Finally, we conduct an evolutionary search to find the best SubTransformer under the hardware latency constraint. Experiments show that HAT can be naturally incorporated with model compression techniques such as quantization and knowledge distillation.
We evaluate HAT with WMT’14 En-De, WMT’14 En-Fr, WMT’19 En-De, and IWSLT’14 De-En tasks on Raspberry Pi ARM CPU, Intel Xeon CPU, and Nvidia TITAN Xp GPU. Compared with previous work Vaswani:2017attention; So:2019et; Gu:2019levenshtein; Anonymous:2020efficient, HAT achieves up to 3 speedup, 3.7 smaller size over Transformer-Big without loss of accuracy. With 12,041 less search cost, HAT outperforms the Evolved Transformer with 2.7 speedup and 3.6 smaller size. It also achieves up to 1.9 speedup over Levenshtein and Lite Transformers with no BLEU score loss. With 4-bit quantization, HAT can further reach 25 model size reduction.
HAT has three contributions: (1) Hardware-Aware and Specialization. To our best knowledge, we are the first to directly involve the hardware feedback in the model design, to reduce NLP model latency for target hardware, instead of relying on proxy signals (FLOPs). For different hardware platforms, specialized models for low-latency inference are explored. (2) Low-cost Neural Architecture Search with a Large Design Space. We propose arbitrary encoder-decoder attention to break the information bottleneck; and heterogeneous layer to let different layers alter its capacity. A weight-shared SuperTransformer is trained to search for efficient models at a low cost. (3) Design Insights. Based on the search results, we reveal some design insights: Attending to multiple encoder layers is beneficial for the decoder; GPU prefers shallow and wide models while ARM CPU prefers deep and thin ones.
2 Proposed Approaches
An overview of the HAT framework is shown in Figure 3. We firstly train a SuperTransformer with a large design space. Then, for a given hardware platform, we collect a dataset of (SubTransformer architecture, measured latency) pairs for different models, and train a latency predictor. Finally, we conduct an evolutionary search with a latency constraint to find an efficient model specialized for the target hardware.
2.1 Design Space
We construct a large design space by breaking two conventions in the Transformer design: (1) All decoder layers only attend to the last encoder layer; (2) All the layers are identical.
Arbitrary Encoder-Decoder Attention.
Different encoder layers extract features on different abstraction levels. Conventionally, all the decoder layers only attend to the last encoder layer. It forms an information bottleneck that forces all the decoder layers to learn solely from the high abstraction level and ignore the low-level information. To break the bottleneck, we propose Arbitrary Encoder-Decoder Attention to learn the most suitable connections between the encoder and the decoder. Each decoder layer can choose multiple encoder layers to attend. The key and valuevectors from encoder layers are concatenated in the sentence length dimension (Figure 4) and fed to the encoder-decoder cross attention module. The mechanism is efficient because it introduces no additional parameters. The latency overhead is also negligible. For example, with each decoder layer attending to two encoder layers, the latency of Transformer-Base on Nvidia TITAN Xp GPU barely increases by 0.4%. It improves the model capacity by allowing attention to different abstraction levels.
Heterogeneous Transformer Layers.
Previous Transformers repeat one architecture for all layers. In HAT, instead, different layers are heterogeneous, with different numbers of heads, hidden dim, and embedding dim. In attention layers, different heads are used to capture various dependencies. However, Voita:2019analyzing shows that many heads are redundant. We thereby make attention head number elastic so that each attention module can decide its necessary number of heads.
In the FFN layer, the input features are cast to a higher dimension (hidden dim), followed by an activation layer. Traditionally, the hidden dim is set as 2 or 4 of the embedding dim, but this is sub-optimal since different layers need different capacities depending on the feature extraction difficulty. We hence make the hidden dim elastic.
Moreover, we also support elastic embedding dim of encoder and decoder, but it is consistent inside encoder/decoder. The number of encoder & decoder layers are also elastic to learn the proper level of feature encoding and decoding. Other design choices such as the length of vectors in attention modules can be naturally incorporated in our framework, which we leave for future work.
It is critical to have a large design space in order to find high-performance models. However, training all the models and comparing their BLEU scores is infeasible. We thus propose SuperTransformer, a supernet for performance approximation, which can judge the performance of a model without fully training it. The SuperTransformer is the largest model in the search space with weight sharing Pham:2018tl; liu2018darts; cai2019once. Every model in the search space (a SubTransformer) is a part of the SuperTransformer. All SubTransformers share the weights of their common parts. For elastic embedding dim, all SubTransformers share the front portion of the longest word embedding and corresponding FC layer weights. As in Figure 5, for elastic FFN hidden dim, the front part of the FC weights is shared. For elastic head number in attention modules, the whole vectors (the lengths are fixed in our design space) are shared by dividing into parts. Elastic layer numbers let all SubTransformers share the first several layers.
In the SuperTransformer training, all possible SubTransformers are uniformly sampled, and the corresponding weights are updated. In practice, the SuperTransformer only needs to be trained for the same steps as a baseline Transformer model, which is fast and low-cost. After training, we can get the performance proxy of sampled models in the design space by evaluating the corresponding SubTransformers on the validation set without training.
2.3 Evolutionary Search for SubTransformer
Given a latency requirement, we perform an evolutionary search to find a satisfactory SubTransformer. There are two ways to evaluate the hardware latency of a SubTransformer: (1) Online measurement in which we measure the models during the search process. (2) Offline, where we train a latency predictor to provide the latency. We apply the offline method here because it is fast and accurate
. For the online method, a single sampled SubTransformer requires hundreds of inferences to get an accurate latency, which lasts for minutes and slows down the searching. For the offline method, we encode the architecture of a SubTransformer into a feature vector, and predict its latency instantly with a multi-layer perceptron (MLP). Trained with thousands of real latency data points, the predictor yields high accuracy (Figure6). Note that the predicted latency is only used in the search process, and we report real measured latency in the experiment section. Compared with deducing a closed-form latency model for each hardware, the latency predictor method is more general and faster.
We use an evolutionary algorithm to conduct the search process. As in Figure3, the search engine queries the latency predictor for SubTransformer latency, and validates the loss on the validation set. The engine only adds SubTransformers with latency smaller than the hardware constraint to the population. We then train the searched models from scratch to obtain the final performance.
We conduct experiments on four machine translation tasks: WMT’14 En-De, WMT’14 En-Fr, WMT’19 En-De, and IWSLT’14 De-En, consisting of 4.5M, 36.3M, 43.0M, and 160K pairs of training sentences, respectively. For WMT’14 En-De, we apply 32K source-target BPE vocabulary, train on WMT’16, validate on newstest2013 and test on newstest2014, replicating Wu:2019payless; For WMT’14 En-Fr, we use 40K source-target BPE vocabulary, validate on newstest2012&2013, and test on newstest2014, replicating Gehring:2017conv. WMT’19 En-De adopts 49.6K source-target BPE vocabulary, validates on newstest2017, and tests on newstest2018, the same as junczys2019microsoft. We use 10K joint BPE vocabulary in lower case for IWSLT’14 De-En Grave:2017ef.
3.2 Experiment Setups
Our baseline models are Transformer Vaswani:2017attention, Levenshtein Transformer Gu:2019levenshtein, both with the Ott:2019fairseq implementation, Evolved Transformer So:2019et and Lite Transformer Anonymous:2020efficient.
For evaluation, we use beam four and length penalty 0.6 for WMT, and beam five for IWSLT Vaswani:2017attention. All BLEUs are calculated with case-sensitive tokenization111https://github.com/moses-smt/mosesdecoder, but we also apply the compound splitting BLEU222https://github.com/tensorflow/tensor2tensor for WMT, the same as Vaswani:2017attention. We test the model with the lowest validation set loss for WMT and the last ten checkpoints averaged for IWSLT.
We test the latency of the models by measuring translation from a source sentence to a target sentence with the same length. The length is the average output length on the test set – 30 for WMT and 23 for IWSLT. For each model, we measure the latency for 300 times, remove the fastest and slowest 10% and then take the average of the rest 80%. We conduct experiments on three representative hardware platforms: Raspberry Pi-4 with an ARM Cortex-A72 CPU, Intel Xeon E5-2640 CPU, and Nvidia TITAN Xp GPU.
|Hardware- Aware||Hetero. Layers||Latency||#Params||FLOPs (G)||BLEU||GPU Hours||CO2e (lbs)||Cloud Comp. Cost|
|IWSLT’14 De-En||Transformer||✗||✗||3.3s||32M||1.5||34.5||2||5||$12 - $40|
|HAT (Ours)||✓||✓||2.1s||23M||1.1||34.5||4||9||$24 - $80|
|WMT’14 En-Fr||Transformer||✗||✗||23.2s||176M||10.6||41.2||240||68||$178 - $595|
|Evolved Trans.||✗||✗||20.9s||175M||10.8||41.3||2,192,000||626,000||$1.6M - $5.5M|
|HAT (Ours)||✓||✓||7.8s||48M||3.4||41.4||216||61||$159 - $534|
|HAT (Ours)||✓||✓||9.1s||57M||3.9||41.8||224||64||$166 - $555|
|WMT’14 En-De||Transformer||✗||✗||20.5s||176M||10.6||28.4||184||52||$136 - $456|
|Evolved Trans.||✗||✗||7.6s||47M||2.9||28.2||2,192,000||626,000||$1.6M - $5.5M|
|HAT (Ours)||✓||✓||6.0s||44M||2.7||28.2||184||52||$136 - $456|
|HAT (Ours)||✓||✓||6.9s||48M||3.0||28.4||200||57||$147 - $495|
emissions (lbs) and cloud computing cost (USD) for Transformer, the Evolved Transformer and HAT. The training cost estimation is adapted fromStrubell:2019uv. The training time is for one Nvidia V100 GPU, and the latency is measured on the Raspberry Pi ARM CPU. The cloud computing cost is based on AWS.
3.3 Implementation Details
The SuperTransformer for WMT has the following design space: [512, 640] for embedding dim, [1024, 2048, 3072] for hidden dim, [4, 8] for the head number in all attention modules, [1, 2, 3, 4, 5, 6] for decoder layer number. Due to decoder auto-regression, encoder only accounts for less than 5% of the measured latency; thereby, we set the encoder layer number fixed as 6. For arbitrary encoder-decoder attention, each decoder can choose to attend to the last one, two, or three encoder layers. The SuperTransformer design space for IWSLT is the same as WMT except for [2048, 1024, 512] for hidden dim and [4, 2] for head number. We set the vector dim fixed as 512. The design space contains around possible SubTransformers and covers a wide range of model size and latency (largest = 6smallest). We train the SuperTransformers of WMT for 40K steps and 50K steps for IWSLT.
Hardware-Aware Evolutionary Search Setups.
The input of the latency predictor is a feature vector of SubTransformer architecture with ten elements: layer number, embed dim, average hidden dim, average self-attention heads, of both encoder and decoder; plus average encoder-decoder attention heads, and the average number of encoder layers each decoder layer attends. A dataset of 2000 (SubTransformer architecture, measured latency) samples for each hardware is collected, and split into train:valid:test=8:1:1. We normalize the features and latency, and train a three-layer MLP with 400 hidden dim and ReLU activation. We choose three-layer because it is more accurate than the one-layer model, and over three layers do not improve accuracy anymore. With the predictor, we conduct an evolutionary search for 30 iterations in the SuperTransformer, with population 125, parents population 25, mutation population 50 with 0.3 probability and crossover population 50.
Our training settings are in line with Wu:2019payless and Anonymous:2020efficient. For WMT, we train for 40K steps with Adam optimizer and a cosine learning rate (LR) scheduler Kingma:2015adam; Loshchilov:2016sgdr, where the LR is linearly warmed up from to , and then cosine annealed. For IWSLT, we train for 50K steps with inverse square root LR scheduler. The baseline Transformers are trained with the same settings as the searched SubTransformers for fair comparisons.
4.1 HAT Performance Comparisons
In Figure 7, 8 and Appendix Table 8, we compare HAT with Transformer baselines on four tasks. The embedding dims are 512 and 1024 for the Transformer-Base and Big, respectively. The hidden dims are and of the embedding dim for WMT and IWSLT. The IWSLT models are smaller to prevent overfitting Wu:2019payless. We obtain a series of baseline models with layer number scaling (yellow) and dimension scaling (blue). We set different latency constraints on three hardware to get a series of HAT models. HAT consistently outperforms baselines with a large gap under different latency constraints. On ARM CPU, HAT is 3 faster and 3.7 smaller than Transformer-Big with the same BLEU. On Intel CPU, HAT achieves over 2 speedup. On Nvidia GPU, the blue dash line is nearly vertical, indicating that dimension scaling can hardly reduce the latency. In this case, HAT can still find models with low latency and high performance.
We further compare various aspects of HAT with Transformer Vaswani:2017attention and Evolved Transformer So:2019et in Table 2. HAT achieves up to 1.6, 3, and 3.4 speedup with up to 1.4, 3.7, and 4 smaller size than baselines. We report FLOPs for translating a 23-token sentence for IWSLT and 30 for WMT. We show the overall GPU hours for training the SuperTransformer and the searched SubTransformer. We also calculate the cloud computing costs with different modes: “preemptable” is cheaper ($0.74/h) than “on-demand” ($2.48/h) Strubell:2019uv. HAT is highly affordable since the total GPU-hour is over 12000 smaller than the Evolved Transformer, and is even smaller than Transformer-Big by virtue of the compact model size.
|Evolved Transformer So:2019et||3.7s||25.40|
|Lite Transformer Anonymous:2020efficient||3.4s||25.79|
In Table 3, we compare HAT with other latest models. We scale down all models to have similar BLEU scores with Levenshtein for fair comparisons. We adopt the average iteration time of 2.88 for decoding Gu:2019levenshtein, without limiting the length of the output sentence (12 tokens after decoding). HAT runs 1.3 faster than Transformer with higher BLEU; 1.9 faster than Levenshtein with 0.7 higher BLEU. Under similar latency, HAT also outperforms Lite Transformer. These results demonstrate HAT’s effectiveness in lower latency scenarios. Our framework can also be adopted to speedup those models.
For all HAT WMT models in Figure 7, 10% of all decoder layers attend to three encoder layers, 40% attend to two encoder layers. That demonstrates the necessity of arbitrary encoder-decoder attentions.
In Appendix Figure 12, we visualize the models specialized for different hardware mentioned in Table 1. We find that the GPU model is wide but shallow; the Raspberry Pi model is deep but thin. The phenomenon echos with our latency profiling (Figure 2) as GPU latency is insensitive to embedding and hidden dim, but Raspberry Pi is highly sensitive. It guides manual designs: on GPU, we can reduce the layer number and increase dimension to reduce latency and keep high performance.
HAT achieves higher BLEU with 1.5 lower latency and 1.5 smaller size compared with the largest SubTransformer (Table 4). This suggests that larger models do not always provide better performance, and demonstrates the effectiveness of HAT. We also compare the evolutionary search with random search (Figure 9). Evolutionary search can find models with lower losses than random search.
SubTransformer Performance Proxy.
All SubTransformers inside the SuperTransformer are uniformly sampled and thus equally trained, so the performance order is well-preserved during training. We conduct experiments to show the effectiveness of the SubTransformer performance proxy as in Table 5 and Appendix Figure 11. The BLEUs of SubTransformers with inherited weights and weights trained from-scratch are very close. More importantly, they also have the same relative performance order. Therefore, we can rely on the proxy to search high-performance model architecture, significantly reducing the search cost.
|WMT’14 En-De||WMT’14 En-Fr|
|Inherited Val Loss||Inherited BLEU||From- Scratch BLEU||Inherited Val Loss||Inherited BLEU||From- Scratch BLEU|
Low Search Cost.
As shown in Table 2 and Figure 10, the search cost of HAT is 12,041 lower than the Evolved Transformer. Although both are using Evolutionary Search, the key difference is that Evolved Transformer needs to train all individual models and sort their final performance to pick top ones; on the contrary, HAT trains all models together inside SuperTransformer and sorts their performance proxy to pick top ones. The superior performance of HAT proves that the performance proxy is accurate enough to find good models.
Finetuning Inherited SubTransformers
In section 4.1, we trained each searched SubTransformer from-scratch in order to conduct fair comparisons with baselines. In practice, we can also directly finetune the SubTransformers with the inherited weights from the SuperTransformer to further reduce the training cost. With 10K finetuning steps (1/4 of from-scratch training), the inherited SubTransformers can achieve similar or better performance than trained from-scratch ones (Table 6). In this way, the training cost for a model under a new hardware constraint can be further reduced by 4, since the SuperTransformer training cost is amortizable among all searched models.
|Task||From-Scratch 40K||Inherit-Finetune 10K|
HAT is orthogonal to other model compression techniques such as quantization. We apply K-means quantization to HAT and further reduce the model size. We initialize centroids uniformly in the range of [min, max] of each weight matrix and run at most 300 iterations for each of them. Even without any finetuning, 4-bit quantization can reduce the model size by 25with negligible BLEU loss compared to the Transformer-Big baseline (Table 7). Interestingly, the 8-bit model even has 0.1 higher BLEU than the full precision model, indicating the robustness of searched HAT. Compared with the Transformer-Base 4-bit quantization baseline, which has 24MB model size and 38.9 BLEU score, HAT has 2.2 higher BLEU with similar model size.
Knowledge Distillation Friendly.
HAT is also orthogonal to knowledge distillation (KD) because HAT focuses on searching for an efficient architecture while KD focuses on better training a given architecture. We combine KD with HAT by distilling token-level knowledge (top-5 soft labels) from a high-performance SubTransformer to a low-performance SubTransformer on WMT’14 En-De task. The teacher model has a BLEU of 28.5 and 49M parameters; the student model has 30M parameters. KD can improve the BLEU of the student model from 25.8 to 26.1.
5 Related Work
|HAT 8 bits||41.9||57MB||12|
|HAT 4 bits||41.1||28MB||25|
Transformer Vaswani:2017attention has prevailed in sequence modeling ng-etal-2019-facebook; junczys2018microsoft. By stacking identical blocks, the model obtains a large capacity but incurs high latency. Recently, a research trend is to modify the Transformer to improve the performance Chen:2018vf; Wu:2019payless; Sukhbaatar:2019adaptive; wang-etal-2019-learning-deep. Among them, Wu:2019payless introduced a convolution-based module to replace the attention; wang-etal-2019-learning-deep proposed to train deep Transformers by propagating multiple layers together in the encoder. zhang2018accelerating and kim-etal-2019-research
also proposed AAN and SSRU to replace the attention mechanism. HAT is orthogonal to them and can be combined to search for efficient architecture with those new modules. Another trend is to apply non- or partially-autoregressive models to cut down the iteration number for decodingGu:2019levenshtein; akoury-etal-2019-syntactically; wei-etal-2019-imitation; gu2018nonautoregressive. Although reducing latency, they sometimes suffer from low performance. bapna-etal-2018-training explored using learned linear combinations of encoder outputs as decoder inputs, while HAT concatenates the outputs without linear combinations, thus better preserving the low-level information. Anonymous:2020efficient investigated mobile settings for NLP tasks and proposed a multi-branch Lite Transformer. However, it relied on FLOPs for efficient model design, which is an inaccurate proxy for hardware latency (Figure 2). There are also works kim2016sequence; junczys-dowmunt-etal-2018-marian; kim-etal-2019-research; micronet using Knowledge Distillation (KD) to obtain small student models. Our method is orthogonal to KD and can be combined with it to improve the efficiency further. There are also hardware accelerators ham20203; zhang2020sparch for attention and fully-connected layers in the Transformer to achieve efficient processing.
Neural Architecture Search.
In the computer vision community, there has been an increasing interest in automating efficient model design with Neural Architecture Search (NAS)Zoph:2017uo; Zoph:2018ta; Pham:2018tl; He:2018am. Some applied black-box optimization such as evolutionary search apqCai:2019ui; He:2018am; learncircuits; learncircuits2; mao2019park
; Some leveraged backpropagation with differentiable architecture searchliu2018darts. Some also involved hardware constraints into optimizations such as MNasNet Tan:2019vw, ProxylessNAS Cai:2019ui, FBNet Wu:2019tk and APQ apq. To reduce the NAS cost, supernet based methods Pham:2018tl; pmlr-v80-bender18a; Guo:2019single apply a proxy for sub-network performance and adopt search algorithms to find good sub-networks. For NLP tasks, the benefits of the architecture search have not been fully investigated. Recently, So:2019et proposed the Evolved Transformer to search for architectures under model size constraints and surpassed the original Transformer baselines. However, it suffered from very high search costs (250 GPU years), making it unaffordable to search specialized models for various hardware and tasks. In addition, hardware latency feedback was not taken into account for better case-by-case specializations. Since different hardware has distinct architecture and features cong2018understanding, feedback from hardware is critical for efficient NLP.
We propose Hardware-Aware Transformers (HAT) framework to solve the challenge of efficient deployments of Transformer models on various hardware platforms. We conduct hardware-aware neural architecture search in an ample design space with an efficient weight-shared SuperTransformer, consuming four orders of magnitude less cost than the prior Evolved Transformer, and discover high-performance low-latency models. We hope HAT can open up an avenue towards efficient Transformer deployments for real-world applications.
We thank NSF Career Award #1943349, MIT-IBM Watson AI Lab, Semi-conductor Research Corporation (SRC), Intel, and Facebook for supporting this research.
Appendix A Appendix for “HAT: Hardware-Aware Transformers for Efficient Natural Language Processing”
a.1 SubTransformer Performance Proxy
In Figure 11, we show the relationship between the validation loss of SubTransformers directly inherited from the SuperTransformer, and the BLEU score of the SubTransformers trained from-scratch. We can observe that the larger the validation loss, the lower the BLEU score. Therefore the validation loss can be a good performance proxy.
a.2 Visualizations of Searched Models on WMT’14 En-De Task
We show the HAT models searched for Raspberry Pi ARM Cortex-A72 CPU and Nvidia TITAN Xp GPU in Figure 12. The searched model for Raspberry Pi is deep and thin, while that for GPU is shallow and wide. The BLEU scores of the two models are similar: 28.10 for Raspberry Pi CPU, and 28.15 for Nvidia GPU.
a.3 Latency, BLEU and SacreBLEU of searched HAT models.
|WMT’14 En-De||Raspberry Pi ARM Cortex-A72 CPU||3.5s||25.8||25.6|
|Intel Xeon E5-2640 CPU||137.9ms||25.8||25.6|
|Nvidia TITAN Xp GPU||57.1ms||25.8||25.6|
|WMT’14 En-Fr||Raspberry Pi ARM Cortex-A72 CPU||4.3s||38.8||36.0|
|Intel Xeon E5-2640 CPU||154.7ms||39.1||36.3|
|Nvidia TITAN Xp GPU||69.3ms||39.1||36.3|
|WMT’19 En-De||Nvidia TITAN Xp GPU||55.7ms||42.4||41.9|
|IWSLT’14 De-En||Nvidia TITAN Xp GPU||45.6ms||33.4||32.5|