The spectacular success of Deep Learning has come riding on the ready availability of data and a tremendous growth in compute capability of deep learning systems. In recent years, compute growth has been driven by specialized architectures for GEMM (General Matrix Multiply) acceleration, and the shift to low precision compute. Inference has witnessed a proliferation of mixed precision compute deepcompression ; rastegariECCV16 ; mellempudi2017ternary ; trn where different operations execute at different precision, all the way from binary/ternary operands to 16b floating point. Similarly, training has also witnessed its share of mixed precision methods, where a combination of half- and single-precision compute is used.
There are at least three half-precision formats in the domain of mixed precision training of large neural networks: FP16nvbaidu_mixed , 16-bit Integer based mixedPrec18 , and BFLOAT16 distbelief ; tf_whitepaper ; cloudTPU . All these methods have 16-bit input operands and 32-bit accumulators for all the computations. Of the three formats, only the first two have publicly available description of training methodology and experimental results on a wide variety of neural networks although BFLOAT16 was originally conceived for deep learning training. BFLOAT16 data format was first introduced as part of distributed training frameworks DistBelief distbelief and Tensorflow tf_whitepaper as a low precision storage format used to reduce communication volumes of weights and activations shared between compute nodes during distributed parallel training111special thanks to Jeff Dean for pointing this out. It has since become an alternative numeric format specifically targeted towards accelerating deep leaning training (mainly within the Google ecosystem cloudTPU ), because of its wider dynamic range and smaller footprint. In this work, we present a detailed mixed precision methodology using BFLOAT16 and demonstrate coverage by training a variety of workloads from image processing (including GANs), to speech/language processing, and recommendation systems. For our experiments we employ a method, where FP32 operations emulate the behavior of BFLOAT16 operations by appropriately zeroing out the lower 16 bits and appropriately rounding the input operands.
We have developed a library called Quantlib to implement the emulation in multiple deep learning frameworks such as IntelCaffe, Caffe2, Neon and Tensorflow. One of the functions Quantlib provides is appropriately modifying the elements of an input FP32 tensor to emulate the behavior of BFLOAT16. Specifically, it zeroes out the lower 16 bits of the FP32 elements and performs RNE (Round to Nearest Even) rounding based on those bits. This modification ensures that the tensor possesses FP32 precision (so that FP32-hardware and FP32-libraries can operate on it), and also provides the exact precision and rounding as would be afforded by BFLOAT16 hardware. Quantlib is called prior to GEMM operations (or other operations which are planned to be implemented in BFLOAT16) to emulate the behavior of BFLOAT16 input operands, while FP32 output of the GEMM naturally fits into the “BFLOAT16-input, FP32-accumulator” schema of this training methodology.
Using multiple deep learning frameworks modified to insert appropriate Quantlib calls, we provide SOTA results for: AlexNet alexnet , ResNet-50 resnet , DC-GAN dcgan , SR-GAN ledig2017photo , DeepSpeech2 amodei2016deep , GNMT wu2016google
, and two industrial workloads namely: a Deep and Cross Network, and a DNN Recommendation System. We also qualitatively compare and contrast the training methodologies using BFLOAT16, FP16 and INT16. We observe that FP16-based training requires tuning an additional hyper-parameter for loss scaling to achieve SOTA results; INT16-based training requires fine grained block-quantization and maintaining block-level scaling factors to achieve SOTA results. In comparison all BFLOAT16 experiments are performed without any hyperparameter changes and BFLOAT16 kernels are expected to be relatively straightforward.
The rest of the paper is organized as follows. Section 2 provides a survey of the literature and describes various attempts at half-precision based training. Section 3 discusses the BFLOAT16 format, operations and data flow in detail. Section 4 describes our experimental results in detail. Section 5 discusses our concluding thoughts.
2 Related Work
Application of low precision datatype in deep learning is a well explored topic in research. Literature shows that various different reduced precision data representations have been investigated which can be broadly classified into two types: the more standard floating-point based formatsnvbaidu_mixed ; borisgtc17fp16 ; dettmers20158 and custom fixed point based formats vanhoucke2011improving ; courbariaux2014training ; gupta2015lp ; hubara2016qnn ; flexpoint .
Custom fixed point representations may offer more flexibility than typical floating point based ones in terms of both increased precision and dynamic range by maintaining separate integer values for precision and range. Consequently, fixed point representations may provide more robust and accurate training of an underlying application. The work reported in vanhoucke2011improving leverages the dynamically scaled fixed point representation proposed in williamson1991dynamically
to speed up convolution neural networks byover an optimized floating point implementation on general purpose CPU hardware. In gupta2015lp , the authors present a comprehensive study on the effect of low precision fixed point computation for deep learning. They also train smaller networks using 16-bit fixed point on specialized hardware.
Researchers have ventured into less than 16-bit precision as well and almost all of them use custom fixed point schemes. Reference courbariaux2014training uses a dynamical fixed point format with low precision multiplications with up to 12-bit operations. This idea is further advanced in courbariaux2015binaryconnect where the authors showcase training with only binary weights while keeping all other tensors and operations in full precision. Another extension hubara2016binarized uses binary activations as well; however, the gradients and the weights are maintained in full precision. Another related work hubara2016qnn uses activations and weights quantized into 6-bits for neural network training with gradients in full precision. The method described in rastegariECCV16
uses binary representation for all components including gradients. However, all the aforementioned methods are shown to work for smaller benchmark model/data-sets only and invariably result in a non-trivial drop in accuracy with larger ImageNet data-setimagenet_data and classification task imagenet_contest . The authors of flexpoint advocate for a fixed point numerical format called Flexpoint that is specifically tailored for executing deep neural networks on a specialized hardware; this datatype is shown to outperform FP16 and achieve numerical parity with FP32 across a diverse set of workloads. A more general dynamic fixed point representation and associated compute primitives are presented in mixedPrec18 , which leverages general purpose hardware using the integer-compute pipeline to match FP32 baseline accuracy across state of the art convolution neural networks.
The disadvantage of these integer based representations in contrast to floating point is the additional overheads of handling shared exponents and managing accumulator overflow. Köster et al.flexpoint proposed an algorithm that predicts the shared exponent ahead of time to eliminate some of these overheads. However this solution requires collection of additional statistics at each layer, which cannot be efficiently computed on general purpose hardware.
A mixed precision training methodology using FP16 and FP32 is reported in nvbaidu_mixed . This work employs FP16 for storing activations, weights and gradients. The computations during forward pass and back propagation use FP16 datatype while results are accumulated into FP32. A master copy of the FP32 weights are preserved for the update operation. The authors successfully perform deep learning training on a wide range of applications encompassing deep networks and larger data-sets (ILSVRC-class problems) at the expense of minimal loss compared to baseline FP32 results. This work, however, underlines that FP16/FP32 mixed precision training entails loss scaling borisgtc17fp16 to attain near-SOTA results. Loss scaling, basically, ensures that back-propagated gradient values are shifted into a range which can be represented by FP16 and therefore the small magnitude (negative exponent) values which are critical for accuracy are preserved.
The need for loss scaling can be avoided by using BFLOAT16 datatype. The hardware numerics of BFLOAT16 on Intel architecture is available at bfloat16 . BFLOAT16 has been underlined to be a crucial ingredient for achieving peta-FLOPS scale on image classification task in yingImage . In cloudTPU
, Google notes the speed ups achieved on using this datatype over FP32 for TensorFlow models for various tasks, such as, image classification, image segmentation, object detection, machine translation. It may be noted that the benefits of BFLOAT16 are not only restricted to machine learning paradigm, the recently published workmcBfloat16 uses this representation for performing Monte Carlo simulations of Ising model which is an important model is statistical physics. In fact, the authors of mcBfloat16 mention BFLOAT16 “provides better training and model accuracy than the IEEE half-precision representation”. The Julia language julia , which is designed to provide high-performance without sacrificing ease of programming, has also been shown to be benefited from BFLOAT16 juliaBfloat16 . OpenAI has also mentioned that optimizing kernels targeting BFLOAT16 is “in active development” openaiBfloat16 .
3 Training with Brain Floating Point
Numerous studies have shown that 16-bits of precision is sufficient for training deep neural networksgupta2015lp ,nvbaidu_mixed ,flexpoint ,mixedPrec18 . Researchers have experimented with various numeric formats to optimize training platforms for power and performance. State-of-the-art training platforms today have chosen IEEE-754 half-precision floating point as the preferred numeric format for deep leaning training. However, the narrow dynamic range of half-precision floating point is not sufficient to represent error gradients during back propagation. To mitigate this, training methods use loss scaling techniquesnvbaidu_mixed to shift gradients into expressible range supported by half-precision floats. While this is easier to implement for certain feed-forward networks as a simple multiplication of loss with a constant scaling factor, others (e.g. recurrent) require a more sophisticated approach to determine the right scaling parameter. This process is often iterative and requires significant time and resource investment from data scientists to optimize it to their network. These software overheads become a hindrance for seamless migration of new deep learning applications to take advantage of the low-precision hardware. Recent developments in software tools such as “automatic mixed precision” nvidia_amp are aimed at easing some of this burden from data scientists. However, these tools in their current form also require changes in the original model code and are not guaranteed to result in sufficient performance gains.
The values are represented as truncated full precision floating point values with 8 bits of mantissa and the dynamic range comparable to FP32 (Table 1). The extended dynamic range can now represent smaller gradient values without applying complicated loss scaling methods, which enables easier migration of deep learning workloads to BFLOAT16 hardware. There are some additional benefits to adopting BFLOAT16 numeric format to build hardware for deep learning. Core compute primitives such as FMA can be built using 8-bit multipliers which lead to significant area and power savings while preserving the full dynamic range of FP32. Table 1 shows the comparison of BFLOAT16 with other standard IEEE floating point formats.
|Data Type||Bit Format||Max||Min||Min||Acc.|
|(s, e, m)||Normal||Normal||Subnormal||Size|
|FP32||1, 8, 23||float32|
|FP16||1, 5, 10||float32|
|BFLOAT16||1, 8, 7||N/A||float32|
Figure 1 shows the mixed precision data flow used to train deep neural networks using BFLOAT16 numeric format. The core compute kernels represented as GEMM operations accept inputs as BFLOAT16 tensors and accumulate the output to FP32 tensors. Quantlib (shown as Q in Figure 1
) modifies these output tensors to BFLOAT16 format before passing them to the next layer. Quantlib is also employed to modify a copy of the FP32 weights to BFLOAT16 for the forward pass. Error gradients with respect to the inputs also in BFLOAT16 format. Non-GEMM compute operations including batch-normalization, and activation functions such as ReLU, tanh and sigmoid also accept BFLOAT16 tensors as inputs. Bias tensors are always maintained in FP32. The weight update step (e.g., in SGD solver) uses the FP32 copy of the weights to maintain model accuracy.
Our evaluation of BFLOAT16 consists of the aforementioned deep learning models from different application domains and frameworks, using the tensor modification method via Quantlib discussed in section 1.
4.1 Convolution Neural Networks
Convolutional neural networks (CNN) have been primarily used for computer vision applications such as image classification, object detection and semantic segmentation. CNNs have been extensively studied both in academia and industry, primarily driven by public benchmarks such as the ImageNet Large Scale Visual Recognition Competition (ILSVRC). Over the past few years the CNNs which have won the ILSVRC competition, have become well established benchmarks. Here we choose AlexNet (ILSVRC 2012)alexnet and ResNet-50 (ILSVRC 2015) resnet as representative models for the BFLOAT16 evaluation. In addition to Convolution and InnerProduct layers (which contributes to majority of the computations), we use BFLOAT16 emulations for ReLU, BatchNorm, Pooling, Dropout and EltWise layers as well. This ensures that the full training pipeline uses BFLOAT16, not necessitating the use of higher precision for the intermediate tensor outputs.
For AlexNet, we used a global minibatch of 1024 running data parallel on 16 nodes for 88 epochs and achieved 57.4% top-1 and 80.7% top-5 accuracy. As shown in Figure2, our BFLOAT16 emulation follows very closely to the actual FP32 run and achieves 57.2% top-1 and 80.1% top-5 accuracy.
Our Resnet-50 experiments are with a global minibatch of 1024 running data parallel on 32 nodes using SGD with Nestrov momentum. We trained for 90 epochs with learning rate warm up for first 5 epochs. Baseline FP32 run achieved top-1 accuracy of 74.7% and top-5 accuracy of 92.0% and as shown in Figure 2, our BFLOAT16 emulation follows the baseline almost exactly and achieving the same top-1 and top-5 accuracy. During training we use local batch statistics for the batch normalization to compute validation accuracy after every epoch. The fully trained BFLOAT16 model, achieves 75.7% top-1 test accuracy with global sample statistics, matching the baseline FP32 results.
4.2 Recurrent Neural Networks
Recurrent neural networks (RNN) unlike the feedforward networks allows for capturing temporal information due to its feedback connections. These models have been popularly used for applications such as automatic speech recognition (ASR) and language processing, which primarily involve sequence-based learning. RNNs have been observed to have more demanding numerical range requirements nvbaidu_mixed and are more sensitive to the half precision datatype. For this class of networks we identify Baidu’s DeepSpeech2 amodei2016deep
and Google’s neural machine translation (GNMT) modelwu2016google as representative candidates for the BFLOAT16 evaluation.
The Deep speech 2 (DS2) topology, consists of two convolution layers followed by 3 bi-directional gated recurrent unit (GRU) layers with 2048 cells and a final inner-product layer as a classifier. We use Adam optimizer to compute connectionist temporal classification loss (CTC)ctc . We use a batch size of 64 and a learning rate of 0.0005. The aforementioned model is trained on the librispeech dataset librispeech , which consists of 460 hour corpus.
4.2.2 Neural Machine Translation
Google’s Neural Machine Translation (GNMT) is the SOTA neural machine translation model using a recurrent network. It uses stack of long short-term memory (LSTM) layers, along with an attention model for language modeling and translation. Table2 compares translation accuracy in terms of achieved BLEU scores for baseline FP32 and BFLOAT16 emulation. We use the small Vietnamese (VI) to English (EN) model and big German (DE) to English (EN) model. BFLOAT16 emulation achieves same or better accuracy than baseline. Figure 3 shows how closely BFLOAT16 emulation run follows the baseline FP32 run.
|ViEn, IWSLT’15 Attention||17.1||18.3|
4.3 Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) have become a very important class of networks, they can be used to learn and mimic any arbitrary distribution of data. GANs achieve this by using two separate generator and discriminators networks in a tightly coupled way. Because GANs combine regression and discrimination tasks during training they tend to have different requirements for numerical precision and range. For the BFLOAT16 evaluation we consider DC-GANdcgan and SR-GAN ledig2017photo models.
represents a critical step in designing GAN architectures which were earlier known to be notoriously difficult to train. This consists of fractionally-strided convolutions with ReLU activations in the generator, whereas convolutions with leaky ReLU activations are used in the discriminator; batch normalization layers are used in both the generator and the discriminator. We have implemented DC-GAN in Caffe and for our experiments all the input tensors (activations, weights) are converted to BFLOAT16 for convolution layers (in both the generator and the discriminator), while only the input activations are converted to BFLOAT16 for batch normalization layers; all other tensors are maintained in full precision.
A comparison between FP32 and BFLOAT16 is shown in Table 3 in terms of inception scores and MS-SSIM. As evident from the table, the outputs obtained for FP32 and BFLOAT16 are comparable.
SR-GAN generates photo-realistic high-resolution images by super-resolving from a single shot of the low-resolution image ledig2017photo . The low-resolution images are scaled
preserving the spatial features while minimizing the noise. The quality of the output is measured using SSIM (Structural Similarity) and MS-SSIM (multi-scale structural similarity) and PNSR (peak signal to noise ratio) metrics. The topology consists of a “generator” network based on Resnet architecture, and the “discriminator” consists of 8 convolution layers each followed by a batchnorm and LeakyReLu. The network uses a “VGG loss” function using a pre-trained VGG19 networkvgg .
For the BFLOAT16 experiments, we converted all the inputs tensors (weights, activations) at convolution layers to BFLOAT16, while keeping the rest of the layers (Batchnorm, ReLU, LeakyReLu and Eltwise) at full precision.
4.4 Industrial Scale Recommendation System
Recommendation system and personalization models are very important for many practical-scale applications. Here we evaluate the Deep & Cross Network deepcross17 on a small Kaggle Criteo Dataset222https://www.kaggle.com/c/criteo-display-ad-challenge/data and a typical DNN recommender system fp16embeddings ; dlrm on a large Terabyte Criteo Dataset333https://www.criteo.com/news/press-releases/2015/07/criteo-releases-industrys-largest-ever-dataset/, which target predicting the ads click-through rate CTR . The accuracy of the recommendation system models is measured by the log loss deepcross17 ; fp16embeddings ; dlrm , which predicts when the users will click on ads. Lower log loss translates into higher prediction accuracy for the recommendation model. Note that an accuracy loss of 0.001 in log loss is considered unacceptable in practice.
For our BFLOAT16 experiments, all input tensors (activations and weights) are converted to BFLOAT16 for fully connected layers in both forward and backward propagation passes. During the weight update stages, we use a FP32 master copy nvbaidu_mixed to reduce the additional accuracy loss. We use either the round-to-nearest or direct truncation scheme when we do the conversion from BFLOAT16 to FP32.
The accuracy evaluation results are shown in Table 5. As we can observe, BFLOAT16 with the round-to-nearest scheme is almost the same as FP32 baseline accuracy, while BFLOAT16 with the direct truncation scheme suffers from a tiny accuracy degradation ().
|Recommendation system||Baseline(FP32)||BFLOAT16 (rnd)||BFLOAT16 (trunc)|
|Deep & Cross Network||0.44372||0.44372||0.44393|
|DNN Recommender System||0.12520||0.12520||0.12537|
4.5 Beyond Emulation - Towards Bare Metal Execution
We close this section by highlighting that our presented emulation strategy is an excellent approximation of future Intel Xeon CPUs. We therefore took the slim CNN training framework GxM, which was presented in Georganas:2018:AHD:3291656.3291744 , and implemented all important operators (convolution, fully-connected, batch-normalization and pooling layers) utilizing the AVX512BF16 instruction set extensions ise
. Therefore, all activation and weight data was only present as 16bit data in memory and the special VNNI (vector neural network instructions) datalayout was employed to support the BFLOAT16 dot-product instruction with FP32 accumulation. The execution of these instructions was done by bit-accurate emulation on current AVX512 silicon with only a very slight performance tax. When training ResNet-50 on Imagenet using the AVX512BF16 instructions, we achieved a Top-1 accuracy of 75.62%, which matches the current state-of-the-art performance. Additionally, the code is already heavily optimized and ready for prime-time. Additionally we have implemented a full BFLOAT16 LSTM-cell which is currently being integrated into Tensorflow.
Our goal in this paper was to establish BFLOAT16 as an alternative half-precision format for Deep Learning training, given that its dynamic range is the same as that of FP32 and conversion to/from FP32 is straightforward. Our empirical study demonstrates that, unlike IEEE 754 half-precision format and 16-bit Integer, BFLOAT16 based training eliminates the need for hyperparameter tuning or complex software management for block quantization. Our study also demonstrates that BFLOAT16 is a robust datatype having the ability to cover the range of tensors across application domains including vision, speech, language, generative networks, and recommendation systems. We expect industry-wide adoption of BFLOAT16 across emerging domains.
-  BFLOAT16 - hardware numerics definition. https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf. Accessed: 2019-03-22.
-  Intel architecture instruction set extensions and future features programming reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf. Accessed: 2019-05-21.
-  Using bfloat16 with tensorflow models. https://cloud.google.com/tpu/docs/bfloat16. Accessed: 2019-16-04.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, pages 173–182, 2016.
-  Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, pages 3123–3131, 2015.
-  Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj D. Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesús Corbal, Nikita Shustrov, Roman Dubtsov, Evarist Fomenko, and Vadim O. Pirogov. Mixed precision training of convolutional neural networks using integer operations. In ICLR, 2018.
-  Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
-  Tim Dettmers. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561, 2015.
-  Keno Fischer and Elliot Saba. Automatic full compilation of julia programs and ML models to cloud TPUs. CoRR, abs/1810.09868, 2018.
-  Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning convolutions on simd architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, pages 66:1–66:12, Piscataway, NJ, USA, 2018. IEEE Press.
-  Boris Ginsburg, Sergei Nikolaev, and Paulius Micikevicius. Training of deep networks with half-precision float. NVidia GPU Technology Conference, 2017.
-  Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.
-  Scott Gray, Alec Radford, and Diederik P. Kingma. Gpu kernels for block-sparse weights. https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/blocksparsepaper.pdf. Accessed: 2019-16-04.
-  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, pages 1737–1746, 2015.
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016.
-  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
-  Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. CoRR, abs/1711.02213, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  Abhisek Kundu, Kunal Banerjee, Naveen Mellempudi, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary residual networks. CoRR, abs/1707.04679, 2017.
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew P Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photo-realistic single image super-resolution using a generative adversarial network.In CVPR, volume 2, page 4, 2017.
-  Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462, 2017.
-  Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
-  Maxim Naumov, Dheevatsa Mudigere, et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
-  NVIDIA. Automatic mixed precision for deep learning. https://developer.nvidia.com/automatic-mixed-precision. Accessed: 2019-22-05.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
-  Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
Matthew Richardson, Ewa Dominowska, and Robert Ragno.
Predicting clicks: Estimating the click-through rate for new ads.In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 521–530, New York, NY, USA, 2007. ACM.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, 2011.
-  Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, ADKDD’17, pages 12:1–12:7, New York, NY, USA, 2017. ACM.
-  Darrell Williamson. Dynamically scaled fixed point arithmetic. In PacRim, pages 315–318, 1991.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
-  Kun Yang, Yi-Fan Chen, George Roumpos, Chris Colby, and John R. Anderson. High performance monte carlo simulation of ising model on TPU clusters. CoRR, abs/1903.11714, 2019.
-  Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification at supercomputer scale. CoRR, abs/1811.06992, 2018.
-  Jian Zhang, Jiyan Yang, and Hector Yuen. Training with low-precision embedding tables. In Systems for Machine Learning Workshop at NeurIPS 2018. 2018.