1 Introduction
In the past few years, deep neural networks (DNNs) have demonstrated great success in various domains such as natural language processing
Howard and Ruder (2018); Radford et al. (2018), speech recognition Deng et al. (2013); Anwar et al. (2015)and computer vision
Simonyan and Zisserman (2014); Szegedy et al. (2015). The increasing size and complexity of DNN models have become an impediment to the deployment of these models on edge devices where latency is often a hard requirement. Even for cloud platforms like Azure, AWS and Google Cloud, where more computational resources are available, complex models such as BERT Devlin et al. (2018) may still cause long service time and high cost for both training and inference.Quantization has emerged to be an effective way to significantly improve the performance of DNN models by utilizing lowbit computations Krishnamoorthi (2018); Jacob et al. (2018); Rastegari et al. (2016b); Han et al. (2015); Lin et al. (2016). By converting the weights and activations of a DNN model from highprecision floating point values to lowprecision representations, quantization is able to reduce the model size and hence requires less memory for running the model. Smaller memory footprint can have better cache behavior, because more intermediate computation results can be kept in the cache for reuse. Moreover, computations on lower numerical representations such as 8bit integers almost always need fewer clock cycles on contemporary general purpose CPUs and GPUs, compared with their high precision counterparts such as 32bit floating point.
GEMM operations often dominate the computation for training or inferencing DNN models Jia (2014). Consequently, high performant integer GEMM has great impact to the efficiency of quantized DNN models, where highprecision floating point matrix elements are converted to lowbit such as 8bit integers. Recently, there have been increasing efforts in developing highperformance lowprecision GEMM libraries, such as FBGEMM Deng. (2018), gemmlowp Jacob and others (2017) and MKL 22, where various techniques such as vectorization and memory layout are applied to optimize GEMM computation. However, these solutions are still not fast enough in certain scenarios.
We developed NGEMM, an optimized GEMM implementation for DNN models based on compiler techniques. NGEMM can provide high performance GEMM computation with different lowprecision representations for various target machines. Our experimental results showed that NGEMM outperformed MKL by an average of 1.4x. NGEMM has been used in a number of Microsoft production services.
In the rest of the paper, we will be focusing on Intel X86 CPUs, which are ubiquitous in modern computing systems. However, our methodology is widely applicable and not just specific to Intel CPUs.
2 Background
In this section, we briefly discuss the conventional approach to implementing lowprecision GEMM.
2.1 Conventional Approach
Figure 1 shows the conventional approach Rodriguez et al. (2018) to conduct lowprecision GEMM. Matrix A is the weight matrix with size of , and is typically determined from the training. Matrix B is input matrix, which contains the input data. Matrix C is the result. In lowprecision GEMM, matrices A and B are the ones with low numerical precision such as 8 or 16bit, while matrix C is in higher precision like 32bit. Throughout the example below, we will use unsigned 8bit integers for matrix A and signed 8bit integers for matrix B. Other integer types have similar processes.
In Figure 1, ➀, ➁ and ➂ are the typical phases to perform GEMM vector reduction, which computes the dot product of one row from matrix A and one column from matrix B to produce one element of matrix C. Phase ➀ shows the process of converting lowprecision values to higher precision. VPMADDUBSW takes two 8bit integer vectors, and , which contain unsigned and signed integers loaded from matrix A and B, respectively. It then multiplies each unsigned 8bit integer in and the corresponding signed 8bit integer in , produces a signed 16bit integer. Lastly, it adds adjacent pairs of the 16bit integers to generate another vector :
After that, VPMADDWD takes and a unit vector to perform horizontal reduction, which produces a signed 32bit integer vector :
where each 32bit value in comes from four 8bit values, thus . The result of phase ➀ is a 32bit integer vector with size of .
Note that the vector width (i.e. the number of cubes in the vector) in Figure 1 is just for illustration purpose. The actual width depends on the data type and target machine. For example, on an AVX2 machine, where the vector width is 256 bits, the vector size of is . In contrast, for an AVX512 machine, becomes .
Phase ➁ repeats the computation in Phase ➀ multiple times for the rest of the bytes of the same row from matrix A and the same column from matrix B. Each repetition produces a 32bit vector , which will be accumulated using instruction VPADDD. This phase gives us: .
Phase ➂ sums all of the 32bit integers within the vector using treereduction to generate one final element in matrix C (shown as the black cube). There are different ways to achieve this phase depends on instructions used. One approach is to use VPHADDD, which is the instruction to horizontally add adjacent pairs of 32bit integers. Depending on vector width, multiple VPHADDD are needed to accumulate the integer values and generate the final value. The other approach is to utilize the VPSHFD instruction, which shuffles the 32bit integers within the vector, with following VPADDD to accumulate corresponding values to generate the final result.
Finally, phase ➀ ➁ ➂ are applied to all rows of matrix A and columns of matrix B to generate the whole matrix C. Although these phases change the common reduction order of GEMM, they are typically applicable in integer computation unless overflow happens due to a very large , which is rare for most DNN models, where ranges from 100s to 1000s.
3 Methodology
3.1 Proposed Approach
The conventional approach described in the previous section is straightforward to perform lowprecision GEMM computation. However, it is not optimal because the vector units are not fully used during the treereduction phase ➂. Therefore, we propose NGEMM, which has better use of the vector units by avoiding nonefficient vector computation such as treereduction.
Figure 2 shows the details of our proposed approach. We use the same configuration as that we used for the conventional approach: matrix A is the weight matrix, matrix B is the input matrix, and matrix C is the result matrix. Matrices A and B are in low numerical precision, and matrix C has 32bit full precision.
Similar to phase ➀ in the conventional approach, phase ❶ computes partial results from two vectors, and , using VPMADDUBSW and VPMADDWD. However, the shapes of the loaded data are different from the conventional approach even with the same vector length. In this illustration, its shape is , which is determined by the vector width and the width of horizontal reduction in practise. For example, the shape is for AVX2. Because both dimensions are shorter than the vector width, loads of access noncontiguous memory, which may compromise memory bandwidth. To resolve the issue, we apply data layout transformation (see Sec 3.1.1) to force to be fully contiguous in memory.
Furthermore, because vector has shorter length than (), we broadcast to :
where has the same width as . Since we have as an example in Figure 2 (two elements during load of matrix B), we also use that for simplicity, which give us: . Consequently, and will be as follows:
The second phase ❷ accumulates all the partial results of to directly form the final result vector in matrix C.
Figure 2 also shows the two more partial results obtained by applying phases ❶ and ❷ to the same part of A but different two vectors of B. The same computation will be applied to the entire A and B and obtain the final result matrix C.
Compared with phases ➀ ➁ ➂ of the conventional approach, our approach only has phases ❶ and ❷ but missing the treereduction phase, which is implicitly finished through the vector additions in phase ❷.
Our proposed approach is not limited to GEMM operations, but benefits other similar computations such as GEMV. Moreover, it is also applicable to GEMM operations with other types such as signed 16bit or unsigned 8bit integers.
3.1.1 Data Layout
We describe the data layout used by our proposed approach in detail. The layout is mainly for the weight matrix in a DNN model, e.g. matrix A in our examples. Therefore, data marshalling can often be performed offline. For those uncommon cases where both matrices are inputs, we can simply perform online packing with extra overhead.
Figure 3 and Figure 4 demonstrates the twolevel hierarchy of the layout used in NGEMM. Figure 3 shows various inner layouts in NGEMM, where the top row display the original vector access. Assuming the memory address grows horizontally, the original vector loads the contiguous elements from the memory. Vector width depends on hardware instructions and data types as displayed in Figure 3:
The bottom row draws the inner layout, which determines the data access pattern for the matrix. Instead of accessing data from a single row or column, inner layout will first access elements along dimension and then move to next row, and repeat this process times along M dimension. This kind of memory access forms a small zigzag pattern, which is known as Morton code Morton (1966). Usually, and are determined by hardware instructions and data type:
The size of simply implies the number of elements needed to be accumulated to a single 32bit value in phase (❶).
Figure 4 illustrates the outer layout pattern in NGEMM for weight matrix. The outer layout can be combined with loop optimizations such as loop tiling to deliver better performance. Loop tiling is a classic loop transformation technique to maximize parallelism and improve cache locality Hong et al. (2016); Bao et al. (2017b, a). It has been widely adopted in various BLAS libraries to optimize intensive computation such as GEMM.
Figure 4 (a) shows the outer layout without loop tiling applied, which iterate the inner layout pattern first through M dimension and then K dimension for the whole matrix. It is worth mentioning that if columns or rows are not multiple of or
, we simply pad 0element, respectively.
Figure 4 (b) shows the tiling with a tile size by combined with the layout for matrix A, where denote tiling sizes along dimension of the GEMM. In our tiling scheme, by typically contains multiple by. Thus the tiling size selection of becomes , where is the tiling size along dimension N. The outer layout then is to iterate the tile pattern through the whole matrix.
Moreover, we also apply inter and intra tile level optimizations. For intertile level, we perform loop unrolling to increase locality and reduce loop control overhead within a tile. For intratile level, different loop permutation can be used, which corresponds to the outer layout iterate order described previously. Overall, the loop tiling size and permutation are a typical searchspaceexploration problem, which can be solved by techniques such as autotuning.
3.2 Latency Analysis
In this part, we perform the latency analysis between the conventional approach and our NGEMM. Note, here, we only considers the computation cost, but ignores the cost of loads/stores, since after data layout transformation, memory accesses in the both methods have become contiguous.
Assume the matrix sizes are the same as previous section, i.e. with vector width . The latency of the conventional approach can be calculated as follows:
where and are the total cost for phase ➀ and ➁ in the conventional approach respectively. Here, is the latency cost of treereduction computation , and is constant parameter.
For our proposed NGEMM with layout size , the latency is as follows:
where latency(❶) and latency(❷) are the time cost of phase ❶ and ❷ in NGEMM. The speedup ratio between these two can be expressed as:
(1) 
where phases of ➀ ➁ and ❶ ❷ have the same latency cost, and their latency can be noted as . Then equation (1) can be expressed as:
(2) 
where represents the reduction dimension in the step of vector width of matrix multiplication, is the vector width and is constant parameter.
From equation (2), we can observe that the speedup is relevant to reduction dimension and vector width. The longer vector width and the shorter reduction dimension will result in better speedup. It is worth mentioning that the trend of increasing vector width continues in new generation of processors.
4 Implementation
We implemented singlethreaded NGEMM by leveraging TVM Chen et al. (2018)tensorize schedule primitive that replaces partial computation code with the corresponding intrinsics in LLVM Lattner and Adve (2004). It can be easily extended to support multithreading by parallelizing the outermost loop. We modified TVM code to properly handle tail condition during code generation, which is important for certain loop transformation techniques Bao et al. (2016b); Bao (2018). Furthermore, we added several LLVM intrinsics that were not supported by TVM. It is worthwhile to mention that our NGEMM is not limited to TVM. The same techniques can be implemented in other compilerbased frameworks N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov, N. Gibson, J. Hegeman, M. Lele, R. Levenstein, et al. (2018); 1.
Layout The layout of the weight matrix has been prepacked offline, and the marshalled matrix is fed as a constant initializer to the inference runtime. NGEMM generates different layouts for the weight matrix based on numerical precision requirements and supported instruction sets.
Loop Optimization The optimal loop tiling sizes and permutation orders are determined through compiler autotuning. NGEMM generates the best choice based on matrix sizes, target machines and numerical precision.
Moreover, it is straightforward to explore more extensions such as operation fusion with the help of compiler techniques. DNN models have pre and postGEMM operations, which can be fused together with GEMM to increase cache locality and thus improve performance.
5 Evaluation
We evaluated NGEMM’s performance against MKLDNN (MKLML) v0.21^{1}^{1}1This version of MKL fixed certain performance issues in previous versions.. Our NGEMM was implemented in ONNX Runtime Microsoft (2018), using a custom version of TVM with LLVM 6.0.1. The experiments were conducted on a machine with 8core 2.3GHz Intel Xeon E52673V4 processor with AVX2 support. The machine has Ubuntu 16.04 and GCC version of 6.5. No frequency scaling (DVFS) related techniques were used in the experiments Bao et al. (2016a); Farkas et al. (2000). The performance numbers were obtained using the performance test tool of the onnxruntime.
Figure 5 shows the speedups of NGEMM over MKL (singlethreaded cblas_gemm_s8u8s32
routine) on various problem sizes with the type of 8bit integers. The last bar in the figure is the geometric mean of the NGEMM speedups across all the problem sizes. And NGEMM demonstrated a geometric mean of 1.4x speedups compared to MKL.
As shown in the graph, NGEMM has better performance compared to MKL for all the problem sizes we chosen. But the speedup varies, small batch sizes have larger benefits, where the batch size is the matrix M dimension. During the evaluation, we find that even for large batch size, NGEMM still shows around 1.1X speedups compared to MKL.
It is worth mentioning that the actual performance benefits of NGEMM might vary for deep learning models. That is because that real models have not just GEMMrelated operations. There are quite many other computations such as elementwise computations, which also take considerable time. In our experiments of real production models, we saw NGEMM could delivered up to 1.3X speedups over MKL for these models.
6 Related work
Many research studies Vanhoucke et al. (2011); Rastegari et al. (2016a); Mellempudi et al. (2017); Das et al. (2018) have shown that low numerical precision can be applied to DNN models with minimal to no accuracy losses. Quantization techniques have been adopted by many DNN frameworks to improve training and inference performance. XLA (Accelerated linear algebra) Leary and Wang (2017)
is a compiler backend for TensorFlow
Abadi et al. (2016), which supports various optimizations including quantization and generate machine instructions for different targets. Glow Rotem et al. (2018)is a machine learning compiler, which consumes neural network graphs, performs highlevel graph optimizations and lowlevel machinedependent optimizations for diverse hardware targets. Glow also supports quantization with various bitwidths. TVM
Chen et al. (2018), another compiler stack for deep learning, compiles the model into lowlevel IR and performs loop optimizations. It supports multiple hardware backends and quantization with different bitwidths. Our work, NGEMM, incorporates with Microsoft ONNX Runtime Microsoft (2018).Many BLAS libraries, meanwhile, have implemented low precision GEMM for various architectures. Intel MKL (Math Kernel Library) Wang et al. (2014) provides well tuned 8 and 16bit GEMM kernels, which is widely adopted across many DNN frameworks as a CPU vendor library. NVIDIA cuBLAS library Nvidia (2008) provides GPU counterparts on NVIDIA GPUs. FBGEMM Deng. (2018) is a reducedprecision linear algebra library for deep learning inference, and is integrated into Caffe2. Gemmlowp Jacob and others (2017) is a library that only supports lowprecision GEMM. In addition to execution time, it is also optimized to minimize the power consumption, which is crucial for edge devices.
Other than TVM used implementing our proposed techniques in the paper, generalpurpose code generation frameworks Chen et al. (2008); Püschel et al. (2004); Chang et al. (2016); Chang (2017); De Gonzalo et al. (2019), and domainspecific languages RaganKelley et al. (2013); Sujeeth et al. (2014), are also capable to adopt our techniques for optimizing GEMM routines.
7 Conclusion
In this paper, we present NGEMM, an optimized lowprecision GEMM implementation based on compiler techniques. Compared to the conventional approach adopted by contemporary BLAS libraries, our approach does not require tree reduction and hence has better performance. We implemented a hierarchical layout for the weight matrix to further improve the latency. Our evaluation on various problem sizes demonstrate an average of 1.4X speedup over stateoftheart MKL library. A number of production models also show up to 1.3X speedup using NGEMM compared to MKL.
References
 [1] (2019) "Multilevel intermediate representation" compiler infrastructure. Note: https://github.com/tensorflow/mlir Cited by: §4.
 Tensorflow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §6.

Fixed point optimization of deep convolutional neural networks for object recognition
. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131–1135. Cited by: §1.  Efficient cache simulation for aine computations. In International Workshop on Languages and Compilers for Parallel Computing (LCPC’17), Cited by: §3.1.1.
 Static and dynamic frequency scaling on multicore CPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13 (4), pp. 51. Cited by: §5.
 Polycheck: dynamic verification of iteration space transformations on affine programs. In ACM SIGPLAN Notices, Vol. 51, pp. 539–554. Cited by: §4.
 Analytical modeling of cache behavior for affine programs. Proceedings of the ACM on Programming Languages 2 (POPL), pp. 32. Cited by: §3.1.1.
 Compiler techniques for transformation verification, energy efficiency and cache modeling. Ph.D. Thesis, The Ohio State University. Cited by: §4.
 Efficient kernel synthesis for performance portable programming. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 12:1–12:13. Cited by: §6.
 Toward performance portability for CPUs and GPUs through algorithmic compositions. Ph.D. Thesis, University of Illinois. Cited by: §6.
 CHiLL: a framework for composing highlevel loop transformations. Technical report Cited by: §6.
 TVM: an automated endtoend optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594. Cited by: §4, §6.
 Mixed precision training of convolutional neural networks using integer operations. External Links: 1802.00930 Cited by: §6.
 Automatic generation of warplevel primitives and atomic instructions for fast and portable parallel reduction on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, pp. 73–84. Cited by: §6.
 New types of deep neural network learning for speech recognition and related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. Cited by: §1.
 Opensourcing FBGEMM for stateoftheart serverside inference. Note: https://code.fb.com/mlapplications/fbgemm/ Cited by: §1, §6.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
 Quantifying the energy consumption of a pocket computer and a java virtual machine. ACM SIGMETRICS Performance Evaluation Review 28 (1), pp. 252–263. Cited by: §5.
 Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
 Effective padding of multidimensional arrays to avoid cache conflict misses. In ACM SIGPLAN Notices, Vol. 51, pp. 129–144. Cited by: §3.1.1.
 Universal language model finetuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §1.
 [22] (2019) Intel math kernel library. Note: https://software.intel.com/enus/mkl/ Cited by: §1.

Quantization and training of neural networks for efficient integerarithmeticonly inference.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2704–2713. Cited by: §1.  Gemmlowp: a small selfcontained lowprecision GEMM library.(2017). Cited by: §1, §6.
 Learning semantic image representations at a large scale. Ph.D. Thesis, UC Berkeley. Cited by: §1.
 Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs/1806.08342. External Links: Link, 1806.08342 Cited by: §1.
 LLVM: a compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedbackdirected and runtime optimization, pp. 75. Cited by: §4.
 XLA: tensorflow, compiled. TensorFlow Dev Summit. Cited by: §6.
 Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §1.
 Mixed lowprecision deep learning inference using dynamic fixed point. External Links: 1701.08978 Cited by: §6.
 Onnx runtime: crossplatform, high performance scoring engine for ml models. Note: https://github.com/microsoft/onnxruntime Cited by: §5, §6.
 A computer oriented geodetic data base and a new technique in file sequencing. Cited by: §3.1.1.
 Cublas library. NVIDIA Corporation, Santa Clara, California 15 (27), pp. 31. Cited by: §6.
 Spiral: a generator for platformadapted libraries of signal processing alogorithms. International Journal of High Performance Computing Applications 18 (1), pp. 21–45. Cited by: §6.

Improving language understanding with unsupervised learning
. Technical report Technical report, OpenAI. Cited by: §1.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48 (6), pp. 519–530. Cited by: §6.
 XNORnet: imagenet classification using binary convolutional neural networks. Lecture Notes in Computer Science, pp. 525–542. External Links: ISBN 9783319464930, ISSN 16113349, Link, Document Cited by: §6.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1.
 Lower numerical precision deep learning inference and training. Intel White Paper. Cited by: §2.1.
 Glow: graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907. Cited by: §4, §6.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
 Delite: a compiler architecture for performanceoriented embedded domainspecific languages. ACM Transactions on Embedded Computing Systems 13 (4s), pp. 134:1–134:25. Cited by: §6.
 Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
 Improving the speed of neural networks on CPUs. Cited by: §6.
 Intel math kernel library. In HighPerformance Computing on the Intel® Xeon Phi™, pp. 167–188. Cited by: §6.
Comments
There are no comments yet.