1. Introduction
Over the past decade Deep Neural Networks (DNNs) have made substantial improvements in the fields of computer vision and natural language processing, with DNNs now surpassing human’s in image recognition on the ImageNet dataset
(He et al., 2015). As the accuracy of state of the art DNNs has improved, model sizes and compute complexity has also increased. For example, AlexNet(Krizhevsky et al., 2012) uses roughly 60 million parameters and Facebook’s DeepFace(Parkhi et al., 2015) uses 120 million, requiring billions of operations. For low powered devices such as phones and edge devices, running inference on state of the art models poses a significant challenge due to their compute and memory requirements. Large models must be stored in offchip DRAM where fetching model parameters from memory can become the dominant energy and time consumer(Han et al., 2016), and computation is dominated by expensive multiplyaccumulate operations.To combat this, prior work has explored trading accuracy for performance by using low precision weights and activations of a few bits, or even a single bit as is the case with Binarized Neural Networks (BNNs)
(Rastegari et al., 2016)(Courbariaux et al., 2016). Low precision models are drastically smaller, greatly reducing the cost of fetching model parameters from memory. For example BNN models are up to 32x smaller than full precision models and can often fit in on chip memory (Rastegari et al., 2016). During compute intensive dense and convolution layers, multiplication can be replaced with cheap bitwise operations and popcount by computing in Hamming space, allowing for high performance inference.While training low precision networks has received lots of interest in the research community and made steady progress towards improving accuracy(Zhou et al., 2016)(Cai et al., 2017), most work assumes low precision inference will result in a linear speedup over full precision and reference theoretical speedups based off of number of operations. However, achievable speedups are dependent on many factors including architectural support for matrix operations and software optimizations to effectively use the underlying memory subsystem and hardware instructions. Current deep learning frameworks leverage decades of prior work in optimizing linear algebra operations through libraries such as NNPACK(Dukhan, [n. d.]) and Intel’s MKL (int, [n. d.]) that optimize for both hardware backends and matrix properties, resulting in highly efficient floating point and integer operations. Naive implementations of bitserial operators can be slower than 8bit integer and even full precision floating point implementations, partly because they lack optimized operator implementations.
In this paper we address the challenges of writing optimized low precision bitserial operators for CPUs. We introduce a work flow to quickly generate high performance low precision deep learning operators for arbitrary precision that target multiple CPU architectures and include optimizations such as memory tiling and vectorization. Specifically our contributions include:

A library of operators for quantization of floating point data, flexible bit packing, bitserial matrix multiply, and convolution.

An extensive case study of optimizing low precision operators for a low power Raspberry Pi 3B that can surpass state of the art hand written 16bit kernels for 1bit weight, 2bit activation convolutions.
2. Background and Related Work
2.1. Low Precision Neural Networks
Low precision neural networks operate on weights and activations quantized down to a few bits, or a single bit in the extreme case of BNNs. These types of neural networks improve performance over full precision models by significantly reducing memory movement costs and computing using bitserial methods and can be deployed on existing hardware.
BNNs, such as XNORNet(Rastegari et al., 2016) and BinaryNet(Courbariaux et al., 2016) have achieved near state of the art results on datasets such as MNIST and CIFAR10, showing that binarization works extremely well for relatively simple datasets. However, they preform significantly worse than full precision models for complex datasets like ImageNet, with accuracy degradations over 18% for XNORNet on ResNet18(He et al., 2016). To combat this, recent work(Zhou et al., 2016)(Hubara et al., 2016) has shifted to low precision networks which relax activation quantization to a few bits, while keeping weights binarized. Current state of the art low precision networks have made significant improvements in the accuracy. HWGQ(Cai et al., 2017) have shown that 1bit weights and 2bit activations models can achieve top1 accuracy drops of between 5 to 9% on different models for ImageNet, including ResNet18 and GoogLeNet(Fromm et al., 2018)(Szegedy et al., 2015).
2.2. Bitserial Computation
While FPGAs and custom hardware can take advantage of arbitrary precision through custom datapaths (Umuroglu et al., 2017), most commodity hardware such as CPUs, have no support for low precision data types. Efficient handling of low precision data requires bit packing into a larger storage data type such as an 8bit or 32bit integer, and bitserial operations that compute on an implicit vector of packed quantized data.
We describe bitserial dotproducts as seen in previous work(Umuroglu and Jahre, 2017)(Zhou et al., 2016), starting first with binary case and extending towards higher precisions. A binary dot product between two vectors containing only elements of 0 and 1, can be computed by bitwise anding the two vectors and counting the number of 1’s in the result using popcount, as seen in Equation 1
. If binary data is encoded in the bipolar format (1 and 1), then bitwise xnor replaces bitwise and.
(1) 
Binary dot products can be easily extended to preform bitserial dot products between an Mbit and Nbit vector, by computing the weighted sum of MN binary dotproducts as described in Equation 2, where n and m refer to the bit position of x and y.
(2) 
While bitserial dot products can be used for any precision inputs, its compute complexity grows linearly with the product of x and y’s bitwdiths as O(MN), so this technique is only practical for very quantized data. Equation 2 can be modified to support signed data by weighting binary dotproducts by their sign.
On CPUs low bit datatypes are not supported and data must be bitpacked to efficiently support bitserial computation and take full advantage of the hardware’s data path. This is achieved by splitting input vectors into individual bitplanes essentially creating B binary vectors. The binary vectors are then packed data into a larger storage type such as a uint32. Bitwise operations can be directly applied to the packed data, allowing low precision data to be efficiently stored and computed.
In Figure 1 we show an example of how bitpacking and bitserial computation work on a 2bit, 1bit dot product. Input data is first bitpacked by separating the bits of each element into separate bitplanes. Each bitvector is then compressed into a larger integer type, for simplicity a fourbit unsinged integer. Since bitwise operations have no carry chains, it can be preformed on packed data in an implicit vectorized fashion and take full advantage of the hardware’s data path. For example if data into a 32bit integer, a single pair of popcountand instructions computes the 32 element dot product.
3. Low Precision Operators
Low precision operators rely on efficient bitserial computation. We implement our operators using TVM, the deep learning compiler(Chen et al., 2018a). Our operators are designed to provide flexibility in precision and data layout, and performance portability across different CPU architectures. In Halide fashion, we provide (1) a declarative computation rule that describes the transformation of input to output tensors and (2) a separate schedule that contains the implementation of how to the data is transformed. We leverage TVM features to provide highlevel architecture agnostic optimizations for parallelism and memory tiling. For CPU code generation, TVM uses LLVM, which we find produces adequate code. However, for further performance we implement architecturespecific mickrokernels that take advantage of unique hardware intrinsics that LLVM misses.
3.1. Preprocessing: quantization and bitpacking
In order to apply bitserial operators, both input activation and weight tensors need to be in the correct format which involves quantizing data to the desired precision and bitpacking them to the proper data layout.
Since bitserial operators compute on each bit of an element individually, the separate bitplanes, the set of bits for each bit position, of each input tensor must easily accessible. This is accomplished by a bitpacking step that transforms the input tensors into a bitpacked tensor. Elements of the packed tensor hold a single bit for many elements as opposed to all bits of a single element for the input tensor.
We provide a bitpacking operator that takes an arbitrary dimension tensor and returns a dimension bitpacked tensor, with a new bit axis addressing the bitplanes of the tensor. Though the bitpacked tensor has more dimensions its total size is smaller as multiple elements are packed along reduction axis into a single element. The tensor’s reduction axis is dependent on the layout and represents the axis along which elements are multiplied and accumulated against.
Our bitpacking operator is flexible with respect to input data layout and size. The user specifies the input tensor’s reduction axis and the bitaxis location and datatype of the packed tensor. For example, in Figure 2, we show an example of bitpacking matrix A, a 2bit matrix, where is the reduction axis. The packed matrix can have layouts , , or .
The different layouts provide different types of data locality. When implementing bitserial operators, we found that flexibility in specifying bit position to be useful, as different bitserial algorithms rely on a specific packed layout For example, work in(Umuroglu and Jahre, 2017) relied an interleaved layout with B as the innermost dimension while work in(Tulloch and Jia, 2017) required B to be the outermost dimension
3.2. Low precision operators and high level optimizations:
We implement a library of low precision operators for common neural network operations such as 2D convolutions and dense matrix multiply, for arbitrary CPU backends. These operators transform bitpacked tensor into standard format tensors in a higher machine supported precision. Using TVM we can leverage existing features of the compiler to cleanly describe our operator’s implementation and quickly add optimizations for memory access, parallelism, and more.
At the compute level we describe variations of bitserial convolutions that accept different high level data layouts such as NCHW and NHWC, and various convolution implementations such as lowering to matrix multiply and an efficient in place convolution modified from (Jiang et al., 2018), all parameterized for different precisions. Depending on the input’s shape and characteristics of the convolution kernel, different implementations preform better.
For each operator variation, we provide a generic schedule (described in section 4) for each these operators that creates moderately efficient code for any CPU backed that takes advantage of vectorization, tiling, and other CPU agnostic optimizations. While many of these techniques are well established and simple in theory, they are time consuming to implement, with some requiring rewrites of the entire implementation and tuning parameters such as tile sizes. Furthermore, many of these parameters do not affect performance independently and must be tuned together, leading to a large search space of parameters to find an optimal configuration. We take advantage of TVM’s autotuning capabilities(Chen et al., 2018b) to search for optimal or near optimal parameter configurations.
3.3. Architecture specific optimizations:
TVM relies on LLVM to preform code generation to the desired CPU backend. Since generating optimal code is a hard problem, unsurprisingly, LLVM tends to produce suboptimal code. To approach or surpass the performance of the state of the art operator implementations, schedules can be augmented with handcrafted custom microkernels to implement the core computation. This allows us to take advantage of new hardware intrinsics many CPU vendors are releasing to accelerate deep learning such as pairwise addition instructions, as well as optimize low level memory loads and stores.
We create a custom ARM schedule (described in section 5) that takes advantage of specific hardware intrinsics to implement code that outpreforms state of the art hand optimized libraries. The custom schedule relies on highly optimized handgenerated microkernel.
4. CPUAgnostic Schedule
In this section we describe the layers of optimizations that we explored when putting together our bitserial operator schedule template. We show a running example of how we add in optimizations, and show their effects on a Raspberry Pi running the 2nd layer of ResNet18 (see Table 1 for details), for a 1bit weight 2bit activation convolution. Figure 4 shows snippets of the unoptimized compute rule and the pseudocode it generates, and the final optimized compute rule and schedule. In Figure 3 we show speedups against an optimized 16bit integer convolution as we add in each optimization, showing how we can take an unoptimized schedule that is 0.36x slower than the baseline and make it almost 2x faster. Many of these optimizations come from high performance computing techniques to improve memory locality and take advantage of parallelism and vectorization, but are necessary steps for achieving optimized low precision convolutions.
Tiling
Tiling is a common optimization to improve temporal memory locality and reduce memory loads during matrix computations. Tiling splits input tensors into multiple subtensors or tiles, such that each tile fits in the cache to avoid evicting elements that will be reused.
TVM exposes functions to easily split and reorder computational axis without the user needing to manually edit for loops. In Figure 4a, we show how we apply tiling in TVM through calls to split, and reorder. The height, width, and channels of the output matrix are split into blocks of size VH by VW by VC elements, which can be tuned by TVM. Using reorder, the iteration order is updated so that computation is completed within each tiles.
After tiling, data is not fetched sequentially from input tensors as data is stored in roworder. A common trick to improve tiling’s performance further, is to repack input tensors in a hierarchical fashion such that elements within a tile are stored sequentially, and the tiles then stored in a sequential order. Repacking requires a change to the computation description and is shown in the optimized compute rule in Figure 4.
Unrolling
Loop unrolling replaces loops with copies of the loop statement, and improves performance by reducing the number of branches during program execution. This performance gains comes at a cost of increased binary size as instructions inside the loop are duplicated. Loop unrolling is easily expressed in TVM through the unroll scheduling primitive, and we apply it small dimensions such as the bit axis.
Vectorization
Vectorization takes advantage of hardware support for SIMD instructions, and applies each instruction to a vector of data elements. Calling TVM’s vectorize primitive on an axis provides a hint to the compiler to output vectorized instructions. In our schedule, we tell the compiler to vectorize the innermost computation axis, since it is stored sequentially on output tensor and kernel tensor, allowing for contiguous vector loads and stores. In the generated C code in Figure 4, single element indices are replaced with a ramp(n) expression representing a vector of n elements, or a vector made from n copies of a single element.
On the ARM CortexA53, this allows us to use ARM’s SIMD unit. The intrinsic popcount instruction, vcnt, is only available for SIMD instructions, so it is crucial for ARM devices to vectorize the binary dot product’s computation or else popcount is inefficiently implemented in software. On X86, the opposite scenario occurs and popcount is only available in scalar registers. However, we found the vectorized software popcount outperforms the scalar hardware popcount as confirmed by (Muła et al., 2017), and left vectorization in the generic CPU schedule.
Parallellization
We parallelize the outer most axis of computation by calling TVM’s parallel function, allowing us to take advantage of all four cores on the Raspberry Pi, so that the generic CPU schedule outperforms the baseline by almost 2x.
All of the optimizations described above (and a few more) are necessary to implement a high performance convolution implementation. Optimizing code is often time consuming to implement, and can significantly reduce the readability of code by through bloat from unrolling, and further nesting of loops from tiling. While we carefully decided the layout of data and which axes to tile, split, and parallelize, we rely on TVM to handle the time consuming process of generating the code, allowing us to quickly write moderately efficient schedules.
5. ARM Cortex A53 Specfic Convolution Schedule
The techniques described in the previous section are CPU agnostic, with the same schedule able to generate code for different CPU backends such as X86 and ARM. However, to reach the performance of hand optimized microkernels, schedules require architecture aware optimizations through backend specific intrinsic instructions and human insight into efficient techniques.
In this section we describe two ways in which we optimize ARM specific schedules. The first, a modification to the code generator that is invisible to the user and overrides LLVM’s default popcount lowering rule. The second is an ARM specific tensorize primitive that implements a highly optimized bitserial matrix vector operation.
5.1. Custom ARM Popcount Lowering Rule
Analyzing generated assembly code revealed LLVM (TVM’s code generator for ARM and other CPU backends, version 5.0) inefficiently lowers popcount, requiring 14 assembly instructions to implement a 32 bit vectorized popcount. The inefficiency arises because ARM’s hardware popcount instruction, vcnt, operates at a granularity of 8 bits. When LLVM lowers a 32bit vectorized popcount, it must reinterprets the input as a vector of 8 bit, and then sum four neighboring elements to reform the 32bit bector. This default accumulation step is inefficient, requiring 13 additional assembly instructions.
We replaced the default lowering rule in TVM’s code generation for 32 bit vectorized popcounts with only 4 assembly instructions, by relying on a special intrinsic instruction, vpaddl, that performs pairwise adds in a larger datatype. Figure 5 shows the LLVM 32bit popcount lowering rule as well as our custom rule.
This change requires no modifications to the schedule, and significantly speeds up the generic CPU schedule by 2.7x on the Raspberry Pi, for a total speedup of 5.4x over the 16bit integer baseline.
5.2. Optimized ARM bitserial matrix vector inner loop
In order to surpass the performance of handoptimized code we analyzed code from Caffe2’s ultra low precision library(Tulloch and Jia, 2017) and gemmbitserial(Umuroglu and Jahre, 2017). We identified patterns TVM currently cannot express such as vectorization along a reduction access, and lowlevel tricks the code generator misses, and wrote an efficient bitserial matrix vector multiply microkernel. We packaged the microkernel into tensorize primitive, which we parameterize for flexibility in activation/weight precision. Our tensorize primitive can be used in any ARM schedule, and we use it for all our ARM specific convolutions and matrix multiply schedules. However, it requires tiling input tensors such that to match vector lengths and relaying out inputs tensors such that tiles are stored contiguously.
Figure 6 gives a simplified example of how our tensorized primitive behaves and the assembly code it generates. It preforms 4 bitserial dot products and optimizes accumulation and write back steps in a tree reduction fashion allowing for vectorized loads and stores on all tensors.
We summarize the optimizations the code generator missed. We note that in high performance computing, it is common to in line assembly for performance critical sections, such as innerloops of computation, and TVM provides a method to mimic this.
Accumulation in small datatypes:
In order to prevent saturation of values the output of a low precision operation needs to be stored in a larger data type such as an int16 or int32 depending on the size of inputs tensors. TVM’s code generator promotes inputs to a larger datatype after a single operation; however, multiple operations can be safely accumulated before moving up a storage size. Our tensorize primitive takes advantage of this by accumulating 8 bit integers until overflow is possible and then extending accumulation to 16 bit.
High level patterns:
Programmers are very good at writing code that follows high level patterns, such as the tree reduction patterns. In general compilers fail to preform high level optimizations such as this, since most optimizations occur between a small localized regions of code.
After adding the tensorized reduction to our innerloop our low precision convolution schedule preforms almost 7x faster than the optimized 16bit baseline, with the final 1.6x coming from ARM specific optimizations to the convolution schedule (see Figure 3).
6. Evaluation
6.1. Methodology
We preform an analysis of our low precision bitserial operators on two CPU backends, a lowpower Raspberry Pi with an ARM CortexA53 processor and a highend x86 Intel i74790K processor. The ARM CotexA53 is a four core 1.2 GHz processor and belongs to the ARMV8 architecture, while the X86 machine is a 4.0 GHz four core processor with eight hyperthreads.
Note throughout the evaluation section when referring to our low precision operators we will adopt the naming convention from (Umuroglu and Jahre, 2017) and refer to an xbit weight and ybit activation operation as WxAy. For all our experiments we include the cost of bitpacking activations with the quantized operations, but not the cost of bitpacking weights, since we assume weights can be bitpacked ahead of time.
6.2. Raspberry Pi Results
6.2.1. Matrix Multiply
We benchmark the performance of low precision matrix multiplication, which makeup the computation of dense fully connected layers, against an optimized 8 bit integer implementation in TVM, for three various precisions of W1A1, W1A2, and W2A2. Note the same schedule is used for all precisions, though tiling parameters have been tuned separately. To prevent saturation of results the outputs of both the 8bit baseline and quantized operators stored a larger storage type, with the 8bit baseline accumulating in 32bit integers, and the quantized results accumulating in 16bit integers.
In Figure 7, we plot speedup relative to the 8bit baseline. As expected, the W1A1 preforms significantly better and is up to 11x faster than the 8bit baseline. The performance dropping steadily for higher quantization levels, with max speedups of 6.3x and 3.1x for W1A2 and W2A2, scaling roughly with the product of bitwidths.
Name  Operator  

2  conv2d  56, 56  64,64  3, 1 
3  conv2d  56, 56  64,64  1, 1 
4  conv2d  56, 56  64,128  3, 2 
5  conv2d  56, 56  64,128  1, 2 
6  conv2d  28, 28  128,128  3, 1 
7  conv2d  28, 28  128,256  3, 2 
8  conv2d  28, 28  128,256  1, 2 
9  conv2d  14, 14  256,256  3, 1 
10  conv2d  14, 14  256,512  3, 2 
11  conv2d  14, 14  256,512  1, 2 
12  conv2d  7, 7  512,512  3, 1 
Configurations of 2Dconvolution operators in ResNet18. Layer 1 is omitted as input channel depth is too small to allow efficient packing. H/W for height and width, IC for input channels, OC for output channels, K for kernel size, S for stride size.
6.2.2. Convolutions
Convolutions makeup the bulk of image recognition neural networks. In this section we benchmark our low precision operators on layers 212 of ResNet18 (see Table 1 for information). We omit the first layer, as most low precision networks preform the first layer in full precision. In Figure 9, we plot the speedup of three low precision convolutions (W1A1, W1A2 and W2A2), against an optimized 16bit integer baseline(Jiang et al., 2018), that experiences negligible accuracy degradation compared to full precision floating point. Our results show significant speedups over the baseline reaching 25x, 13x, and 8x speedups for W1A1, W1A2, and W2A2 respectively. It should be noted layers 3, 5, and 8 show poor speedups. However, these layers preform an order of magnitude fewer operations than other layers, indicating bitserial computation does not scale well for small problems that can’t amortize the bitpacking costs.
Additionally, we compare our low precision convolutions against the current state of the art implementation, Caffe2’s ultra low precision library(Tulloch and Jia, 2017) that was written specifically for ARM v7/v8 architecture and in lines calls to ARM NEON intrinsics We verified their results that the innermost loop of computation reaches about 70% peak theoretical performance for binary operations, assuming a vectorized instruction can be issued every cycle.
In Figure 8, we plot the speedup of our low precision implementations against theirs. The Caffe2 library currently only supports W1A2 operations and is single threaded, therefore we show a single threaded implementation of our code for fair comparison a multithreaded implementation to highlight the maximum speedup we can achieve using parallelism.
Our single threaded implementation preforms roughly 2.3x better than the baseline. It should be noted that we preform significantly better on layers 5, 6, and 11, which are 1x1 convolutions with a stride of 2 that the baseline did not optimize well. Our implementation preforms 1.6x better when we omit these layers. Since we modeled our tensorized innerloop after their code, both implementation emit roughly the same assembly code for the innermost loop. We attribute our singlethreaded speedup to selecting better tiling parameters by using TVM’s autotuning infrastructure, and the use of efficient inplace convolution that results in less memory duplication than the convolution lowering strategy the baseline implemented.
6.2.3. Bitserial Limit Study
We preformed a low precision limit study to analyze how performance scales with increasing precision of weights and activations. In Figure 10 we report speedup for all combinations of one to four bit activations and weight convolutions against the 16bit integer baseline from section 6.2.2. We preform this study on Layer 9 of ResNet18 as it responded best to quantization.
Our results confirmed that computation scales roughly with the product of bitwidths, as the W1A1 convolution preforms 14.4X faster than W4A4. We also see the value of memory reuse as W1A4 and W2A2 have the same computational complexity, but the W2A2 preforming 1.3x better as we experience better data reuse among the bit planes.
6.3. X86
We benchmark our low precision operators on layers 212 of ResNet18 (see Table 1 for information) on X86 as well. Since we have not implemented any X86 specific schedules these results are for the generic CPU schedule. We compare our results against an optimized 32bit floating point baseline implemented in TVM. These schedules are optimized to take advantage of X86’s AVX2 vectorized operations and were contributed by engineers at Amazon. We do not compare against any hand optimized bitserial kernels as we currently do not know of any, perhaps due to X86 lack of a vectorized popcount instruction.
In Figure 11 we plot the speed up of three low precision convolutions (W1A1, W1A2 and W2A2), against the full precision baseline. Our generic CPU schedule preformed moderately well achieving a average speedups of 5.38x, 2.73x and 1.69x on W1A1, W1A2, and W2A2 respectively. Similar to the Raspberry Pi, we see a wide range of speedups across the layers, with peak speedups twice the average speedup, indicating some layers respond better to low precision than others.
7. Conclusion
To conclude we detailed the modifications we made to TVM to implement low precision DNN operators for CPUs, emphasizing support for arbitrary low precision quantized neural networks. We provide a library of flexible compute operators such as bitpacking and quantized bitserial convolutions and matrix multiply, that users can reuse or write develop custom schedules for.
We then used these operators to preform an extended case study on optimizing low precision bitserial convolutions for the Raspberry Pi, and showed that by using a custom schedule we could outpreform a hand optimized 1bit weight 2bit activation convolution kernel by 2.3x on the layers of ResNet. Furthermore, our 1bit weight 2bit activation endtoend Raspberry Pi inference achieves a speedup of 3.3x over a full precision implementation.
8. Acknowledgements
We would like to thank Yaman Umuroglu for his help guiding us, and Andrew Tulloch for letting us use his microkernel and providing feedback. This work was supported in part by a Google PhD Fellowship for Tianqi Chen, ONR award #N000141612795, NSF under grants CCF1518703, CNS1614717, and CCF1723352, and gifts from Intel (under the CAPA program), Oracle, Huawei and anonymous sources.
References
 (1)
 int ([n. d.]) [n. d.]. Intel Math Kernel Library. Reference Manual. Intel Corporation. Santa Clara, USA. ISBN 630813054US.
 Cai et al. (2017) Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by halfwave Gaussian quantization. arXiv preprint arXiv:1702.00953 (2017).
 Chen et al. (2018a) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: EndtoEnd Optimization Stack for Deep Learning. arXiv preprint arXiv:1802.04799 (2018).
 Chen et al. (2018b) Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018b. Learning to Optimize Tensor Programs. arXiv preprint arXiv:1805.08166 (2018).
 Courbariaux et al. (2016) Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830 (2016).
 Dukhan ([n. d.]) Marat Dukhan. [n. d.]. NNPACK. https://github.com/Maratyszcza/NNPACK. ([n. d.]).
 Fromm et al. (2018) Josh Fromm, Shwetak Patel, and Matthai Philipose. 2018. Heterogeneous Bitwidth Binarization in Convolutional Neural Networks. arXiv preprint arXiv:1805.10368 (2018).
 Han et al. (2016) Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243–254.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision. 1026–1034.

He
et al. (2016)
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2016.
Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 770–778.  Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).
 Jiang et al. (2018) Ziheng Jiang, Tianqi Chen, and Mu Li. 2018. Efficient Deep Learning Inference on Edge Devices. (2018).

Krizhevsky
et al. (2012)
Alex Krizhevsky, Ilya
Sutskever, and Geoffrey E Hinton.
2012.
Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems. 1097–1105.  Muła et al. (2017) Wojciech Muła, Nathan Kurz, and Daniel Lemire. 2017. Faster population counts using AVX2 instructions. Comput. J. 61, 1 (2017), 111–120.

Parkhi et al. (2015)
Omkar M Parkhi, Andrea
Vedaldi, Andrew Zisserman, et al.
2015.
Deep Face Recognition.. In
BMVC, Vol. 1. 6.  RaganKelley et al. (2013) Jonathan RaganKelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525–542.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. 2015. Going deeper with convolutions. Cvpr.
 Tulloch and Jia (2017) Andrew Tulloch and Yangqing Jia. 2017. High performance ultralowprecision convolutions on mobile devices. arXiv preprint arXiv:1712.02427 (2017).
 Umuroglu et al. (2017) Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays. ACM, 65–74.
 Umuroglu and Jahre (2017) Yaman Umuroglu and Magnus Jahre. 2017. Towards efficient quantized neural network inference on mobile devices: workinprogress. In Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion. ACM, 18.
 Zhou et al. (2016) Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFaNet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).