A Highly Parallel FPGA Implementation of Sparse Neural Network Training

05/31/2018 ∙ by Sourya Dey, et al. ∙ University of Southern California 0

We demonstrate an FPGA implementation of a parallel and reconfigurable architecture for sparse neural networks, capable of on-chip training and inference. The network connectivity uses pre-determined, structured sparsity to significantly lower memory and computational requirements. The architecture uses a notion of edge-processing and is highly pipelined and parallelized, decreasing training times. Moreover, the device can be reconfigured to trade off resource utilization with training time to fit networks and datasets of varying sizes. The overall effect is to reduce network complexity by more than 8x while maintaining high fidelity of inference results. This complexity reduction enables significantly greater exploration of network hyperparameters and structure. As proof of concept, we show implementation results on an Artix-7 FPGA.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Neural networks (NNs) in machine learning systems are critical drivers of new technologies such as image processing and speech recognition. Modern NNs are built as graphs with millions of trainable parameters [1, 2, 3], which are tuned until the network converges. This parameter explosion demands large amounts of memory for storage and logic blocks for operation, which make the process of training difficult to perform on-chip. As a result, most hardware architectures for NNs perform training off-chip on power-hungry CPUs/GPUs or the cloud, and only support inference capabilities on the final FPGA or ASIC device [4, 5, 6, 7, 8, 9, 10]. Unfortunately, off-chip training results in a non-reconfigurable network being implemented on-chip which cannot support training time optimizations over model architecture and hyperparameters. This severely hinders the development of independent NN devices which a) dynamically adapt themselves to new models and data, and b) do not outsource their training to costly cloud computation resources or data centers which exacerbate problems of large energy consumption [11].

Training a network with too many parameters makes it likely to overfit [12], and memorize undesirable noise patterns [13]. Recent works [14, 15, 16, 17] have shown that the number of parameters in NNs can be significantly reduced without degradation in performance. This motivates our present work, which is to train NNs with reduced complexity and easy reconfigurability on FPGAs. This is achieved by using pre-defined sparsity [14, 18, 15]. Compared to other methods of parameter reduction such as [5, 19, 20, 21, 10], pre-defined sparsity does not require additional computations or processing to decide which parameters to remove. Instead, most of the weights are always absent, i.e. sparsity is enforced prior to training. This results in a sparse network of lesser complexity as compared to a conventional fully connected (FC) network. Therefore the memory and computational burdens posed on hardware resources are reduced, which enables us to accomplish training on-chip. Section II describes pre-defined sparsity in more detail, along with a hardware architecture introduced in [14] which exploits it.

A key factor in NN hardware implementation is finite bit width effect. A previous FPGA implementation [22] used fixed point adders, but more resource-intensive floating point multipliers and floating-to-fixed-point converters. Another previous implementation [23] used probabilistic fixed point rounding techniques, which incurred additional DSP resources. Keeping hardware simplicity in mind, our implementation uses only fixed point arithmetic with clipping of large values.

The major contributions of the present work are summarized here and described in detail in Section III:

  • The first implementation of NNs which can perform both training and inference on FPGAs by exploiting parallel edge processing. The design is parametrized and can be easily reconfigured to fit on FPGAs of varying capacity.

  • A low complexity design which uses pre-defined sparsity while maintaining good network performance. To the best of our knowledge, this is the first NN implementation on FPGA exploiting pre-defined sparsity.

  • Theoretical analysis and simulation results which show that sparsity leads to reduced dynamic range and is more tolerant to finite bit width effects in hardware.

Ii Sparse Hardware Architecture

Ii-a Pre-defined Sparsity

Our notation treats the input of a NN as layer 0 and the output as layer

. The number of neurons in the layers are

. The NN has junctions in between the layers, with and respectively being the number of neurons in the earlier (left) and later (right) layers of junction . Every left neuron has a fixed number of edges (or weights) going from it to the right, and every right neuron has a fixed number of edges coming into it from the left. These numbers are defined as out-degree () and in-degree (), respectively. For FC layers, and . In contrast, pre-defined sparsity leads to sparsely connected (SC) layers, where and , such that , which is the total number of weights in junction . Having a fixed and ensures that all neurons in a junction contribute equally and none of them get disconnected, since that would lead to a loss of information. The connection density in junction is given as and the overall connection density of the network is defined as . Previous works [14, 15] have shown that overall density levels of incur negligible performance degradation – which motivates us to implement such low density networks on hardware in the present work.

Ii-B Hardware Architecture

This subsection describes the mathematical algorithm and the subsequent hardware architecture for a NN using pre-defined sparsity. The input layer, i.e. the leftmost, is fed activations () from the input data. For an image classification problem, these are image pixel values. Then the feedforward (FF) operation proceeds as described in eq. (1a):


Both eqs. (1a) and (2a) are . Here, is activation, is its derivative (a-dot), is bias, is weight, and and

are respectively the activation function and its derivative (with respect to its input), which are described further in Section

III. For , and , subscript denotes layer number and superscript denotes a particular neuron in a layer. For the weights, denotes the weight in junction which connects neuron in layer to neuron in layer . The summation for a particular right neuron is carried out over all weights and left neuron activations which connect to it, i.e. . These left indexes are arbitrary because the weights in a junction are interleaved, or permuted. This is done to ensure good scatter, which has been shown to enhance performance [15].

The output layer activations are compared with the ground truth labels

which are typically one-hot encoded, i.e.

, , is 1 if the class represented by output neuron is the true class of the input sample, otherwise 0. We use the cross-entropy cost function for optimization, the derivative of which with respect to the activations is . We also experimented with quadratic cost, but its performance was inferior compared to cross-entropy. The backpropagation (BP) operation proceeds as described in eq. (3a):


where denotes delta value. Eq. (3a) is , and eq. (4a) is . The summation for a particular left neuron is carried out over all weights and right neuron deltas which connect to it, i.e. . The right indexes are arbitrary due to interleaving.

Based on the values, the trainable weights and biases have their values updated and the network learns. We used the gradient descent algorithm, so the update (UP) operation proceeds as described in eq. (5a):


where is the learning rate hyperparameter. Both eqs. (5a) and (6a) are . While eq. (5a) is , eq. (6a) is only for those and which are connected by a weight .

The architecture uses a) operational parallelization to make FF, BP and UP occur simultaneously in each junction, and b) junction pipelining wherein all the junctions execute all 3 operations simultaneously on different inputs. Thus, there is a factor of speedup as compared to doing 1 operation at a time, albeit at the cost of increased hardware resources. Fig. 1 shows the architecture in action. As an example, consider , i.e. the network has an input layer, a single hidden layer, and an output layer. When the second junction is doing FF and computing cost on input , it is also doing BP on the previous input which just finished FF, as well as updating (UP) its parameters from the finished cost computation results of input . Simultaneously, the first junction is doing FF on the latest input , and UP using the finished BP results of input . BP does not occur in the first junction because there are no values to be computed.

Fig. 1: Junction pipelining and operational parallelization in the architecture.

The architecture uses edge processing by making every junction have a degree of parallelism , which is the number of weights processed in parallel in 1 clock cycle (or simply cycle) by all 3 operations. So the total number of cycles to process a junction is plus some additional cycles for memory accesses. This comprises a block cycle, the reciprocal of which is ideal throughput (inputs processed per second).

All parameters and computed values in a junction are stored in banks of memories. The weights in the th cells of all weight memories are read out in the th cycle. Additionally, up to activations, a-dots, deltas and biases are accessed in a cycle. The order of accessing them can be natural (row-by-row like the weights), or permuted (due to interleaving). All accesses need to be clash-free, i.e. the different values to be accessed in a cycle must all be stored in different memories so as to avoid memory stalls, as shown in Fig. 2. Optimum clash-free interleaver designs are discussed in [18]. Fig. 3 shows simultaneous FF, BP and UP, along with memory accesses, in more detail inside a single junction.

Fig. 2: Example of clash-freedom in some junction with . In each cycle, weights are read corresponding to 2 right neurons (shown in same color). When traced back through the interleaver , this requires accessing left activations in permuted order. There are activation memories , only 1 element from each is read in a cycle in order to preserve clash-freedom. This is shown by the checkerboards, where only 1 cell in each column is shaded. Picture taken from [18] with permission.
Fig. 3: Operational parallelization in junction (), showing natural and permuted order accesses as solid and dashed lines, respectively.

This architecture is ideal for implementation on reconfigurable hardware due to a) its parallel and pipelined nature, b) its low memory footprint due to sparsity, and particularly c) the degree of parallelism parameters, which can be tuned to efficiently utilize available hardware resources, as described in Sections III-D and III-E.

Iii FPGA Implementation

Iii-a Device and Dataset

We implemented the architecture described in Section II-B

on an Artix-7 FPGA. This is a relatively small FPGA and therefore allowed us to explore efficient design styles and optimize our RTL to make it more robust and scalable. We experimented on the MNIST dataset where each input is an image consisting of 784 pixels in 8-bit grayscale each. Each ground truth output is one-hot encoded between 0-9. Our implementation uses powers of 2 for network parameters to simplify the hardware realization. Accordingly we padded each input with 0s to make it have 1024 pixels. The outputs were padded with 0s to get 32-bit one-hot encoding. Prior to hardware implementation, software experiments showed that having extra always-0 I/O did not detract from network performance.

Iii-B Network Configuration and Training Setup

The network has 1 hidden layer of 64 neurons, i.e. 2 junctions overall. Other parameters were chosen on the basis of hardware constraints and experimental results, which are described in Sections III-C and III-D. The final network configuration is given in Table I.

Junction Number () 1 2
Left Neurons () 1024 64
Right Neurons () 64 32
Fan-out () 4 16
Weights () 4096 1024
Fan-in () 64 32
128 32
Block cycle () 111In terms of number of clock cycles. Not considering the additional clock cycles needed for memory accesses. 32 32
Density () 6.25% 50%
Overall Density 7.576%
TABLE I: Implemented Network Configuration

We selected

MNIST inputs to comprise 1 epoch of training. Learning rate (

) is initially , halved after the first 2 epochs, then after every 4 epochs until its value became . Dynamic adjustment of leads to better convergence, while keeping it to a power of 2 leads to the multiplications in eq. (5a) getting reduced to bit shifts. Pre-defined sparsity leads to a total number of trainable parameters , which is much less than , so we theorized that overfitting was not an issue. We verified this using software simulations, and hence did not apply weight regularization.

Iii-C Bit Width Considerations

Iii-C1 Parameter Initialization

We initialized weights using the Glorot Normal technique, i.e. their values are taken from Gaussian distributions with mean

and variance

. This translates to a three standard deviation range of

for junction 1 and for junction 2 in our network configuration described in Table I.

The biases in our architecture are stored along with the weights as an augmentation to the weight memory banks. So we initialized biases in the same manner as weights. Software simulations showed that this led to no degradation in performance from the conventional method of initializing biases with 0s. This makes sense since the maximum absolute value from initialization is much closer to 0 than their final values when the network converges, as shown in Fig. 4.

To simplify the RTL, we used the same set of unique values to initialize all weights and biases in junction . Again, software simulations showed that this led to no degradation in performance as compared to initializing all of them randomly. This is not surprising since an appropriately high value of initial learning rate will drive each weight and bias towards its own optimum value, regardless of similar values at the start.

Fig. 4: Maximum absolute values (left y-axis) for , and , and percentage classification accuracy (right y-axis), as the network is trained.

Iii-C2 Fixed Point Configuration

We recreated the aforementioned initial conditions in software and trained our configuration to study the range of values for network variables until convergence. The results for , and are in Fig. 4. The values are generated using the sigmoid activation function, which has range .

To keep the hardware optimal, we decided on the same fixed point bit configuration for all computed values and trainable parameters — , , , and . Our configuration is characterized by the bit triplet , which are respectively the total number of bits, integer bits, and fractional bits, with the constraint , where the 1 is for the sign bit. This gives a numerical range of and precision of . Fig. 4 shows that the maximum absolute values of various network parameters during training stays within 8. Accordingly we set . We then experimented with different values for the bit triplet and obtained the results shown in Table II. Accuracy is measured on the last 1000 training samples. Noting the diminishing returns and impractical utilization of hardware resources for high bit widths, we chose the bit triplet as being the optimal case.

FPGA LUT Accuracy after Accuracy after
Utilization % 1 epoch 15 epochs
8 2 5 37.89 78 81
10 2 7 72.82 90.1 94.9
10 3 6 63.79 88 93.8
12 3 8 83.38 90.3 96.5
16 4 11 112 91.9 96.5
TABLE II: Effect of Bit Width on Performance

Iii-C3 Dynamic Range Reduction due to Sparsity

We found that sparsity leads to reduction in the dynamic range of network variables, since the summations in eqs. (1a) and (3a) are over smaller ranges. This motivated us to use a special form of adder and multiplier which preserves the bit triplet between inputs and outputs by clipping large absolute values of output to either the positive or negative maximum allowed by the range. For example, 10 would become 7.996 and would become . Fig. 5

analyzes the worst clipping errors by comparing the absolute values of the argument of the sigmoid function in the hidden layer, i.e.

from eq. (1a), for our sparse case vs. the corresponding FC case (, ). Notice that the sparse case only has 17% of its values clipped due to being outside the dynamic range afforded by , while the FC case has 57%. The sparse case also has a smaller variance. This implies that the hardware errors introduced due to finite bit-width effects are less pronounced for our pre-defined sparse configuration as compared to FC.

Fig. 5: Histograms of absolute value of eq. (1a)’s with respect to dynamic range for (a) sparse vs. (b) FC cases, as obtained from ideal floating point simulations on software. Values right of the pink line are clipped.

Iii-C4 Experiments with ReLU

As demonstrated in literature [1, 2, 3]

, the native (ideal) ReLU activation function is more widely used than sigmoid due to the former’s better performance, no vanishing gradient problem, and tendency towards generating sparse outputs. However, ideal ReLU is not practical for hardware due to its unbounded range. We experimented with a modified form of the

ReLU activation function where the outputs were clipped to a) 8, which is the maximum supported by , and b) 1, to preserve bit width consistency in the multipliers and adders and ensure compatibility with sigmoid activations. Fig. 6 shows software simulations comparing sigmoid with these cases. Note that ReLU clipped at 8 converges similar to sigmoid, but sigmoid has better initial performance. Moreover, there is no need to promote extra sparsity by using ReLU because our configuration is already sparse, and sigmoid does not suffer from vanishing gradient problems because of the small range of our inputs. We therefore concluded that sigmoid activation for all layers is the best choice.

Fig. 6: Comparison of activation functions for .

Iii-D Implementation Details

Iii-D1 Sigmoid Activation

The sigmoid function uses exponentials, which are computationally infeasible to obtain in hardware. So we pre-computed the values of and

and stored them in look-up tables (LUTs). Interpolation was not used, instead we computed sigmoid for all 4096 possible 12-bit arguments up to the full 8 fractional bits of accuracy. On the other hand, its derivative values were computed to

fractional bits of accuracy since they have a range of . Note that clipped ReLU activation uses only comparators and needs no LUTs. However, the number of sigmoid LUTs required is , which incurs negligible hardware cost. This reinforces our decision to use sigmoid instead of ReLU.

Iii-D2 Interleaver

We used clash-free interleavers of the SV+SS variation, as described in [18]

. Starting vectors for all sweeps were pre-calculated and hard-coded into FPGA logic.

Iii-D3 Arithmetic Units

We numbered the weights sequentially on the right side of every junction, which leads to permuted numbering on the left side due to interleaving. We chose . This means that the weights accessed in a cycle correspond to an integral () number of right neurons, so the FF summations in eq. (1a) can occur in a single cycle. This eliminates the need for storing FF partial sums. The total number of multipliers required for FF is . The summations also use a tree adder of depth for every neuron processed in a cycle.

BP does not occur in the first junction since the input layer has no values. The BP summation in eq. (4a) will need several cycles to complete for a single left neuron since weight numbering is permuted. This necessitates storing partial sums, however, tree adders are no longer required. Eq. (4a) for BP has 2 multiplications, so the total number of multipliers required is .

The UP operation in each junction requires adders for the weights and adders for the biases, since that many right neurons are processed every cycle. Only the weight update requires multipliers, so their total number is .

Our FPGA device has 240 DSP blocks. Accordingly, we implemented the 224 FF and BP multipliers using 1 DSP for each, while the other 160 UP multipliers and all adders were implemented using logic.

Iii-D4 Memories and Data

All memories were implemented using block RAM (BRAM). The memories for and never need to be read from and written into in the same cycle, so they are single-port. memories are true dual-port, i.e. both ports support reads and writes. This is required due to the read-modify-write nature of the memories since they accumulate partial sums. The ‘weight+bias’ memories are simple dual-port, with 1 port used exclusively for reading the th cell in cycle , and the other for simultaneously writing the th cell. These were initialized using Glorot normal values while all other memories were initialized with 0s.

The ground truth one-hot encoding for all inputs were stored in a single-port BRAM, and initialized with word size to represent the 10 MNIST outputs. After reading, the word was padded with 0s to make it 32-bit long. On the other hand, the input data was too big to store on-chip. Since the native MNIST images are pixels, the total input data size is Mb, while the total device BRAM capacity is only Mb. So the input data was fed from PC using UART interface.

Iii-D5 Network Configuration

Here we explain the choice of network configuration in Table I. We initially picked , which is the minimum power of 2 above 10. Since later junctions need to be denser than earlier ones to optimize performance [15], we experimented with junction 2 density and show its effects on network performance in Fig. 7. We concluded that 50% density is optimum for junction 2. Note that individual values should be adjusted to have the same block cycle length for all junctions. This ensures an always full pipeline and no stalls, which can achieve the ideal throughput of 1 input per block cycle. This, along with the constraint , led to , which was beyond the capacity of our FPGA. So we increased to 32 and set to the minimum value of 32, leading to . We experimented with , but the resulting accuracy was within 1 percentage point of our final choice of .

Fig. 7: Performance for different junction 2 densities, keeping junction 1 density fixed at 6.25%.

Iii-D6 Timing and Results

A block cycle in our design is clock cycles since each set of weights in a junction need a total of 3 clock cycles for each operation. The first and third are used to compute memory addresses, while the second performs arithmetic computations and determines our clock frequency, which is 15MHz.

We stored the results of several training inputs and fed them out to 10 LEDs on the board, each representing an output from 0-9. The FPGA implementation performed according to RTL simulations and within percentage points of the ideal floating point software simulations, giving 96.5% accuracy in 14 epochs of training.

Iii-E Effects of

Fig. 8: Dependency of various design and performance parameters on the total , keeping the network architecture and sparsity level fixed.

A key highlight of our architecture is the total degree of parallelism , which can be reconfigured to trade off training time and hardware resources while keeping the network architecture the same. This is shown in Fig. 8. The present work uses total , which leads to a block cycle time of , but economically uses arithmetic resources and has a small number of deep memories, making it ideal for a fully BRAM implementation. Given more powerful FPGAs, the same architecture can be reconfigured to achieve higher GOPS count and process inputs in , albeit at the cost of more FPGA resources and a greater number of shallower memories. Moreover, this reconfigurability also allows a complete change in network structure and hyperparameters to process a new dataset on the same device if desired.

Iv Conclusion

This paper demonstrates an FPGA implementation of both training and inference of a neural network pre-defined to be sparse. The architecture is optimized for FPGA implementation and uses parallel and pipelined processing to increase throughput. The major highlights are the degrees of parallelism , which can be quickly reconfigured to re-allocate FPGA resources, thereby adapting any problem to any device. While the present work uses a modest FPGA board as proof-of-concept, this reconfigurability is allowing us to explore various types of networks on bigger boards as future work. Our RTL is fully parametrized and the code available on request.


  • [1]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Proc. NIPS, 2012, pp. 1097–1105.
  • [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015, pp. 1–9.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
  • [4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. ASPLOS.   ACM, 2014, pp. 269–284.
  • [5] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proc. ICML.   JMLR.org, 2015, pp. 2285–2294.
  • [6] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in Proc. ISCA, 2016, pp. 243–254.
  • [7] X. Zhou, S. Li, K. Qin, K. Li, F. Tang, S. Hu, S. Liu, and Z. Lin, “Deep adaptive network: An efficient deep neural network with sparse binary connections,” in arXiv:1604.06154, 2016.
  • [8]

    Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J. Seo, “ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler,”

    Integration, the VLSI Journal, 2018.
  • [9] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. on FPGAs, 2017, pp. 75–84.
  • [10] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, “C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs,” in Proc. ACM/SIGDA Int. Symp. on FPGAs, 2018, pp. 11–20.
  • [11] A. Shehabi, S. Smith, N. Horner, I. Azevedo, R. Brown, J. Koomey, E. Masanet, D. Sartor, M. Herrlin, and W. Lintner, “United States data center energy usage report,” Lawrence Berkeley National Laboratory, Tech. Rep. LBNL-1005775, 2016.
  • [12] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, “Predicting parameters in deep learning,” in Proc. NIPS, 2013, pp. 2148–2156.
  • [13] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in arXiv:1611.03530, 2016.
  • [14] S. Dey, Y. Shao, K. M. Chugg, and P. A. Beerel, “Accelerating training of deep neural networks via sparse edge processing,” in Proc. ICANN.   Springer, 2017, pp. 273–280.
  • [15] S. Dey, K.-W. Huang, P. A. Beerel, and K. M. Chugg, “Characterizing sparse connectivity patterns in neural networks,” in Proc. ITA, 2018.
  • [16] A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg, “Net-trim: Convex pruning of deep neural networks with performance guarantee,” in Advances in Neural Information Processing Systems 30, 2017.
  • [17] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression,” in Proc. ICLR, 2017.
  • [18] S. Dey, P. A. Beerel, and K. M. Chugg, “Interleaver design for deep neural networks,” in Proceedings of the 51st Asilomar Conference on Signals, Systems, and Computers, Oct 2017, pp. 1979–1983.
  • [19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” JMLR, vol. 15, pp. 1929–1958, 2014.
  • [20] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proc. ICLR, 2016.
  • [21] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional networks using vector quantization,” in arXiv:1412.6115, 2014.
  • [22] K. Kara, D. Alistarh, G. Alonso, O. Mutlu, and C. Zhang, “FPGA-accelerated dense linear machine learning: A precision-convergence trade-off,” in Proc. FCCM, 2017.
  • [23] S. Gupta, A. Agarwal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in arXiv:1502.02551, 2015.