nervanagpu
Nervana™ library for GPUs
view repo
Training of largescale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of lowprecision fixedpoint computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16bit wide fixedpoint number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energyefficient hardware accelerator that implements lowprecision fixedpoint arithmetic with stochastic rounding.
READ FULL TEXT VIEW PDF
Multipliers are the most space and powerhungry arithmetic operators of ...
read it
The high computational and parameter complexity of neural networks makes...
read it
Stripes is a Deep Neural Network (DNN) accelerator that uses bitserial
...
read it
Artificial neural networks can be trained with relatively lowprecision
...
read it
Algorithms and a hardware accelerator for performing stochastic rounding...
read it
Low precision training is one of the most popular strategies for deployi...
read it
Convolutional Neural Networks (CNNs) have been widely used in many field...
read it
To a large extent, the success of deep learning techniques is contingent upon the underlying hardware platform’s ability to perform fast, supervised training of complex networks using large quantities of labeled data. Such a capability enables rapid evaluation of different network architectures and a thorough search over the space of model hyperparameters. It should therefore come as no surprise that recent years have seen a resurgence of interest in deploying largescale computing infrastructure designed specifically for training deep neural networks. Some notable efforts in this direction include distributed computing infrastructure using thousands of CPU cores
(Dean et al., 2012; Chilimbi et al., 2014), or highend graphics processors (GPUs) (Krizhevsky & Hinton, 2009), or a combination of CPUs and GPUs scaledup to multiple nodes (Coates et al., 2013; Wu et al., 2015).At the same time, the natural error resiliency of neural network architectures and learning algorithms is welldocumented, setting them apart from more traditional workloads that typically require precise computations and number representations with high dynamic range. It is well appreciated that in the presence of statistical approximation and estimation errors, highprecision computation in the context of learning is rather unnecessary
(Bottou & Bousquet, 2007). Moreover, the addition of noise during training has been shown to improve the neural network’s performance (Murray & Edwards, 1994; Bishop, 1995; Audhkhasi et al., 2013). With the exception of employing the asynchronous version of the stochastic gradient descent algorithm
(Recht et al., 2011) to reduce network traffic, the stateoftheart largescale deep learning systems fail to adequately capitalize on the errorresiliency of their workloads. These systems are built by assembling generalpurpose computing hardware designed to cater to the needs of more traditional workloads, incurring high and often unnecessary overhead in the required computational resources.The work presented in this paper owes its inception to the thinking that it may be possible to leverage algorithmlevel noisetolerance to relax certain constraints on the underlying hardware, leading to a hardwaresoftware cooptimized system that achieves significant improvement in computational performance and energy efficiency. Allowing the lowlevel hardware components to perform approximate, possibly nondeterministic computations and exposing these hardwaregenerated errors up to the algorithm level of the computing stack forms a key ingredient in developing such systems. Additionally, the lowlevel hardware changes need to be introduced in a manner that preserves the programming model so that the benefits can be readily absorbed at the applicationlevel without incurring significant software redevelopment costs.
As a first step towards achieving this crosslayer codesign, we explore the use of lowprecision fixedpoint arithmetic for deep neural network training with a special focus on the rounding mode adopted while performing operations on fixedpoint numbers. The motivation to move to fixedpoint arithmetic (from the conventional floatingpoint computations) is twofold. Firstly, fixedpoint compute units are typically faster and consume far less hardware resources and power than floatingpoint engines. The smaller logic footprint of the fixedpoint arithmetic circuits would allow for the instantiation of many more such units for a given area and power budget. Secondly, lowprecision data representation reduces the memory footprint, enabling larger models to fit within the given memory capacity. Cumulatively, this could provide dramatically improved datalevel parallelism.
The key finding of our exploration is that deep neural networks can be trained using lowprecision fixedpoint arithmetic, provided that the stochastic rounding scheme is applied while operating on fixedpoint numbers. We test the validity of the proposed approach by training deep neural networks for the MNIST and CIFAR10 image classification tasks. Deep networks trained using bit wide fixedpoint and stochastic rounding achieve nearly the same performance as that obtained when trained using 32bit floatingpoint computations. Furthermore, we present a hardware accelerator design, prototyped on an FPGA, that achieves high throughput and low power using a large number of fixedpoint arithmetic units, a dataflow architecture, and compact stochastic rounding modules.
Determining the precision of the data representation and the compute units is a critical design choice in the hardware (analog or digital) implementation of artificial neural networks. Not surprisingly, a rich body of literature exists that aims to quantify the effect of this choice on the network’s performance. However, a disproportionately large majority of these studies are focused primarily on implementing just the feedforward (inference) stage, assuming that the network is trained offline using high precision computations. Some recent studies that embrace this approach have relied on the processor’s vector instructions to perform multiple
bit operations in parallel (Vanhoucke et al., 2011), or employ reconfigurable hardware (FPGAs) for highthroughput, energyefficient inference (Farabet et al., 2011; Gokhale et al., 2014), or take the route of custom hardware implementations (Kim et al., 2014; Merolla et al., 2014).Previous studies have also investigated neural network training using different number representations. Iwata et al. (Iwata et al., 1989) implements the backpropagation algorithm using bit floatingpoint processing units. Hammerstrom (Hammerstrom, 1990) presents a framework for onchip learning using to bit fixedpoint arithmetic. In (Holt & Hwang, 1993), the authors perform theoretical analysis to understand a neural network’s ability to learn when trained in a limited precision setting. Results from empirical evaluation of simple networks indicate that in most cases,  bits of precision is sufficient for backpropagation learning. In (Höhfeld & Fahlman, 1992), probabilistic rounding of weight updates is used to further reduce (
8 bits) the precision requirements in gradientbased learning techniques. While these studies provide valuable insights into the behavior of the limited precision training of neural networks, the networks considered are often limited to variants of the classical multilayer perceptron containing a single hidden layer and only a few hidden units. Extrapolating these results to the stateoftheart deep neural networks that can easily contain millions of trainable parameters is nontrivial. Consequently, there is a need to reassess the impact of limited precision computations within the context of more contemporary deep neural network architectures, datasets, and training procedures.
A recent work (Chen et al., 2014) presents a hardware accelerator for deep neural network training that employs fixedpoint computation units, but finds it necessary to use
bit fixedpoint representation to achieve convergence while training a convolutional neural network on the MNIST dataset. In contrast, our results show that it is possible to train these networks using only
bit fixedpoint numbers, so long as stochastic rounding is used during fixedpoint computations. To our knowledge, this work represents the first study of application of stochastic rounding while training deep neural networks using lowprecision fixedpoint arithmetic.Standard implementations of deep neural network training via the backpropagation algorithm typically use 32bit floatingpoint (float) representation of real numbers for data storage and manipulation. Instead, consider the generalized fixedpoint number representation: , where QI and QF correspond to the integer and the fractional part of the number, respectively. The number of integer bits () plus the number of fractional bits () yields the total number of bits used to represent the number. The sum + is referred to as the word length . In this paper, we use the notation to denote a fixedpoint representation in which () correspond to the length of the integer (fractional) part of the number. We also employ to denote the smallest positive number that may be represented in the given fixedpoint format. Therefore, the fixedpoint format limits the precision to bits, sets the range to , and defines to be equal to .
As will be evident in the sections to follow, the rounding mode adopted while converting a number (presumably represented using the float or a higher precision^{1}^{1}1We call to be a higher precision representation than iff fixedpoint format) into a lower precision fixedpoint representation turns out to be a matter of important consideration while performing computations on fixedpoint numbers. Given a number and the target fixedpoint representation , we define as the largest integer multiple of less than or equal to and consider the following rounding schemes:
Roundtonearest
Stochastic rounding: The probability of rounding
to is proportional to the proximity of to :Stochastic rounding is an unbiased rounding scheme and possesses the desirable property that the expected rounding error is zero, i.e.
Irrespective of the rounding mode used, if lies outside the range of , we saturate the result to either the lower or the upper limit of :
(1) 
Consider two dimensional vectors a and b such that each component is represented in the fixedpoint format , and define as the inner product of a and b. is also represented in some fixedpoint format . We split the computation of into the following two steps:
Compute
The product of and produces a fixedpoint number in the format. can be thought of as a temporary fixedpoint register with enough width (number of bits) to prevent saturation/overflow and avoid any loss of precision while accumulating the sum over all products . The requirement on the width of is in the worst case. Note that the worst case is extremely rare and occurs when all and are saturated to either the lower or the upper limit of .
Convert:
This step invokes the function defined previously in eq. 1, resulting in either clipping the value in to the limits set by or rounding to bits of fractional precision using the specified rounding mode.
Adopting this twostep approach has several advantages. Firstly, it closely mimics the behavior of the hardware implementation of vector inner product using the the hardware DSP^{2}^{2}2Digital Signal Processing units are hardware units in the FPGA fabric that implement fixedpoint multiplication and addition units in FPGAs. These DSP units accept bit inputs and accumulate the results of the MACC operation in a bit wide register. Secondly, by invoking the rounding mode only after the accumulation of all the sums, we significantly reduce the hardware overhead in implementing the stochastic rounding scheme. Lastly, the adoption of this approach allows us to efficiently simulate fixedpoint computations using CPUs/GPUs and vendorsupplied BLAS^{3}^{3}3Basic Linear Algebra Subprograms libraries. For instance, matrix multiplication of two fixedpoint matrices and can be simulated by first converting them into float matrices, calling the hardwareoptimized SGEMM routine and applying the function to each element of the resulting float matrix.
In this section, we present the results of our investigation into the effect of employing limited precision data representation during the training of deep neural networks. We consider both fully connected deep neural networks (DNN) as well as convolutional neural networks (CNN) and present results for the MNIST(Lecun & Cortes, ) and the CIFAR10(Krizhevsky & Hinton, 2009) datasets. As a baseline for comparison, we first evaluate the network performance (in terms of the rate of reduction of both the training error and the error on the test set) using the conventional
bit floatingpoint arithmetic. Subsequently, we constrain the neural network parameters (weights
, biases ), as well as the other intermediate variables generated during the backpropagation algorithm (layer outputs , backpropagated error , weight updates , bias updates ) to be represented in the fixedpoint format and train the network again starting from random initialization of the parameters. While training using fixedpoint, the different model hyperparameters such as weight initialization, regularization parameters, learning rates etc. are kept unchanged from the ones used during the baseline evaluation. The word length for the fixedpoint format is set to bits i.e. the number of bits allocated to represent the integer and the fractional parts add up to .This fairly restrictive choice of number representation has some important implications. From the perspective of neural network training, an aggressive reduction of the precision with which the parameter updates are computed and stored may result in the loss of the gradient information if the updates are significantly smaller than the for the given fixedpoint format. As a consequence, this may impede the progress of the gradient descent algorithm, or worse, introduce instabilities during the training procedure. Note that in the roundtonearest scheme, any parameter update in the range is always rounded to zero, as opposed to the stochastic rounding scheme which maintains a nonzero probability of small parameter updates to round to
. Secondly, since the fixedpoint format offers only a limited range, outputs of the ReLU activation function may get clipped to the upper limit set by
. From a hardware perspective, the use of bits for data storage (instead of float) corresponds to a factor reduction in the amount of memory needed for training a given network. Moreover, the use of the same word length for all network variables carries with it the added advantage of simplifying the hardware implementation.In the first set of experiments, we construct a fully connected neural network with hidden layers, each containing units with ReLU activation function and train this network to recognize the handwritten digits from the MNIST dataset. This dataset comprises of training images and test images – each image is x pixels containing a digit from to . The pixel values are normalized to lie in the range. No other form of data preprocessing or augmentation is performed. The weights in each layer are initialized by sampling random values from
while the bias vectors are initialized to
0. The network is trained using minibatch stochastic gradient descent (SGD) with a minibatch size of to minimize the cross entropy objective function. The baseline achieves a test error of .Next, we retrain the network using fixedpoint computations and set to bits. Figure 1 shows the results for the two rounding modes: Roundtonearest and Stochastic rounding. In both cases, allocating bits to the fractional part^{4}^{4}4Using up bits for the fractional part leaves only bits (including the sign bit) for representing the integer portion of the number. This does not seem to adversely affect the network performance. produces no noticeable degradation in either the convergence rate or the classification accuracy. A reduction in the precision below bits begins to negatively impact the network’s ability to learn when the roundtonearest scheme is adopted. This is primarily because at reduced fractional precision, most of the parameter updates are rounded down to zero. In contrast, the stochastic rounding preserves the gradient information, atleast statistically, and the network is able to learn with as few as bits of precision without any significant loss in performance. Note, however, at a precision lower than bits, even the stochastic rounding scheme is unable to fully prevent the loss of gradient information.
Using the MNIST dataset, we also evaluate a CNN with an architecture similar to LeNet5 (LeCun et al., 1998). It comprises of convolutional layers with x filters and ReLU activation function. The first layer has feature maps while the second convolutional layer produces
feature maps. Each convolutional layer is followed by a pooling/subsampling layer. The pooling layers implement the max pooling function over nonoverlapping pooling windows of size
x. The output of the second pooling layer feeds into a fully connected layer consisting ofReLU neurons, which is then connected into a
way softmax output layer.For training this network, we adopt an exponentially decreasing learning rate – scaling it by a factor of 0.95 after every epoch of training. The learning rate for the first epoch is set to 0.1. Momentum () is used to speed up SGD convergence. The weight decay parameter is set to for all layers. When trained using float, the network achieves a test error of . As was done previously for DNNs, we retrain the network using fixedpoint computations with set to bits. However, in this case, saturating the output of the convolutional layers to a low integer value created some difficulty in jumpstarting the training procedure. As a result, we increase the number of bits allocated for the integer part at the expense of reducing the precision and choose the format for representing the layer outputs. Figure 2 compiles the results obtained using the two different rounding modes. Unlike in the case of DNNs, when the roundtonearest scheme is adopted during fixedpoint computations, the training procedure fails to converge. When stochastic rounding is used, we achieve a test error of and for bit and bit precision, respectively – corresponding to only a slight degradation from the float baseline.
To further test the validity of the stochastic rounding approach, we consider another commonly used image classification benchmark: CIFAR10. The training set consists of RGB images of size x pixels. The images are divided into classes, each containing images. The test set has images. We scale the image RGB values to [0,1] range and do not perform any other form of data preprocessing or augmentation. For this dataset, we construct a CNN with convolutional layers each followed by a subsampling/pooling layer. The convolutional layers consist of x filters and the subsampling layers implement the max pooling function over a window of size x
using a stride of
. The pooling layer connects to a way softmax output layer. This architecture is similar to the one introduced in (Hinton et al., 2012) with the exception that it does not implement local response normalization or dropout layers.The network training starts off with a learning rate of and reduced by a factor of after , , and epochs. Using bit floating point numbers for training, this network configuration misclassifies approximately of the images in the test set. This serves as the baseline for comparing the results obtained while training the network using fixedpoint computations. Similar to earlier experiments, we set the for fixedpoint number to and test the different rounding modes and fractional precision. The layer outputs are represented in the format. As observed previously and as shown in Figure 3, training using fixedpoint with roundtonearest scheme begins to collapse after only a few epochs. On the contrary, the stochastic rounding scheme appears to bestow upon the training procedure a significantly higher degree of stability. For bits of fractional precision and the stochastic rounding scheme, the network’s behavior is quite similar to that observed during the baseline evaluation and achieves a test error of .
If the precision is reduced further (to bits) the convergence rate degrades as the learning proceeds and after a point, SGD stops making progress. This is expected since at reduced precision, the parameter updates tend to become sparser (despite stochastic rounding) due to the perilous combination of smaller gradients and diminished learning rates. The network’s performance suffers as a result and the minimum achievable test error saturates at . Fortunately, this damage is reversible as shown in Figure 3. After training for epochs using the format, we relax the constraint on slightly and increase by bits to bits. This increases the fractional precision to bits ( format) and subsequent training results in a rapid improvement in the network’s performance. After an additional 1520 epochs of training using the higher precision representation, the test error approaches that obtained using float.
This result reveals a promising (and possibly more robust) strategy for deep neural network training in which the network is first trained using lowprecision fixedpoint arithmetic and stochastic rounding. At the point where learning shows stagnation, the network can be “finetuned” using only a few epochs of higherprecision fixedpoint computations. Such a concept of employing mixedprecision computations has been explored previously in the context of floating point arithmetic (Baboulin et al., 2009), motivated largely by the fact that most modern processors achieve a factor to higher computational throughput for singleprecision (bit) floatingpoint as compared with doubleprecision (bit) floatingpoint. Similar concepts, in conjunction with stochastic rounding, can be extended to perform mixedprecision fixedpoint arithmetic.^{5}^{5}5While preparing this paper, we became aware of a very recent work (Courbariaux et al., 2014) that shares our motivations but adopts an orthogonal approach. The authors propose the use of dynamic fixedpoint (a hybrid of the fixedpoint and the conventional floatingpoint arithmetic) for training deep neural networks. However, hardware implications of this approach are not immediately obvious.
The execution time of the minibatch stochastic gradient descent algorithm is dominated by a series of GEMM operations in the feedforward, error backpropagation and weight update calculation steps^{6}^{6}6Convolution may also be rewritten as a GEMM operation. As a result, an improvement in the computational throughput of the GEMM operation translates into an improvement in the training time. GPUs offering a large number of parallel vector processors and high memory bandwidth have therefore been very effective in accelerating these workloads.
In this section we describe a FPGAbased hardware accelerator for matrixmatrix multiplication. Our choice of using FPGAs as the hardware substrate is motivated by two factors. Firstly, FPGAs enable fast hardware development times and significantly lower costs when compared to ASICs^{7}^{7}7Application Specific Integrated Circuits. Secondly, modern FPGAs have a large number of hardwired fixedpoint DSP units that are wellsuited to implementing the fixedpoint arithmetic described in the earlier sections, and can potentially yield gains in performance and power efficiency. However, limited memory bandwidth must still be carefully managed through various design choices.
Our prototype is implemented on an offtheshelf FPGA card featuring a Xilinx KintexT FPGA and GB DDR memory, and communicating with the host PC over a PCIe bus. This FPGA has 840 DSP multiplyaccumulate units and almost MB of onchip block RAM. The data bandwidth between the offchip DDR memory and the FPGA is GB/s. The typical dimensions of the input matrices preclude storing entire matrices in onchip RAM. Thus, these matrices are stored in the DDR memory and parts of the matrices are brought into the FPGA for performing the computations. The offchip communication bandwidth limitation necessitates that we reuse the onchip data to the highest extent possible to make the achievable throughput, measured in gigaoperations/second (Gops/s), computebound.
Figure 4 presents a block diagram of the our fixedpoint matrix multiplier. The DSP units within the FPGA are organized as a massively parallel dimensional systolic array (SA) (Kung, 1982) of size such that . This forms the core of the multiplier and will be described in greater detail in the next subsection. Most of the block RAM on the FPGA is designated as the L cache where a fraction of the input matrices are stored. The READ logic sends data requests to the DDR memory and organizes the incoming data into the L cache. The WRITE logic sends back computed results to the external memory. The LtoSA circuit moves relevant rows and columns from the L cache to the array. The TOP controller coordinates the entire process. The FPGA also contains Xilinxsupplied IP blocks that interface to the DDR memory.
The operation sequence of the multiplier is as follows. Assume the first input matrix has dimensions x and the second input matrix has dimensions x . Initially columns of matrix and rows of matrix , where is the largest integer we can choose based on onchip memory capacity constraints, are brought into the FPGA to compute elements of the result matrix. The next columns of matrix are then brought it and processed. This continues until all columns of matrix have been multiplied with the first rows of matrix . This entire sequence is repeated times to process all rows of matrix . Double buffering is employed to hide the latency of bringing in new subsets of the matrices in to the chip. This sequence of operation ensures that elements of matrix are reused times once brought into the FPGA while those of matrix are reused times. This reuse allows efficient use of the bandwidth between the FPGA and the DDR memory.
Figure 5 shows the logical organization of the systolic array. Each node of the systolic array (DSP MACC) has a DSP unit that implements two operations (multiply and accumulate) in every clock cycle. Elements of input matrices and brought in from Lcache are staged in local block RAM units configured as FIFO (First In First Out) queues. Each FIFO contains elements from either a row of or a column of . In each clock cycle, one element is read out from the FIFO. Elements from earlier cycles are cascaded right (for ) or down (for ) and the corresponding partial products are accumulated at the DSP units. After accumulation of all partial products, output data is cascaded out to stochastic rounding units (DSP ROUND) that are also implemented with DSP units. Rounded results are stored in output FIFOs (one per column) before final readout to external memory. Throughput of the array depends on the number of DSPs available and the maximum operating frequency at which the system can be operated without timing errors. This is an example of a wavefronttype systolic array where all connections are local, i.e. only between neighboring DSPs and edge FIFOs, which limits interconnect delays and improves maximum operating frequency.
In a wavefront array, as depicted in Figure 6, at the end of cycles, where corresponds to the inner dimension of the matrix multiplication, MACC unit “11” has accumulated all of its partial products. At this point, the accumulated result is transferred to a local register and the DSP is reset. This frees it up to receive data from the next matrix multiplication operation, even before other elements have completed. This achieves high throughput for the systolic array so long as the pipeline is fed with new incoming data. At the end of cycles, the matrix multiplication is complete, and data from the last DSP unit can be read out. Output paths from local registers to the edge of the array are also cascaded.
Word length of the result elements after MACC operations are much larger (typically bits if using series DSPs) than word length of the inputs (typically bits or less). Before transferring to output FIFOs, result elements must be trimmed through the stochastic rounding of least signficant bits (LSB) and truncation of excess MSB bits (after detection of overflow/underflow). Both operations can be efficiently achieved using a single DSP unit per output. At each column, linear feedback shift register (LFSR) is used to generate a random number whose width is equal to the number of LSB bits being rounded off. The DSP unit adds the random number to the incoming result and drops rounded off LSB bits. Patterndetect capabilities built into the DSP are used to determine if excess MSB bits are identical (all “s” or all “s”). If not, an overflow/underflow condition is detected, and result values are saturated to the max/min ’s complement values^{8}^{8}8A more direct stochastic rounding approach is multibit magnitude comparison of result LSB vs. a random number, followed by a conditional addition and examining excess MSBs. The approach in this section achieves the same result but removes the first full multibit comparison, enabling compact implementation on a single DSP unit. . The result is then transferred to output column FIFOs awaiting writeback to external memory. The overhead of stochastic rounding is thus the logic occupied by DSP ROUND units, which in our case is DSP units – corresponding to less than overhead in hardware resources.
For a x systolic array implemented on the KintexKT FPGA, Xilinx’s Vivado synthesis and placeandroute tool estimated a maximum circuit operation frequency of MHz and a power consumption of W. This translates to a throughput of Gops/s at a power efficiency of Gops/s/W. This compares very favorably against the Intel iQM CPU, the NVIDIA GTm and the GTX GPUs, all of which achieve power efficiency in the range of  Gops/s/W (Gokhale et al., 2014). Table 1 presents a summary of the utilization of various resources in the FPGA. Throughput numbers can benefit from migration to newer Xilinx FPGAs, such as the Ultrascale series, that have much higher number of DSP units and can potentially operate at higher frequencies.
Resource  Usage 




LUTs  62922  203800  31%  
Flipflops  146510  407600  36%  
DSP  812  840  97%  
Block RAM  334  445  75%  

In this paper, we embrace a topdown approach exploiting the noisetolerance of deep neural networks and their training algorithms to influence the design of lowlevel compute units. Specifically, the substitution of floatingpoint units with fixedpoint arithmetic circuits comes with significant gains in the energy efficiency and computational throughput, while potentially risking the neural network’s performance. For lowprecision fixedpoint computations, where conventional rounding schemes fail, adopting stochastic rounding during deep neural network training delivers results nearly identical as 32bit floatingpoint computations. Additionally, we implement a highthroughput, energyefficient architecture for matrix multiplication that incorporates stochastic rounding with very little overhead. Extrapolating, we envision the emergence of hardwaresoftware codesigned systems for largescale machine learning based on relaxed, inexact models of computing running on nondeterministic components all across the stack, right down to lowlevel hardware circuitry.
Noise benefits in backpropagation and deep bidirectional pretraining.
In Neural Networks (IJCNN), The 2013 International Joint Conference on, pp. 1–8. IEEE, 2013.The MNIST database of handwritten digits.
URL http://yann.lecun.com/exdb/mnist/.
Comments
There are no comments yet.