ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning

03/25/2020 ∙ by Vinay Joshi, et al. ∙ ibm King's College London 0

Deep neural networks (DNNs) have surpassed human-level accuracy in a variety of cognitive tasks but at the cost of significant memory/time requirements in DNN training. This limits their deployment in energy and memory limited applications that require real-time learning. Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of DNNs. Strategies to improve the efficiency of MVM computation in hardware have been demonstrated with minimal impact on training accuracy. However, the VVOP computation remains a relatively less explored bottleneck even with the aforementioned strategies. Stochastic computing (SC) has been proposed to improve the efficiency of VVOP computation but on relatively shallow networks with bounded activation functions and floating-point (FP) scaling of activation gradients. In this paper, we propose ESSOP, an efficient and scalable stochastic outer product architecture based on the SC paradigm. We introduce efficient techniques to generalize SC for weight update computation in DNNs with the unbounded activation functions (e.g., ReLU), required by many state-of-the-art networks. Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling. We show that the ResNet-32 network with 33 convolution layers and a fully-connected layer can be trained with ESSOP on the CIFAR-10 dataset to achieve baseline comparable accuracy. Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier design, ESSOP is 82.2 energy and area efficiency respectively for outer product computation.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Research on developing accelerators for training deep neural networks (DNNs) has attracted significant interest. Several potential applications such as autonomous navigation, health care, and mobile devices require learning in-the-field while adhering to strict memory and energy budgets. DNN training demands significant time and compute/memory. The two most expensive computations in DNNs are the matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) and both require multiplications for a layer with a weight matrix of the size . Several strategies to improve efficiency of MVM computation have been proposed with minimal impact on the training accuracy. These strategies leverage either low precision digital representation [1, 2] or crossbar architectures [3, 4, 5, 6]. Less precise implementations of MVM are shown to perform sufficiently well for DNN training [1, 2, 7, 8, 9, 10].

For improving the efficiency of VVOP computation to calculte the weight updates, the algorithmic ideas that have been proposed so far require expensive multiplier circuits [11, 12, 13]. Stochastic computing (SC) has been suggested as an efficient alternative to floating point (FP) multiplications, given that operands are real numbers in []. This poses challenges for DNN training as operations such as ReLu, batchnorm, etc. have unbounded outputs. Moreover, small error gradients are often quantized to 0 due to the limited precision in the range [0,1].

The main contributions of this paper are the following: (1) We propose an SC-based efficient architecture ESSOP for computing weight updates for DNN training. (2) We introduce efficient schemes to generalize SC-based multiplier to unbounded activation functions (e.g. ReLU) that are essential for DNN training [14, 15, 16]

. (3) We show that these improvements have minimal effect on training accuracy of a deep convolution neural network (CNN). (4) Post place and route results at

nm CMOS show that ESSOP design is % and % better in energy and area efficiency respectively, compared to a highly pipelined FP16 multiplier design for outer product computation.

Ii Background and Motivation

Ii-a Neural Network Training

DNN training proceeds in three phases, namely (1) forward propagation, (2) backpropagation and (3) weight update. As shown in Fig. 

1, MVM operation is essential in forward and backpropagation while during the weight update phase, the VVOP is computed between the error gradient of that layer and the output activations of the previous layer to calculate the weight update matrix (see Eq. (1)). Note that this calculation, in general, applies to both fully-connected and convolution layers as well [17].

Fig. 1: Illustration of the two bottleneck operations in DNN training. Each of the illustrated operations require multiplications. Unlike MVM, VVOP is a relatively less explored problem that eventually becomes the bottleneck even when MVM is efficiently implemented in hardware. The ESSOP architecture addresses this problem to enable efficient hardware implementation of DNN training.

Ii-B Stochastic Computing (SC)

SC is a method of computing arithmetic operations using the stochastic representation of real numbers constrained to the interval [], instead of using real valued operands [18, 19, 20]. For notational convenience, we denote scalars by lowercase letters (e.g. ) and vectors by uppercase letters (e.g. ). To compute the stochastic representation ( ) of a real number , where , a Bernoulli sequence with

binary bits is computed such that the probability that any one of these bits being 1 is equal to r, i.e.,

for .

Using this representation, the product of two real numbers , where , can be computed using the bit-wise AND operation on the Bernoulli sequences as




Equation (3) thus replaces the expensive floating point multiplications with bitwise AND operations and subsequent summation operations. However, the range of numbers being multiplied in DNNs is usually not confined to []. Equations (2) and (3) can be generalized to numbers of arbitrary range; we illustrate this assuming and both have elements and weight update is determined using VVOP computation as given by equation (1). Assuming that the vector lies in the range of and similarly the error gradient lies in the range , we first normalize both and vectors to constrain their values to [] as,


Next, we denote the stochastic representation of all elements in a vector as ( ). For computing the Bernoulli sequences of and

in hardware, we can implement a random number generator (RNG) to sample from the uniform distribution of [

] and compare the normalized real number with the sampled random number:


In (5), is the Bernoulli event of the element of obtained by comparing element of with sample from corresponding random number generator , where is or . We can approximate the product in equation (1) using SC as,


where the parameter is defined as .

From equations (4)-(6), it is clear that a SC-based multiplier implementation for VVOP calculation presents the following challenges: (i) determination of the maximum elements and of the vectors and respectively in equation (4), requiring floating point comparisons; (ii) floating point division for normalization in equation (4) that requires floating point division operations; (iii) computation in equation (5) requires random number generations; and (iv) scaling by in equation (6) requires FP multiplication operations. We now discuss several techniques to address these challenges.

Iii Optimization of the SC-based multiplier for the design of ESSOP architecture

Iii-a Eliminating the normalization operations

As discussed above, the normalization operation introduces floating point divisions. This can be addressed by the following improvement. Consider a number z that lies in the range [] and another real number y that is obtained by using a constant positive scaling factor as . A Bernoulli representation () of is obtained as and that of is,


In equation (7), RNG can be realized by using a linear feedback shift register (LFSR) circuit to generate -bit pseudo-random numbers. Notably, the hardware realization of equation (7) does not require a multiplication with and can be realized by sampling few bits from the LFSR. For example, in a floating point representation, power in can be used as an exponent and as mantissa to compute without any floating point multiplications. Alternatively, in a fixed point representation, only a fraction of the bits generated from the LFSR need to be used to eliminate the fixed point multiplications in equation (7). For instance, if requires only -bits in the fixed point representation, -bits could be sampled from the -bit LFSR to compute . This eliminates the need for expensive divisions irrespective of the numerical representation of .

Iii-B Reusing the generated random numbers

The next hurdle for a SC-based multiplier is the requirement to create uniformly distributed -bit random numbers. In order to efficiently utilize the generated random numbers, we propose to reuse the random numbers, for the computations in equation (5), times by generating only random numbers. Hence, we generate only random numbers from the LFSR instead of random numbers for the elements in and combined. The first random numbers are used to generate the Bernoulli sequences of all the elements in the vector and the remaining random numbers are used to generate the Bernoulli sequences of all the elements in the vector . For example, to generate 8-bit long Bernoulli sequences corresponding to 256-element long vectors and , only random numbers are generated. The first 8 random numbers will be used to generate the Bernoulli sequence of all elements in the vector and the remaining for that of vector . With this modification, random number generation complexity is reduced to , making it independent of the dimensions of the weight matrix. As our detailed network simulations indicate, unintended correlations are not introduced by reusing random numbers, as the two Bernoulli sequences in equation (6) are uncorrelated.

Iii-C Approximating the scaling operations

Scaling the result of the AND operation in equation (6) with is an operation involving full-precision multiplications. To efficiently realize this scaling operation in hardware, we propose to use the closest power of the number obtained as shown below,


where denotes the largest integer smaller than . Using instead of makes the computations in equation (6) straightforward as only bit shift operations are required. Usually,

in DNNs will be smaller than 1 and hence such a bit-shift operation will mostly be a right shift operation. In the case of stochastic gradient descent optimizer, note that the learning rate can also be accommodated in the


Iv The ESSOP architecture

Fig. 2: The high-level design of a generalized ESSOP multiplier (unit cell) that can operate on two floating point numbers. (a) Internal blocks and the bit-lengths are shown for a unit cell implementation with FP16 inputs ( and ) and 16-bit sequence length. Superscirpts i, j denote the index of a real number in a vector and k denotes the index of a Bernoulli sample in a stochastic representation. (b) Example implementation of the shift logic for a floating point representation.
Fig. 3: The ESSOP architecture for computing the multiplications in parallel with the SC-based multipliers, assuming and have floating point 16-bit precision. R, C and U stand for random number generator, comparator and unit cell respectively. This architecture can be reused to compute one full outer product. This also represent the high-level architecture used for our silicon design.

Iv-a Unit cell design

We leverage innovations from Section III to develop the architecture of a single SC-based multiplier that we refer to as an ESSOP unit cell, shown in Fig. 2(a). At the periphery of the unit cell, -bit stochastic sequences and are computed for two input real numbers and respectively. The unit cell receives two inputs each with bits representation, with the first bit representing the stochastic representation of a real number and the second bit is the sign of the real number. The sign of the final product is computed using a 1-bit XOR on the sign bits of two real numbers. In our design, we assume a simple -input AND gate that is used times to compute the -bit representation of SC-based multiplication. For each cycle out of cycles, the output of the AND gate is fed to a counter that counts the number of s in the resulting sequence. In the periphery of the unit cell, is computed, which is used by the shift logic circuit to scale the output of the counter to a desired range. Shift logic will depend on the digital representation used for the input real numbers. For example, for a floating point representation as shown in Fig. 2(b), shift logic is as simple as copying the output of the XOR to the sign bit position, the counter output to the mantissa position and value to the exponent position. Finally, the result of the shift logic circuit is stored in a latch (or in a desired memory location) for the weight update. Note that the SC-based multiplier requires only clock cycles to compute one product.

Iv-B The multi-cell architecture of ESSOP

To compute all the elements of an outer product matrix, either a unit cell can be multiplexed or multiple unit cells can be used simultaneously. The proposed ESSOP architecture has multiple unit cells stacked in a single row. The high-level architecture of ESSOP is shown in Fig. 3, and has unit cells (U) arranged in a row. At the periphery of the unit cells, all inputs have their corresponding comparator (C). Each comparator receives two inputs, one from either an element of an activation vector () or error gradient vector (), second from a corresponding random number generator (R). The comparator generates one stochastic bit by comparing two inputs at a time. In the -bit implementation of the ESSOP, several such random numbers could be fed to the comparator circuit to compute comparisons in total, resulting in an -bit long stochastic sequence. As discussed in Section III-B, there are exactly two RNGs, one for and another for , that generate random numbers each for every outer product.

V Numerical validation

Fig. 4: The accuracy of the ResNet-32 network as a function ESSOP with the VVOP inputs represented using 16, 8, and 2-bits long Bernoulli sequences. 16-bit sequence is enough to achieve baseline comparable accuracy and 2-bit representation suffers only 2.6% drop on an average compared to the baseline. Each box plot shows five independent runs with different seeds.

We train the ResNet-32[14] network on CIFAR-10 [21] dataset to validate the ESSOP architecture. We use ESSOP to compute weight updates in all the 33 convolution layers and in the final layer of the ResNet-32 network. The CIFAR-10 dataset has K images in the training set and K images in the test set. Each image in the dataset has a resolution of pixels and three channels and belongs one of the ten classes. We preprocess CIFAR-10 images by implementing the commonly used image processing steps for the family of residual networks as reported in [22]. The simulation was performed with FP16 precision for data and computation, with a mini-batch size of images for epochs. We used initial learning rate (LR) of with LR evolution (LRE) for baseline as in [14]; LRE is tuned for better accuracy in ESSOP

implementation. The categorical cross-entropy loss function is minimized using stochastic gradient descent with momentum of

. In our results, we denote ESSOP(M) to indicate FP16 precision for input operands ( and ) represented using -bit Bernoulli sequences. Fig. 4 shows the accuracy of ResNet-32 as a function of ESSOP16 sequence length. The test accuracy drop with ESSOP16(16) is only 0.25% compared to the baseline. Experiments on different sequence lengths indicate that -bits is sufficient to achieve close to baseline accuracy with FP16 outer product. ESSOP16(16) shows on an average of 0.73% drop in the test accuracy compared to the baseline accuracy. ESSOP16(8) has on an average of % drop in test accuracy compared to the baseline. Remarkably, ESSOP16(2) has on an average of only % drop in the test accuracy compared to the baseline. It is important to note that with a sequence length of -bits, it is possible to compute the weight update in just clock cycles.

Vi Post layout results

Fig. 5: Physical and performance characteristics of ESSOP16(16) architecture vs floating point 16-bits (FP16) multiplier array.

In this section, we present the details of hardware implementation of ESSOP16 and compare its post route layout performance with that of a highly pipelined FP16 multiplier design after place and route. FP16 design is an array of FP16 multipliers (from Samsung’s Low Power Plus (LPP) library) that compute FP16 multiplications in parallel. Similarly, ESSOP16 design has unit cells ( to ) to compute elements of an outer product matrix in parallel, as illustrated in Fig. 3. Inputs and , which are in FP16 precision, are fed to the corresponding comparator circuit ( to ). Two RNGs and corresponding to and generate random numbers each. Each RNG generates only mantissa part of the random number, exponent ( or ) is derived from the absolute maximum number in the vector or and sign bit is derived from another input of a corresponding comparator. Configuration parameters such as Bernoulli sequence length is stored in configuration register G. Post place and route results at nm using Samsung LPP libraries are shown in Fig. 5. These results indicate ESSOP16(16) design, even though sequential and not pipelined, operates at higher frequency and achieves % and % better energy and area efficiency respectively, compared to the FP16 multiplier array for outer product computation.

Vii Conclusion

We proposed an efficient hardware architecture ESSOP that facilitates training of deep neural networks. The central idea is to efficiently implement the vector-vector outer product calculation associated with the weight updates using stochastic computing. We proposed efficient schemes to implement the stochastic computing-based multipliers that can generalize to operands with unbounded magnitude and significantly reduce the computational cost by re-using random numbers. This addresses a significant performance bottleneck for training DNNs in hardware, particularly for applications with stringent constraints on area and energy. ESSOP complements architectures that accelerate matrix-vector multiply operations associated with the forward and backpropagations where weights are represented in low precision or stored in computational memory based crossbar array architectures. We evaluated ESSOP on a 32-layer deep CNN that achieves baseline comaprable accuracy for a sequence length of -bits. nm place and route of the ESSOP architecture compared with FP16 design shows % and % improvement in energy and area efficiency respectively for outer product computation.


This project was supported partially by the Semiconductor Research Corporation.