I Introduction
Research on developing accelerators for training deep neural networks (DNNs) has attracted significant interest. Several potential applications such as autonomous navigation, health care, and mobile devices require learning inthefield while adhering to strict memory and energy budgets. DNN training demands significant time and compute/memory. The two most expensive computations in DNNs are the matrixvector multiplications (MVM) and vectorvector outer product (VVOP) and both require multiplications for a layer with a weight matrix of the size . Several strategies to improve efficiency of MVM computation have been proposed with minimal impact on the training accuracy. These strategies leverage either low precision digital representation [1, 2] or crossbar architectures [3, 4, 5, 6]. Less precise implementations of MVM are shown to perform sufficiently well for DNN training [1, 2, 7, 8, 9, 10].
For improving the efficiency of VVOP computation to calculte the weight updates, the algorithmic ideas that have been proposed so far require expensive multiplier circuits [11, 12, 13]. Stochastic computing (SC) has been suggested as an efficient alternative to floating point (FP) multiplications, given that operands are real numbers in []. This poses challenges for DNN training as operations such as ReLu, batchnorm, etc. have unbounded outputs. Moreover, small error gradients are often quantized to 0 due to the limited precision in the range [0,1].
The main contributions of this paper are the following: (1) We propose an SCbased efficient architecture ESSOP for computing weight updates for DNN training. (2) We introduce efficient schemes to generalize SCbased multiplier to unbounded activation functions (e.g. ReLU) that are essential for DNN training [14, 15, 16]
. (3) We show that these improvements have minimal effect on training accuracy of a deep convolution neural network (CNN). (4) Post place and route results at
nm CMOS show that ESSOP design is % and % better in energy and area efficiency respectively, compared to a highly pipelined FP16 multiplier design for outer product computation.Ii Background and Motivation
Iia Neural Network Training
DNN training proceeds in three phases, namely (1) forward propagation, (2) backpropagation and (3) weight update. As shown in Fig.
1, MVM operation is essential in forward and backpropagation while during the weight update phase, the VVOP is computed between the error gradient of that layer and the output activations of the previous layer to calculate the weight update matrix (see Eq. (1)). Note that this calculation, in general, applies to both fullyconnected and convolution layers as well [17].(1) 
IiB Stochastic Computing (SC)
SC is a method of computing arithmetic operations using the stochastic representation of real numbers constrained to the interval [], instead of using real valued operands [18, 19, 20]. For notational convenience, we denote scalars by lowercase letters (e.g. ) and vectors by uppercase letters (e.g. ). To compute the stochastic representation ( ) of a real number , where , a Bernoulli sequence with
binary bits is computed such that the probability that any one of these bits being 1 is equal to r, i.e.,
for .Using this representation, the product of two real numbers , where , can be computed using the bitwise AND operation on the Bernoulli sequences as
(2) 
and
(3) 
Equation (3) thus replaces the expensive floating point multiplications with bitwise AND operations and subsequent summation operations. However, the range of numbers being multiplied in DNNs is usually not confined to []. Equations (2) and (3) can be generalized to numbers of arbitrary range; we illustrate this assuming and both have elements and weight update is determined using VVOP computation as given by equation (1). Assuming that the vector lies in the range of and similarly the error gradient lies in the range , we first normalize both and vectors to constrain their values to [] as,
(4) 
Next, we denote the stochastic representation of all elements in a vector as ( ). For computing the Bernoulli sequences of and
in hardware, we can implement a random number generator (RNG) to sample from the uniform distribution of [
] and compare the normalized real number with the sampled random number:(5) 
In (5), is the Bernoulli event of the element of obtained by comparing element of with sample from corresponding random number generator , where is or . We can approximate the product in equation (1) using SC as,
(6) 
where the parameter is defined as .
From equations (4)(6), it is clear that a SCbased multiplier implementation for VVOP calculation presents the following challenges: (i) determination of the maximum elements and of the vectors and respectively in equation (4), requiring floating point comparisons; (ii) floating point division for normalization in equation (4) that requires floating point division operations; (iii) computation in equation (5) requires random number generations; and (iv) scaling by in equation (6) requires FP multiplication operations. We now discuss several techniques to address these challenges.
Iii Optimization of the SCbased multiplier for the design of ESSOP architecture
Iiia Eliminating the normalization operations
As discussed above, the normalization operation introduces floating point divisions. This can be addressed by the following improvement. Consider a number z that lies in the range [] and another real number y that is obtained by using a constant positive scaling factor as . A Bernoulli representation () of is obtained as and that of is,
(7) 
In equation (7), RNG can be realized by using a linear feedback shift register (LFSR) circuit to generate bit pseudorandom numbers. Notably, the hardware realization of equation (7) does not require a multiplication with and can be realized by sampling few bits from the LFSR. For example, in a floating point representation, power in can be used as an exponent and as mantissa to compute without any floating point multiplications. Alternatively, in a fixed point representation, only a fraction of the bits generated from the LFSR need to be used to eliminate the fixed point multiplications in equation (7). For instance, if requires only bits in the fixed point representation, bits could be sampled from the bit LFSR to compute . This eliminates the need for expensive divisions irrespective of the numerical representation of .
IiiB Reusing the generated random numbers
The next hurdle for a SCbased multiplier is the requirement to create uniformly distributed bit random numbers. In order to efficiently utilize the generated random numbers, we propose to reuse the random numbers, for the computations in equation (5), times by generating only random numbers. Hence, we generate only random numbers from the LFSR instead of random numbers for the elements in and combined. The first random numbers are used to generate the Bernoulli sequences of all the elements in the vector and the remaining random numbers are used to generate the Bernoulli sequences of all the elements in the vector . For example, to generate 8bit long Bernoulli sequences corresponding to 256element long vectors and , only random numbers are generated. The first 8 random numbers will be used to generate the Bernoulli sequence of all elements in the vector and the remaining for that of vector . With this modification, random number generation complexity is reduced to , making it independent of the dimensions of the weight matrix. As our detailed network simulations indicate, unintended correlations are not introduced by reusing random numbers, as the two Bernoulli sequences in equation (6) are uncorrelated.
IiiC Approximating the scaling operations
Scaling the result of the AND operation in equation (6) with is an operation involving fullprecision multiplications. To efficiently realize this scaling operation in hardware, we propose to use the closest power of the number obtained as shown below,
(8) 
where denotes the largest integer smaller than . Using instead of makes the computations in equation (6) straightforward as only bit shift operations are required. Usually,
in DNNs will be smaller than 1 and hence such a bitshift operation will mostly be a right shift operation. In the case of stochastic gradient descent optimizer, note that the learning rate can also be accommodated in the
computation.Iv The ESSOP architecture
Iva Unit cell design
We leverage innovations from Section III to develop the architecture of a single SCbased multiplier that we refer to as an ESSOP unit cell, shown in Fig. 2(a). At the periphery of the unit cell, bit stochastic sequences and are computed for two input real numbers and respectively. The unit cell receives two inputs each with bits representation, with the first bit representing the stochastic representation of a real number and the second bit is the sign of the real number. The sign of the final product is computed using a 1bit XOR on the sign bits of two real numbers. In our design, we assume a simple input AND gate that is used times to compute the bit representation of SCbased multiplication. For each cycle out of cycles, the output of the AND gate is fed to a counter that counts the number of s in the resulting sequence. In the periphery of the unit cell, is computed, which is used by the shift logic circuit to scale the output of the counter to a desired range. Shift logic will depend on the digital representation used for the input real numbers. For example, for a floating point representation as shown in Fig. 2(b), shift logic is as simple as copying the output of the XOR to the sign bit position, the counter output to the mantissa position and value to the exponent position. Finally, the result of the shift logic circuit is stored in a latch (or in a desired memory location) for the weight update. Note that the SCbased multiplier requires only clock cycles to compute one product.
IvB The multicell architecture of ESSOP
To compute all the elements of an outer product matrix, either a unit cell can be multiplexed or multiple unit cells can be used simultaneously. The proposed ESSOP architecture has multiple unit cells stacked in a single row. The highlevel architecture of ESSOP is shown in Fig. 3, and has unit cells (U) arranged in a row. At the periphery of the unit cells, all inputs have their corresponding comparator (C). Each comparator receives two inputs, one from either an element of an activation vector () or error gradient vector (), second from a corresponding random number generator (R). The comparator generates one stochastic bit by comparing two inputs at a time. In the bit implementation of the ESSOP, several such random numbers could be fed to the comparator circuit to compute comparisons in total, resulting in an bit long stochastic sequence. As discussed in Section IIIB, there are exactly two RNGs, one for and another for , that generate random numbers each for every outer product.
V Numerical validation
We train the ResNet32[14] network on CIFAR10 [21] dataset to validate the ESSOP architecture. We use ESSOP to compute weight updates in all the 33 convolution layers and in the final layer of the ResNet32 network. The CIFAR10 dataset has K images in the training set and K images in the test set. Each image in the dataset has a resolution of pixels and three channels and belongs one of the ten classes. We preprocess CIFAR10 images by implementing the commonly used image processing steps for the family of residual networks as reported in [22]. The simulation was performed with FP16 precision for data and computation, with a minibatch size of images for epochs. We used initial learning rate (LR) of with LR evolution (LRE) for baseline as in [14]; LRE is tuned for better accuracy in ESSOP
implementation. The categorical crossentropy loss function is minimized using stochastic gradient descent with momentum of
. In our results, we denote ESSOP(M) to indicate FP16 precision for input operands ( and ) represented using bit Bernoulli sequences. Fig. 4 shows the accuracy of ResNet32 as a function of ESSOP16 sequence length. The test accuracy drop with ESSOP16(16) is only 0.25% compared to the baseline. Experiments on different sequence lengths indicate that bits is sufficient to achieve close to baseline accuracy with FP16 outer product. ESSOP16(16) shows on an average of 0.73% drop in the test accuracy compared to the baseline accuracy. ESSOP16(8) has on an average of % drop in test accuracy compared to the baseline. Remarkably, ESSOP16(2) has on an average of only % drop in the test accuracy compared to the baseline. It is important to note that with a sequence length of bits, it is possible to compute the weight update in just clock cycles.Vi Post layout results
In this section, we present the details of hardware implementation of ESSOP16 and compare its post route layout performance with that of a highly pipelined FP16 multiplier design after place and route. FP16 design is an array of FP16 multipliers (from Samsung’s Low Power Plus (LPP) library) that compute FP16 multiplications in parallel. Similarly, ESSOP16 design has unit cells ( to ) to compute elements of an outer product matrix in parallel, as illustrated in Fig. 3. Inputs and , which are in FP16 precision, are fed to the corresponding comparator circuit ( to ). Two RNGs and corresponding to and generate random numbers each. Each RNG generates only mantissa part of the random number, exponent ( or ) is derived from the absolute maximum number in the vector or and sign bit is derived from another input of a corresponding comparator. Configuration parameters such as Bernoulli sequence length is stored in configuration register G. Post place and route results at nm using Samsung LPP libraries are shown in Fig. 5. These results indicate ESSOP16(16) design, even though sequential and not pipelined, operates at higher frequency and achieves % and % better energy and area efficiency respectively, compared to the FP16 multiplier array for outer product computation.
Vii Conclusion
We proposed an efficient hardware architecture ESSOP that facilitates training of deep neural networks. The central idea is to efficiently implement the vectorvector outer product calculation associated with the weight updates using stochastic computing. We proposed efficient schemes to implement the stochastic computingbased multipliers that can generalize to operands with unbounded magnitude and significantly reduce the computational cost by reusing random numbers. This addresses a significant performance bottleneck for training DNNs in hardware, particularly for applications with stringent constraints on area and energy. ESSOP complements architectures that accelerate matrixvector multiply operations associated with the forward and backpropagations where weights are represented in low precision or stored in computational memory based crossbar array architectures. We evaluated ESSOP on a 32layer deep CNN that achieves baseline comaprable accuracy for a sequence length of bits. nm place and route of the ESSOP architecture compared with FP16 design shows % and % improvement in energy and area efficiency respectively for outer product computation.
Acknowledgment
This project was supported partially by the Semiconductor Research Corporation.
References
 [1] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/abs/1511.00363
 [2] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.04711
 [3] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani, M. Ishii, P. Narayanan, A. Fumarola et al., “Neuromorphic computing using nonvolatile memory,” Advances in Physics: X, vol. 2, no. 1, pp. 89–124, 2017.
 [4] V. Joshi, M. L. Gallo, I. Boybat, S. Haefeli, C. Piveteau, M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou, “Accurate deep neural network inference using computational phasechange memory,” CoRR, vol. abs/1906.03138, 2019. [Online]. Available: http://arxiv.org/abs/1906.03138
 [5] A. Sebastian, I. Boybat, M. Dazzi, I. Giannopoulos, V. Jonnalagadda, V. Joshi, G. Karunaratne, B. Kersting, R. KhaddamAljameh, S. R. Nandakumar, A. Petropoulos, C. Piveteau, T. Antonakopoulos, B. Rajendran, M. L. Gallo, and E. Eleftheriou, “Computational memorybased inference and training of deep neural networks,” in 2019 Symposium on VLSI Technology, June 2019, pp. T168–T169.
 [6] A. Sebastian, M. Le Gallo, G. W. Burr, S. Kim, M. BrightSky, and E. Eleftheriou, “Tutorial: Braininspired computing using phasechange memory devices,” Journal of Applied Physics, vol. 124, no. 11, p. 111101, 2018.
 [7] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. TakamaedaYamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Motomura, “BRein memory: A singlechip binary/ternary reconfigurable inmemory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal of SolidState Circuits, vol. 53, no. 4, pp. 983–994, April 2018.
 [8] S. R. Nandakumar, M. L. Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou, “Mixedprecision training of deep neural networks using computational memory,” CoRR, vol. abs/1712.01192, 2017. [Online]. Available: http://arxiv.org/abs/1712.01192

[9]
E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr, “Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC,” in
2016 International Conference on FieldProgrammable Technology (FPT), Dec 2016, pp. 77–84.  [10] G. Tayfun and Y. Vlasov, “Acceleration of deep neural network training with resistive crosspoint devices,” CoRR, vol. abs/1603.07341, 2016. [Online]. Available: http://arxiv.org/abs/1603.07341
 [11] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http://arxiv.org/abs/1602.02830
 [12] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with few multiplications,” CoRR, vol. abs/1510.03009, 2015. [Online]. Available: http://arxiv.org/abs/1510.03009
 [13] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
 [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
 [15] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online]. Available: http://arxiv.org/abs/1608.06993
 [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.4842
 [17] T. Gokmen, M. Onen, and W. Haensch, “Training deep convolutional neural networks with resistive crosspoint devices,” Frontiers in Neuroscience, vol. 11, p. 538, 2017. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2017.00538
 [18] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM Trans. Embed. Comput. Syst., vol. 12, no. 2s, pp. 92:1–92:19, May 2013. [Online]. Available: http://doi.acm.org/10.1145/2465787.2465794
 [19] B. R. Gaines, “Stochastic computing,” in Proceedings of the April 1820, 1967, Spring Joint Computer Conference, ser. AFIPS ’67 (Spring). New York, NY, USA: ACM, 1967, pp. 149–156. [Online]. Available: http://doi.acm.org/10.1145/1465482.1465505
 [20] W. J. Poppelbaum, C. Afuso, and J. W. Esch, “Stochastic computing elements and systems,” in Proceedings of the November 1416, 1967, Fall Joint Computer Conference, ser. AFIPS ’67 (Fall). New York, NY, USA: ACM, 1967, pp. 635–644. [Online]. Available: http://doi.acm.org/10.1145/1465611.1465696
 [21] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, 05 2012.
 [22] T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017. [Online]. Available: http://arxiv.org/abs/1708.04552
Comments
There are no comments yet.