I Introduction
The processinginmemory (PIM) paradigm has been considered as a promising alternative to break the bottlenecks of conventional vonNeumann architecture. In the era of big data, data movement between the processor and the memory results in huge power consumption (power wall) and performance degradation (memory wall), known as the vonNeumann bottleneck[1]. By placing the processing units inside/near the memory, PIM remarkably reduces the energy and performance overhead induced by data transport[2][3]. In the recent advancement of PIM designs, it also allows fully leverage the large memory internal bandwidth and embrace massive parallelism by simultaneously activating multiple rows, subarrays, and banks in memory arrays for bitwise operations [4]. These performance gains are all achieved at a minimal cost of slightly modifying the memory array peripheral circuits [5] [6].
Multiplication (MUL) is always a complex task to accomplish in efficient PIM designs, despite that MUL instructions are frequently used in Neuro Network (NN) algorithms and linear transforms (e.g. Discrete Fourier Transform). As shown by the recent developed DRISA
[6], it takes 143 cycles to calculate an 8bit multiplication, which deviates from its original motivation to achieve high performance with inmemory bitwise operations. The situation may be even worse with operands composing of more bits, as the cycle count can increase exponentially with the operand’s bit length. The challenge mainly lies in the fact that MUL can not be effectively decomposed into a small serial of bitwise Boolean logic operations which can be performed locally in memory.To tackle such challenge, prior efforts propose to either approximate the MUL or utilize the analog computing features of hardware devices. On one hand at the algorithm level, binary NN with approximate binary weights and activations has been developed [7]
. As such, the MUL is simplified into bitwise XNOR operations that become PIM friendly
[8]. Unfortunately, such simplification comes at the cost of the undesired and significant degradation in the classification accuracy of NN. On the other hand at the hardware level, ReRAM is implemented to ease MUL in novel PIM designs, taking advantage of ReRAM’s analog storage. The analog resistance/conductance of ReRAM encodes the weights in NN. By activating one entire row/column simultaneously in a ReRAM crossbar, the dot product between a matrix and a vector in NN can be easily achieved using Ohm’s law
[5]. Nevertheless, ReRAM itself suffers from the long write latency, high programming voltage and limited endurance, which hinders its application in highspeed and energy efficient architecture design.In this work, we propose a new stochastic computing (SC) design to effectively perform MUL with inmemory operations, in light of the simplicity to implement MUL with SC. In order to tightly couple SC with PIM, we embrace the inherent stochasticity of the memory bit in spinorbittorque magnetic random access memory (SOTMRAM). Specifically, the stochastic number generation and massive AND operations in the conventional SCbased MUL are implemented with simple memory write operations in SOTMRAM. Consequently, each bit serves as an SC engine, and the large supporting circuits for stochastic number generation and logic operations can be effectively saved. Finally, the MUL outcome is represented by the probability distribution of the binary storage states among MRAM bits, and can be converted back to its binary form with popcount. The contributions of this paper are summarized as follows:

We propose the idea of employing the inherent stochastic write in SOTMRAM to promote SC in the PIM design.

We develop an efficient approach to implement MUL in the way of memory write, by converting the binary multipliers to the write voltage pulse with varied duration.

We propose two strategies of popcount to convert the MUL result back to its binary format, offering flexibility to further trade performance with area.

The proposed design provides up to 4x improvement in performance and significant reduction in area occupancy compared with conversational SC approaches, and achieves 18x speedup over implementing MUL with only inmemory bitwise Boolean logic operations.
Ii Preliminaries
This section introduces the motivation to combine SC with PIM and the preliminary design with the stochastic switching behavior of SOTMRAM.
Iia SC and PIM
SC provides an alternative approach to implement the MUL function. SC is an approximate computing method, which has been studied for decades and widely applied to image/signal processing, control systems and general purpose computing[9][10]. SC method essentially trades the data representation density for simpler logic design and lower power. For instance, SC represents a nbit binary number with a stochastic bitstream ( bit). The value of the binary number equals to the probability of the appearance of ”1”s in the bitstream :
(1) 
Benefiting from such data representation, the MUL between two numbers can be converted to simple bitwise AND operations, which dramatically reduces the complexity of logic design.
(2) 
However, SC is not friendly to conventional vonNeumann architecture. The data explosion of SC aggravates data movement between processor and memory, which offsets the simplicity brought by SC.
Instead, SC tightly couples with PIM from multifold aspects, leading to significant performance gain: First, the many bits of stochastic bitstream can be stored in offchip memory with large capacity. Second, the logic operation with reduced complexity can be implemented by the processing units locally in memory. Finally, the stochastic feature of bitstream allows parallel computing on the individual bit, so that the internal memory bandwidth can be fully leveraged. Therefore, the MUL instruction can be significantly accelerated by combining SC with PIM.
Several challenges still exist towards combine SC with PIM. The random bitstream still relies on stochastic number generators (SNGs), which incurs large area overhead for the supporting circuits. In addition, those stochastic bits can be hardly generated in parallel and with eliminated correlations, resulting in degradation of performance and accuracy in computing MUL. In our design, we overcome these drawbacks by utilizing the inherent stochasticity in MRAM bit.
IiB SOTMRAM and its stochastic switching
SOTMRAM utilizes the spinorbit torques to write the memory cell, overcoming the drawbacks of Spin Transfer TorqueMRAM (STTMRAM) in terms of high write latency, and large write energy dissipation[11] [12]. Fig. 1 compares the similarity and difference between SOTMRAM and STTMRAM cells. Similarly, both types of MRAM cells store the bit value in a magnetic tunnerling junction (MTJ). The bit value ”0” or ”1” is read out electrically as high or low tunneling magnetoresistance, which is controlled by the antiparallel (AP) or parallel (P) alignment of magnetization in the free layer (FL) and the reference layer (RL). Although the write of MRAM bit is always fulfilled by controlling the magnetization direction of the FL, the mechanisms used are different between STTMRAM and SOTMRAM. In STTMRAM, the write current passes through MTJ and the spin polarized current exerts notable STT to switch the FL magnetization[11] [13]. Differently in SOTMRAM, SOTs are generated by transversing write current though an additional heavy metal layer (HML) to switch the magnetization in the adjunct FL. As a result, SOTMRAM does not suffer from the asymmetric of write latency between ”AP P” and ”P AP” in STTMRAM, speeding up the write procedure. Moreover, the energy efficiency of write is fundamentally higher in SOTMRAM. That’s because each electron can be reused multiple times to exert SOTs after bounced back from HML and FL interface, while it can be used once at most in STT.
We harness the stochastic behavior within the memory write of SOTMRAM to perform SC. The probability of MRAM bit remains not switched under the appliance of electrical current is described [14]
(3) 
Here, denotes the pulse duration of the applied in nanosecond, represents the thermal stability parameter of the MTJ, and is the critical current strength required to switch the FL magnetization. Fig. 2 plots the as functions of and , with and estimated from previous micromagnetic simulations on SOT driven magnetization dynamics[12]. By finely controlling the parameters in the write of SOTMRAM, each memory bit can serve as a stochastic bit generator with the desired probability of holding either ”0” or ”1”. Utilizing this feature of SOTMRAM, the large amount of stochastic bits in SC can be generated in parallel and insitu stored in memory with a simple write operation.
Iii Data conversion and hardware design
To implement the MUL operation with the stochastic switching of MRAM bit, the binary operands have to be translated into certain parameter of the write voltage pulse. The flow of our proposed sequential data conversion can be summarized as:
(4) 
In this section, we will introduce them and the related hardware design step by step.
Iiia Binary numbers to logarithmic timing signals
We first perform logarithmic operation on the digital numbers stored in memory, i.e. . The multiple bits of the operand are read out by sensing amplifiers (SAs) and decoded to find their logarithmic values using a lookup table (LUT) (Fig. 3). The LUT method is usually used in logarithm multiplication, and has been demonstrated to be fast and accurate[15]. This conversion step is necessary, since an exponential operation is inherently included in the following stochastic switching of the MRAM bit.
Afterwards, we convert the to timing signals with a digitaltotime converter (DTC). The DTC outputs a voltage square pulse , where the pulse duration in Eq. 3 is proportional to the value of input . The magnitude of the pulse is normalized and fixed to drive SOTMRAM bit in its nondeterministic switching region.
IiiB Logarithmic timing signals to stochastic bitstream
The write voltage pulse is subsequently applied onto the source lines (SLs) of multiple rows of SOTMRAM bits, and drives their stochastic switching behaviors. The entire row of MRAM array can be written simultaneously with a crosspoint design (Fig. 4)[12]. The MTJs in a row share a set of driving transistors, and are directly linked to the BLs and SLs without additional transistors for individual bit. As a result, minimal area and energy overhead are introduced to enable such simultaneous write.
Fig. 5 shows how the SCbased MUL is performed. Initialization: a preset operation is required to initialize all the bits to ”1” with reversed current . Input first operand: the converted write voltage pulse is input onto the MRAM array, resulting in partial switching of the bits. The probability of remaining ”1”s equals to , where is proportional to the value of operand . MUL with the second operand input: The MUL operation is performed by inputing a subsequent voltage pulse (similarly converted from operand ) onto the same MRAM array. As a result, the remaining ”1”s survive from not switched by neither pulse nor , and they are distributed among the MRAM arrays with a probability equaling (proportional to ).
IiiC Stochastic bits to Binary numbers
At last, we perform bit counting to convert the outcome from the stochastic representation to its binary format. Either approximate popcount (APC) [16] or PIMbased ADD operations [6] can be employed to bit counting. APC method can be performed with one clock cycle, but introduces much area overhead. Alternatively, PIMbased ADD is areaefficient, but takes many clock cycles to perform the popcount.
Specifically, we can accelerate the PIMbased popcount for the vectored multiplyandaccumulate (MAC) in NN. Fig. 6 shows the twostep strategy, where the sum is performed after several MULs have been done. In the first step, we perform rowwise sum with a carrysave addition (CSA). Then in the second step, the intermediate sum results undergo a columnwise additions with full adder (FA). Our motivation here is to lessen the usage of FA for columnwised addition, since it takes more clock cycles than the lock step bitwise operations of CSA. As shown in Fig. 6, the delay from FA can be averaged out, and the popcount related cycle count converges to that of CSA after many MULs.
IiiD Put them all together
After putting all the pieces together, we point out strategies to further improve the performance and accuracy, and explain certain considerations in the design.
The sequential flow of data conversion can be separated and pipelined to improve the throughput and performance. For example, the LUT operation on the second operand can be performed simultaneously with the stochastic memory write for the first operand. Moreover, the bit counting can work in parallel with MUL operations for NN applications. There is no need for the relative slow popcount to start until all the fast MULs between and have been finished in the computation of . Furthermore, one could preconvert certain frequently used data (e.g. weight in NN) into stochastic bits, which can be stored nonvolatilely in MRAM arrays. Once other multipliers (e.g. inputs ) come, their converted timing signals can be directly input onto the corresponding MRAM arrays to perform MUL operations.
There are several normalization units in the circuits that can be used to fine tune the accuracy and performance. For example, the pulse duration of can be scaled to a range where
. Through such scaling, the switching voltage pulse can not be longer than the usual time required to switch MRAM bit, avoiding unnecessary slowdown in computing. Moreover, the bitstream can be tuned neither sparse nor dense to guarantee the accuracy of MUL, so that more bits are effectively involved in SC. This is fundamentally similar to the improved classification accuracy of NN with more neurons involved.
Multiple rows can be simultaneously activated and wrote to generate more stochastic bits in parallel. This situation happens when performing MUL on operands with more bits. In the crosspoint MRAM design, we limit the number of memory cells in each row due to the concern of IR drop. The MTJs farther away from the driving transistors in the row would suffer from a lower switching voltage[17], and would likely undergo stochastic switching with undesired and incorrect probability.
Finally, we note that the pulse duration is used here for computing, instead of the magnitude of switching voltage pulse (equivalent to in Eq. 3). That’s because the usage of the magnitude requires more complicated circuits design for data conversion, owing to the complex dependence of on . In addition, the two inputs and has to be input simultaneously onto the MRAM arrays. This is not friendly to the pipeline strategies mentioned above, but will introduce large area overhead onto the driven transistors to enable higher current write instead.
Iv Monte Carlo simulations
To estimate the accuracy and its dependence on hardware variance from statistics, we performe the Monte Carlo simulations on the stochastic switching of MRAM bits. In the following,
denotes the number of stochastic bits per MUL, represents the probability that the bit remains not switched under certain input voltage pulse. For one MUL operation, we test the proposed SC with 1000 iterations and make statistics on the results among iterations.Iva Accuracy
Fig. 7(a) shows the distribution of the error among the 1000 iterations, where the the probability is stochastically computed (with ) and
are theoretically calculated from the two operands. The error distribution is centered to zero, indicating that there is no intrinsic bias in the SC arithmetic. The distribution can be well fitted with a Gaussian function (red line), with the standard deviation
. This indicates that the MUL is with about uncertainty for .We further investigate the dependence of on the inputs and the number of stochastic bits . As shown in Fig. 7(b), is almost independent on the inputs , but decreases with larger . Therefore, we can improve the accuracy of SC by using more MRAM bits, despite that the improvement becomes more gradual with larger .
IvB The impact of hardware variance
We also investigate the impact of hardware variance on the accuracy of MUL operation, by introducing random fluctuations on the devices’ parameters in Monte Carlo simulation.
The critical currents of MRAM bits may be slightly varied, since the many MRAM bits can not be manufactured identically and they may also experience different thermal fluctuations when in use[18]. Therefore, we introduce 0% to 10% random fluctuations on the . As shown in Fig. 8(a), the accuracy of SC remains almost unchanged under different strength of fluctuations.
We also compare the fault tolerance of our design with that of logarithm multiplication. To implement logarithm multiplication[15], we replace the DTC and SOTMRAMs with an antilogarithm amplifier. Then we introduce 4% to 10% random fluctuations on DTC and antilogarithm respectively for the two cases. As shown in Fig. 8(b), the accuracy of our SC+PIM design remains almost unchanged, while logarithm multiplication suffers from severe degradation in accuracy with stronger fluctuations.
V Evaluation
In this section, we evaluate the performance, power and area overhead of the proposed SC+PIM design, and compare them with that of other approaches using either SC or PIM.
Va Experimental setup
We adopt the crosspoint design of SOTMRAM arrays similar to PRESCOTT[12], to enable the parallel memory write. The lowpower DTC generates voltage pulses with 22 ps time resolution and occupies in area[19]. For the APC, we design onecycle fully parallel circuit synthesized with 45nm FreePDK[20], integrating parameters from[16]. Our evaluation is based on the multiplication between two 10bit operands that represented by stochastic bits.
In the following, different configurations have been compared: SC+PIM (with APC) denotes our SC+PIM design with popcount conducted by APC. SC+PIM (with CSA) is our SC+PIM design with popcount performed with CSA+FA. Specially, the evaluation is averaged onto each MUL for the situation of performing 100 MULs in a MAC. SC represents the usage of a built multiplier with the stateoftheart SNG [21] and popcount with APC. PIM is the situation that we only use inmemory Boolean logic operations to implement MUL.
VB Performance
Fig. 9(a) compares the cycle count used to perform each MUL operation with different designs. Evidently, our SC+PIM approach outperforms prior approaches using either SC or PIM. The boost of performance in our design benefits from the parallel generation of stochastic bits. In contrast, prior SC approaches requires additional cycles to generate stochastic bitstreams or to shuffle the existing pseudostochastic or deterministic bitstreams [21].
In addition, we investigate the dependence of MUL cycle count on the operands’ bit length as shown in Fig. 9(b). The cycle count remains unchanged in our SC+PIM design, since different amount of stochastic bits ( for nbit operand) can be generated in parallel. As a comparison, the cycle count required for MUL increases exponentially for the operands’ bits length in prior PIM design. Therefore, the speedup of SC+PIM over PIM becomes more attractive for MUL between operands with more bits.
VC Energy consumption
Our SC+PIM design consumes 58% less energy compared with the SC method (Fig. 10), thanks to the low write energy of SOTMRAM[22]. In our design, most energy is spent through memory write, such as in the generation/computing of stochastic bits and the popcount with bitwise addition (CSA). The situation is similar to prior SC approaches, where 88% of the energy is consumed in data buffering related operations.
As shown by the breakdown of the energy consumption in Fig. 10, the initialization step costs more energy than the following steps performing SC for MUL. That’s because a write voltage pulse with a higher magnitude and a longer pulse duration needs to be applied to guarantee the initialization. Afterwards, the memory bits are mainly driven in a nondeterministic switching region which consumes less energy.
VD Area overhead
The area overhead of different designs is compared in Fig. 11. The area overhead is smaller by about one order of magnitude for our SC+PIM design than conventional SC. The improvement originates from the removal of the additional circuits for SNG, which occupies 95% of the area in the conventional SC approach.
As shown by the breakdown of area overhead in Fig. 11, the memory space required for the LUT table is comparable to the DTC and APC in our design, for the case of 10bit multiplication. The LUT table size will shrink for regular 8bit multiplication, since it depends exponentially on the bit length of the operands.
Vi Conclusion
In this paper, we propose a new SC design to perform MUL with inmemory operations. The stochastic random generation and AND operation in conventional SC are implemented by the simple write operations onto the SOTMRAM. Such design is enabled by converting the binary multipliers to the varied pulse duration of the write voltage for SOTMRAM. Consequently, the stochastic bits for the MUL outcome are insitu stored. Two strategies of popcount (APC or PIMbased ADD) have been proposed to convert the MUL result back to its binary format, offering flexibility to further trade off performance with area. Our approach improves the performance to compute MUL with PIM, in synergy with the mitigation of area overhead for supporting circuits of SC.
References
 [1] G. Koo, K. K. Matam, H. Narra, J. Li, H.W. Tseng, S. Swanson, M. Annavaram et al., “Summarizer: trading communication with computing near storage,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 219–231.
 [2] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Toppim: throughputoriented programmable processing in memory,” in Proceedings of the 23rd international symposium on Highperformance parallel and distributed computing. ACM, 2014, pp. 85–98.
 [3] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processinginmemory accelerator for parallel graph processing,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3, pp. 105–117, 2016.
 [4] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A processinginmemory architecture for bulk bitwise operations in emerging nonvolatile memories,” in Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016, p. 173.

[5]
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processinginmemory architecture for neural network computation in rerambased main memory,” in
ACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 27–39.  [6] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A drambased reconfigurable insitu accelerator,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 288–301.
 [7] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1,” arXiv preprint arXiv:1602.02830, 2016.
 [8] S. Angizi, Z. He, F. Parveen, and D. Fan, “Imce: Energyefficient bitwise inmemory convolution engine for deep neural network,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 2018, pp. 111–116.
 [9] J. P. Hayes, “Introduction to stochastic computing and its challenges,” in Proceedings of the 52nd Annual Design Automation Conference. ACM, 2015, p. 59.
 [10] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM Transactions on Embedded computing systems (TECS), vol. 12, no. 2s, p. 92, 2013.
 [11] Z. Wang, L. Zhang, M. Wang, Z. Wang, D. Zhu, Y. Zhang, and W. Zhao, “Highdensity nandlike spin transfer torque memory with spin orbit torque erase operation,” IEEE Electron Device Letters, vol. 39, no. 3, pp. 343–346, 2018.
 [12] L. Chang, Z. Wang, A. O. Glova, J. Zhao, Y. Zhang, Y. Xie, and W. Zhao, “Prescott: Presetbased crosspoint architecture for spinorbittorque magnetic random access memory,” in ComputerAided Design (ICCAD), 2017 IEEE/ACM International Conference on. IEEE, 2017, pp. 245–252.
 [13] L. Chang, Z. Wang, Y. Gao, W. Kang, Y. Zhang, and W. Zhao, “Evaluation of spinhallassisted sttmram for cache replacement,” in Nanoscale Architectures (NANOARCH), 2016 IEEE/ACM International Symposium on. IEEE, 2016, pp. 73–78.
 [14] T. Seki, A. Fukushima, H. Kubota, K. Yakushiji, S. Yuasa, and K. Ando, “Switchingprobability distribution of spintorque switching in mgobased magnetic tunnel junctions,” Applied Physics Letters, vol. 99, no. 11, p. 112504, 2011.
 [15] D. Nandan, J. Kanungo, and A. Mahajan, “65 years journey of logarithm multiplier,” Int J Pure Appl Math, vol. 118, pp. 261–266, 2018.
 [16] K. Kim, J. Lee, and K. Choi, “Approximate derandomizer for stochastic circuits,” in SoC Design Conference (ISOCC), 2015 International. IEEE, 2015, pp. 123–124.
 [17] J. Liang and H.S. P. Wong, “Crosspoint memory array without cell selectors—device characteristics and data storage pattern dependencies,” IEEE Transactions on Electron Devices, vol. 57, no. 10, pp. 2531–2538, 2010.
 [18] K. An, X. Ma, C.F. Pai, J. Yang, K. S. Olsson, J. L. Erskine, D. C. Ralph, R. A. Buhrman, and X. Li, “Current control of magnetic anisotropy via stress in a ferromagnetic metal waveguide,” Physical Review B, vol. 93, no. 14, p. 140404, 2016.
 [19] B. Wang, Y.H. Liu, P. Harpe, J. van den Heuvel, B. Liu, H. Gao, and R. B. Staszewski, “A digital to time converter with fully digital calibration scheme for ultralow power adpll in 40 nm cmos,” in Circuits and Systems (ISCAS), 2015 IEEE International Symposium on. IEEE, 2015, pp. 2289–2292.
 [20] J. E. Stine, J. Chen, I. Castellanos, G. Sundararajan, M. Qayam, P. Kumar, J. Remington, and S. Sohoni, “Freepdk v2. 0: Transitioning vlsi education towards nanometer variationaware designs,” in Microelectronic Systems Education, 2009. MSE’09. IEEE International Conference on. Citeseer, 2009, pp. 100–103.
 [21] K. Kim, J. Lee, and K. Choi, “An energyefficient random number generator for stochastic circuits,” in Design Automation Conference (ASPDAC), 2016 21st Asia and South Pacific. IEEE, 2016, pp. 256–261.
 [22] K. Jabeur, G. Di Pendina, F. BernardGranger, and G. Prenat, “Spin orbit torque nonvolatile flipflop for high speed and low energy applications,” IEEE electron device letters, vol. 35, no. 3, pp. 408–410, 2014.
Comments
There are no comments yet.