In Von-Neuman architecture, transistor scaling has improved memory and processing units asymmetrically, leading to gaps in performance/energy consumption. Faster processing units need frequent and energy-intensive data-transfers from slower memory, which imposes leads to a degradation in overall system performance. Processing data within the memory will avoid Von-Neumann bottleneck and improves system performance and power consumption.
In-Memory Computing (IMC) is a promising compute model to minimize the data transfer between processor and memory by confining data within compute-capable memory. Several works have been proposed to move compute logic closer to the main memory or infuse the processing ability into memory cells [9, 23, 15, 25, 17, 22]. IMC can perform specific tasks such as dot-products that are used for recognition  and search , or support a wide range of logic  and arithmetic operations (e.g. matrix multiplication ). The compute capability of conventional memories such as Static RAM (SRAM) and Dynamic RAM (DRAM) have been heavily studied , , . IMC is also achievable by using emerging Non-Volatile Memories (NVMs) e.g., Resistive RAM (RRAM), Spin Transfer Torque (STT) Magnetic RAM, Phase Change Memory, etc. RRAM based IMC architectures in particular, has exhibited significant promise due to low power consumption, fast operation, and high integration density ( footprint in crossbar architecture ). RRAM provides higher sense-margin and many analog states compared to other NVMs e.g., STTRAM. Although RRAM may not be preferred for storage applications due to poor write endurance, it is useful for IMC since write-operation is only needed once (to program the function). Furthermore, unlike read operation of RRAM, IMC does not require constant DC current, which leads to higher endurance. In this paper, we propose SCARE (SCA on IMC for reverse engineering) taking RRAM based IMC architectures as test cases which can reveal IMC Intellectual Property (IP).
Existing IMC architectures such as, Dynamic Computing In Memory (DCIM) , Memristor Aided Logic (MAGIC) , material implication (IMPLY) , etc. are only capable of implementing functions with limited number of inputs. Implementing large functions using DCIM reduces Sense Margin (SM) . For MAGIC and IMPLY, the delay/power consumption increases exponentially with function size. An adversary can leverage power/current signature by Side Channel Attack (SCA) to Reverse Engineer (RE) the implemented function i.e., the number of minterms and the number of input literals per minterm. The final goal of the adversary is to find the function implemented in IMC (example is shown in Fig. 1).
Example of SCARE attack: A simplified example of the attack is shown in Fig. 1. It considers that the IMC operation is carried out in two cycles to compute a function in the Sum-of-Product (SOP) form. The first cycle computes the and the second cycle computes the . In this example, involves the following steps: (1) extraction of current profile during the compute cycles; (2) matching the cycle current profile with one of the pre-calculated current-profile models to determine the number of minterms implemented within the array. Here, the minterms represent the fanin of the gate; (3) matching the cycle current profile with one of the pre-calculated current-profile models to determine the number of literals in the array i.e., the fanin of the gates. Knowing the number of minterms from the cycle allows the adversary to determine the number of input literals per SOP minterm; (4) after finding the function structure, extracting the implemented function by applying a limited number of patterns to the chip and validating using a golden chip.
Baseline attack model: We have assumed that adversary can obtain the pre-computed current profile models developed through foundry-calibrated simulations or by fabricating small known functions in a test chip (more details in Section II-D) to aid in the RE process. For the sake of brevity, this paper focuses on two RRAM based IMC architectures, namely DCIM  and MAGIC  to demonstrate . However, the attack can also be implemented on other emerging IMC with minor changes. Note that process variation can lead to a large variation in power/timing profile of implemented gates. This, in turn, can make RE challenging due to the overlap of signatures from two different gates (e.g., 2-input AND gate power can overlap with 3-input AND due to variations). However, adversary can use statistical analysis of power/delay to filter the correct function (details in Section III and IV). IMC can improve power-efficiency by cutting down sensor data movement in IoT/mobile devices whereas in servers, it can improve performance and reduce cooling costs by lowering power dissipation. Our attack model is suited for IoT/mobile where the users/adversaries will have access to the devices/power port.
Distinctions from memory SCA: SCAs on traditional memories have been studied in the past. Side channel leakage of ASIC components is investigated in . It shows that SRAMs have geometrically uniform structures and their leakage closely follows a generic hamming distance. The observation is based on SRAM write operation that flips the data stored in the memory cell. The data dependent write current can be leveraged to launch simple power analysis attacks. Emerging memories provide better scope of SCA due to higher write and read currents compared to SRAM. Differential Power Analysis (DPA) on STTRAM have been investigated in  to decode the memory contents. Note that the write current based SCA will not work for all IMC architectures due to the absence of write operation (e.g., DCIM) during computing. Furthermore, IMC computing is different than the read operation due to the absence of static current during computing. Therefore, read current based SCA is not directly applicable to IMC. Furthermore, the objective of SCA based RE is to extract the implemented function in contrast to SCA based key extraction which only identifies the hamming weight of the data.
Distinctions from conventional RE: RE is generally an invasive and destructive form of analyzing integrated circuits (IC) where an adversary grinds away each layer of an IC and captures optical images. The base layer provides gate types and upper layers provide their connectivity. By combining the information, the IP could be unlocked. In contrast, is a non-invasive RE approach that exploits SCA to extract IP implemented in emerging IMC. This eliminates the need for highly expensive and invasive forms of RE for emerging IMC based computing. With SCARE, the extracted structure reveals the number of input literals and the number of SOP minterms. In essence, the structure reveals the number and input-type of each of the and gates present in the function. This extracted structure can be further used to design a limited number of input patterns to determine the overall SOP function (not covered in this paper). Based on our literature survey, is the first work on RE of IMC based IP.
In particular, we make the following contributions in this paper. We,
investigate SCA on IMC architectures for non-invasive RE of IP;
exploit side channel current profiles to identify the gate structures of the implemented functions;
propose two attack models for DCIM and MAGIC, respectively. One works for true inputs only and other one works for both, true and complementary inputs;
conduct PV analysis of the IMC architectures to develop an SCA comparison model;
propose countermeasures such as, redundant inputs and expansion of literals to protect from SCARE.
The rest of the paper is organized as follows: Section II introduces the basics of RRAM, DCIM, and MAGIC IMC architectures, background on SCA and simulation setup; Sections III and IV describe the proposed attack models on DCIM and MAGIC, respectively and the results; Section V presents countermeasures; Section VI presents discussion; Finally, Section VII draws the conclusions.
Ii Background and Related Work
Ii-a Basics of RRAM
RRAM is a two terminal device, which resistive switching layer is sandwiched between two electrodes. Switching from Low Resistance State (LRS) to High Resistance State (HRS) is called ‘reset’ process and switching from HRS to LRS is called ‘set’ process. The resistance of the insulator layer changes from LRS (HRS) to HRS (LRS) depending on the voltage polarity between the two terminals .
Ii-B Basics of IMC
We have implemented the DCIM architecture proposed in . Fig. 1(a) shows the implementation of as an example and Fig. 1(b) shows the corresponding timing waveforms. Each memory cell consists of an RRAM connected in series with a selector diode. Functions are implemented in SOP form using pre-programmed memory arrays. Separate arrays for and operations are needed to implement the logical functions. Inputs are given to the arrays through Wordlines (WL). Final Bitline (BL) voltages are considered as outputs ( array BLs implement minterms e.g., and , and, array BLs implement functions). Any LRS RRAM in array’s BLs is considered as a literal to the respective BL and any LRS RRAM in array’s BLs serves as a minterm for the implemented function.
Initially, array’s BLs are pre-charged to by asserting . Then, inputs are applied to the RRAMs by activating signal. If one of the literals of array’s BLs is ‘0’, the respective BL gets discharged and its voltage drops below the Sense Amplifier (SA) reference voltage () and the minterm is considered as ‘0’. If all the inputs are logically ‘1’, the BL holds its pre-charged value which is higher than the reference voltage and is considered as ‘1’.
The BLs of array are initially pre-discharged to 0v. After activating signal if one of the input literals in a BL is logically ‘1’ it can charge up the BL to a value higher than array’s reference voltage (). Finally, the voltage of array’s BLs are compared against at the edge of and output is generated.
We have also implemented MAGIC architecture proposed in  that employs memristors (RRAM in this paper) to implement logic gates. A number of memristors serve as inputs with previously stored data while an additional memristor acts as the output. Gates including MAGIC- , , , and are shown in Fig. 3 (a)-(d), respectively. MAGIC’s logical state is represented as a resistance, where the HRS and LRS represent logical ‘0’ and ‘1’ respectively. Fig. 3 (e) shows the implementation of as an example. It consists of two 2-fanin AND gates and one 2-fanin OR gate. Here, the input RRAMs, and , are initialized to logical ‘1’ (LRS) and RRAM is initialized to logical ‘0’ (HRS). All output RRAMs are initialized to ‘0’ (HRS). In the first cycle, is computed by asserting its bitline driver () using the enable () signal. Since , ’s output RRAM switches from ‘0’ (HRS) ‘1’ (LRS). Similarly, during the second cycle, when is asserted using the signal, ’s output RRAM remains at ‘0’ (HRS). During the final cycle, the bitline driver for the operation () is asserted using the signal. Fig. 3 (e) shows that the final output RRAM switches from ‘0’‘1’ and reflects the correct output of .
Ii-C Background on Side Channel Attack
SCA  is a powerful threat which targets weak implementation of systems on the chips. SCA exploits the unintentional signature observed in physical channels like timing , power consumption  and electromagnetic emanation  etc. with an objective to recover the sensitive data being processed e.g., cryptographic keys. Since different data bits exhibit different physical signatures (power consumption, delay), SCA can unveil the data. SCA on memory components targets hamming distance which is equal to the number of bit transitions. Then, a statistical dependency is tested between the hypothetical leakage computed (using simulations, test chips, and leakage models) and the measured leakage to guess the stored data.
Ii-D Adversarial Modeling of IMC Power/Timing
In order to correlate the IMC power/timing (extracted using SCA) with the appropriate gate type and fanin value, the adversary requires a pre-calculated power and timing model. This model may be easily developed if the adversary has access to the foundry calibrated device models. If not, the adversary can order a limited number of test chips that implement multiple small known functions using IMC. Such opportunity is available through shuttle programs of vendors like CMP and MOSIS. The adversary can then proceed to develop a model based on the power and timing distributions calculated for different gates and input sizes.
Ii-E Simulation Setup
Simulations are performed in HSPICE with 65nm PTM technology , ASU RRAM model  and bi-directional selector diode model . Detailed parameters of the devices employed for simulations are shown in Table I. The test chips obtained by the adversary with the known functions/gates are subject to process variations. Therefore, the adversary will only have a distribution of power profiles and operation times. In order to represent this situation, we introduce process variations to the power profile/operation-time modeling by performing Monte Carlo simulations with the parameters listed in Table II.
|MOSFET Gate Length||65 nm|
|NMOS/PMOS Threshold Voltage||423/-365 mV|
|BL Capacitance||100 fF|
|RRAM Gap Min/Max/Oxide Thickness||0.1/1.7/5 nm|
|Atomic Energy: Vacancy Generation/Recombination||1.501/1.5 eV|
|RRAM Write Latency||25 ns|
|RRAM HRS/LRS at 1.2V||6.7M/58.9K|
|Parameter||Real Value||Variation||STD. Deviation|
|RRAM LRS Gap||0.1 nm||7%|
|RRAM HRS Gap||1.7 nm||7%|
|MOS Oxide Thickness||1.2 nm||10%|
|MOS Gate Length||65nm||10%|
Iii Attack on DCIM Architecture
Adversary can distinguish power drawn by OR and AND arrays of DCIM by looking at their power signature. The AND array is pre-charged by a PMOS transistor during pre-charge phase, therefore, its power signature is negative (power is drawn from voltage supply). The OR array is pre-discharged by an NMOS transistor during pre-discharge phase, therefore, the power is signature is positive (power is dissipated by the ground node).
Iii-a Attack Model 1
Iii-A1 Leveraging the power drawn by OR and AND array
In DCIM, computations are performed in two cycles between signal and signal ( and , and, and ) activation. especially considers the peaks in the power profile for three reasons: (a) signal activates large buffers (that are upsized to charge/discharge the BLs); (b) signal activates SAs (significant capacitive component); (c) short circuit current from to ground through buffers/SAs (when both pull up/down networks are on).
Initially, the adversary chooses two consecutive time periods between and in each cycle to launch the attack. This is followed by asserting all the input signals to logical ‘’s and recording the power profile. The recorded power during the second cycle (i.e., the function) is matched with the power-profile reading from the models developed using multiple test chips/simulations. This step allows the adversary to determine the number of SOP minterms processed in the cycle ( function input literals). By setting all the input literals to logical ‘’, the adversary ensures that each SOP minterm in the function also equals a logical ‘’. The BLs in the DCIM array are pre-discharged to ‘’. Thus, by applying logical ‘’ to all SOP minterms, ensures that the BLs are charged as fast as possible (leading to the highest possible power consumption).
Once the number of SOP minterms is determined, the adversary analyzes the power profile of the cycle to determine the number of input literals in each minterm. Each of the array input literals are set to logical ‘’ to ensure that each SOP minterm equals to ‘’. By applying ‘’ to each minterm, the adversary ensures that the BL, which is pre-charged to , discharges at the highest possible rate. Finally, by analyzing the power profile of the operation, the adversary determines the number of input literals for each gate (Fig. 1). Note that the leakage power must be subtracted to analyze the function-dependent power.
The above attack model works only with true inputs. If the function consists of complementary inputs, the attack will fail since adversary cannot confirm if all the minterms are ‘1’ by forcing all inputs to ‘1’.
Iii-A2 Simulation Results
DCIM OR Array: The current profiles of the gate with fanin ranging between 0 to 8 is shown in Fig. 3(a) (time offset is chosen when signal is activated). It indicates that gates with various fanins charge the BLs at different rates. Since the observed resistance of each BL decreases with the number of LRS inputs (more number of resistors in parallel), current through the array increases with fanin. However, the separation between current profiles of each gate decreases with fanin. With time the current profiles of various fanin gates become indistinguishable. The current profiles merge when the BL is charged up to . Therefore, selection of measurement window is critical.
Shorter measurement windows increase the resolution for distinguishing fanins of gates (the best measurement window is shown in Fig. 3(b)). In the suggested measurement window, the currents of and gates differ by at least 5
A. This is evident from the Probability Density Function (PDF) of the current profiles ofgates (Fig. 3(c)). It is seen that the PDF of and
have a noticeable overlap. If the recorded current profile (from SCA) falls within this overlap, the adversary might need to consider both possibilities. In such cases, the Cumulative Distribution Function (CDF) may be useful to predict the number of inputs (Fig.3(d)
). Additional properties of the distribution such as the mean and Standard Deviation (STD) may be used to accurately determine the gate fanin. The STD ofgates with different fanins is presented in Fig. 3(e). It is noted that the STD does not follow a monotonically increasing or decreasing trend. In contrast, the mean values (Fig. 3(f)) exhibit a monotonically increasing trend with fanin. Thus, the adversary can leverage the CDF, PDF and mean distributions of the gate current profiles collected from the test chips/simulation to analyze the current profile recorded by SCA.
DCIM AND Array: Simulation setup for the DCIM arrays is similar to the array. However, the adversary applies SCA to extract the current profiles by monitoring the ground node. Additionally, the power is higher than power. This can be attributed to the fact that similar sized transistors act as input buffers. The transistors have higher mobility compared to transistors and discharge the BL faster.
The current profiles of the gate with fanin ranging between 0 to 8 is shown in Fig. 4(a). Fig. 4(b) suggests the best measurement window to maximize the difference between power profiles of various fanins. The PDF and CDF of current distribution are shown in Fig. 4(c) and Fig. 4(d), respectively. It can be noted that the current increases with fanin. Unlike the array, both the STD and mean value graphs of the array, as shown in Fig. 4(e) and Fig. 4(f), respectively, increase monotonically with fanin. Therefore, the PDF, CDF, STD, and mean value distributions collected from the test chips/simulations can be leveraged by the adversary to identify the fanin of the gate from the recorded SCA current profile.
Iii-B Attack Model 2
Iii-B1 Leveraging the power drawn during pre-charge of AND Array
Adversary can identify gate fanin by analyzing the current drawn during the pre-charge (pre-discharge) phase of AND (OR) array (Fig. 5(a)). The adversary forces all minterms to ‘0’ by trying multiple patterns of inputs and analyzing OR array current. If OR array current is in the range of leakage current, adversary identifies that as all the minterms are ‘0’. Every active BL (BLs which participate in the implemented function) is discharged to if adversary forces all minterms to ‘0’. This will lead to the maximum current drawn during AND array pre-charge phase. In the next pre-charge phase of AND operation cycle, DCIM has to charge all active BLs up to and based on the capacitor energy () adversary can find which is the addition of all active BLs capacitance. The adversary will know the capacitance value on each BL by modeling BLs using accurate tools. Subsequently, adversary can find the number of minterms by the analysis proposed in Section III-A.
To find the pre-charge phase in the extracted power profile, adversary will examine a large peak without a short circuit. All large current peaks range from very large positive values to negative values due to switching of CMOS gates which consume short circuit (when both pull-up and pull-down networks are on). However, pre-charge circuit only includes PMOS transistors that only leads to negative current values for array and positive current values for array. This attack model works for all the functions (with true and complementary inputs) implemented by DCIM.
Iii-B2 Simulation Results
The simulation setup for DCIM attack model 2 is similar to DCIM attack model 1. However, the adversary examines the power signature during the pre-charge phase instead of the computation window. The AND array’s BLs are charged during the pre-charge phase. The power profiles observed differ based on the number of BLs. The adversary uses the distinct power measured during the pre-charge phase to determine the number BLs, which represents the number of minterms. The current profiles during the pre-charge phase of AND gates with different fanins ranging from 0 to 8 are shown in Fig. 5(a). Fig. 5(b) suggests the best measurement window to maximize the difference between these current profiles. Additionally, the PDF and CDF distributions of pre-charge currents are shown in Fig. 5(c) and Fig. 5(d), respectively.
Note that the pre-charge current observed increases with fanin. Both the STD and mean value graphs of the array, as shown in Fig. 5(e) and (Fig. 5(f), respectively, increase monotonically with fanin. Therefore, the PDF, CDF, STD, and mean value distributions collected from the test chips/simulations can be leveraged by the adversary to identify the fanin of the gate from the recorded SCA current profile.
Iii-C Analysis of the Impact of Supply Voltage Magnitude on SCARE Performance
We have also swept the magnitude of supply voltage to analyze its impact on the performance of SCARE. For DCIM, we swept from V to V in 50mV increments. The result is summarized in Fig. 6(a). It is evident that mean value of current of the gates with different fanins increases at higher voltages. The differences in the mean values of currents are increased (higher slope) at higher voltages too, which helps the adversary to distinguish more accurately between different fanins. Sigma analysis of current profiles show that current distribution are wider at higher voltages and sharper at lower voltages, which means sigma increases with voltage (Fig. 6(b)). As it is shown in Fig. 6(c), CDF slope is decreasing at higher voltages, which shows that sigma increases at higher voltages. Note that, under process variation, cases which overlap into adjacent fanins’ distribution are important and hard to distinguish. SCARE has calculated overlaps between gates with fanins and (Fig. 6(d)), to have an insight on the overlap percentage at various different voltage nodes. Fig. 6(d) shows that for voltages near the nominal the overlap is at its lowest and it increases when the supply voltage magnitude increases or decreases. The worst case is for very low supply voltages, when the current magnitude is very small and a small variation in the resistance values of RRAMs can lead to ambiguity of the gates’ fanins.
Iv Attack on MAGIC Architecture
An adversary can distinguish between the power drawn by the OR and AND array of MAGIC by examining the operation time determined using the spike in the power signature that is created during operation. The order of magnitude of difference between OR and AND operation times ranges from 10X to 100X (as seen in Fig. 8).
Iv-a Attack Model 1
Iv-A1 Leveraging the power signature/operation time of OR and AND arrays
Unlike DCIM, MAGIC’s computation time depends on the type of gate and the fanin. An adversary can RE the MAGIC functions using the computation time extracted from the power profile.
MAGIC writes the result of a computation into a designated output RRAM by altering its resistance (HRS ‘0’ and LRS ‘1’). We observe a significant change in the power profile when the resistance of the output RRAM changes. This sharp change in the power profile (during writing the output to RRAM) signifies the end of one MAGIC operation (e.g. 3-input operation). Note that the adversary is capable of finding the computation times for different gates and inputs by implementing known functions in MAGIC test chips and/or simulations and recording their power profiles.
Alternatively, the adversary can observe the constant current passing through output RRAM. In this approach, each of the input literals is set to a logical ‘’ (MAGIC initializes all the input RRAMs to HRS, the output RRAMs of and gates to HRS, and the output RRAMs of gate to LRS). By measuring the current (minus the leakage current), the adversary can determine the current passing through the output RRAM. This allows the adversary to determine the gate implemented in a particular clock cycle (e.g. differentiate between the , and gates). The observed constant current (I) and the fanin value () can be attributed to each gate (, , and ) based on the following rules (considering a 8 input system):
Note that due to the process variations, might not be an integer and should be used to approximate to the nearest positive whole number (since fanin should be integer). The proposed attack model works only with true inputs. If the function consists of complementary inputs, the attack will fail since adversary cannot confirm if all the minterms are ‘1’ by forcing all inputs to ‘1’.
Iv-A2 Simulation Results
To evaluate the above-mentioned attack model for MAGIC gates, IMC computations are performed for , , and gates. Note that MAGIC simulations for are not performed since it requires each of the input RRAMs to be initialized to HRS. Therefore, the output RRAM does not get enough voltage headroom to get written since the input RRAMs (in HRS) consume a high voltage across them. Further study is required to ensure the validity of the MAGIC design. For each of the remaining gates, an increasing ( gate) or decreasing ( and gates) trend of computation time is observed when the fanin of the logic array increases from 2 to 8. The computation completion is determined by a sharp change in the current profile during the switching of the output RRAM.
i) MAGIC AND Array: The distribution of operation times for fanin ranging from 2 to 8 is shown in (Fig. 7(a). The computation times for each of the cases is represented by a distinct distribution. Furthermore, the mean and STD of each of the distributions (Fig. 7(b) and 7(c), respectively), show a monotonically increasing trend as the fanin increases. Each of these graphs can be used to accurately determine the fanin of the gates.
ii) MAGIC OR Array: The distribution of operation times for fanin ranging from 2 to 8 is shown in Fig. 7(d). Unlike MAGIC , we find that the computation time distribution for each of the fanin overlaps. Therefore, the PDF alone cannot be reliably used by an adversary to determine the gate fanin. It is seen that the mean and STD of each of the distributions (Fig. 7(e) and 7(f), respectively), show a decreasing trend with the fanin. We note that the MAGIC implementation is comparatively resilient against SCA compared to and . But an adversary can still leverage the SCA data to predict the structure of the gate and the fanin value with reasonable accuracy.
iii) MAGIC NOR Array: The distribution of computation times for the operation with fanin ranging from 2 to 8 is shown in Fig. 7(g). Similar to , the completion times for each of the cases is represented by distinct CDFs. Furthermore, the mean and STD of each of the distributions (Fig. 7(h) and 7(i), respectively), show a distinctly decreasing trend with increase in fanin. Each of these graphs can be reliably used to accurately determine the fanin of gate.
Iv-B Attack Model 2
Iv-B1 Leveraging pre-compute RRAM write operation times of AND or OR arrays
Since inputs are stored as resistance values in RRAMs, their write operations are asymmetric and the adversary can find the values which are stored in the RRAMs by examining the power profile. The adversary can force each of the inputs to logical ‘1’ (LRS) and examine the RRAM write currents. Based on the write-current observed the adversary can determine the number of RRAM cells switched to ‘0’ (HRS) and the number of cells that remain at ‘1’ (HRS). IMC of any function through MAGIC occurs only after the RRAM cells corresponding to inputs and output are initialized to HRS or LRS values depending on the function and the array operation (i.e. NOR, OR, etc). Assuming a representative example function with 8 input literals, the MAGIC architecture will employ 8 input RRAMs and 1 output RRAM. Furthermore, each of these are preset to ‘0’ (HRS) state. To execute a particular function, some or all of these 9 RRAMs resistances are switched to ‘1’ (LRS state). Note that the power consumed for switching different number of RRAMs (0 to 9 in this case) is distinct and can be extracted through SCA.
In addition to the case-by-case method explained here, the adversary can determine the number of HRS RRAMs () by using the following equations (similar to equations mentioned in Section IV-A):
Iv-B2 Simulation Results
Simulation setup for the attack model 2 on MAGIC is similar to the one in model 1, but with an added step (looking during RRAM initialization). This added step reveals the complementary inputs.
A 100-point Monte Carlo analysis is performed on RRAM write current with the setting shown in Table II. The resulting current distribution, average current, and STD of the distribution are shown in Fig. 8(a) 8(b) and 8(c), respectively. It is evident that the number of RRAMs initialized to ‘1’ and ‘0’ can be found. In order to determine the inputs whose complementary values are used, each input is flipped from its original value (‘1’) one at a time. If a change in an input value (1 0) leads to an increase in the number of ‘0’s (HRS RRAMs), determined by re-examining the power signature, we can deduce that the original value of the input is used in the function. Alternatively, if the number of ‘0’s decreases, the input’s complementary value is used in the function. In this way, adversary can extract the structure of function with true and complementary inputs.
Iv-C Analysis of the Impact of Supply Voltage Magnitude on SCARE Performance
In case of MAGIC, the value is swept from 2.2V to 3V in 100mV increments. As shown in Fig. 10(a) and 10(b), the mean operation times decreases as the value increases. Furthermore, it is also seen that the standard deviation value decreases under all s as the the fanin value increases. Fig. 10(c) and 10(d) show that the standard deviation of the operation also decreases with increase in and shows a mostly negative trend with the change in fanin. Fig. 12 shows that the slope increases as the value increases and shows a decrease in sigma value. The slope of is extremely high and is therefore not shown in Fig. 12 since they overlap with the CDF at .
We propose the following countermeasures in order to protect IMC architectures against SCARE.
V-a Redundant Inputs
DCIM: Few redundant LRS RRAMs on each BL can be implemented which are biased with a fixed voltage.
For instance, function can be implemented as: . In this method area increases and the overhead is based on the number of maximum redundant inputs and number of inputs (e.g. four redundant inputs for a function with eight inputs increases the area by 0.25%). As long as the number of redundant LRS RRAMs in each BL is less than 8, the SM stays relatively constant. Based on our simulations, eight redundant inputs are enough to mask an array with 64 inputs, which increases the area by 6%. The number of LRS RRAMs on each BL can be randomly distributed to further obfuscate the structure of the implemented function. Power overhead is completely dependant on the number of redundant inputs in each BL. An example of masking with two redundant inputs is shown in Fig. 10. It can be noted that the power profile with two redundant inputs completely overlaps with power profile with no redundant inputs. Therefore, power profile signature gets obfuscated. In this example, the power overhead is 21%.
Since the selector diode turns off when the voltage across its two terminals is less than , the AND (OR) array’s redundant inputs should not be driven by (‘0’) instead they should be driven by for AND array and for OR array for better obfuscation.
MAGIC: In the case of MAGIC, redundant inputs increases the fanin, and thus increases the number of input RRAM bitcells. As previously shown in Section IV, increasing the number of inputs even by ‘1’ literal has a distinguishable change in the operation completion time. This change as depicted in Fig. 8, can be leveraged to mask the true structure of any MAGIC implementation for any of the operations.
V-B Minterms with expanded literals
DCIM: Each minterm in a function can be implemented by the maximum number of inputs. In this scenario, all minterms show the same power profile and SCA alone fails. However, an adversary can still try all the possible input patterns and generate input-output pairs. Next, that can be used to determine the function by using a Karnaugh map to reveal the simplified Boolean expression.
For example, can be implemented in a 4 input system as, . Furthermore, it will become complicated for the adversary to find the function when and have the same number of minterms in the expanded version and when has more minterms than . This technique can protect the IP at the cost of increased area and power overhead. An example of masking the function by using this technique is shown in Fig. 13, which the two functions consume the same power. Power and area overhead depends on the implemented function and for the example in Fig. 13 power consumption increases by 36% and AND array area stays the same (since crossbar array is already there and it has enough BLs to implement the minterms). However, OR array’s area increases by 50% (number of WLs is changed).
MAGIC: Similarly for MAGIC, we consider two representative example functions and . The first function requires a 2-fanin AND and a 2-fanin OR operation, while the second function requires three 3-fanin ANDs and one 3-fanin OR operation. Expanding these functions into their maximized SOP form will require six 6-fanin ANDs and one 6-fanin OR operation for both of the functions. Since these operations are identical, SCA delivers the same result and masks the true structure of the operation. Fig. 13(a) shows that 2-fanin and 3-fanin ANDs have distinctly different operation completion time as depicted by a sharp change in their current profiles. The 4-fanin AND for both operation in their maximized SOP form is shown to be identical. Similarly, Fig. 13(b) shows the distinctly different 2-fanin and 3-fanin current profiles of each function’s OR operation and depicts the identical 6-fanin OR current profiles for their maximized form. This attack model will not incur any area overhead since the maximized SOP form will only leverage previously present RRAM cells in the crossbar array for any additional literals. But, it will incur some power overhead due to the increase in the number of SOP minterms to be computed.
The above-mentioned countermeasures increase the RE effort. For example, without any countermeasures would require 84 combinations of inputs to determine the function structure. Implementing the countermeasures increases the required number of combinations to 256 (RE effort increases by 3.04X). This increased effort enhances the resiliency of the IMC architectures against . RE effort increases exponentially with the number of inputs.
Vi-a Extracting the Exact Function
For MAGIC, in the absence of parallelism, the adversary can find the number of input literals for each minterm and test possible patterns to determine the exact minterm that correlates to a particular function. The adversary can repeat this approach to find all minterms one by one. For DCIM, the adversary can determine the number of input literals per minterm after calculating the number of minterms. Finally, by examining the output, the adversary can try multiple patterns to determine the exact correlating minterms.
Vi-B Extracting Multiple Functions
For DCIM, the total number of functions implemented can be determined by the number of outputs generated. For each of these functions, the number of minterms per function and the number of input literals per minterm can be extracted by following the method described in this paper. The adversary then proceeds to manipulate the input values to determine the correlation of each minterm with each function output. Since MAGIC does not support parallelism, adversary can easily determine the number of input literals per minterm and try various patterns to correlate each minterm with each function.
Vi-C Number of Test Chips Needed for an Attack
We performed 1000-point Monte-Carlo simulations to develop SCARE’s power models for DCIM/MAGIC. In absence of models, adversary can fabricate few chips to launch the attack with minor loss in accuracy. The mean value of worst case margin (e.g.,margin between AND7/AND8 for DCIM and OR7/OR8 for MAGIC) is degraded by 3.5%, 3.2%, 1.8% (for DCIM) and 5.2%, 4.77%, and 2.8% (for MAGIC) for 25, 50 and 100 chips, respectively compared to 1000-point Monte-Carlo. Furthermore, standard deviation increases by 18.5%, 15.9%, and 4.9% for DCIM and 21.3%, 9%, and 7.9% for MAGIC. Adversary can minimize measurement noise by taking multiple samples and averaging them. This approach is possible for IoT/mobile applications where adversary has physical possession.
Vi-D Realistic Attack & Parallelism
IMC architectures include MAGIC, DCIM and matrix-vector multipliers (MVM). DCIM and MAGIC can implement arbitrary functions while MVM can only implement dot product.
Under parallel operations/functions, adversary can find the number of functions by observing the number of outputs in DCIM/MAGIC. In DCIM, number of outputs = number of NOR gates. Power in second-cycle yields NOR gate fanins (e.g., two , one ). Power in cycle-1 yields number of AND gates/fanins. After determining the total number of AND/OR gates, adversary can run a limited number of input patterns to relate the input bits to the corresponding observed output bit. Compared to brute-force, SCARE reduces number of test patterns to RE functionality e.g., 62.5% less patterns to identify .
In MAGIC, designers need multiple arrays to implement functions in parallel. Power peaks might overlap. The number of functions can be found from the magnitude of power (e.g., for writing 1- vs 2-output RRAMs). While multiple array operations can cause overlapping power spikes due to more than one AND/OR, adversary can determine the individual gates by dividing the total power as a summation of individual powers as modeled before. Timing difference between completion of gate operation can also be exploited.
Vi-E SCARE on conventional memory
Although not covered in this paper, SCARE is also applicable to SRAM based IMC such as, X-SRAM  since the power timing profile of the read bitline during computation depends on input patterns (e.g. read bitline discharges faster for function if A = 1/B = 1 compared to A = 0/B = 1 or A = 1/ B = 0). These studies are subject of our future research.
Vi-F Hybrid Architectures
SCARE will experience noise from CMOS-logic for mixture of IMC and CMOS-gates. MAGIC-based IMC (pipelined/non-pipelined) involve high-power and long latency write operations that can be distinguished from CMOS-logic power. Furthermore, the CMOS-logic will compute after IMC for non-pipelined implementations of DCIM/MAGIC and can be separated in time. Pipelined implementations of DCIM will combine CMOS and IMC powers and it could be difficult to distinguish them. This could be a subject of future studies.
Vi-G Prior Knowledge on the Implementation
Adversary can identify the sequence of gates without any prior knowledge. Note, there are only two efficient methods for function implementation in IMC namely, SOP or Product-of-Sum (POS). For DCIM, adversary can distinguish between SOP/POS by observing the polarity of current drawn during pre-charge phase of each cycle e.g., negative (current drawn from voltage supply) (positive (current drwan by the ground node)) current for AND (OR) array. For MAGIC, adversary can identity function (AND/OR/NOR) due to distinguishable difference in latency of gate operation extracted from the power profile. Therefore, functions do not need to be in AND-OR formats.
If IMC-paradigm (i.e., MAGIC vs DCIM) is unknown, adversary can identify IP by screening the power profiles (). DCIM’s output is sensed by a sense-amplifier (low-power) while MAGIC’s output is written to an output RRAM (high-power). Non-parallel implementation of MAGIC uniquely exhibits multiple peaks in the power profile compared to DCIM.
Vi-H Applicability of Existing SCA Obfuscation Techniques
SCA obfuscation techniques such as  propose to inject random code execution to scramble power profile and prevent SCA on cryptographic implementations. Such protection techniques, if extended to IMC architectures, will impose significant throughput overhead since random functions between actual ones will incur extra delay. This is in addition to area and power overheads. In , duplicating logics with complementary operations are proposed to eliminate the asymmetry between power drawn to process 0 and 1. This technique will not protect IMC against SCA since the function and its complement may have different number of minterms. Therefore, they may consume different amount of power.
Vi-I Hspice Modeling & Fabrication
Conventional random logic can include various gate flavors (NAND/NOR/AOI/OAI/AND/OR/INV) which makes Hspice-model (or experimental chip)-based attack challenging. However, IMC circuit is systematic and only includes NAND/NOR gates due to SOP/POS implementation for simplicity. Furthermore, IMC using emerging NVMs provide distinct and high amplitude power signatures compared to CMOS gates. Therefore, RE of functionality will be challenging in CMOS even if adversary has accurate power-model of individual gates. In SCARE, we assume (Section III-A2) that adversary can fabricate few test-chips (costly) to characterize the power signature of individual gates in IMC if a model is not available. This is achievable by multiplexing power of multiple gates and enabling them one at a time.
Adversary will extract power/timing model from his own fabricated chips with test features to characterize individual gates although it will require high-precision equipment, time (due to multiple measurements). Obtaining PDK from vendor is considered easy (although NDA may be needed). Power/timing analysis of victim chip will be more challenging.
This paper proposes , a non-invasive RE on IMC using SCA for the first time. is applied to two well-known emerging technology based IMC architectures (DCIM and MAGIC). The adversary extracts power/timing distributions from well-calibrated simulations or IMC test chips with known functions. Next, the functions are extracted by matching the probed power and timing profiles with modeled profiles for various gates and fanins. We also present possible countermeasures to mitigate attack.
Acknowledgement: This work is supported by SRC (2847.001), and NSF (CNS- 1722557, CCF-1718474, CNS-1814710, DGE-1723687 and DGE-1821766).
-  (2018-12) X-sram: enabling in-memory boolean computations in cmos static random access memories. IEEE Transactions on Circuits and Systems I: Regular Papers 65 (12), pp. 4219–4232. External Links: Cited by: §I, §VI-E.
-  (2018-06) Imaging-In-Memory Algorithms for Image Processing. IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS1). Cited by: §I.
-  (Website) External Links: Cited by: §II-E.
-  (2005) Cache-timing attacks on AES. External Links: Cited by: §II-C.
-  (2001) Electromagnetic Analysis: Concrete Results. Cryptographic Hardware and Embedded Systems. Cited by: §II-C.
-  (2012-05) Metal-Oxide RRAM. Proceedings of the IEEE 100 (6). Cited by: §II-A.
-  (2015-06) A scalable processing-in-memory accelerator for parallel graph processing. Annual International Symposium on Computer Architecture (ISCA). Cited by: §I.
-  (2015-06) PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. Annual International Symposium on Computer Architecture (ISCA). Cited by: §I.
-  (2017-04) . IEEE Journal of Solid-State Circuits (JSC). Cited by: §I.
-  (2007) A smart random code injection to mask power analysis based side channel attacks. In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’07, New York, NY, USA, pp. 51–56. External Links: Cited by: §VI-H.
-  (2011) One Selector-One Resistor (1S1R) Crossbar Array for High-Density Flexible Memory Applications. Electron Devices Meeting (IEDM). Cited by: §II-E.
-  (2014-09) MAGIC-Memristor-Aided Logic. IEEE Transactions on Circuits and Systems II: Express Briefs 61 (11), pp. 895–899. Cited by: §I, §I, §II-B2.
-  (2016-06) Dot-product engine for neuromorphic computing: programming 1t1m crossbar to accelerate matrix-vector multiplication. Proceedings of the 53rd Annual Design Automation Conference (DAC). Cited by: §VI-D.
-  (2017-01) MPIM: Multi-Purpose In-Memory Processing Using Configurable Resistive Memory. Asia and South Pacific Design Automation Conference (ASP-DAC). Cited by: §I.
An Energy-Efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation in SRAM. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8326–8330. Cited by: §I.
-  (2017-11) Side-channel Attack on STTRAM Based Cache for Cryptographic Application. IEEE International Conference on Computer Design (ICCD). Cited by: §I.
-  (2016-05) Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC). IEEE Transactions on Nanotechnology 15 (4), pp. 635–650. Cited by: §I.
Prime: a novel processing-in-memory architecture for neural network computation in reram-based main memory. Annual International Symposium on Computer Architecture (ISCA). Cited by: §I.
-  (Website) External Links: Cited by: §II-E.
-  (2006) Minimality of the hamming weight of the -naf for koblitz curves and improved combination with point halving. In Selected Areas in Cryptography, B. Preneel and S. Tavares (Eds.), Berlin, Heidelberg, pp. 332–344. External Links: Cited by: §VI-H.
-  (2014-10) Memristor-Based Material Implication (IMPLY) Logic: Design Principles and Methodologies. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22 (10). Cited by: §I.
-  (2016-06) Pinatubo: A Processing-In-Memory Architecture for Bulk Bitwise Operations in Emerging Non-Volatile memories. ACM/EDAC/IEEE Design Automation Conference (DAC). Cited by: §I.
-  (2019-10) Dynamic Computing in Memory (DCIM) in Resistive Crossbar Arrays. ICCD. Cited by: Fig. 2, §I, §I, §I, §I, §II-B1.
-  (2013) On Measurable Side-channel Leaks Inside ASIC Design Primitives. Cryptographic Hardware and Embedded Systems. Cited by: §I, §II-C.
-  (2017-05) In-Memory Processing Paradigm for Bitwise Logic Operations in STT-MRAM. IEEE Transactions on Magnetics 53 (11). Cited by: §I.
-  (2017-03) Design and Benchmarking of Ferroelectric FET Based TCAM. Design, Automation & Test in Europe Conference & Exhibition (DATE). Cited by: §I.