I Introduction
The advent of DNNs has drastically advanced the field of machine learning by enabling superhuman accuracies for many cognitive tasks involved in image, video, and natural language processing
[wireddnns]. However, DNNs present a high computation cost that severely limits their ubiquitous adoption in energy and costconstrained IoT devices [swagathaspdac16].The use of lower precision to represent the weights and activations in DNNs is a promising technique for realizing DNN inference (evaluation of pretrained DNN models) on energyconstrained platforms [XNORNet, BC, DoReFaNet, TC, TTQ, WRPN, TNN, HitNet, FGQ, QuantRNN, PACT]. Reduced bitprecision can lower all facets of energy consumption including computation, memory and interconnects. Current commercial hardware [nvidiavoltav100, edgetpu], includes widespread support for 8bit and 4bit fixed point DNN inference, and recent research has continued the push towards even lower precision [XNORNet, BC, DoReFaNet, TC, TTQ, TNN, WRPN, HitNet, FGQ].
Studiestodate [XNORNet, BC, DoReFaNet, TC, TTQ, WRPN, TNN, HitNet, FGQ, CompensatedDNN] suggest that among lowprecision networks, ternary DNNs represent a promising sweetspot as they enable lowpower inference with high applicationlevel accuracy. This is illustrated in Figure 1, which shows the reported accuracies of various stateoftheart binary [XNORNet, BC, DoReFaNet], ternary [TC, TTQ, WRPN, TNN, HitNet, FGQ]
, and fullprecision (FP32) networks on complex image classification (ImageNet) and language modeling (PTB
[PTB]) tasks. We observe that the accuracy degradation of binary DNNs over the FP32 networks can be considerable (513% on image classification, 150180 PPW [Perplexity Per Word] on language modeling). In contrast, ternary DNNs achieve accuracy significantly better than binary networks (and much closer to FP32 networks). In this work, we focus on the design of specialized hardware for realizing various stateoftheart ternary DNNs.Ternary networks greatly simplify the multiplyandaccumulate (MAC) operation that constitutes 9599% of total DNN computations. Consequently, the amount of energy and time spent on DNN computations can be drastically improved by using lowerprecision processing elements (the complexity of a MAC operation has a superlinear relationship with precision). However, when classical accelerator architectures (e.g., TPU and GPU) are adopted to realize ternary DNNs, onchip memory, wherein the data elements within a memory array are read sequentially (rowbyrow), becomes the energy and performance bottleneck. Inmemory computing [XNORRRAM, BinaryRRAM, reno, prime,
Inmemclassifier
, ConvRAM, RLui:2018, XNORSRAM, XcelRAM] is an emerging computing paradigm that overcomes memory bottlenecks by integrating computations within the memory array itself, enabling much greater parallelism and eliminating the need to transfer data to/from memory. This work explores inmemory computing in the specific context of ternary DNNs and demonstrate that it leads to significant improvements in performance and energy efficiency.While several efforts have explored inmemory accelerators in recent years, TiMDNN differs in significant ways and is the first to apply inmemory computing (massively parallel vectormatrix multiplications within memory array itself in an analog fashion) to ternary DNNs using a new CMOS based bitcell. Many inmemory accelerators use nonvolatile memory (NVM) technologies such as PCM and ReRAM [XNORRRAM, BinaryRRAM, reno, prime] to realize inmemory dot product operations. While NVMs promise much higher density and lower leakage than CMOS memories, they are still an emerging technology with open challenges such as largescale manufacturing yield, limited endurance, high write energy, and errors due to device and circuitlevel nonidealities [mnsim, rxnn]. Other efforts have explored SRAMbased inmemory accelerators for binary networks [Inmemclassifier, ConvRAM, RLui:2018, XNORSRAM, XcelRAM] and SRAMbased nearmemory accelerators for ternary networks [TNN, BRein]. However, the restriction to binary networks is a significant limitation as binary networks known to date incur a large drop in accuracy as highlighted in Figure 1, and the nearmemory accelerators are bottlenecked by the onchip memory where only one memory row is enabled per access. Extended SRAMs that perform bitwise binary operations inmemory may be augmented with nearmemory logic to perform higher precision computations in a bitserial manner [neuralCache]. However, such an approach requires multiple sequential execution steps (array accesses) to realize even one multiplication operation (and many more to realize dotproducts), limiting efficiency. In contrast, we propose TiMDNN, a programmable inmemory accelerator that can realize massively parallel signed ternary vectormatrix multiplications per array access. TiMDNN supports various ternary representations including unweighted (1,0,1), symmetric weighted (a,0,a), and asymmetric weighted (a,0,b) systems, enabling it to execute a broad range of ternary DNNs.
The building block of TiMDNN is a new memory cell, Ternary Processing Cell (TPC), which functions as both a ternary storage unit and a scalar ternary multiplication unit. Using TPCs, we design TiM tiles which are specialized memory arrays to execute signed ternary dotproduct operations. TiMDNN comprises of a plurality of TiM tiles arranged into banks, wherein all tiles compute signed vectormatrix multiplications in parallel.
In summary, the key contributions of our work are:

We present TiMDNN, a programmable inmemory accelerator supporting various ternary representations including unweighted (1,0,1), symmetric weighted (a,0,a), and asymmetric weighted (a,0,b) systems for realizing a broad range of ternary DNNs.

We propose a Ternary Processing Cell (TPC) that functions as both ternary storage and a ternary scalar multiplications unit and a TiM tile that is a specialized memory array to realize signed vectormatrix multiplication operations with ternary values.

We develop an architectural simulator for evaluating TiMDNN, with arraylevel timing and energy models obtained from circuitlevel simulations. We evaluate an implementation of TiMDNN in 32nm CMOS using a suite of 5 popular DNNs designed for image classification and language modeling tasks. A 32tile instance of TiMDNN achieves a peak performance of 114 TOPs/s, consumes 0.9W power, and occupies 1.96 chip area, representing a 300X improvement in TOPS/W compared to a stateoftheart NVIDIA Tesla V100 GPU [nvidiavoltav100]. In comparison to lowprecision accelerators [neuralCache, BRein], TiMDNN achieves 55.2X240X improvement in TOPS/W. TiMDNN also obtains 3.9x4.7x improvement in system energy and 3.2x4.2x improvement in performance over a welloptimized nearmemory accelerator for ternary DNNs.
Ii Related Work
In recent years, several research efforts have focused on improving the energy efficiency and performance of DNNs at various levels of design abstraction. In this section, we limit our discussion to efforts on inmemory computing for DNNs [XNORRRAM, BinaryRRAM, reno, prime, Inmemclassifier, ConvRAM, RLui:2018, XNORSRAM, XcelRAM, neuralCache, dotSRAM, computemem, sttcim, neurocube].
Table I classifies prior inmemory computing efforts based on the memory technology [CMOS and NonVolatile Memory (NVMs)] and the targeted precision. One group of efforts [XNORRRAM, BinaryRRAM, reno, prime] has focused on inmemory DNN accelerators using emerging NonVolatile memory (NVM) technology such as PCM and ReRAM. Although NVMs promise density and low leakage relative to CMOS, they still face several open challenges such as largescale manufacturing yield, limited endurance, high write energy, and errors due to device and circuitlevel nonidealities [mnsim, rxnn]. Efforts on SRAMbased inmemory accelerators can be classified into those that target binary [Inmemclassifier, ConvRAM, RLui:2018, XNORSRAM, XcelRAM] and highprecision [neuralCache, dotSRAM, computemem] DNNs. Accelerators targeting binary DNNs [Inmemclassifier, ConvRAM, RLui:2018, XNORSRAM, XcelRAM] can execute massively parallel vectormatrix multiplication per array access. However, the restriction to binary networks is a significant limitation as binary networks known to date incur a large drop in accuracy as highlighted in Figure 1. Efforts [neuralCache, dotSRAM, computemem] that target higher precision (48 bit) DNNs require multiple execution steps (array accesses) to realize signed dotproduct operations, wherein both weights and activations are signed numbers. For example, Neural cache [neuralCache] computes bitwise Boolean operations inmemory but uses bitserial nearmemory arithmetic to realize multiplications, requiring several array accesses per multiplication operation (and many more to realize dotproducts). Apart from inmemory computing efforts, Table I also details efforts targeting nearmemory accelerator for ternary networks [TNN, BRein]. However, the efficiency of these nearmemory accelerators is limited by the onchip memory, as they can enable only one memory row per access. Further, none of these efforts support asymmetric weighted (a,0,b) ternary systems.
In contrast to previous proposals, TiMDNN is the first specialized and programmable inmemory accelerator for ternary DNNs that supports various ternary representations including unweighted (1,0,1), symmetric weighted (a,0,a), and asymmetric weighted (a,0,b) ternary systems. TiMDNN utilizes a new CMOS based bitcell (i.e., TPC) and enables multiple memory rows simultaneously to realize massively parallel inmemory signed vectormatrix multiplications with ternary values per memory access, enabling efficient realization of ternary DNNs. As illustrated in our experimental evaluation, TiMDNN achieves 3.9x4.7x improvement in systemlevel energy and 3.2x4.2x speedup over a welloptimized nearmemory accelerator. In comparison to the nearmemory ternary accelerator [BRein], it achieves 55.2X improvement in TOPS/W.
Iii TiMDNN architecture
In this section, we present the proposed TiMDNN accelerator along with its building blocks, i.e., Ternary Processing Cells and TiM tiles.
Iiia Ternary Processing Cell (TPC)
To enable inmemory signed multiplication with ternary values, we present a new Ternary Processing Cell (TPC) that operates as both a ternary storage unit and a ternary scalar multiplication unit. Figure 2 shows the proposed TPC circuit, which consists of two crosscoupled inverters for storing two bits (‘A’ and ‘B’), a write wordline (), two source lines ( and ), two read wordlines ( and ) and two bitlines (BL and BLB). A TPC supports two operations  write and scalar ternary multiplication. A write operation is performed by enabling and driving the sourcelines and the bitlines to either or 0 depending on the data. We can write both bits simultaneously, with ‘A’ written using BL and and ‘B’ written using BLB and . Using both bits ‘A’ and ‘B’ a ternary value (1,0,1) is inferred based on the storage encoding shown in Figure 2 (Table on the left). For example, when A=0 the TPC stores W=0. When A=1 and B=0 (B=1) the TPC stores W=1 (W=1).
A scalar multiplication in a TPC is performed between a ternary input and the stored weight to obtain a ternary output. The bitlines are precharged to , and subsequently, the ternary inputs are applied to the read wordlines ( and ) based on the input encoding scheme shown in Figure 2 (Table on the right). The final bitline voltages ( and ) depend on both the input (I) and the stored weight (W). The table in Figure 3 details the possible outcomes of the scalar ternary multiplication (W*I) with the final bitline voltages and the inferred ternary output (Out). For example, when W=0 or I=0, the bitlines remain at and the output is inferred as 0 (W*I=0). When W=I=1, BL discharges by a certain voltage, denoted by , and BLB remains at . This is inferred as Out=1. In contrast, when W=I=1, BLB discharges by and BL remains at producing Out=1. The final bitline voltages are converted to a ternary output using singleended sensing at BL and BLB. Figure 3 depicts the output encoding scheme and the results of SPICE simulation of the scalar multiplication operation with various possible final bitline voltages. Note that, the TPC design uses separate read and write paths to avoid read disturb failures during inmemory multiplications.
IiiB Dotproduct computation using TPCs
Next, we extend the idea of realizing a scalar multiplication using the TPC to a dotproduct computation. Figure 4(a) illustrates the mapping of a dotproduct operation () to a column of TPCs with shared bitlines. To compute, first the bitlines are precharged to , and then the inputs (Inp) are applied to all TPCs simultaneously. The bitlines (BL and BLB) function as an analog accumulator, wherein the final bitline voltages ( and ) represent the sum of the individual TPC outputs. For example, if ‘n/L’ and ‘k/L’ TPCs have output 1 and 1, respectively, the final bitline voltages are and . The bitline voltages are converted using AnalogtoDigital converters (ADCs) to yield digital values ‘n’ and ‘k’. For the unweighted encoding where the ternary weights are encoded as (1,0,1), the final dotproduct is ‘nk’. Figure 4(b) shows the sensing circuit required to realize dotproduct with unweighted (1,0,1) ternary system.
We can also realize dotproducts with a more general ternary encoding represented by asymmetric weighted (a,0,b) values. Figure 5(a) shows the sensing circuit that enables dotproduct with asymmetric ternary weights () and inputs (). As shown, the ADC outputs are scaled by the corresponding weights ( and ), and subsequently, an input scaling factor () is applied to yield ‘(*n*k)’. In contrast to dotproduct with unweighted values, we require two execution steps to realize dotproducts with the asymmetric ternary system, wherein each step a partial dotproduct (pOut) is computed. Figure 5(b) details these two steps using an example. In step 1, we choose =, and apply and as ‘1’ and ‘0’, respectively, resulting in a partial output (pOut) given by = (*n*k). In step 2, we choose =, and apply and as ‘0’ and ‘1’, respectively, to yield = (*n*k). The final dotproduct is given by ‘+’.
To validate the dotproduct operation, we perform a detailed SPICE simulation to determine the possible final voltages at BL () and BLB (). Figure 6 shows various BL states ( to ) and the corresponding value of and ‘n’. Note that the possible values for (‘k’) and (‘n’) are identical, as BL and BLB are symmetric. The state refers to the scenario where ‘i’ out of ‘L’ TPCs compute an output of ‘1’. We observe that from to the average sensing margin () is 96mv. The sensing margin decreases to 6080mv for states to , and beyond the bitline voltage () saturates. Therefore, we can achieve a maximum of 11 BL states ( to ) with sufficiently large sensing margin required for sensing reliably under process variations [XNORSRAM]. The maximum value of ‘n’ and ‘k’ is thus 10, which in turn determines the number of TPCs (‘L’) that can be enabled simultaneously. Setting would be a conservative choice. However, exploiting the weight and input sparsity of ternary DNNs [WRPN, HitNet, FGQ], wherein 40% or more of the elements are zeros, and the fact that nonzero outputs are distributed between ‘1’ and ‘1’, we choose a design with = 8, and = 16. Our experiments indicate that this choice had no effect on the final DNN accuracy compared to the conservative case. In this paper, we also evaluate the impact of process variations on the dotproduct operations realized using TPCs, and provide the experimental results on variations in Section VF.
IiiC TiM tile
We now present the TiM tile, i.e., a specialized memory array designed using TPCs to realize massively parallel vectormatrix multiplications with ternary values. Figure 7 details the tile design, which consists of a 2D array of TPCs, a row decoder and write wordline driver, a block decoder, Read Wordline Drivers (RWDs), column drivers, a sample and hold (S/H) unit, a column mux, Peripheral Compute Units (PCUs), and scale factor registers. The TPC array contains ‘L*K*N’ TPCs, arranged in ‘K’ blocks and ‘N’ columns, where each block contains ‘L’ rows. As shown in the Figure, TPCs in the same row (column) share wordlines (bitlines and sourcelines). The tile supports two major functions, (i) programming, i.e., rowbyrow write operations, and (ii) a vectormatrix multiplication operation. A write operation is performed by activating a write wordline () using the row decoder and driving the bitlines and sourcelines. During a write operation, ‘N’ ternary words (TWs) are written in parallel. In contrast, to the rowwise write operation, a vectormatrix multiplication operation is realized at the block granularity, wherein ‘N’ dotproduct operations each of vector length ‘L’ are executed in parallel. The block decoder selects a block for the vectormatrix multiplication, and RWDs apply the ternary inputs. During the vectormatrix multiplication, TPCs in the same row share the ternary input (Inp), and TPCs in the same column produce partial sums for the same output. As discussed in section IIIB, accumulation is performed in the analog domain using the bitlines (BL and BLB). In one access, TiM can compute the vectormatrix product , where Inp is a vector of length L and W is a matrix of dimension LxN stored in TPCs. The accumulated outputs at each column are stored using a sample and hold (S/H) unit and get digitized using PCUs. To attain higher area efficiency, we utilize ‘M’ PCUs per tile (‘M’ ‘N’) by matching the bandwidth of the PCUs to the bandwidth of the TPC array and operating the PCUs and TPC array as a twostage pipeline. Next, we discuss the TiM tile peripherals in detail.
Read Wordline Driver (RWD). Figure 7 shows the RWD logic that takes a ternary vector (Inp) and block enable (bEN) signal as inputs and drives all ‘L’ read wordlines ( and ) of a block. The block decoder generates the bEN signal based on the block address that is an input to the TiM tile. and are activated using the input encoding scheme shown in Figure 2 (Table on the right).
Peripheral Compute Unit (PCU). Figure 7 shows the logic for a PCU, which consists of two ADCs and a few small arithmetic units (adders and multipliers). The primary function of PCUs is to convert the bitline voltages to digital values using ADCs. However, PCUs also enable other key functions such as partial sum reduction, and weight (input) scaling for weighted ternary encoding (,0,) and (,0,). Although the PCU can be simplified if ==1 or/and ==1, in this work, we target a programmable TiM tile that can support various stateoftheart ternary DNNs. To further generalize, we use a shifter to support DNNs with ternary weights and higher precision activations [WRPN, FGQ]. The activations are evaluated bitserially using multiple TiM accesses. Each access uses an input bit, and we shift the computed partial sum based on the input bit significance using the shifter. TiM tiles have scale factor registers (shown in Figure 7) to store the weight and the activation scale factors that vary across layers within a network.
IiiD TiMDNN accelerator architecture
Figure 8 shows the proposed TiMDNN accelerator, which has a hierarchical organization with multiple banks, wherein each bank comprises of several TiM tiles, an activation buffer, a partial sum (Psum) buffer, a global Reduce Unit (RU), a Special Function Unit (SFU), an instruction memory (Inst Mem), and a Scheduler. The compute time and energy in Ternary DNNs are heavily dominated by vectormatrix multiplications which are realized using TiM tiles. Other DNN functions, viz.
, ReLU, pooling, normalization, Tanh and Sigmoid are performed by the SFU. The partial sums produced by different TiM tiles are reduced using the RU, whereas the partial sums produced by separate blocks within a tile are reduced using PCUs (as discussed in section
IIIC). TiMDNN has a small instruction memory and a Scheduler that read instructions and orchestrates operations inside a bank. TiMDNN also contains activation and Psum buffers to store activations and partial sums, respectively.Mapping. DNNs can be mapped to TiMDNN both temporally and spatially. The networks that fit on TiMDNN entirely are mapped spatially, wherein the weight matrix of each convolution (Conv) and fullyconnected (FC) layer is partitioned and mapped to dedicated (one or more) TiM tiles, and the network executes in a pipelined fashion. In contrast, networks that cannot fit on TiMDNN at once are executed using the temporal mapping strategy, wherein we execute Conv and FC layers sequentially over time using all TiM tiles. The weight matrix (W) of each CONV/FC layer could be either smaller or larger than the total weight capacity (TWC) of TiMDNN. Figure 9 illustrates the two scenarios using an example workload (vectormatrix multiplication) that is executed on two separate TiMDNN instances differing in the number of TiM tiles. As shown, when (W TWC) the weight matrix partitions ( & ) are replicated and loaded to multiple tiles, and each TiM tile computes on input vectors in parallel. In contrast, when (W TWC), the operations are executed sequentially using multiple steps.
Iv Experimental Methodology
In this section, we present our experimental methodology for evaluating TiMDNN.
TiM tile modeling.
We perform detailed SPICE simulations to estimate the tilelevel energy and latency for the write and vectormatrix multiplication operations. The simulations are performed using 32nm bulk CMOS technology and PTM models. We use 3bit flash ADCs to convert bitline voltages to digital values. To estimate the area and latency of digital logic both within the tiles (PCUs and decoders) and outside the tiles (SFU and RU), we synthesized RTL implementations using Synopsys Design Compiler and estimated power consumption using Synopsys Power Compiler. We performed the TPC layout (Figure
10) to estimate its area, which is about (where F is the minimum feature size). We also performed the variation analysis to estimate error rates due to incorrect sensing by considering variations in transistor (/=5%) [kuhn2011].Systemlevel simulation. We developed an architectural simulator to estimate applicationlevel energy and performance benefits of TiMDNN. The simulator maps various DNN operations, viz., vectormatrix multiplications, pooling, Relu, etc. to TiMDNN components and produces execution traces consisting of offchip accesses, write and inmemory operations in TiM tiles, buffer reads and writes, and RU and SFU operations. Using these traces and the timing and energy models from circuit simulation and synthesis, the simulator computes the applicationlevel energy and performance.
TiMDNN parameters. Table II details the microarchitectural parameters for the instance of TiMDNN used in our evaluation, which contains 32 TiM tiles, with each tile having 256x256 TPCs. The SFU consists of 64 Relu units, 8 vector processing elements (vPE) each with 4 lanes, 20 special function processing elements (SPEs), and 32 Quantization Units (QU). SPEs computes special functions such as Tanh and Sigmoid. The output activations are quantized to ternary values using QUs. The latency of the dotproduct operation is 2.3 ns. TiMDNN can achieve a peak performance of 114 TOPs/sec, consumes 0.9 W power, and occupies 1.96 chip area.
Baseline. The processing efficiency (TOPS/W) of TiMDNN is 300X better than NVIDIA’s stateoftheart Volta V100 GPU [nvidiavoltav100]. This is to be expected, since the GPU is not specialized for ternary DNNs. In comparison to nearmemory ternary accelerators [BRein], TiMDNN achieves 55.2X improvement in TOPS/W. To perform a fairer comparison and to report the benefits exclusively due to inmemory computations enabled by the proposed TPC, we design a welloptimized nearmemory ternary DNN accelerator. This baseline accelerator differs from TiMDNN in only one aspect — tiles consist of regular SRAM arrays (256x512) with 6T bitcells and nearmemory compute (NMC) units (shown in Figure 11), instead of the TiM tiles. Note that, to store a ternary word using the SRAM array, we require two 6T bitcells. The baseline tiles are smaller than TiM tiles by 0.52x, therefore, we use two baselines designs. (i) An isoarea baseline with 60 baseline tiles and the overall accelerator area is same as TiMDNN.(ii) An isocapacity baseline with the same weight storage capacity (2 Mega ternary words) as TiMDNN. We note that the baseline is welloptimized, and our isoarea baseline can achieve 21.9 TOPs/sec, reflecting an improvement of 17.6X in TOPs/sec over nearmemory accelerator for ternary DNNs proposed in [BRein].
DNN Benchmarks. We evaluate the systemlevel energy and performance benefits of TiMDNN using a suite of DNN benchmarks. Table III
details our benchmark applications. We use stateoftheart convolutional neural networks (CNN), viz., AlexNet, ResNet34, and Inception to perform image classification on ImageNet. We also evaluate popular recurrent neural networks (RNN) such as LSTM and GRU that perform language modeling task on the Penn Tree Bank (PTB) dataset
[PTB]. Table III also details the activation precision and accuracy of these ternary networks.V Results
In this section, we present various results that quantify the improvements obtained by TiMDNN over the nearmemory accelerator baseline due to inmemory computing enabled by TPCs. We also compare TiMDNN with other stateoftheart DNN accelerators.
Va Performance benefits
We first quantify the performance benefits of TiMDNN over our two baselines (Isocapacity and Isoarea nearmemory accelerators). Figure 12 shows the two major components of the normalized inference time which are MACOps (vectormatrix multiplications) and NonMACOps (other DNN operations) for TiMDNN (TiM) and the baselines. Overall, we achieve 5.1x7.7x speedup over the Isocapacity baseline and 3.2x4.2x speedup over the Isoarea baseline across our benchmark applications. The speedups depend on the fraction of application runtime spent on MACOps, with DNNs having higher MACOps times attaining superior speedups. This is expected as the performance benefits of TiMDNN over the baselines derive from accelerating MACOps using inmemory computations. Isoarea (baseline2) is faster than Isocapacity (baseline1) due to the higherlevel of parallelism available from the additional baseline tiles. The 32tile instance of TiMDNN achieves 4827, 952, 1834, 2*, and 1.9* inference/sec for AlexNet, ResNet34, Inception, LSTM, and GRU, respectively. Our RNN benchmarks (LSTM and GRU) fit on TiMDNN entirely, leading to better inference performance than CNNs.
VB Energy benefits
Next, we show the application level energy benefits of TiMDNN over the superior of the two baselines (Baseline2). Figure 13 shows major energy components for TiMDNN and Baseline2, which are programming (writes to TiM tiles), DRAM accesses, reads (writes) from (to) activation and Psum buffers, operations in reduce units and special function units (RU+SFU Ops), and MACOps. As shown, TiM reduces the MACOps energy substantially and achieves 3.9x4.7x energy improvements across our DNN benchmarks. The primary cause for this energy reduction is that TiMDNN computes on 16 rows simultaneously per array access.
VC Kernellevel benefits
To provide more insights on the applicationlevel benefits, we compare the TiM tile and the baseline tile at the kernellevel. We consider a primitive DNN kernel, i.e., a vectormatrix computation (Out = Inp*W, where Inp is a 1x16 vector and W is a 16x256 matrix), and map it to both TiM and baseline tiles. We use two variants of TiM tile, (i) TiM8 and (ii) TiM16, wherein we simultaneously activate 8 wordlines and 16 wordlines, respectively. Using the baseline tile, the vectormatrix multiplication operation requires rowbyrow sequential reads, resulting in 16 SRAM accesses. In contrast, TiM16 and TiM8 require 1 and 2 accesses, respectively. Figure 14 shows that the TiM8 and TiM16 designs achieve a speedup of 6x and 11.8x respectively, over the baseline design. Note that the benefits are lower than 8x and 16x, respectively, as SRAM accesses are faster than TiM8 and TiM16 accesses.
Next, we compare the energy consumption in TiM8, TiM16, and baseline designs for the above kernel computation. In TiM8 and TiM16, the bitlines are discharged twice and once, respectively, whereas, in the baseline design the bitlines discharge multiple (16*2) times. Therefore, TiM tiles achieve substantial energy benefits over the baseline design. The additional factor ‘2’ in (16*2) arises as the SRAM array uses two 6T bitcells for storing a ternary word. However, the energy benefits of TiM8 and TiM16 is not 16x and 32x, respectively, as TiM tiles discharge the bitlines by a larger amount (multiple s). Further, the amount by which the bitlines get discharged in TiM tiles depends on the number of nonzero scalar outputs. For example, in TiM8, if 50% of the TPCs output in a column are zeros the bitline discharges by , whereas if 75% are zeros the bitline discharges by . Thus, the energy benefits over the baseline design are a function of the output sparsity (fraction of outputs that are zero). Figure 14 shows the energy benefits of TiM8 and TiM16 designs over the baseline design at various output sparsity levels.
VD TiMDNN area breakdown
We now discuss the area breakdown of various components in TiMDNN. Figure 15 shows the area breakdown of the TiMDNN accelerator, a TiM tile, and a baseline tile. The major area consumer in TiMDNN is the TiMtile. In the TiM and baseline tiles, area mostly goes into the corearray that consists of TPCs and 6T bitcells, respectively. Further, as discussed in section IV, TiM tiles are 1.89x larger than the baseline tile at isocapacity. Therefore, we use the isoarea baseline with 60 tiles and compare it with TiMDNN having 32 TiM tiles.
VE Comparison with prior DNN accelerators
Next, we quantify the advantages of TiMDNN over prior DNN accelerators using processing efficiencies (TOPS/W and TOPS/) as our metric. Table IV details 3 prior DNN accelerators including  (i) Nvidia Tesla V100 [nvidiavoltav100], a stateoftheart GPU, (ii) NeuralCache [neuralCache], a design using bitwise inmemory operations and bitserial nearmemory arithmetic to realize dotproducts, (iii) BRien [BRein], a nearmemory accelerator for ternary DNNs. As shown, TiMDNN achieves substantial improvements in both TOPS/W and TOPS/. GPUs, nearmemory accelerators [BRein], and boolean inmemory accelerators are less efficient than TiMDNN as they cannot realize massively parallel inmemory vectormatrix multiplications with signed ternary values per array access. Their efficiency is limited by the onchip memory bandwidth, wherein they can simultaneously access one [nvidiavoltav100, BRein] or at most two [neuralCache] memory rows. In contrast, TiMDNN offers addition parallelism by simultaneously accessing ‘L’ (L=16) memory rows to compute inmemory vectormatrix multiplications.
VF TiMDNN under process variation
Finally, we study the impact of process variation on the computations ( i.e., ternary vectormatrix multiplications) performed using TiMDNN. To that end, we first perform MonteCarlo circuit simulation of ternary dotproduct operations executed in TiM tiles with = 8 and L = 16 to determine the sensing errors under random variations. We consider variations ( = 5%) [kuhn2011] in the threshold voltage () of all transistors in each and every TPC. We evaluate 1000 samples for every possible BL/BLB state ( to ) and determine the spread in the final bitline voltages (/). Figure 16 shows the histogram of the obtained voltages of all possible states across these random samples. As mentioned in section IIIB, the state represents n = i, where ‘n’ is the ADC Output. We can observe in the figure that some of the neighboring histograms slightly overlap, while the others do not. For example, the histograms and overlap but and
do not. The overlapping areas in the figure represent the samples that will result in sensing errors (SEs). However, the overlapping areas are very small, indicating that the probability of the sensing error (
) is extremely low. Further, the sensing errors depend on ‘n’, and we represent it as the conditional sensing error probability [(SE/n)]. It the also worth mentioning that the error magnitude is always 1, as only the adjacent histograms overlap.(1) 
Equation 1 details the probability () of error in the ternary vectormatrix multiplications executed using TiM tiles, where and are the conditional sensing error probability and the occurrence probability of the state (ADCOut = n), respectively. Figure 17 shows the values of , , and their product ( *) for each n. The is obtained using the MonteCarlo simulation (described above), and is computed using the traces of the partial sums obtained during ternary vectormatrix multiplications [WRPN, HitNet] executed using TiM tiles. As shown in Figure 17, the is maximum at n=1 and drastically decreases with a higher value of n. In contrast, the shows an opposite trend, wherein the probability of sensing error is higher for larger n. Therefore, we find the product * to be quite small across all values on n. In our evaluation, the is found to be 1.5*, reflecting an extremely low probability of error. In other words, we have only 2 errors of magnitude (1) every 10K ternary vector matrix multiplications executed using TiMDNN. In our experiments, we found that = 1.5* has no impact on the application level accuracy. We note that this is due to the low probability and magnitude of error as well as the ability of DNNs to tolerate errors in their computations [AXNN].
Vi Conclusion
Ternary DNNs are extremely promising due to their ability to achieve accuracy similar to fullprecision networks on complex machine learning tasks, while enabling DNN inference at low energy. In this work, we present TiMDNN, an inmemory accelerator for executing stateoftheart ternary DNNs. TiMDNN is programmable accelerator designed using TiM tiles, i.e., specialized memory arrays for realizing massively parallel signed vectormatrix multiplications with ternary values. TiM tiles consist of a new Ternary Processing Cell (TPC) that functions as both a ternary storage unit and a scalar multiplication unit. We evaluate an embodiment of TiMDNN with 32 TiM tiles and demonstrate that TiMDNN achieves significant energy and performance improvements over a welloptimized nearmemory accelerator baseline.
Comments
There are no comments yet.