The use of lower precision to represent the weights and activations in DNNs is a promising technique for realizing DNN inference (evaluation of pre-trained DNN models) on energy-constrained platforms [XNOR-Net, BC, DoReFa-Net, TC, TTQ, WRPN, TNN, HitNet, FGQ, QuantRNN, PACT]. Reduced bit-precision can lower all facets of energy consumption including computation, memory and interconnects. Current commercial hardware [nvidia-volta-v100, edge-tpu], includes widespread support for 8-bit and 4-bit fixed point DNN inference, and recent research has continued the push towards even lower precision [XNOR-Net, BC, DoReFa-Net, TC, TTQ, TNN, WRPN, HitNet, FGQ].
Studies-to-date [XNOR-Net, BC, DoReFa-Net, TC, TTQ, WRPN, TNN, HitNet, FGQ, Compensated-DNN] suggest that among low-precision networks, ternary DNNs represent a promising sweet-spot as they enable low-power inference with high application-level accuracy. This is illustrated in Figure 1, which shows the reported accuracies of various state-of-the-art binary [XNOR-Net, BC, DoReFa-Net], ternary [TC, TTQ, WRPN, TNN, HitNet, FGQ]
, and full-precision (FP32) networks on complex image classification (ImageNet) and language modeling (PTB[PTB]) tasks. We observe that the accuracy degradation of binary DNNs over the FP32 networks can be considerable (5-13% on image classification, 150-180 PPW [Perplexity Per Word] on language modeling). In contrast, ternary DNNs achieve accuracy significantly better than binary networks (and much closer to FP32 networks). In this work, we focus on the design of specialized hardware for realizing various state-of-the-art ternary DNNs.
Ternary networks greatly simplify the multiply-and-accumulate (MAC) operation that constitutes 95-99% of total DNN computations. Consequently, the amount of energy and time spent on DNN computations can be drastically improved by using lower-precision processing elements (the complexity of a MAC operation has a super-linear relationship with precision). However, when classical accelerator architectures (e.g., TPU and GPU) are adopted to realize ternary DNNs, on-chip memory, wherein the data elements within a memory array are read sequentially (row-by-row), becomes the energy and performance bottleneck. In-memory computing [XNOR-RRAM, Binary-RRAM, reno, prime, In-mem-classifier
In-mem-classifier, Conv-RAM, RLui:2018, XNOR-SRAM, Xcel-RAM] is an emerging computing paradigm that overcomes memory bottlenecks by integrating computations within the memory array itself, enabling much greater parallelism and eliminating the need to transfer data to/from memory. This work explores in-memory computing in the specific context of ternary DNNs and demonstrate that it leads to significant improvements in performance and energy efficiency.
While several efforts have explored in-memory accelerators in recent years, TiM-DNN differs in significant ways and is the first to apply in-memory computing (massively parallel vector-matrix multiplications within memory array itself in an analog fashion) to ternary DNNs using a new CMOS based bit-cell. Many in-memory accelerators use non-volatile memory (NVM) technologies such as PCM and ReRAM [XNOR-RRAM, Binary-RRAM, reno, prime] to realize in-memory dot product operations. While NVMs promise much higher density and lower leakage than CMOS memories, they are still an emerging technology with open challenges such as large-scale manufacturing yield, limited endurance, high write energy, and errors due to device and circuit-level non-idealities [mnsim, rxnn]. Other efforts have explored SRAM-based in-memory accelerators for binary networks [In-mem-classifier, Conv-RAM, RLui:2018, XNOR-SRAM, Xcel-RAM] and SRAM-based near-memory accelerators for ternary networks [TNN, BRein]. However, the restriction to binary networks is a significant limitation as binary networks known to date incur a large drop in accuracy as highlighted in Figure 1, and the near-memory accelerators are bottlenecked by the on-chip memory where only one memory row is enabled per access. Extended SRAMs that perform bitwise binary operations in-memory may be augmented with near-memory logic to perform higher precision computations in a bit-serial manner [neuralCache]. However, such an approach requires multiple sequential execution steps (array accesses) to realize even one multiplication operation (and many more to realize dot-products), limiting efficiency. In contrast, we propose TiM-DNN, a programmable in-memory accelerator that can realize massively parallel signed ternary vector-matrix multiplications per array access. TiM-DNN supports various ternary representations including unweighted (-1,0,1), symmetric weighted (-a,0,a), and asymmetric weighted (-a,0,b) systems, enabling it to execute a broad range of ternary DNNs.
The building block of TiM-DNN is a new memory cell, Ternary Processing Cell (TPC), which functions as both a ternary storage unit and a scalar ternary multiplication unit. Using TPCs, we design TiM tiles which are specialized memory arrays to execute signed ternary dot-product operations. TiM-DNN comprises of a plurality of TiM tiles arranged into banks, wherein all tiles compute signed vector-matrix multiplications in parallel.
In summary, the key contributions of our work are:
We present TiM-DNN, a programmable in-memory accelerator supporting various ternary representations including unweighted (-1,0,1), symmetric weighted (-a,0,a), and asymmetric weighted (-a,0,b) systems for realizing a broad range of ternary DNNs.
We propose a Ternary Processing Cell (TPC) that functions as both ternary storage and a ternary scalar multiplications unit and a TiM tile that is a specialized memory array to realize signed vector-matrix multiplication operations with ternary values.
We develop an architectural simulator for evaluating TiM-DNN, with array-level timing and energy models obtained from circuit-level simulations. We evaluate an implementation of TiM-DNN in 32nm CMOS using a suite of 5 popular DNNs designed for image classification and language modeling tasks. A 32-tile instance of TiM-DNN achieves a peak performance of 114 TOPs/s, consumes 0.9W power, and occupies 1.96 chip area, representing a 300X improvement in TOPS/W compared to a state-of-the-art NVIDIA Tesla V100 GPU [nvidia-volta-v100]. In comparison to low-precision accelerators [neuralCache, BRein], TiM-DNN achieves 55.2X-240X improvement in TOPS/W. TiM-DNN also obtains 3.9x-4.7x improvement in system energy and 3.2x-4.2x improvement in performance over a well-optimized near-memory accelerator for ternary DNNs.
Ii Related Work
In recent years, several research efforts have focused on improving the energy efficiency and performance of DNNs at various levels of design abstraction. In this section, we limit our discussion to efforts on in-memory computing for DNNs [XNOR-RRAM, Binary-RRAM, reno, prime, In-mem-classifier, Conv-RAM, RLui:2018, XNOR-SRAM, Xcel-RAM, neuralCache, dotSRAM, compute-mem, sttcim, neurocube].
Table I classifies prior in-memory computing efforts based on the memory technology [CMOS and Non-Volatile Memory (NVMs)] and the targeted precision. One group of efforts [XNOR-RRAM, Binary-RRAM, reno, prime] has focused on in-memory DNN accelerators using emerging Non-Volatile memory (NVM) technology such as PCM and ReRAM. Although NVMs promise density and low leakage relative to CMOS, they still face several open challenges such as large-scale manufacturing yield, limited endurance, high write energy, and errors due to device and circuit-level non-idealities [mnsim, rxnn]. Efforts on SRAM-based in-memory accelerators can be classified into those that target binary [In-mem-classifier, Conv-RAM, RLui:2018, XNOR-SRAM, Xcel-RAM] and high-precision [neuralCache, dotSRAM, compute-mem] DNNs. Accelerators targeting binary DNNs [In-mem-classifier, Conv-RAM, RLui:2018, XNOR-SRAM, Xcel-RAM] can execute massively parallel vector-matrix multiplication per array access. However, the restriction to binary networks is a significant limitation as binary networks known to date incur a large drop in accuracy as highlighted in Figure 1. Efforts [neuralCache, dotSRAM, compute-mem] that target higher precision (4-8 bit) DNNs require multiple execution steps (array accesses) to realize signed dot-product operations, wherein both weights and activations are signed numbers. For example, Neural cache [neuralCache] computes bitwise Boolean operations in-memory but uses bit-serial near-memory arithmetic to realize multiplications, requiring several array accesses per multiplication operation (and many more to realize dot-products). Apart from in-memory computing efforts, Table I also details efforts targeting near-memory accelerator for ternary networks [TNN, BRein]. However, the efficiency of these near-memory accelerators is limited by the on-chip memory, as they can enable only one memory row per access. Further, none of these efforts support asymmetric weighted (-a,0,b) ternary systems.
In contrast to previous proposals, TiM-DNN is the first specialized and programmable in-memory accelerator for ternary DNNs that supports various ternary representations including unweighted (-1,0,1), symmetric weighted (-a,0,a), and asymmetric weighted (-a,0,b) ternary systems. TiM-DNN utilizes a new CMOS based bit-cell (i.e., TPC) and enables multiple memory rows simultaneously to realize massively parallel in-memory signed vector-matrix multiplications with ternary values per memory access, enabling efficient realization of ternary DNNs. As illustrated in our experimental evaluation, TiM-DNN achieves 3.9x-4.7x improvement in system-level energy and 3.2x-4.2x speedup over a well-optimized near-memory accelerator. In comparison to the near-memory ternary accelerator [BRein], it achieves 55.2X improvement in TOPS/W.
Iii TiM-DNN architecture
In this section, we present the proposed TiM-DNN accelerator along with its building blocks, i.e., Ternary Processing Cells and TiM tiles.
Iii-a Ternary Processing Cell (TPC)
To enable in-memory signed multiplication with ternary values, we present a new Ternary Processing Cell (TPC) that operates as both a ternary storage unit and a ternary scalar multiplication unit. Figure 2 shows the proposed TPC circuit, which consists of two cross-coupled inverters for storing two bits (‘A’ and ‘B’), a write wordline (), two source lines ( and ), two read wordlines ( and ) and two bitlines (BL and BLB). A TPC supports two operations - write and scalar ternary multiplication. A write operation is performed by enabling and driving the source-lines and the bitlines to either or 0 depending on the data. We can write both bits simultaneously, with ‘A’ written using BL and and ‘B’ written using BLB and . Using both bits ‘A’ and ‘B’ a ternary value (-1,0,1) is inferred based on the storage encoding shown in Figure 2 (Table on the left). For example, when A=0 the TPC stores W=0. When A=1 and B=0 (B=1) the TPC stores W=1 (W=-1).
A scalar multiplication in a TPC is performed between a ternary input and the stored weight to obtain a ternary output. The bitlines are precharged to , and subsequently, the ternary inputs are applied to the read wordlines ( and ) based on the input encoding scheme shown in Figure 2 (Table on the right). The final bitline voltages ( and ) depend on both the input (I) and the stored weight (W). The table in Figure 3 details the possible outcomes of the scalar ternary multiplication (W*I) with the final bitline voltages and the inferred ternary output (Out). For example, when W=0 or I=0, the bitlines remain at and the output is inferred as 0 (W*I=0). When W=I=1, BL discharges by a certain voltage, denoted by , and BLB remains at . This is inferred as Out=1. In contrast, when W=-I=1, BLB discharges by and BL remains at producing Out=-1. The final bitline voltages are converted to a ternary output using single-ended sensing at BL and BLB. Figure 3 depicts the output encoding scheme and the results of SPICE simulation of the scalar multiplication operation with various possible final bitline voltages. Note that, the TPC design uses separate read and write paths to avoid read disturb failures during in-memory multiplications.
Iii-B Dot-product computation using TPCs
Next, we extend the idea of realizing a scalar multiplication using the TPC to a dot-product computation. Figure 4(a) illustrates the mapping of a dot-product operation () to a column of TPCs with shared bitlines. To compute, first the bitlines are precharged to , and then the inputs (Inp) are applied to all TPCs simultaneously. The bitlines (BL and BLB) function as an analog accumulator, wherein the final bitline voltages ( and ) represent the sum of the individual TPC outputs. For example, if ‘n/L’ and ‘k/L’ TPCs have output 1 and -1, respectively, the final bitline voltages are and . The bitline voltages are converted using Analog-to-Digital converters (ADCs) to yield digital values ‘n’ and ‘k’. For the unweighted encoding where the ternary weights are encoded as (-1,0,1), the final dot-product is ‘n-k’. Figure 4(b) shows the sensing circuit required to realize dot-product with unweighted (-1,0,1) ternary system.
We can also realize dot-products with a more general ternary encoding represented by asymmetric weighted (-a,0,b) values. Figure 5(a) shows the sensing circuit that enables dot-product with asymmetric ternary weights () and inputs (). As shown, the ADC outputs are scaled by the corresponding weights ( and ), and subsequently, an input scaling factor () is applied to yield ‘(*n-*k)’. In contrast to dot-product with unweighted values, we require two execution steps to realize dot-products with the asymmetric ternary system, wherein each step a partial dot-product (pOut) is computed. Figure 5(b) details these two steps using an example. In step 1, we choose =, and apply and as ‘1’ and ‘0’, respectively, resulting in a partial output (pOut) given by = (*n-*k). In step 2, we choose =-, and apply and as ‘0’ and ‘1’, respectively, to yield = -(*n-*k). The final dot-product is given by ‘+’.
To validate the dot-product operation, we perform a detailed SPICE simulation to determine the possible final voltages at BL () and BLB (). Figure 6 shows various BL states ( to ) and the corresponding value of and ‘n’. Note that the possible values for (‘k’) and (‘n’) are identical, as BL and BLB are symmetric. The state refers to the scenario where ‘i’ out of ‘L’ TPCs compute an output of ‘1’. We observe that from to the average sensing margin () is 96mv. The sensing margin decreases to 60-80mv for states to , and beyond the bitline voltage () saturates. Therefore, we can achieve a maximum of 11 BL states ( to ) with sufficiently large sensing margin required for sensing reliably under process variations [XNOR-SRAM]. The maximum value of ‘n’ and ‘k’ is thus 10, which in turn determines the number of TPCs (‘L’) that can be enabled simultaneously. Setting would be a conservative choice. However, exploiting the weight and input sparsity of ternary DNNs [WRPN, HitNet, FGQ], wherein 40% or more of the elements are zeros, and the fact that non-zero outputs are distributed between ‘1’ and ‘-1’, we choose a design with = 8, and = 16. Our experiments indicate that this choice had no effect on the final DNN accuracy compared to the conservative case. In this paper, we also evaluate the impact of process variations on the dot-product operations realized using TPCs, and provide the experimental results on variations in Section V-F.
Iii-C TiM tile
We now present the TiM tile, i.e., a specialized memory array designed using TPCs to realize massively parallel vector-matrix multiplications with ternary values. Figure 7 details the tile design, which consists of a 2D array of TPCs, a row decoder and write wordline driver, a block decoder, Read Wordline Drivers (RWDs), column drivers, a sample and hold (S/H) unit, a column mux, Peripheral Compute Units (PCUs), and scale factor registers. The TPC array contains ‘L*K*N’ TPCs, arranged in ‘K’ blocks and ‘N’ columns, where each block contains ‘L’ rows. As shown in the Figure, TPCs in the same row (column) share wordlines (bitlines and source-lines). The tile supports two major functions, (i) programming, i.e., row-by-row write operations, and (ii) a vector-matrix multiplication operation. A write operation is performed by activating a write wordline () using the row decoder and driving the bitlines and source-lines. During a write operation, ‘N’ ternary words (TWs) are written in parallel. In contrast, to the row-wise write operation, a vector-matrix multiplication operation is realized at the block granularity, wherein ‘N’ dot-product operations each of vector length ‘L’ are executed in parallel. The block decoder selects a block for the vector-matrix multiplication, and RWDs apply the ternary inputs. During the vector-matrix multiplication, TPCs in the same row share the ternary input (Inp), and TPCs in the same column produce partial sums for the same output. As discussed in section III-B, accumulation is performed in the analog domain using the bitlines (BL and BLB). In one access, TiM can compute the vector-matrix product , where Inp is a vector of length L and W is a matrix of dimension LxN stored in TPCs. The accumulated outputs at each column are stored using a sample and hold (S/H) unit and get digitized using PCUs. To attain higher area efficiency, we utilize ‘M’ PCUs per tile (‘M’ ‘N’) by matching the bandwidth of the PCUs to the bandwidth of the TPC array and operating the PCUs and TPC array as a two-stage pipeline. Next, we discuss the TiM tile peripherals in detail.
Read Wordline Driver (RWD). Figure 7 shows the RWD logic that takes a ternary vector (Inp) and block enable (bEN) signal as inputs and drives all ‘L’ read wordlines ( and ) of a block. The block decoder generates the bEN signal based on the block address that is an input to the TiM tile. and are activated using the input encoding scheme shown in Figure 2 (Table on the right).
Peripheral Compute Unit (PCU). Figure 7 shows the logic for a PCU, which consists of two ADCs and a few small arithmetic units (adders and multipliers). The primary function of PCUs is to convert the bitline voltages to digital values using ADCs. However, PCUs also enable other key functions such as partial sum reduction, and weight (input) scaling for weighted ternary encoding (-,0,) and (-,0,). Although the PCU can be simplified if ==1 or/and ==1, in this work, we target a programmable TiM tile that can support various state-of-the-art ternary DNNs. To further generalize, we use a shifter to support DNNs with ternary weights and higher precision activations [WRPN, FGQ]. The activations are evaluated bit-serially using multiple TiM accesses. Each access uses an input bit, and we shift the computed partial sum based on the input bit significance using the shifter. TiM tiles have scale factor registers (shown in Figure 7) to store the weight and the activation scale factors that vary across layers within a network.
Iii-D TiM-DNN accelerator architecture
Figure 8 shows the proposed TiM-DNN accelerator, which has a hierarchical organization with multiple banks, wherein each bank comprises of several TiM tiles, an activation buffer, a partial sum (Psum) buffer, a global Reduce Unit (RU), a Special Function Unit (SFU), an instruction memory (Inst Mem), and a Scheduler. The compute time and energy in Ternary DNNs are heavily dominated by vector-matrix multiplications which are realized using TiM tiles. Other DNN functions, viz.
, ReLU, pooling, normalization, Tanh and Sigmoid are performed by the SFU. The partial sums produced by different TiM tiles are reduced using the RU, whereas the partial sums produced by separate blocks within a tile are reduced using PCUs (as discussed in sectionIII-C). TiM-DNN has a small instruction memory and a Scheduler that read instructions and orchestrates operations inside a bank. TiM-DNN also contains activation and Psum buffers to store activations and partial sums, respectively.
Mapping. DNNs can be mapped to TiM-DNN both temporally and spatially. The networks that fit on TiM-DNN entirely are mapped spatially, wherein the weight matrix of each convolution (Conv) and fully-connected (FC) layer is partitioned and mapped to dedicated (one or more) TiM tiles, and the network executes in a pipelined fashion. In contrast, networks that cannot fit on TiM-DNN at once are executed using the temporal mapping strategy, wherein we execute Conv and FC layers sequentially over time using all TiM tiles. The weight matrix (W) of each CONV/FC layer could be either smaller or larger than the total weight capacity (TWC) of TiM-DNN. Figure 9 illustrates the two scenarios using an example workload (vector-matrix multiplication) that is executed on two separate TiM-DNN instances differing in the number of TiM tiles. As shown, when (W TWC) the weight matrix partitions ( & ) are replicated and loaded to multiple tiles, and each TiM tile computes on input vectors in parallel. In contrast, when (W TWC), the operations are executed sequentially using multiple steps.
Iv Experimental Methodology
In this section, we present our experimental methodology for evaluating TiM-DNN.
TiM tile modeling.
We perform detailed SPICE simulations to estimate the tile-level energy and latency for the write and vector-matrix multiplication operations. The simulations are performed using 32nm bulk CMOS technology and PTM models. We use 3-bit flash ADCs to convert bitline voltages to digital values. To estimate the area and latency of digital logic both within the tiles (PCUs and decoders) and outside the tiles (SFU and RU), we synthesized RTL implementations using Synopsys Design Compiler and estimated power consumption using Synopsys Power Compiler. We performed the TPC layout (Figure10) to estimate its area, which is about (where F is the minimum feature size). We also performed the variation analysis to estimate error rates due to incorrect sensing by considering variations in transistor (/=5%) [kuhn2011].
System-level simulation. We developed an architectural simulator to estimate application-level energy and performance benefits of TiM-DNN. The simulator maps various DNN operations, viz., vector-matrix multiplications, pooling, Relu, etc. to TiM-DNN components and produces execution traces consisting of off-chip accesses, write and in-memory operations in TiM tiles, buffer reads and writes, and RU and SFU operations. Using these traces and the timing and energy models from circuit simulation and synthesis, the simulator computes the application-level energy and performance.
TiM-DNN parameters. Table II details the micro-architectural parameters for the instance of TiM-DNN used in our evaluation, which contains 32 TiM tiles, with each tile having 256x256 TPCs. The SFU consists of 64 Relu units, 8 vector processing elements (vPE) each with 4 lanes, 20 special function processing elements (SPEs), and 32 Quantization Units (QU). SPEs computes special functions such as Tanh and Sigmoid. The output activations are quantized to ternary values using QUs. The latency of the dot-product operation is 2.3 ns. TiM-DNN can achieve a peak performance of 114 TOPs/sec, consumes 0.9 W power, and occupies 1.96 chip area.
Baseline. The processing efficiency (TOPS/W) of TiM-DNN is 300X better than NVIDIA’s state-of-the-art Volta V100 GPU [nvidia-volta-v100]. This is to be expected, since the GPU is not specialized for ternary DNNs. In comparison to near-memory ternary accelerators [BRein], TiM-DNN achieves 55.2X improvement in TOPS/W. To perform a fairer comparison and to report the benefits exclusively due to in-memory computations enabled by the proposed TPC, we design a well-optimized near-memory ternary DNN accelerator. This baseline accelerator differs from TiM-DNN in only one aspect — tiles consist of regular SRAM arrays (256x512) with 6T bit-cells and near-memory compute (NMC) units (shown in Figure 11), instead of the TiM tiles. Note that, to store a ternary word using the SRAM array, we require two 6T bit-cells. The baseline tiles are smaller than TiM tiles by 0.52x, therefore, we use two baselines designs. (i) An iso-area baseline with 60 baseline tiles and the overall accelerator area is same as TiM-DNN.(ii) An iso-capacity baseline with the same weight storage capacity (2 Mega ternary words) as TiM-DNN. We note that the baseline is well-optimized, and our iso-area baseline can achieve 21.9 TOPs/sec, reflecting an improvement of 17.6X in TOPs/sec over near-memory accelerator for ternary DNNs proposed in [BRein].
DNN Benchmarks. We evaluate the system-level energy and performance benefits of TiM-DNN using a suite of DNN benchmarks. Table III
details our benchmark applications. We use state-of-the-art convolutional neural networks (CNN), viz., AlexNet, ResNet-34, and Inception to perform image classification on ImageNet. We also evaluate popular recurrent neural networks (RNN) such as LSTM and GRU that perform language modeling task on the Penn Tree Bank (PTB) dataset[PTB]. Table III also details the activation precision and accuracy of these ternary networks.
In this section, we present various results that quantify the improvements obtained by TiM-DNN over the near-memory accelerator baseline due to in-memory computing enabled by TPCs. We also compare TiM-DNN with other state-of-the-art DNN accelerators.
V-a Performance benefits
We first quantify the performance benefits of TiM-DNN over our two baselines (Iso-capacity and Iso-area near-memory accelerators). Figure 12 shows the two major components of the normalized inference time which are MAC-Ops (vector-matrix multiplications) and Non-MAC-Ops (other DNN operations) for TiM-DNN (TiM) and the baselines. Overall, we achieve 5.1x-7.7x speedup over the Iso-capacity baseline and 3.2x-4.2x speedup over the Iso-area baseline across our benchmark applications. The speedups depend on the fraction of application runtime spent on MAC-Ops, with DNNs having higher MAC-Ops times attaining superior speedups. This is expected as the performance benefits of TiM-DNN over the baselines derive from accelerating MAC-Ops using in-memory computations. Iso-area (baseline2) is faster than Iso-capacity (baseline1) due to the higher-level of parallelism available from the additional baseline tiles. The 32-tile instance of TiM-DNN achieves 4827, 952, 1834, 2*, and 1.9* inference/sec for AlexNet, ResNet-34, Inception, LSTM, and GRU, respectively. Our RNN benchmarks (LSTM and GRU) fit on TiM-DNN entirely, leading to better inference performance than CNNs.
V-B Energy benefits
Next, we show the application level energy benefits of TiM-DNN over the superior of the two baselines (Baseline2). Figure 13 shows major energy components for TiM-DNN and Baseline2, which are programming (writes to TiM tiles), DRAM accesses, reads (writes) from (to) activation and Psum buffers, operations in reduce units and special function units (RU+SFU Ops), and MAC-Ops. As shown, TiM reduces the MAC-Ops energy substantially and achieves 3.9x-4.7x energy improvements across our DNN benchmarks. The primary cause for this energy reduction is that TiM-DNN computes on 16 rows simultaneously per array access.
V-C Kernel-level benefits
To provide more insights on the application-level benefits, we compare the TiM tile and the baseline tile at the kernel-level. We consider a primitive DNN kernel, i.e., a vector-matrix computation (Out = Inp*W, where Inp is a 1x16 vector and W is a 16x256 matrix), and map it to both TiM and baseline tiles. We use two variants of TiM tile, (i) TiM-8 and (ii) TiM-16, wherein we simultaneously activate 8 wordlines and 16 wordlines, respectively. Using the baseline tile, the vector-matrix multiplication operation requires row-by-row sequential reads, resulting in 16 SRAM accesses. In contrast, TiM-16 and TiM-8 require 1 and 2 accesses, respectively. Figure 14 shows that the TiM-8 and TiM-16 designs achieve a speedup of 6x and 11.8x respectively, over the baseline design. Note that the benefits are lower than 8x and 16x, respectively, as SRAM accesses are faster than TiM-8 and TiM-16 accesses.
Next, we compare the energy consumption in TiM-8, TiM-16, and baseline designs for the above kernel computation. In TiM-8 and TiM-16, the bit-lines are discharged twice and once, respectively, whereas, in the baseline design the bit-lines discharge multiple (16*2) times. Therefore, TiM tiles achieve substantial energy benefits over the baseline design. The additional factor ‘2’ in (16*2) arises as the SRAM array uses two 6T bit-cells for storing a ternary word. However, the energy benefits of TiM-8 and TiM-16 is not 16x and 32x, respectively, as TiM tiles discharge the bitlines by a larger amount (multiple s). Further, the amount by which the bitlines get discharged in TiM tiles depends on the number of non-zero scalar outputs. For example, in TiM-8, if 50% of the TPCs output in a column are zeros the bitline discharges by , whereas if 75% are zeros the bitline discharges by . Thus, the energy benefits over the baseline design are a function of the output sparsity (fraction of outputs that are zero). Figure 14 shows the energy benefits of TiM-8 and TiM-16 designs over the baseline design at various output sparsity levels.
V-D TiM-DNN area breakdown
We now discuss the area breakdown of various components in TiM-DNN. Figure 15 shows the area breakdown of the TiM-DNN accelerator, a TiM tile, and a baseline tile. The major area consumer in TiM-DNN is the TiM-tile. In the TiM and baseline tiles, area mostly goes into the core-array that consists of TPCs and 6T bit-cells, respectively. Further, as discussed in section IV, TiM tiles are 1.89x larger than the baseline tile at iso-capacity. Therefore, we use the iso-area baseline with 60 tiles and compare it with TiM-DNN having 32 TiM tiles.
V-E Comparison with prior DNN accelerators
Next, we quantify the advantages of TiM-DNN over prior DNN accelerators using processing efficiencies (TOPS/W and TOPS/) as our metric. Table IV details 3 prior DNN accelerators including - (i) Nvidia Tesla V100 [nvidia-volta-v100], a state-of-the-art GPU, (ii) Neural-Cache [neuralCache], a design using bitwise in-memory operations and bit-serial near-memory arithmetic to realize dot-products, (iii) BRien [BRein], a near-memory accelerator for ternary DNNs. As shown, TiM-DNN achieves substantial improvements in both TOPS/W and TOPS/. GPUs, near-memory accelerators [BRein], and boolean in-memory accelerators are less efficient than TiM-DNN as they cannot realize massively parallel in-memory vector-matrix multiplications with signed ternary values per array access. Their efficiency is limited by the on-chip memory bandwidth, wherein they can simultaneously access one [nvidia-volta-v100, BRein] or at most two [neuralCache] memory rows. In contrast, TiM-DNN offers addition parallelism by simultaneously accessing ‘L’ (L=16) memory rows to compute in-memory vector-matrix multiplications.
V-F TiM-DNN under process variation
Finally, we study the impact of process variation on the computations ( i.e., ternary vector-matrix multiplications) performed using TiM-DNN. To that end, we first perform Monte-Carlo circuit simulation of ternary dot-product operations executed in TiM tiles with = 8 and L = 16 to determine the sensing errors under random variations. We consider variations ( = 5%) [kuhn2011] in the threshold voltage () of all transistors in each and every TPC. We evaluate 1000 samples for every possible BL/BLB state ( to ) and determine the spread in the final bitline voltages (/). Figure 16 shows the histogram of the obtained voltages of all possible states across these random samples. As mentioned in section III-B, the state represents n = i, where ‘n’ is the ADC Output. We can observe in the figure that some of the neighboring histograms slightly overlap, while the others do not. For example, the histograms and overlap but and
do not. The overlapping areas in the figure represent the samples that will result in sensing errors (SEs). However, the overlapping areas are very small, indicating that the probability of the sensing error () is extremely low. Further, the sensing errors depend on ‘n’, and we represent it as the conditional sensing error probability [(SE/n)]. It the also worth mentioning that the error magnitude is always 1, as only the adjacent histograms overlap.
Equation 1 details the probability () of error in the ternary vector-matrix multiplications executed using TiM tiles, where and are the conditional sensing error probability and the occurrence probability of the state (ADC-Out = n), respectively. Figure 17 shows the values of , , and their product ( *) for each n. The is obtained using the Monte-Carlo simulation (described above), and is computed using the traces of the partial sums obtained during ternary vector-matrix multiplications [WRPN, HitNet] executed using TiM tiles. As shown in Figure 17, the is maximum at n=1 and drastically decreases with a higher value of n. In contrast, the shows an opposite trend, wherein the probability of sensing error is higher for larger n. Therefore, we find the product * to be quite small across all values on n. In our evaluation, the is found to be 1.5*, reflecting an extremely low probability of error. In other words, we have only 2 errors of magnitude (1) every 10K ternary vector matrix multiplications executed using TiM-DNN. In our experiments, we found that = 1.5* has no impact on the application level accuracy. We note that this is due to the low probability and magnitude of error as well as the ability of DNNs to tolerate errors in their computations [AXNN].
Ternary DNNs are extremely promising due to their ability to achieve accuracy similar to full-precision networks on complex machine learning tasks, while enabling DNN inference at low energy. In this work, we present TiM-DNN, an in-memory accelerator for executing state-of-the-art ternary DNNs. TiM-DNN is programmable accelerator designed using TiM tiles, i.e., specialized memory arrays for realizing massively parallel signed vector-matrix multiplications with ternary values. TiM tiles consist of a new Ternary Processing Cell (TPC) that functions as both a ternary storage unit and a scalar multiplication unit. We evaluate an embodiment of TiM-DNN with 32 TiM tiles and demonstrate that TiM-DNN achieves significant energy and performance improvements over a well-optimized near-memory accelerator baseline.