A 75kb SRAM in 65nm CMOS for In-Memory Computing Based Neuromorphic Image Denoising

03/23/2020 ∙ by Sumon Kumar Bose, et al. ∙ Nanyang Technological University 0

This paper presents an in-memory computing (IMC) architecture for image denoising. The proposed SRAM based in-memory processing framework works in tandem with approximate computing on a binary image generated from neuromorphic vision sensors. Implemented in TSMC 65nm process, the proposed architecture enables approximately 2000X energy savings (approximately 222X from IMC) compared to a digital implementation when tested with the video recordings from a DAVIS sensor and achieves a peak throughput of 1.25-1.66 frames/us.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Bio-inspired neuromorphic vision sensors (NVS) [Posch2014][Berner2013] have gained traction among the researchers due to its low bandwidth and energy requirement, as well as recent commercial availability [Vision]. Unlike a traditional frame-based camera, NVS detects any changes of contrast in a pixel and outputs an event corresponding to the (x,y) coordinates of that pixel which is also known as Address Event Representation (AER) [Boahen2004]. Hence, AER contains less superfluous information due to its asynchronous event-based encoding and finds application in robotics [Delbruck2013], environment monitoring, traffic surveillance [Litzenberger2006] and object tracking [1542972] in the scene. However, the raw image data is corrupted with noise and its removal from the image is one of the most important pre-processing tasks for region proposal, object tracking and classification [Padala2018]. While earlier approaches use an event based denoise method termed the nearest neighbour filter (NN-filt)[nnfilt], recent hybrid frame-event approaches using median filtering outperformed NN-filt in terms of performance, memory requirement and computes [Jyotibdha_EBBIOT]. However, traditional Von Neumann architecture is still a bottleneck in terms of latency and energy dissipation for hardware implementation of neuromorphic processing[jetcas_review].

Fig. 1: (a) Conventional median filter using a

kernel and stride = 1 (b) Proposed NOMF using a

kernel and stride = 3 for image denoising. Approximation due to NOMF reduces the memory read, write energy and its architecture enables in-memory computing.

To address this, in-memory computing(IMC) paradigm is proposed where processing is performed inside the memory and shows unprecedented performance benefits compared to its Von Neumann counterpart. IMC not only enables highly parallel processing due to its simultaneous access of multiple cells but also gets rid of the energy consumption of data transfer from memory to processor and vice versa [Biswas2018_thesis]. Several works on IMC are shown to be effective, such as [7875410] proposed

-T SRAM based linear classifier using current summation and achieved

x energy savings on MNIST dataset compared to the digital implementation. Similarly, [Biswas2018] implemented

T-SRAM based binary-weighted Convolutional Neural Networks (CNN) leveraging charge distribution and attained

x energy benefit for MNIST dataset. While most of the efforts on IMC are shown for the post-processing of the image, in this paper, we use IMC for efficient denoising of the event based binary image (EBBI) since this method is shown to outperform pure event based ones[Jyotibdha_EBBIOT].

Approximate computing is another avenue for energy reduction in an application like pattern recognition or multimedia processing where slight degradation in the calculation does not affect the final outcome or the output quality remains its acceptable range. Approximation in the calculation can be introduced to the circuit 

[Lu2004] [approx_float], software [Bose_2019] or system level [Raha2018xxx]. Since a slight change of object boundary has a little impact on region proposal, objects tracking or classification performance, we propose to use approximate computing while filtering of an image frame. The details of the algorithm and VLSI implementation are presented in the following sections.

Ii Overview: Median Filter Algorithm

A median filter is a nonlinear filter that replaces the center pixel of an

kernel by the median value of pixels associated with the kernel. The output of median filter at (i, j) location can be presented as Eq.(1) where i, j .

(1)

Implementation of the median filter for a grayscale image involves sorting the pixel values. On the contrary, carrying out the median filtering for a binary image is simple and requires a counter which adds up the number of occurrence of “1” for an patch and assign “1” for the middle pixel if the number of “1”s is higher than that of “0” and vice versa. The whole operation can be shown as

(2)

In a traditional median filter, an kernel convolves over the image in an overlap fashion where the stride, as shown in Fig. 1(a). Hence, fetching and summing up bit by bit for the binary image, followed by comparison in the processing unit and a write operation in the memory demand +1 clock cycles and associated energy for each pixel. However, since the adjacent pixels of an image have similar characteristics, we can apply the decision of an kernel to all the pixels instead of the center one. This is equivalent to having stride (Fig. 1(b)) resulting in non-overlap median filter (NOMF) that we use in this work. While the proposed approach changes the object boundary slightly (marginal effect on tracking as shown later), it reduces the processing and memory read access time by a factor of and enables the same memory to be utilized to store the filtered image. It also enables IMC based denoise as shown next. However, NOMF approach does not reduce the memory write cycles and energy. Table I captures usage of the resources in both approaches for an image of size .

Iii In-memory Denoise: Hardware implementation

Iii-a Architecture

Figure 2 shows an architecture of a SRAM array for image denoising (QVGA or lower resolution) applicable to NVS such as [5648367] [Brandli:200837]. It operates in two modes (a) normal read and write mode (b) filter mode. Unlike a conventional SRAM write, NVS does not allow to write all the bits of a byte or a word simultaneously since this memory is targeted for event-based cameras and events are not contiguous. Therefore, a single bit writing circuitry is implemented in normal write mode. In order to reduce the dynamic bit-line power consumption [Chandrakasan:1995:LPD:560639], the whole memory is divided into banks having cells in each bank except the last one. In filter mode, the kernel can be configured as either a or a (enabling successive WLs and connecting consecutive BLs and BLBs separately, ) patch. To have almost the same delay of WL signal for each cell of a kernel, columns are selected for each bank. In normal SRAM write mode, global (GWL) and local word-line (LWL) blocks enable one of the word-lines (WL), and column decoder writes the data and its complement on the bit-line (BL) and bit-line bar (BLB) respectively. The rest of the BLs and BLBs are charged to VDD by the half select (HS) driver to mitigate the read disturb issue of the half-selected cells in the selected bank (cells are selected along row but not selected along column).

Fig. 2: Architecture of a bitcell array for noise removal from a binary image. In filter mode for , three consecutive word-lines (WL) are enabled together to discharge bit-line (BL) and bit-line bar (BLB) simultaneously. BLs and BLBs of three successive columns are connected together separately using transmission gates to implement a kernel. The IMC architecture enables highly parallel noise filtering of cells in two clock cycles.

During writing a memory cell, one of the lines (either BL or BLB) is driven to V and another line is connected to VDD. The line, connected to V, initiates the bit-flip process in an SRAM cell. For instance, 6T SRAM cell in the left inset of Fig. 2 stores “0” and in order to write “1” in the cell, BL and BLB are connected to VDD and V respectively. Once WL is asserted, the strength of and decides the bit-flip in the cell. If has higher strength than , it will write “1” in the cell. However, the writing operation can happen even when the BL is connected to lower potential than VDD. In that case, strength of transistor has to be increased further. In read mode, BL and BLB are charged to VDD, and when the WL signal is asserted, either of the lines starts discharging depending on the value stored in the cell.

Iii-B Implementation of NOMF

We follow the steps of an SRAM cell read and bit-flip to implement the NOMF for noise removal in the memory. BLs and BLBs of the cells are connected separately employing transmission gates which is shown in the right inset of Fig. 2. Throughout the filter operation, the signal S is kept high. The resistance of the transmission gate, is chosen such that the following criterion is met:

(3)

where denotes the discharging current of each SRAM cell and is a combination of the metal routing capacitor of BL or BLB, and diffusion capacitor of access transistors, or . From post-layout simulation after parasitic extraction, fF. The condition in Eq. (3) is maintained so that the discharge profiles of the three BLs of a kernel follow each other with minimal delay and the same is applicable for the BLBs discharge. The proposed IMC architecture takes two clock cycles to filter the noise from a patch. In the first cycle, BLs and BLBs are charged to VDD. successive WLs are asserted in the next cycle, which enables () cells to discharge BLs and BLBs simultaneously. Since there will be a difference of BL and BLB discharge current due to the different number of “0”s and “1”s in a kernel, one of the lines will discharge and reach V faster. This configuration of BL and BLB is similar to write mode and it will flip the minority pixels in the kernel. If the number of “0”s is less than the number of “1”s in a kernel, we refer “0” as minority pixel in that patch and vice versa. In filter mode, we keep all the bank select signals high to activate highly parallel processing in the memory and it filters cells in one pass. We repeat this procedure until all the rows are filtered.

Intuitively, the kernel can be thought of as a circuit where two latches of different strength and stored values are connected to BL and BLB. Their strengths are determined by the number of “0”s and “1”s stored in the kernel. Whoever wins in discharging BL or BLB faster, imposes its stored value on the other.

Input # memory read # memory write # operations # Bits
NN-filt Events
Median Filter EBBI D
NOMF EBBI D D D D
NOMF+IMC EBBI D
  • , , , , .

TABLE I: Comparison of different filters for an image of size, D=

The voltage difference between BL and BLB at any instant of time, , is represented as

(4)

Where and represent the discharging current of BL and BLB due to the stored “0”s and “1”s in the kernel respectively. In the best-case scenario, all the bits in the kernel are either “0” or “1” and bit-flip does not happen. In contrast, the kernel takes the longest time to decide and flips the minority pixels when the difference between the number of “0”s and “1” is one. However, due to the discharging current and capacitor mismatch, majority pixels in a kernel may flip in the worst-case scenario. The unintended bit-flips due to the mismatch reduces the object boundary when the majority pixel is “1” and inserts new object in the frame in the opposite scenario ( “0”s and

“1”s). However, the probability of

noise pixels appearing inside the faulty kernel is negligible. Nevertheless, to mitigate the mismatch effects, width and length of , , , and are increase by a factor of from its minimum value supported by the process and low VT devices are used. We run trials of Monte-Carlo simulation initializing the kernel with four “1”s and five “0” and do not observe any unintentional bit-flip in the kernel (see Fig. 3(c)) at VDD=V. Even though the usage of low VT devices increases the leakage power, we can shut down the memory once processing is done.

Iii-C Performance

The proposed approach has several major advantages a) it reduces the dynamic BL power consumption during SRAM read operation. BLs and BLBs are required to charge once to read (3 or 5) cells along the column compared to the conventional approach where the requirement is times. b) It does not require any sense amplifier to sense the BL and BLB voltage difference. The kernel inherently acts as a sense amplifier and takes the decision. c) It does not consume any dynamic BL power during write operation since the discharges of BL and BLB are related to the read operation. d) Minimal energy is required to flip the minority pixels (only noise and boundary pixels of an object).

Table I compares the proposed NOMF implemented using IMC with the state-of-the-art denoising techniques. The nearest neighbour filter (NN-filt) [Gonzalez:2006:DIP:1076432] stores and updates the timestamp of an incoming event using (=16) bit per timestamp [Jyotibdha_EBBIOT]. Whereas other techniques process event based-binary image (EBBI) frame. represents the average number of events () during the frame duration (single pixel can be fired multiple times). As discussed earlier, IMC reduces the number of memory read by a factor of . in the third column represents the fraction of the pixels that need to be flipped for the filter implementation (only noise and boundary pixels of the objects). We observed that the average value of is for image frames. Also, the proposed IMC approach does not require any addition or comparison.

Fig. 3: (a) Bit-flip of an SRAM memory cell in a kernel. Since the number of “1” is higher than that of “0”, BLB gets discharged faster and the stored value flips at node A. (b) points Monte-Carlo DC simulation of BL and BLB discharging current at VDD=V, and scenario (worst-case). (c) points Monte-Carlo simulation: unintended bit-flip due to the mismatches across VDD and different number of “1”s and “0” in the kernel. (d) Unit SRAM cell layout.

Iv results

The circuit has been designed in nm CMOS refer to unit SRAM cell layout picture in Fig. 3(d). We initialize one of the kernels of the memory array with four “0”s and five “1” to simulate the NOMF in SPICE. Fig. 3(a) captures the transient behavior of different nodes of the kernel. When WL goes low, BL and BLB are charged to VDD. Initially node A stores “0” and when WL is made high, BL and BLB start discharging. Since the number of “1” is higher than that of “0” in the kernel, BLB gets discharged faster and the minority cells flip its stored value. points Monte-Carlo DC simulation of BL and BLB discharging current at the worst-case scenario and V is shown in Fig.3(b). The overlap region in the histogram is responsible for the unintended bit-flips at 1V. However, it is seen in MATLAB simulation using the dataset described below that the probability of appearing four noisy pixels in a kernel is . Fig. 3(c) shows points Monte-Carlo simulation of unintended bit-flips across VDD. It can be seen that at V, unintended bit-flip does not happen. However, due to lower overdrive, and mismatches, unintended bit-flips occur at V and the worst-case scenario (Overall probability=).

In order to validate the proposed NOMF and compare with prior work, we use the same dataset as used in [Jyotibdha_EBBIOT] for a fair comparison. The dataset comprises more than 1 hour of traffic scene recordings with different objects such as cars, buses, trucks, bikes, humans along with the background noise. More details are available in [Jyotibdha_EBBIOT].

Fig. 4: (a) raw binary frame (b) output frame of median filter and (c) proposed NOMF using a kernel. (d) Precision and (e) recall of an overlap-based tracker (OT) [Jyotibdha_EBBIOT] using both filtered images for different IoU values.

Fig. 4(b)-(c) show the MATLAB simulation of the median filter and proposed NOMF using a kernel on the binary raw image. In term of noise removal, both filters show similar performance. We also evaluate the performance - recall and precision of an overlap-base tracker (OT) [Jyotibdha_EBBIOT] using both filtered images for different IoU values as shown in Fig. 4(d)-(e). , where and denote the area of manually annotated ground truth and region proposed by the OT encapsulating an object respectively. If the IoU of a proposed region is greater than a threshold, the region is assumed to be true positive region. Precision (true positive regions/ total proposed regions) and recall (true positive regions/ total ground truth regions) of the OT are calculated using all the output frames from both filters, and the performance is comparable as shown in Fig.4.

Table II compares the performance of the proposed NOMF implemented using IMC with spatiotemporal [7168735] and fully-digital median filter that is synthesized in the same process for fair comparison. The spatiotemporal filter works on the continuous events from the NVS whereas proposed NOMF and fully-digital implementation process event-based binary image following [Jyotibdha_EBBIOT]

. Latency and energy are estimated at

MHz on the post-layout netlist. The synergy between the approximate computing and IMC reduces the execution time to /frame and enables energy saving compared to the digital counterpart where the contribution of approximation and IMC are and respectively.

Process Area/Cell () Latency() /bit Energy(pJ) /bit
Spatio-temporal Filter [7168735] 180nm 400 10 20
Median Filter 65nm 4.89 95 228
Proposed NOMF+IMC 65nm 3.65 0.01 0.11
TABLE II: Comparison with different filter implementations
This Work  [Biswas2018]  [Kang2018]  [8662392]
Technology 65nm 65nm 65nm 55nm
Algorithm Filter CNN k-NN CNN
SRAM size 75kb 16kb 128kb 3.75kb
Throughput (GOPS) 85.3-153 8 10.2 -
Energy Efficiency(TOPS/W) 11.3-20 14.7-40.3 1.94 18.37-72.1
TABLE III: Comparison of different published IMC works

Table III compares the proposed approach with the recently published IMC works and demonstrates an order of magnitude improvement in throughput due to the highly parallel processing. Assuming operations (addition) for the calculation of a kernel, the energy efficiency is comparable with other state of the art.

Conclusion

In this work, we present an approximate and in-memory computing framework for binary image denoising. The proposed approach is tested with the binary image frames from a DAVIS sensor setup and achieves energy saving compared to conventional Von Neumann digital approaches. The massively parallel architecture reduces the processing time to per frame and provides enough time for the subsequent processing stages.

References