I Introduction
Bioinspired neuromorphic vision sensors (NVS) [Posch2014][Berner2013] have gained traction among the researchers due to its low bandwidth and energy requirement, as well as recent commercial availability [Vision]. Unlike a traditional framebased camera, NVS detects any changes of contrast in a pixel and outputs an event corresponding to the (x,y) coordinates of that pixel which is also known as Address Event Representation (AER) [Boahen2004]. Hence, AER contains less superfluous information due to its asynchronous eventbased encoding and finds application in robotics [Delbruck2013], environment monitoring, traffic surveillance [Litzenberger2006] and object tracking [1542972] in the scene. However, the raw image data is corrupted with noise and its removal from the image is one of the most important preprocessing tasks for region proposal, object tracking and classification [Padala2018]. While earlier approaches use an event based denoise method termed the nearest neighbour filter (NNfilt)[nnfilt], recent hybrid frameevent approaches using median filtering outperformed NNfilt in terms of performance, memory requirement and computes [Jyotibdha_EBBIOT]. However, traditional Von Neumann architecture is still a bottleneck in terms of latency and energy dissipation for hardware implementation of neuromorphic processing[jetcas_review].
To address this, inmemory computing(IMC) paradigm is proposed where processing is performed inside the memory and shows unprecedented performance benefits compared to its Von Neumann counterpart. IMC not only enables highly parallel processing due to its simultaneous access of multiple cells but also gets rid of the energy consumption of data transfer from memory to processor and vice versa [Biswas2018_thesis]. Several works on IMC are shown to be effective, such as [7875410] proposed
T SRAM based linear classifier using current summation and achieved
x energy savings on MNIST dataset compared to the digital implementation. Similarly, [Biswas2018] implementedTSRAM based binaryweighted Convolutional Neural Networks (CNN) leveraging charge distribution and attained
x energy benefit for MNIST dataset. While most of the efforts on IMC are shown for the postprocessing of the image, in this paper, we use IMC for efficient denoising of the event based binary image (EBBI) since this method is shown to outperform pure event based ones[Jyotibdha_EBBIOT].Approximate computing is another avenue for energy reduction in an application like pattern recognition or multimedia processing where slight degradation in the calculation does not affect the final outcome or the output quality remains its acceptable range. Approximation in the calculation can be introduced to the circuit
[Lu2004] [approx_float], software [Bose_2019] or system level [Raha2018xxx]. Since a slight change of object boundary has a little impact on region proposal, objects tracking or classification performance, we propose to use approximate computing while filtering of an image frame. The details of the algorithm and VLSI implementation are presented in the following sections.Ii Overview: Median Filter Algorithm
A median filter is a nonlinear filter that replaces the center pixel of an
kernel by the median value of pixels associated with the kernel. The output of median filter at (i, j) location can be presented as Eq.(1) where i, j .(1) 
Implementation of the median filter for a grayscale image involves sorting the pixel values. On the contrary, carrying out the median filtering for a binary image is simple and requires a counter which adds up the number of occurrence of “1” for an patch and assign “1” for the middle pixel if the number of “1”s is higher than that of “0” and vice versa. The whole operation can be shown as
(2) 
In a traditional median filter, an kernel convolves over the image in an overlap fashion where the stride, as shown in Fig. 1(a). Hence, fetching and summing up bit by bit for the binary image, followed by comparison in the processing unit and a write operation in the memory demand +1 clock cycles and associated energy for each pixel. However, since the adjacent pixels of an image have similar characteristics, we can apply the decision of an kernel to all the pixels instead of the center one. This is equivalent to having stride (Fig. 1(b)) resulting in nonoverlap median filter (NOMF) that we use in this work. While the proposed approach changes the object boundary slightly (marginal effect on tracking as shown later), it reduces the processing and memory read access time by a factor of and enables the same memory to be utilized to store the filtered image. It also enables IMC based denoise as shown next. However, NOMF approach does not reduce the memory write cycles and energy. Table I captures usage of the resources in both approaches for an image of size .
Iii Inmemory Denoise: Hardware implementation
Iiia Architecture
Figure 2 shows an architecture of a SRAM array for image denoising (QVGA or lower resolution) applicable to NVS such as [5648367] [Brandli:200837]. It operates in two modes (a) normal read and write mode (b) filter mode. Unlike a conventional SRAM write, NVS does not allow to write all the bits of a byte or a word simultaneously since this memory is targeted for eventbased cameras and events are not contiguous. Therefore, a single bit writing circuitry is implemented in normal write mode. In order to reduce the dynamic bitline power consumption [Chandrakasan:1995:LPD:560639], the whole memory is divided into banks having cells in each bank except the last one. In filter mode, the kernel can be configured as either a or a (enabling successive WLs and connecting consecutive BLs and BLBs separately, ) patch. To have almost the same delay of WL signal for each cell of a kernel, columns are selected for each bank. In normal SRAM write mode, global (GWL) and local wordline (LWL) blocks enable one of the wordlines (WL), and column decoder writes the data and its complement on the bitline (BL) and bitline bar (BLB) respectively. The rest of the BLs and BLBs are charged to VDD by the half select (HS) driver to mitigate the read disturb issue of the halfselected cells in the selected bank (cells are selected along row but not selected along column).
During writing a memory cell, one of the lines (either BL or BLB) is driven to V and another line is connected to VDD. The line, connected to V, initiates the bitflip process in an SRAM cell. For instance, 6T SRAM cell in the left inset of Fig. 2 stores “0” and in order to write “1” in the cell, BL and BLB are connected to VDD and V respectively. Once WL is asserted, the strength of and decides the bitflip in the cell. If has higher strength than , it will write “1” in the cell. However, the writing operation can happen even when the BL is connected to lower potential than VDD. In that case, strength of transistor has to be increased further. In read mode, BL and BLB are charged to VDD, and when the WL signal is asserted, either of the lines starts discharging depending on the value stored in the cell.
IiiB Implementation of NOMF
We follow the steps of an SRAM cell read and bitflip to implement the NOMF for noise removal in the memory. BLs and BLBs of the cells are connected separately employing transmission gates which is shown in the right inset of Fig. 2. Throughout the filter operation, the signal S is kept high. The resistance of the transmission gate, is chosen such that the following criterion is met:
(3) 
where denotes the discharging current of each SRAM cell and is a combination of the metal routing capacitor of BL or BLB, and diffusion capacitor of access transistors, or . From postlayout simulation after parasitic extraction, fF. The condition in Eq. (3) is maintained so that the discharge profiles of the three BLs of a kernel follow each other with minimal delay and the same is applicable for the BLBs discharge. The proposed IMC architecture takes two clock cycles to filter the noise from a patch. In the first cycle, BLs and BLBs are charged to VDD. successive WLs are asserted in the next cycle, which enables () cells to discharge BLs and BLBs simultaneously. Since there will be a difference of BL and BLB discharge current due to the different number of “0”s and “1”s in a kernel, one of the lines will discharge and reach V faster. This configuration of BL and BLB is similar to write mode and it will flip the minority pixels in the kernel. If the number of “0”s is less than the number of “1”s in a kernel, we refer “0” as minority pixel in that patch and vice versa. In filter mode, we keep all the bank select signals high to activate highly parallel processing in the memory and it filters cells in one pass. We repeat this procedure until all the rows are filtered.
Intuitively, the kernel can be thought of as a circuit where two latches of different strength and stored values are connected to BL and BLB. Their strengths are determined by the number of “0”s and “1”s stored in the kernel. Whoever wins in discharging BL or BLB faster, imposes its stored value on the other.
Input  # memory read  # memory write  # operations  # Bits  
NNfilt  Events  
Median Filter  EBBI  D  
NOMF  EBBI  D  D  D  D 
NOMF+IMC  EBBI  D 

, , , , .
The voltage difference between BL and BLB at any instant of time, , is represented as
(4) 
Where and represent the discharging current of BL and BLB due to the stored “0”s and “1”s in the kernel respectively. In the bestcase scenario, all the bits in the kernel are either “0” or “1” and bitflip does not happen. In contrast, the kernel takes the longest time to decide and flips the minority pixels when the difference between the number of “0”s and “1” is one. However, due to the discharging current and capacitor mismatch, majority pixels in a kernel may flip in the worstcase scenario. The unintended bitflips due to the mismatch reduces the object boundary when the majority pixel is “1” and inserts new object in the frame in the opposite scenario ( “0”s and
“1”s). However, the probability of
noise pixels appearing inside the faulty kernel is negligible. Nevertheless, to mitigate the mismatch effects, width and length of , , , and are increase by a factor of from its minimum value supported by the process and low VT devices are used. We run trials of MonteCarlo simulation initializing the kernel with four “1”s and five “0” and do not observe any unintentional bitflip in the kernel (see Fig. 3(c)) at VDD=V. Even though the usage of low VT devices increases the leakage power, we can shut down the memory once processing is done.IiiC Performance
The proposed approach has several major advantages a) it reduces the dynamic BL power consumption during SRAM read operation. BLs and BLBs are required to charge once to read (3 or 5) cells along the column compared to the conventional approach where the requirement is times. b) It does not require any sense amplifier to sense the BL and BLB voltage difference. The kernel inherently acts as a sense amplifier and takes the decision. c) It does not consume any dynamic BL power during write operation since the discharges of BL and BLB are related to the read operation. d) Minimal energy is required to flip the minority pixels (only noise and boundary pixels of an object).
Table I compares the proposed NOMF implemented using IMC with the stateoftheart denoising techniques. The nearest neighbour filter (NNfilt) [Gonzalez:2006:DIP:1076432] stores and updates the timestamp of an incoming event using (=16) bit per timestamp [Jyotibdha_EBBIOT]. Whereas other techniques process event basedbinary image (EBBI) frame. represents the average number of events () during the frame duration (single pixel can be fired multiple times). As discussed earlier, IMC reduces the number of memory read by a factor of . in the third column represents the fraction of the pixels that need to be flipped for the filter implementation (only noise and boundary pixels of the objects). We observed that the average value of is for image frames. Also, the proposed IMC approach does not require any addition or comparison.
Iv results
The circuit has been designed in nm CMOS refer to unit SRAM cell layout picture in Fig. 3(d). We initialize one of the kernels of the memory array with four “0”s and five “1” to simulate the NOMF in SPICE. Fig. 3(a) captures the transient behavior of different nodes of the kernel. When WL goes low, BL and BLB are charged to VDD. Initially node A stores “0” and when WL is made high, BL and BLB start discharging. Since the number of “1” is higher than that of “0” in the kernel, BLB gets discharged faster and the minority cells flip its stored value. points MonteCarlo DC simulation of BL and BLB discharging current at the worstcase scenario and V is shown in Fig.3(b). The overlap region in the histogram is responsible for the unintended bitflips at 1V. However, it is seen in MATLAB simulation using the dataset described below that the probability of appearing four noisy pixels in a kernel is . Fig. 3(c) shows points MonteCarlo simulation of unintended bitflips across VDD. It can be seen that at V, unintended bitflip does not happen. However, due to lower overdrive, and mismatches, unintended bitflips occur at V and the worstcase scenario (Overall probability=).
In order to validate the proposed NOMF and compare with prior work, we use the same dataset as used in [Jyotibdha_EBBIOT] for a fair comparison. The dataset comprises more than 1 hour of traffic scene recordings with different objects such as cars, buses, trucks, bikes, humans along with the background noise. More details are available in [Jyotibdha_EBBIOT].
Fig. 4(b)(c) show the MATLAB simulation of the median filter and proposed NOMF using a kernel on the binary raw image. In term of noise removal, both filters show similar performance. We also evaluate the performance  recall and precision of an overlapbase tracker (OT) [Jyotibdha_EBBIOT] using both filtered images for different IoU values as shown in Fig. 4(d)(e). , where and denote the area of manually annotated ground truth and region proposed by the OT encapsulating an object respectively. If the IoU of a proposed region is greater than a threshold, the region is assumed to be true positive region. Precision (true positive regions/ total proposed regions) and recall (true positive regions/ total ground truth regions) of the OT are calculated using all the output frames from both filters, and the performance is comparable as shown in Fig.4.
Table II compares the performance of the proposed NOMF implemented using IMC with spatiotemporal [7168735] and fullydigital median filter that is synthesized in the same process for fair comparison. The spatiotemporal filter works on the continuous events from the NVS whereas proposed NOMF and fullydigital implementation process eventbased binary image following [Jyotibdha_EBBIOT]
. Latency and energy are estimated at
MHz on the postlayout netlist. The synergy between the approximate computing and IMC reduces the execution time to /frame and enables energy saving compared to the digital counterpart where the contribution of approximation and IMC are and respectively.Process  Area/Cell ()  Latency() /bit  Energy(pJ) /bit  
Spatiotemporal Filter [7168735]  180nm  400  10  20 
Median Filter  65nm  4.89  95  228 
Proposed NOMF+IMC  65nm  3.65  0.01  0.11 
This Work  [Biswas2018]  [Kang2018]  [8662392]  
Technology  65nm  65nm  65nm  55nm 
Algorithm  Filter  CNN  kNN  CNN 
SRAM size  75kb  16kb  128kb  3.75kb 
Throughput (GOPS)  85.3153  8  10.2   
Energy Efficiency(TOPS/W)  11.320  14.740.3  1.94  18.3772.1 
Table III compares the proposed approach with the recently published IMC works and demonstrates an order of magnitude improvement in throughput due to the highly parallel processing. Assuming operations (addition) for the calculation of a kernel, the energy efficiency is comparable with other state of the art.
Conclusion
In this work, we present an approximate and inmemory computing framework for binary image denoising. The proposed approach is tested with the binary image frames from a DAVIS sensor setup and achieves energy saving compared to conventional Von Neumann digital approaches. The massively parallel architecture reduces the processing time to per frame and provides enough time for the subsequent processing stages.
Comments
There are no comments yet.