DFSynthesizer: Dataflow-based Synthesis of Spiking Neural Networks to Neuromorphic Hardware

08/04/2021
by   Shihao Song, et al.
Drexel University
0

Spiking Neural Networks (SNN) are an emerging computation model, which uses event-driven activation and bio-inspired learning algorithms. SNN-based machine-learning programs are typically executed on tile- based neuromorphic hardware platforms, where each tile consists of a computation unit called crossbar, which maps neurons and synapses of the program. However, synthesizing such programs on an off-the-shelf neuromorphic hardware is challenging. This is because of the inherent resource and latency limitations of the hardware, which impact both model performance, e.g., accuracy, and hardware performance, e.g., throughput. We propose DFSynthesizer, an end-to-end framework for synthesizing SNN-based machine learning programs to neuromorphic hardware. The proposed framework works in four steps. First, it analyzes a machine-learning program and generates SNN workload using representative data. Second, it partitions the SNN workload and generates clusters that fit on crossbars of the target neuromorphic hardware. Third, it exploits the rich semantics of Synchronous Dataflow Graph (SDFG) to represent a clustered SNN program, allowing for performance analysis in terms of key hardware constraints such as number of crossbars, dimension of each crossbar, buffer space on tiles, and tile communication bandwidth. Finally, it uses a novel scheduling algorithm to execute clusters on crossbars of the hardware, guaranteeing hardware performance. We evaluate DFSynthesizer with 10 commonly used machine-learning programs. Our results demonstrate that DFSynthesizer provides much tighter performance guarantee compared to current mapping approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

04/07/2020

Compiling Spiking Neural Networks to Neuromorphic Hardware

Machine learning applications that are implemented with spike-based comp...
03/09/2021

Endurance-Aware Mapping of Spiking Neural Networks to Neuromorphic Hardware

Neuromorphic computing systems are embracing memristors to implement hig...
08/27/2021

A Design Flow for Mapping Spiking Neural Networks to Many-Core Neuromorphic Hardware

The design of many-core neuromorphic hardware is getting more and more c...
10/09/2020

Thermal-Aware Compilation of Spiking Neural Networks to Neuromorphic Hardware

Hardware implementation of neuromorphic computing can significantly impr...
11/05/2013

Event-Driven Contrastive Divergence for Spiking Neuromorphic Systems

Restricted Boltzmann Machines (RBMs) and Deep Belief Networks have been ...
06/16/2021

Improving Inference Lifetime of Neuromorphic Systems via Intelligent Synapse Mapping

Non-Volatile Memories (NVMs) such as Resistive RAM (RRAM) are used in ne...
04/02/2013

Event management for large scale event-driven digital hardware spiking neural networks

The interest in brain-like computation has led to the design of a pletho...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Spiking Neural Network (SNN) is an emerging computing model that uses spike-based computations and bio-inspired learning algorithms (Maass, 1997). In an SNN, pre-synaptic neurons communicate information encoded in spike trains to post-synaptic neurons, via synapses (see Fig. 1). Performance, e.g., accuracy of an SNN model, is assessed in terms of the inter-spike interval (ISI), which is defined as inverse of the mean firing rate of the neurons.

Figure 1. Integration of spike trains at the post-synaptic neuron from four pre-synaptic neurons in a Spiking Neural Network (SNN). Each spike is a voltage waveform of time duration to the order of ms.

SNNs are typically executed on neuromorphic hardware platforms such as DYNAP-SE (Moradi et al., 2017), TrueNorth (DeBole et al., 2019), and Loihi (Davies et al., 2018). These hardware platforms are designed as a tile-based architecture with a shared, hierarchical interconnect to facilitate inter-tile communication (see Fig. 2(Catthoor et al., 2018). Each tile consists of a crossbar for mapping neurons and synapses, and input and output buffer space for communicating spikes over the interconnect. A crossbar is a 2D organization of horizontal and vertical wires, where the horizontal wires are connected to pre-synaptic neurons while the vertical wires are connected to post-synaptic neurons. Non-Volatile Memory (NVM) cells are placed at the crosspoints of each crossbar to implement storage of synaptic weights (Mallik et al., 2017; Burr et al., 2017).111Beyond neuromorphic computing, NVMs are also used as main memory for conventional computing using shared-memory computers (Song et al., 2021a, 2019; Song and Das, 2020b; Song et al., 2020b, d).

Figure 2. A tile-based neuromorphic architecture (Catthoor et al., 2018), which is representative of many neuromorphic platforms such as DYNAP-SE (Moradi et al., 2017), TrueNorth (DeBole et al., 2019), and Loihi (Davies et al., 2018).

Energy consumed by neuromorphic hardware can be several orders of magnitude lower than a conventional machine-learning accelerator such as Eyeriss (Chen et al., 2016). This is due to low-power VLSI implementation of analog neurons (Indiveri, 2003), low-power and high-density NVM-based synaptic storage (Burr et al., 2017), as well as distributed computing and storage architecture using crossbars. Given these advantages, a neuromorphic hardware can implement machine-learning tasks for power-constrained platforms such as embedded systems and edge nodes of the Internet-of-Things (IoT) (Atzori et al., 2010).

Unlike conventional von-Neumann computing systems, where CPUs compute by exchanging data centrally from the main memory, synthesizing, i.e., compiling and mapping a machine-learning program on a neuromorphic hardware is challenging. This is because in a neuromorphic hardware, computation units (i.e., the neurons) and storage units (i.e., the synapses) are distributed within the hardware as crossbars. It is therefore important to properly partition a large SNN model such that it can be mapped efficiently to the underlying resources. Additionally, each crossbar also presents limitations on how many pre-synaptic connections are allowed per post-synaptic neuron, and how much buffer space is available to send and receive spikes over the interconnect. These hardware limitations impact both model accuracy and hardware performance such as throughput, latency, and energy consumption.

We develop DFSynthesizer, a systematic and end-to-end framework to analyze and map machine-learning programs to state-of-the-art neuromorphic hardware, while guaranteeing performance. Following are our key contributions.222Contributions 2, 3, and 4 appeared in our prior work (Song et al., 2020a). This work introduces the contributions 1, 5, and 6.

  • Contribution 1. We present an approach to analyze machine-learning programs and generate SNN workload using representative data. Our framework allows workload generation with only a modest impact on model performance.

  • Contribution 2. We present an approach to decompose and partition complex SNN workloads and generate clusters of neurons and synapses such that each cluster can fit onto the resources of a crossbar in the hardware.

  • Contribution 3. We exploit the rich semantics of Synchronous Dataflow Graphs (SDFGs) (Lee and Messerschmitt, 1987)

    to represent clustered SNN programs. This allows for the SNN’s performance, e.g., throughput, to be estimated on the hardware as a function of key properties such as number of crossbars, dimension of crossbars, buffer space on tiles, and tile communication bandwidth.

  • Contribution 4. We develop a novel scheduling algorithm based on Self-Timed Execution for executing clusters on crossbars of a neuromorphic hardware, providing performance guarantee in scenarios with dynamic resource availability.

  • Contribution 5. We propose a design-space exploration framework incorporating DFSynthesizer that allows the Pareto-space of different SNN mappings to hardware to be explored while considering other hardware metrics such as energy, latency, and reliability.

  • Contribution 6.

    We evaluate DFSynthesizer using 10 machine learning programs that are representative of the three most commonly used neural network classes — convolutional neural network (CNN), multi-layer perceptron (MLP), and recurrent neural network (RNN).

2. Scope and High-Level Overview of DFSynthesizer

DFSynthesizer is developed for supervised machine learning approaches, where a machine-learning model is first trained using representative data from the field. Machine learning inference refers to generating output from the trained model by feeding live data. To improve energy efficiency, the inference is performed on a neuromorphic hardware. Once deployed on the hardware, the model is expected to perform inference in real-time on a continuous basis from data collected using sensors.333Camera sensors are used for image classification models, e.g., LeNet, AlexNet, and VGG16, while electrocardiogram sensors are used for heart-rate classification and estimation models. See our evaluation setup in Section 7. Therefore, a key performance metric for neuromorphic hardware performing real-time inference is throughput, defined as the number of frames processed per unit time, where a frame is defined as an individual image (for image-based models) or a window of time-series data.444By maximizing the throughput, DFSynthesizer minimizes the time to process individual frame using the neuromorphic inference hardware, which makes DFSynthesizer applicable to both real-time and non real-time applications.

Figure 3 illustrates the proposed end-to-end framework of DFSynthesizer, which synthesizes, i.e., compiles and maps a machine learning program to a neuromorphic hardware in four steps. First, it analyzes a machine learning program written in a high-level language such as Python and C/C++ to generate SNN workload (Section 3). Second, it compiles SNN workloads to an intermediate representation format (h5 and json), performing spatial decomposition and clustering to fit onto the resources of a crossbar (Section 4). Third, it uses Synchronous Dataflow Graph (SDF) to represent clustered SNN (in XML representation), allocating resources to the clusters considering hardware resource constraints (Section 5). Finally, it schedules the SDF representation of a clustered SNN to the hardware crossbars, guaranteeing performance (Section 6).

Figure 3. High-level overview of DFSynthesizer. A machine learning program is analyzed and mapped to the hardware using the proposed 4-step methodology.

3. Program Analysis and Workload Generation

In this step, a machine-learning program is analyzed to generate its workload. In the following, we discuss the steps involved in the workload generation.

3.1. Workflow for Workload Generation

Figure 4 summarizes the workflow of the workload generation step of DFSynthesizer, where a machine-learning program is analyzed to generate its workload which is then used to map the application to a neuromorphic hardware.

Figure 4. Workflow of the workload generation step of DFSynthesizer.

DFSynthesizer can incorporate both Artificial Neural Networks (ANNs) and Spiking Neural Networks (SNNs) in its workflow. At a high level, the proposed workflow consists of a model training component followed by model analysis. In the following, we elaborate on these components.

3.2. Model Training

3.2.1. Training Artificial Neural Networks

DFSynthesizer

’s frontend is integrated with Keras 

(Gulli and Pal, 2017)

, which is used to define a model and train it on a database. Keras utilizes Tensorflow backend 

(Abadi et al., 2016)

. DFSynthesizer also supports other frameworks such as PyTorch 

(Paszke et al., 2019). To demonstrate the capabilities of DFSynthesizer, we evaluate it with three Convolutional Neural Network (CNN) architectures – 1) LeNet (LeCun and others, 2015)

, trained on MNIST handwritten digit dataset 

(Deng, 2012), 2) AlexNet (Krizhevsky et al., 2012)

, trained on ImageNet dataset 

(Deng et al., 2009), and 3) VGGNet (Simonyan and Zisserman, 2014)

, trained on ImageNet dataset. These models are derived from the MLPerf 

(Reddi et al., 2020) dataset and instantiated in Keras. We use a Lambda workstation with two GPUs (see our evaluation setup in Section 7) to train these models.

3.2.2. Training Spiking Neural Networks

DFSynthesizer’s frontend supports training SNN models using PyCARL (Balaji et al., 2020a), a Python frontend to CARLsim (Chou et al., 2018). CARLsim facilitates SNN simulations using CPUs and multi-GPUs. PyCARL is designed to integrate with PyNN (Davison et al., 2009), which provides a common frontend to different SNN simulators with various degrees of neurobiological details. We use CARLsim for model training. CARLsim’s support for built-in biologically realistic neuron, synapse, current and emerging learning models and continuous integration and testing, make it an easy-to-use and powerful simulator of biologically-plausible SNN models. DFSynthesizer can also utilize other SNN simulators such as Brian (Goodman and Brette, 2009), NEST (Eppler et al., 2009), and NEURON (Hines and Carnevale, 1997) for model training.

3.3. Model Analysis

3.3.1. Model Parsing and Conversion

Unfortunately, ANN models cannot be executed directly on event-driven neuromorphic hardware platforms such as DYNAP-SE (Moradi et al., 2017), TrueNorth (DeBole et al., 2019), and Loihi (Davies et al., 2018). Recently, many tools have been proposed to convert ANN operations to SNNs. Examples include Nengo (Bekolay et al., 2014), N2D2 (Bichler et al., 2017), and SNNToolBox (Rueckauer et al., 2016). A common limitation of these toolboxes is that they are open-loop converters, meaning that the conversion is performed considering performance degradation only. In our prior work (Balaji et al., 2018), we have proposed a closed-loop conversion mechanism, where the conversion of analog operations to spiking equivalent is performed considering the energy consumption on hardware. These conversion steps are briefly discussed below.555The conversion framework was introduced in (Balaji et al., 2018) for converting CNN-based HeartClass application to its equivalent SNN representation. We used this application to evaluate DFSynthesizer. Additionally, we have extended the conversion framework to add other key functionalities such as Layer Flattening, Concatenation, Binary Weight Activation, and Non-Zero Biases. These new functionalities allowed the conversion framework to convert state-of-the-art CNN architectures such as LeNet, AlexNet, and VGG16, which are used to evaluate DFSynthesizer.

  1. ReLU Activation Functions: This is implemented as the approximate firing rate of a leaky integrate and fire (LIF) neuron.

  2. Bias: A bias is represented as a constant input current to a neuron, the value of which is proportional to the bias of the neuron in the corresponding analog model.

  3. Weight Normalization: This is achieved by setting a factor to control the firing rate of spiking neurons.

  4. Softmax: To implement softmax, an external Poisson spike generator is used to generate spikes proportional to the weighted sum accumulated at each neuron.

  5. Max and Average Pooling:

    To implement max pooling, the neuron which fires first is considered to be the winning neuron, and therefore, its responses are forwarded to the next layer, suppressing the responses from other neurons in the pooling function. To implement average pooling, the average firing rate (obtained from total spike count) of the pooling neurons are forwarded to the next layer.

We have extended our framework with the following new functionalities to allow for the conversion of CNN architectures such as LeNet, AlexNet, and VGGNet to their spiking counterparts.

  1. 1-D Convolution: The 1-D convolution is implemented to extract patterns from inputs in a single spatial dimension. A 1xn filter, called a kernel, slides over the input while computing the element-wise dot-product between the input and the kernel at each step.

  2. Residual Connections: Residual connections are implemented to convert the residual block used in CNN models such as ResNet. Typically, the residual connection connects the input of the residual block directly to the output neurons of the block, with a synaptic weight of ‘1’. This allows for the input to be directly propagated to the output of the residual block while skipping the operations performed within the block.

  3. Flattening: The flatten operation converts the 2-D output of the final pooling operation into a 1-D array. This allows for the output of the pooling operation to be fed as individual features into the decision-making fully connected layers of the CNN model.

  4. Concatenation:

    The concatenation operation, also known as a merging operation, is used as a channel-wise integration of the features extracted from 2 or more layers into a single output.

Table 1 reports the accuracy impact due to the SNN conversion of three state-of-the-art supervised CNN models. These accuracy numbers are obtained from CARLsim (Chou et al., 2018), which allows functional simulation and performance estimation of SNN-based applications. We use these three converted CNN models to evaluate DFSynthesizer (See Section 7).

Application Top-1 Accuracy (%) Application Top-1 Accuracy (%) Application Top-1 Accuracy (%)
Original SNN Original SNN Original SNN
LeNet 94.98% 94.08% AlexNet 74.1% 71.7% VGG16 93.56% 91.62%
Table 1. Accuracy impact due to conversion of three state-of-the-art CNN models to their SNN equivalent. The original accuracy numbers are obtained by simulating these architectures in Keras (Gulli and Pal, 2017) with Tensorflow backend (Abadi et al., 2016). The converted accuracy numbers reported in the columns marked “SNN” are obtained from CARLsim (Chou et al., 2018). We use a multi-GPU machine to simulate these architectures using both Keras and CARLsim. See our evaluation framework in Section 7.

3.3.2. Workload Generation

The SNN model (or the converted ANN model) is analyzed in CARLsim to generate the following information.

  • Spike Data: the exact spike times of all neurons in the SNN model. We let represents a list of spike times of the neuron in the model.

  • Weight Data: the synaptic strength of all synapses in the SNN model. We let represents the synaptic weight of the connection between the and neurons in the SNN model.

The spike and weight data of a trained SNN form the SNN workload. Formally, an SNN workload is defined as

Definition 1 ().

(SNN Workload) An SNN Workload is a directed graph consisting of a finite set of neurons, a set of spikes, and a set of synapses between the neurons.

4. Program Compilation and Performance Estimation

In this step, DFSynthesizer clusters a given machine-learning model to map onto the crossbars of a neuromorphic hardware. To do so, we first introduce the system architecture and then discuss the clustering step needed to map applications to this architecture.

4.1. System Architecture

Figure 5 illustrates our system architecture. DFSynthesizer is designed for crossbar-based neuromorphic hardware designs as shown in Figure 2. This is representative of many recent neuromorphic designs (Catthoor et al., 2018; Gopalakrishnan et al., 2020; Ankit et al., 2017; Hu et al., 2016). A machine learning model (ANN or SNN) is first analyzed to generate its workload (Section 3). This workload is then partitioned to generate clusters, where each cluster consists of a fraction of the neurons and synapses of the original machine learning model. The cluster workload is stored in a disk along with other machine learning workloads. To execute a specific workload on the neuromorphic hardware, it is first loaded into the host memory and then the clusters are programmed on to the crossbars of the hardware via the PCIe interface.666Although we illustrate the crossbars to be interconnected in a mesh-based architecture such as Networks-on-Chip (NoC) (Benini and De Micheli, 2002), DFSynthesizer can work with other interconnect types such as Segmented Bus (Balaji et al., 2019c).

Figure 5. Our system architecture, integrating a neuromorphic hardware. DFSynthesizer is designed for crossbar-based neuromorphic hardware (Catthoor et al., 2018; Gopalakrishnan et al., 2020; Ankit et al., 2017; Hu et al., 2016). This is representative of many recent neuromorphic designs. To evaluate DFSynthesizer, we have configured our evaluation setup to model the DYNAP-SE hardware (Moradi et al., 2017).

In the remainder of this section, we describe the workload compilation step of DFSynthesizer, which consists of the following two design components – Workload Decomposition and Workload Clustering. We conclude this section by providing a dataflow modeling approach for clustered workloads and performance estimation using such model.

4.2. Workload Decomposition

We note that each crossbar in a neuromorphic hardware can accommodate up to pre-synaptic connections per post-synaptic neuron, with typical value of set between 128 (in DYNAP-SE) and 256 (in TrueNorth). Figure 6 illustrates an example of mapping a) one 4-input, b) one 3-input, and c) two 2-input neurons on a crossbar. Unfortunately, neurons with more than 4 pre-synaptic connections per post-synaptic neuron cannot be mapped to the crossbar. In fact, in many complex machine learning models such as AlexNet and VGG16, the number of pre-synaptic connections per post-synaptic neuron is much higher than 128. Therefore, these neurons cannot be mapped to a crossbar in DYNAP-SE.

Figure 6. Example mapping of a) one 4-input, b) one 3-input, and c) two 2-input neurons on a crossbar.

To address the above limitation, we have previously proposed a spatial decomposition technique which exploits the firing principle of LIF neurons, decomposing each neuron with many pre-synaptic connections into a sequence of homogeneous fanin-of-two (FIT) neural units (Balaji et al., 2020d).

Figure 7 illustrates the spatial decomposition using a small example of a 3-input neuron shown in Figure 7(a). We consider the mapping of this neuron to 2x2 crossbars. Since each crossbar can accommodate a maximum of two pre-synaptic connections per neuron, the example 3-input neuron cannot be mapped to the crossbar directly. The most common solution is to eliminate a synaptic connection, which may lead to accuracy loss. Figure 7(b) illustrates the decomposition mechanism, where the 3-input neuron is implemented using two FIT neural units connected in sequence as shown in Figure 7(b). Each FIT unit is similar to a 2-input neuron and it exploits the leaky integrate behavior in hardware to maintain the functional equivalence between Figures 7(a) and 7(b).

Figure 7. Illustrating the decomposition of a 3-input neuron (a) to a sequence of FIT neural units (b). The mapping of the FIT units to two 2x2 crossbars is shown in (c).

For the sake of completeness, Figure 7(c) illustrates the mapping of the decomposed neuron utilizing two 2x2 crossbars. The functionality of the FIT neural units is implemented using the Non-Volatile Memory (NVM) cells of the two crossbars.

To describe the decomposition Algorithm, we introduce the following notations. Let be the pre-synaptic connections of the neuron . Let be the () FIT neural units that are generated by spatially decomposing this neuron. The input of unit denoted as can be represented as

(1)

where is the output of the unit . When decomposing a neuron, we note that the first FIT unit uses two of the original inputs of the original neuron. Subsequently, all other FIT units use one of the original inputs and the output of the preceding FIT units as shown in Figure 7(b).

Formally, a decomposed SNN graph is defined as follows.

Definition 2 ().

(Decomposed SNN Graph) A decomposed SNN graph is a directed graph consisting of a finite set F of FIT neural units and a finite set L of links between these units.

Algorithm 1 shows the pseudo-code of the spatial decomposition technique, which performs the graph transformation . For each neuron (line 1), a set of inputs to this neuron is obtained (line 2). The first FIT unit is formed using two input inputs (line 3). This is in accordance with Equation 1 and Figure 7(b). The FIT unit is inserted into the decomposed graph (line 4). The algorithm then creates the other FIT units iteratively (lines 5-8) using Equation 1 and stores those units in . Finally, the graph is returned (line 10).

The overall complexity of this algorithm is calculated as follows. The Out for loop (lines 1-9) is executed for the neurons in the original graph , i.e., for times. Within each iteration, the algorithm creates a total of FIT units, where is the set of input of neuron . Therefore, the algorithmic complexity is

(2)

In deriving the final expression, we note that the input connections of all the neurons in the graph are the edges in the graph.

Input:
Output:
1 for  do /* for each node of */
          ;
           /* input links of */
          Create node with ;
           /* first FIT unit */
          ;
           /* insert the FIT neural unit in */
2          for  do /* remaining FIT units */
3                   Create node with ;
4                   ;
5                  
6          end for
7         
8 end for
Return
Algorithm 1 Spatial decomposition of SNN graph .

4.3. Workload Clustering

The decomposed SNN graph is clustered such that each cluster is able to fit onto a crossbar. Figure 8 illustrates the concept using an example of a decomposed SNN graph shown in ( ). The nodes are the FIT neural units and the links are the synaptic connections. The number on a link represents the average number of spikes communicated between the source and destination FIT units for the representative training data. We consider the mapping of this decomposed SNN graph to a hardware with crossbars. Since a crossbar in this hardware can only accommodate a maximum of 2 pre-synaptic connections, we partition the graph of ( ) into two partitions (shown in two different colors) in ( ). These partitions can then be mapped to the two crossbars as shown in ( ), with an average 8 spikes communicated between the crossbars due to the mapping of the link between neuron d and e on the shared interconnect of the hardware. Finally, the two clusters generated from the SNN graph are shown in ( ) along with the inter-cluster communication.

Figure 8. Illustration of SNN graph clustering. (

) is the original decomposed SNN graph with FIT neural units shown as the nodes and average spikes communicated between them shown on the links. (

) shows the partitioning of this graph. (

) shows the mapping of the partitions to the two crossbars. (

) shows the two clusters generated from the SNN graph of (

) considering the constraints of the crossbar.

Formally, a clustered SNN graph is defined as follows.

Definition 3 ().

(Clustered SNN Graph) A clustered SNN graph is a directed graph consisting of a finite set A of clusters and a finite set C of connections between these clusters.

Recently, different approaches have been proposed for clustering SNNs. Examples include SpiNeMap (Balaji et al., 2020b) for energy minimization and NEUTRAMS (Ji et al., 2016) for performance. See Section 9 for a comprehensive overview of other state-of-the-art SNN clustering approaches.

We formulate SNN clustering as a graph transformation problem and introduce an efficient algorithm to improve resource utilization. This objective is essential to provide tighter guarantee on performance of SNNs in hardware as we demonstrate in Section 8.

The graph transformation is a classical graph partitioning problem (Kernighan and Lin, 1970), and has been applied in many contexts, including task mapping on multiprocessor systems (Das et al., 2014a). We propose a greedy approach to pack the FIT neural units and synapses of the decomposed SNN graph into clusters, improving cluster resource utilization. Algorithm 2 provides the pseudo-code of the clustering algorithm. For each node of the unrolled graph, the algorithm tries to see if the node can be merged into one of the existing clusters (line 3), before creating a new one (lines 4–8). In this algorithm, clusters in are sorted in descending order of neuron and synapse utilization (line 12), so that the heavily utilized clusters are first considered for packing neurons and synapses, further improving their utilization.

Input:
Output:
1 = {} and cluster_list = {};
2 foreach  do
3          find such that can be packed in while improving neuron and synapse utilization of ;
4          if  then
5                   Create new cluster ;
6                   Assign and its synaptic connections to ;
7                   .push();
8                  
9          end if
10         else
11                   Assign and its synaptic connections to ;
12                  
13          end if
14         sort in descending order of neuron and synapse utilizations;
15         
16 end foreach
Algorithm 2 Utilization-aware SNN clustering.

4.4. Dataflow Modeling of Clustered Workload

We model a clustered SNN as a Synchronous Data Flow Graph (SDFG) for predictable performance analysis (Lee and Messerschmitt, 1987). SDFGs are commonly used to model streaming applications that are implemented on a multi-processor system-on-chip (Sriram and Bhattacharyya, 2000). These graphs are used to analyze a system in terms of key performance properties such as throughput, execution time, communication bandwidth, and buffer requirements (Stuijk et al., 2006a). Nodes of an SDFG are called actors. Each node is a cluster of the clustered SNN graph . Actors are computed by reading tokens, i.e., spikes from their input ports and writing the results of the computation as tokens on the output ports. The number of tokens produced or consumed in one execution of an actor is called the port rate. They represent the number of spikes per unit time at the input and output of different clusters in the SNN. Port rates are visualized as annotations on edges. Actor execution is also called firing, and it requires a fixed amount of time to execute on a crossbar. Edges in the graph are called channels and they represent dependencies among actors. An actor is said to be ready when it has sufficient input tokens on all its input channels and sufficient buffer space on all its output channels; an actor can only fire when it is ready. A set of ports is assumed, and with each port , a finite rate is associated. Formally, an actor is defined as follows.

Definition 4 ().

(Actor) An actor is a tuple consisting of a set () of input ports, a set () of output ports with , is the execution time of and is its state space, i.e., buffer space needed for communicating spikes on all of its channels.

The source of channel is an output port of actor , the destination is an input port of actor . All ports of all actors are connected to precisely one channel, and all channels are connected to ports of some actors. The source and the destination port of channel are denoted by and respectively. Channels connected to the input and output ports of an actor are denoted by and ) respectively.

Before an actor starts its firing, it requires tokens from all . When the actor completes execution, it produces tokens on every . One important property of an SDFG is throughput, which is defined as the inverse of its long-term period. A period is the average time needed for one iteration of the SDFG. An iteration is defined as the minimum non-zero execution such that the original state of the SDFG is obtained. This is the performance parameter used in this paper. Following definitions are introduced to formulate throughput.

Definition 5 ().

(Repetition Vector)

The Repetition Vector RptV of an SDFG is defined as the vector specifying the number of times actors in the SDFG are executed in one iteration.

For the SDFG representation of a clustered SNN, all spikes generated on a channel are consumed by the destination actor. This means that all actors are fired exactly once during one iteration of the application. So, .

4.5. Cyclic Dependency and Deadlock Avoidance

The clustering approach may lead to cyclic dependency among actors. Figure 9(a) illustrates a simple feedforward network of 3 neurons (A, B, & C). Figure 9(b) illustrates a scenario where neurons A and C are placed in cluster 1 (actor 1) and neuron B in cluster 2 (actor 2) during partitioning. Due to the connectivity of the neurons in Figure 9(a), there is a cyclic dependency between the two actors: actor_1actor_2actor_1. SDF graphs allow representing such cyclic dependency among actors, justifying our choice of using them for modeling clustered SNNs.

Figure 9. An example cycle generated during clustering of SNNs.

However, presence of cycles complicates the scheduling problem because cyclic dependences can lead to deadlocks. To address this, a cyclic SDF graph is decomposed into hierarchies of acyclic subgraphs. To describe this, we introduce the following definition.

Definition 6 ().

(Strongly Connected Subgraph) A subgraph of a directed (cyclic or acyclic) graph is called a strongly-connected subgraph, iff for every pair of vertices and of , there is a path from to and a path from to .

Figure 10. Cycle breaking for deadlock avoidance of cyclic SDF graphs (Battacharyya et al., 1996).

Figure 10 shows the flowchart for cycle breaking, also known as sub-independence partitioning, which is the process of decomposition of strongly connected SDF graphs into hierarchies of acyclic graphs. This is roughly based on the Loose Interdependence Algorithms Framework (LIAF) (Battacharyya et al., 1996). A cyclic SDF graph is first decomposed into a series of strongly connected subgraphs . For each strongly connected subgraph , the LIAF algorithm tries to break cycles by properly removing edges that have sufficient delays. Let be the strongly-connected subgraph of the SDF Graph. An edge can be removed if it has enough initial tokens to satisfy the consumption requirements of its sink actor for a complete iteration of and scheduling without does not lead to deadlock. The edge is called inter-iteration edge. The inter-iteration edge removal is performed iteratively until the new subgraph with the inter-iteration edges removed is no longer a strongly connected subgraph (i.e., it becomes a loosely connected subgraph). The subgraph is pushed into a ready list for scheduling purposes. The algorithm is repeated for all the strongly-connected subgraphs. At the end, all deadlock-free subgraphs are scheduled.

4.6. Performance Estimation

We present an approach to compute the application period of an SDFG by analyzing its maximum cycle mean (MCM) and assuming infinite hardware resources. For this, we use Max-Plus Algebra (Heidergott et al., 2014; Zhang and Liu, 2013; Cong and Zhang, 2006). The Max-Plus semiring is the set defined with two basic operations , which are related to linear algebra as

(3)

The identity element for the addition is in linear algebra, i.e., . The identity element for the multiplication is 0 in linear algebra, i.e., .

To use Max-Plus Algebra to analyze an SDFG, it is customary to express the time at which an actor fires in terms of preceding firings in linear algebra and then use standard analysis techniques for Max-Plus Algebra to estimate timing performance. We use the running example of the SDFG in Figure 11(a), which is obtained by clustering EdgeDet (Chou et al., 2018), an application used to evaluate DFSynthesizer (see Section 7). The clustering is performed considering 1024x1024 crossbars.777We evaluate DFSynthesizer primarily for DYNAP-SE neuromorphic hardware with crossbars (Moradi et al., 2017). Here we configure crossbars to generate fewer clusters from EdgeDet for illustration purposes. The firing end times of all 9 actors in the iteration (in linear algebra) are

(4)

Figure 11. (a) An example of SDFG obtained from clustering of the EdgeDet application (Chou et al., 2018). (b) Mapping of the SDFG to a neuromorphic hardware with 4 tiles.

Observe that the firing end time of actor in the iteration is after its firing end time in the iteration. Furthermore, the production and consumption rates are the same for every channel in the SDFG. Using previously introduced Max-Plus semantics, firing end times for every actor in the SDFG can be expressed as

(5)

where is a matrix in that captures the actor execution times and . The following definitions are introduced to estimate latency.

Definition 7 ().

(Digraph) The digraph of a matrix with entries defined in is the tuple , where is the set of vertices, i.e., and is the set of connected ordered arcs between vertices i.e., .

To give an example, the matrix corresponds to the digraph shown in Figure 12.

Figure 12. An example digraph.
Definition 8 ().

(Walk) A walk in digraph is the sequence of arcs ; head of an arc in the sequence is either the start vertex of the walk or tail vertex of a preceding arc; and the tail vertex of an arc in the sequence is either the end vertex of the walk or head vertex of a succeeding arc. Weight of the walk is given by

(6)
Definition 9 ().

(Cycle) A cycle in digraph is the walk , such that .

Definition 10 ().

(Maximum Cycle Mean) The maximum cycle mean, is the maximum of the weight-to-length ratio of all cycles in i.e.,

(7)

In this paper, performance of an SNN is defined in terms of throughput of the equivalent SDFG, measured as the inverse of its maximum cycle mean (Equation 7), i.e.,

(8)

In Equation 8, the performance is computed using the worst-case execution time of an actor on a crossbar. This is obtained from the propagation delay of current through the synaptic elements in the crossbar. As shown in many recent works (Titirsha and Das, 2020a, b; Titirsha et al., 2021b), the current propagation delay within a crossbar depends on the specific synaptic elements that are being activated in the crossbar. This is due to the difference in the amount of parasitic components on the bitlines and wordlines of a crossbar along the different current paths. For performance guarantee purposes, we assume the worst-case propagation delay in the crossbar, and use the same to represent the execution time of actors on the crossbars of a neuromorphic hardware.

The performance metric defined in Equation 8 provides the maximum throughput, considering only the worst-case execution time of actors. However, a neuromorphic hardware introduces constraints such as limited buffer space on the crossbars and non-zero latency on the interconnect, which can lower the throughput significantly. Therefore,

(9)

In this work, we show that performance is impacted by

  1. how hardware resources are allocated to actors of a clustered SNN (Section 5), and

  2. how actors mapped to the same crossbar are time-multiplexed and scheduled (Section 6).

We seek to find the lower bound on performance () such that

(10)

By making close to , we provide a tighter bound on performance.

5. Resource Allocation and Hardware Mapping

The performance obtained using Equation 7 defines the maximum throughput obtained when the clustered SNN is mapped to a hardware with infinite resources, i.e., a hardware with as many crossbars as the number of actors (clusters) in the clustered SNN graph. Additionally, each crossbar is assumed to have sufficient buffer space to send and receive spikes over the shared interconnect. However, state-of-the-art neuromorphic hardware platforms present the following three critical limitations. First, the number of crossbars in a neuromorphic hardware is limited. Therefore, the available crossbars need to be time-multiplexed amongst the clusters of an SNN. Second, the input and output buffer space on each crossbar are limited. Therefore, no more than one cluster can be executed on a crossbar concurrently. Third, the communication bandwidth of each tile is limited. Therefore, only a few spikes can be sent or received from the interconnect at once. Formally, a neuromorphic hardware is defined as follows.

Definition 11 ().

(Neuromorphic Hardware Graph) A neuromorphic hardware graph is a directed graph consisting of a finite set T of tiles and a finite set I of interconnect links.

Each tile consists of a crossbar to map neurons and synapses, and input and output buffers to receive and send tokens (spikes) over the interconnect, respectively. A tile is a tuple , where is the dimension of the crossbar on the tile, i.e., the tile can accommodate pre-synaptic neurons, post-synaptic neurons, and synaptic connections, is the input buffer size on the tile, and is its output buffer size. Each interconnect link is bidirectional, representing two-way communication between the source and destination tiles with a fixed bandwidth .

The mapping is specified by matrix , where is defined as

(11)

The mapping constraint is that a cluster can be mapped to only one tile, i.e.,

(12)

The throughput of the clustered SNN graph on the neuromorphic hardware for mapping is computed as

(13)

where DFSynthesizer is the extended Max-Plus formulation of Equation 7 incorporating platform constraints. The following three steps describe DFSynthesizer. Without loss of generality, we use Equation 14 as a running mapping example, where the 9 actors of Figure 11 are mapped to 4 tiles.

(14)

The mapping corresponding to Equation 14 is therefore .

5.1. Step 1: Modeling Limited Buffer Sizes of Crossbars

Limited input and output buffer sizes of a tile are modeled as back-edges with initial tokens indicating the buffer size available on the tile. This is illustrated in Figure 11(b) with the back-edge from to , both of which are mapped to tile 0. When an actor generates spikes on a channel, the available size reduces; when the receiving actor consumes the spike, the available buffer is released. In the example, before can be executed, it has to check if enough buffer space is available. This is modeled by requiring tokens from the back-edge to be consumed. Since it produces 5068 spikes per firing, 5068 tokens from the back-edge are consumed, indicating reservation of the buffer spaces. On the consumption side, when is executed, it frees 5068 buffer spaces, indicated by a release of these tokens on the back-edge. We assume atomic execution of actors on a crossbar, i.e., a crossbar reads input tokens and produces output tokens in the output buffer for no more than one actor at any given instance of time. To prevent other actors mapped to the same tile from firing simultaneously, the output buffer space is claimed at the start of execution and released only at the end of firing.

5.2. Step 2: Actor Ordering on Crossbars

The number of crossbars in a neuromorphic hardware is limited. Therefore they may have to be shared between actors of an SNN. However, on a tile, only one instance of an actor can be executing at the same moment in time. We use time-division multiple-access (TDMA) to allocate time slices to actors mapped to the same tile. During its allocated time slice, an actor is executed on the crossbar of the tile and generates spikes, which are stored in the output buffer for communication on the interconnect. Next, we generate the order in which the actors bound to a tile are fired to provide performance guarantee, i.e., throughput. For this, we apply our Max-Plus Algebra formulation (Eq. 

7) on the SDFG of Fig. 11(b). This is our static-order schedule, and is constructed at design time.

5.3. Step 3: Actor Execution on Crossbars

Once the static-order schedule is constructed for all tiles of the hardware, we use a self-timed execution strategy (Moreira and Bekooij, 2007) to execute these actors at run time. Here, the exact firing times of actors are discarded, retaining only the assignment and ordering of actors on each tile as obtained from the design-time analysis (step 2). At run time, ready actors are inserted into a list and fired in the same order previously determined during design time.

5.4. Mapping Exploration

Sections 5.1 through 5.3 extend the Max-Plus formulation to incorporate platform constraints. Using these constraints and the new formulation, one can estimate the throughput of a clustered SNN on a neuromorphic hardware for a specific actor-to-tile mapping. In the following, we explain the mapping scenario where the number of tiles in the hardware is less than the number of actors in the clustered SNN. Therefore, each tile needs to be time-multiplexed between multiple actors.

Figure 13 conceptually illustrates the mapping exploration using DFSynthesizer compared to state-of-the-art solutions and the selection of lower bound on throughput. represents the throughput obtained using SpiNeMap (Balaji et al., 2020b), which optimizes energy consumption for a hardware platform where the number of tiles is higher than the number of actors. When SpiNeMap is applied to the case where the tiles need to be time-multiplexed, it randomly distributes the actors to the tiles and schedules them arbitrarily, without considering throughput. Therefore, the throughput represented by (SpiNeMap) is significantly lower than the maximum throughput (i.e., the upper bound) represented using . Therefore, the throughput variation is .

Figure 13. Different mapping explorations and choices for the lower bound of throughput (see Equation 10).

In Figure 13, represents the throughput obtained using a solution such as PyCARL (Balaji et al., 2020a), which balances the load on each tile for a scenario where actors need to be time-multiplexed on the tiles. However, the actors mapped to a tile are scheduled in an arbitrary order without considering throughput. By balancing the tile load, PyCARL reduces the number of clusters mapped per tile, which improves throughput. Therefore, the throughput represented by is higher than , but lower than the maximum throughput . Therefore, the throughput variation is .

In Figure 13, represents the throughput obtained using our previous work SDFSNN (Song et al., 2020a), which first balances the load of each tile by distributing the actors evenly, and then uses a dataflow approach to schedule the actors on each tile, improving throughput. The throughput represented by is therefore higher than both and , but lower than the maximum throughput . Therefore, the throughput variation is .

In Figure 13, represents the throughput obtained using a mapping exploration framework, which explores a combination of actor-to-tile mapping and dataflow-based scheduling of actors on each tile to maximize the throughput. This throughput is higher than - , and is closer to the maximum throughput . Finally, represents the throughput obtained using an actor-to-tile mapping that jointly optimizes energy and throughput, and uses dataflow-based scheduling of actors on each tile to further improve the throughput. Since this solution takes energy into consideration in the mapping step, the throughput can be somewhat lower than as illustrated in the figure. In Section 8, we evaluate all these approaches and show that is still higher than - .

To conclude, the design-space exploration of DFSynthesizer can generate mappings representing two minimum throughput solutions – and . Although the maximum throughput remains the same for DFSynthesizer and other state-of-the-art approaches, the minimum throughput of DFSynthesizer (i.e, ) is higher than the minimum throughput obtained using all state-of-the-art mapping solutions (i.e., - ). Therefore, the difference between maximum and minimum throughput is the least in DFSynthesizer compared to all state-of-the-art solutions, meaning that DFSynthesizer provides stricter performance guarantee, which is critical for real-time systems. We now describe DFSynthesizer.

We integrate the extended Max-Plus formulation inside a design-space exploration framework to obtain cluster mappings that are Pareto optimal in terms of hardware metrics such as throughput, latency, energy, and reliability. In the following, we describe our mapping explorations considering energy and throughput. Such formulations can be trivially extended to consider other metrics.

The energy consumption of the mapping is measured considering the number of spikes that are generated inside each tile and the number of spikes that are routed on the interconnect (Titirsha et al., 2021a). The energy parameters are reported in Table 3. Using these parameters, the energy consumption is

(15)

where is the energy consumed in generating the spikes and propagating the spike current via the synapses, and is the energy consumed in communicating spikes via the shared interconnect. where is the number of spikes generated inside tile and is the number of spikes communicated on the link between tiles and in the hardware.

Our objective is to maximize throughput of a given machine-learning model on hardware (Eq. 7) and minimize the hardware energy consumption (Eq. 15). We formulate a joint metric , and minimize it during our mapping explorations. To this end, we propose an iterative approach, which explores different mapping alternatives, satisfying the cluster mapping constraint (Eq. 12). For each mapping alternative, we evaluate throughput and energy consumption. Finally, Pareto-optimal mappings are retained and returned.

Algorithm 3 provides the pseudo-code of our proposed mapping exploration. We start by randomly distributing clusters to the tiles (line 3). We evaluate throughput and energy consumption of this mapping and compute the joint metric (lines 4–5). For each cluster, we do the following. We move the cluster from its current tile to every other tile and recalculate (lines 6–10). If reduces, the new mapping is retained (lines 11–13), and the algorithm proceeds to analyze the next cluster. In this way, a local minimum is reached, starting from the initial random allocation of clusters. We re-execute the algorithm times, starting with a different random allocation of the clusters each time. In this way, many mappings are explored. Finally, mappings that are Pareto-optimal in terms of throughput and energy consumption are retained.

Input:
Output:
;
  /* This set holds all the mappings */
1 for  do /* Run for times */
2          Allocate clusters randomly to tiles. Call this mapping ;
3          Calculate using (7) and energy consumption using (15);
4          Calculate the joint metric ;
5          for  do /* For each cluster in the graph */
                   ;
                    /* Get the tile to which the cluster is mapped in the mapping */
6                   for  do /* Move the cluster to every other tile */
                            ;
                             /* Update the mapping to reflect the movement of cluster to tile */
7                            Calculate ;
8                            if  then /* If the joint metric improves */
                                     ;
                                      /* Retain the new mapping */
9                                    
10                           
11                   end for
12                  
13          end for
14         
15 end for
;
  /* Retain only the Pareto-Optimal Mappings */
Return , the mapping with minimum execution time.
Algorithm 3 Mapping of the clustered graph .

The complexity of this algorithm is as follows. The unit function GetTileofCluster is essentially an argmax function with a complexity of . The unit function MoveClusterToTile is an update of matrix and can be performed in . Therefore, the complexity of the algorithm is . Here, is a user-defined parameter and controls the compilation time with a trade-off on the solution quality, i.e., execution time and energy consumption of the application on hardware.

6. Scheduling and Performance Guarantee

Self-timed execution is widely used to schedule SDFGs (Ghamarian et al., 2006). Static schedules are constructed using worst-case actor execution times determined during design time. Actor ordering on each tile is retained while discarding the timing information. At run time, actors are fired while maintaining the same order as determined during design time. In this regard, the following lemmas are stated (Ghamarian et al., 2006; Das et al., 2012, 2014a).

Lemma 1 ().

For a consistent and strongly connected SDFG, the self-timed execution consists of a transient phase followed by a periodic phase.

Lemma 2 ().

For a consistent and strongly connected SDFG, the throughput of an actor is given by the average firing of the actor per unit time in the periodic phase of the self-timed execution.

Figure 14 shows an example self-timed execution of 3 actors – , and of Figure 11(b) on tile 2.

Figure 14. Self-timed execution consisting of transient phase followed by periodic phase

A modern neuromorphic hardware is expected to execute many SNN applications simultaneously. When a new application is to be admitted to a hardware, which is currently running other applications, the incoming application needs to be compiled and mapped to the hardware within a short time window, based on resources currently available on the hardware. Furthermore, when an existing application finishes execution, its hardware resources are freed, meaning that such resources can now be allocated to other running applications to improve their performance. For such dynamic scenarios, SDFG schedules must be constructed for every allocation scenario. If the run-time schedule is different from that used for analysis at design time, the throughput obtained will be significantly different than what is guaranteed at design time. There are therefore two approaches to generating run-time schedules.

  • Store the actor mapping and scheduling for all resource-allocation scenarios and for all applications from design time (storage-based solution).

  • Construct the schedule at run time based on the mappings stored from the design-time (construction-based solution)

The former is associated with high storage overhead and the latter with longer execution time. Both storage and schedule construction time are crucial for machine-learning systems deployed in resource- and power-constrained environments. Therefore, we propose a modification of the self-timed execution scheduling as follows. First, we construct the static-order schedule for all actors of an SNN on a single tile at design time. This is achieved using the Max-Plus Algebra formulation of Equation 7. Next, we discard the exact timing information, retaining only the actor firing orders for run-time use. At run time, we first construct the cluster mapping to tiles (Section 5.4), considering the available tiles. Next, we use the single-tile static-order schedule to derive the actor schedules on each tile, without having to construct them from scratch.

Figure 15 illustrates the construction of per-tile schedules for an SNN application with 9 actors, and with two different mappings of actors to tiles from the same single-tile static order schedule. We illustrate two scenarios in this example. In the first scenario (left), the application uses two tiles of the hardware. In the second scenario (right), the application uses three tiles of the hardware. In both scenarios, actor orders on each tile are the same as those on the single tile. Since tile schedules are not constructed from scratch, the schedule construction time is much lower.

Figure 15. Schedules constructed from the same single-tile static order schedule using 2 and 3 tiles, respectively.

However, performance obtained using this single-tile schedule can be lower than the maximum performance of a multi-tile schedule constructed independently. As long as this performance deviation is bounded, the actor schedule for any tile can be easily derived from the binding of actors to this tile and a given single-tile static-order schedule. See Section 8 for performance evaluation.

7. Evaluation Methodology

We conduct all simulations on a Lambda workstation, which has AMD Threadripper 3960X with 24 cores, 128 MB cache, 128 GB RAM, and 2 RTX3090 GPUs. Keras (Gulli and Pal, 2017) and CARLsim (Chou et al., 2018) use the two GPUs to accelerate model training and SNN function simulation, respectively.

Figure 16 illustrates our evaluation setup using the cycle-accurate NeuroXplorer (Balaji et al., 2021) framework. This framework is validated extensively against the DYNAP-SE neuromorphic hardware (Balaji et al., 2018, 2020b; Das et al., 2018c, a; Balaji et al., 2020a), and can model the architecture of other neuromorphic hardware platforms such as Loihi (Davies et al., 2018) and TrueNorth (DeBole et al., 2019). NeuroXplorer can simulate multi-compartment neuron models and 9-parameter Izhikevich and leaky integrate-and-fire (LIF) spiking neuron models. Additionally, NeuroXplorer can model Non-Volatile Memory (NVM) synapses such as Phase Change Memory (PCM) and Oxide-based Resistive Random Access Memory (OxRRAM). NeuroXplorer also models the spike delay on the shared interconnect as well as the delay in propagating spikes through the synapses of a crossbar (Balaji et al., 2021). The mapping and scheduling results obtained using DFSynthesizer are used in NeuroXplorer to estimate energy, accuracy, and throughput.

Figure 16. Our evaluation setup based on NeuroXplorer (Balaji et al., 2021).

7.1. Evaluated Applications

We evaluate 10 machine learning programs which are representative of three most commonly-used neural network classes: convolutional neural network (CNN), multi-layer perceptron (MLP), and recurrent neural network (RNN). These applications are 1) LeNet based handwritten digit recognition with images of handwritten digits from the MNIST dataset; 2) AlexNet for ImageNet classification; 3) VGG16, also for ImageNet classification; 4) ECG-based heart-beat classification (HeartClass) (Balaji et al., 2018; Das et al., 2018b) using electrocardiogram (ECG) data; 5) image smoothing (ImgSmooth) (Chou et al., 2018) on images; 6) edge detection (EdgeDet) (Chou et al., 2018) on images using difference-of-Gaussian; 7) multi-layer perceptron (MLP)-based handwritten digit recognition (DigitRecogMLP) (Diehl and Cook, 2015) using the MNIST database; 8) heart-rate estimation (HeartEstm) (Das et al., 2018a) using ECG data; 9) RNN-based predictive visual pursuit (VisualPursuit) (Kashyap et al., 2018); and 10) recurrent digit recognition (DigitRecogSTDP) (Diehl and Cook, 2015). To demonstrate the potential of DFSynthesizer, we consider a real-time neuromorphic system, where these machine learning programs are executed continuously in a streaming fashion. Therefore, by optimizing throughput, DFSynthesizer improves real-time performance.

Table 2 summarizes the topology, the number of neurons and synapses of these applications, and their baseline accuracy on the DYNAP-SE neuromorphic hardware using the SpiNeMap (Balaji et al., 2020b) mapping framework. As reported in many recent works (Das et al., 2018c; Balaji et al., 2020b, a), spike latency on the shared interconnect of a neuromorphic hardware can lead to inter-spike interval (ISI) distortion and spike disorder. Since the performance of an SNN is a function of ISI, such non-idealities can lead to accuracy loss. Therefore, the accuracy of the three CNN architectures – LeNet, AlexNet, and VGG16 in Table 2 is somewhat lower than that reported via functional simulation in Table 1.

Class Applications Dataset Synapses Neurons Topology Top-1 Accuracy (%)
CNN LeNet MNIST 282,936 20,602 CNN 85.1%
AlexNet ImageNet 38,730,222 230,443 CNN 69.8%
VGG16 ImageNet 99,080,704 554,059 CNN 90.7 %
HeartClass (Balaji et al., 2018) Physionet 1,049,249 153,730 CNN 63.7%
MLP ImgSmooth (Chou et al., 2018) CARLsim 9,025 4,096 FeedForward (4096, 1024) 100%
EdgeDet (Chou et al., 2018) CARLsim 114,057 6,120 FeedForward (4096, 1024, 1024, 1024) 100%
DigitRecogMLP MNIST 79,400 884 FeedForward (784, 100, 10) 91.6%
RNN HeartEstm (Das et al., 2018a) Physionet 66,406 166 Recurrent Reservoir 100%
VisualPursuit (Kashyap et al., 2018) (Kashyap et al., 2018) 163,880 205 Recurrent Reservoir 47.3%
DigitRecogSTDP (Diehl and Cook, 2015) MNIST 11,442 567 Recurrent Reservoir 83.6%
Table 2. Applications used to evaluate DFSynthesizer.

7.2. Hardware Parameters

We model the DYNAP-SE neuromorphic hardware (Moradi et al., 2017) with 1024 tiles organized in a mesh. Each tile has one crossbar. To test the scalability of DFSynthesizer, we also evaluate other crossbar configurations, e.g., , , and . Table 3 reports the relevant hardware parameters.

Neuron technology 28nm FD-SOI
Synapse technology HfO -based OxRAM
Supply voltage 1.0V
Energy per spike 50pJ at 30Hz spike frequency
Energy per routing 147pJ
Switch bandwidth 1.8G. Events/s
Table 3. Major simulation parameters extracted from (Moradi et al., 2017).

The additional overhead in time multiplexing the tiles among multiple crossbars is incorporated in computing the throughput using NeuroXplorer. Specifically, once the cluster mapping to tiles are generated using DFSynthesizer, the synaptic weights of all clusters mapped to a tile are pre-loaded into the tile’s local memory (see our system architecture in Figure 5). In this way, DFSynthesizer reduces the overhead of transferring synaptic weights at run-time from the shared main memory. Additionally, since the loading of clusters (context switching) in crossbars happen concurrently from their respective private memory, the time-multiplexing overhead is minimal.

7.3. Evaluated Metrics

We evaluate the following performance metrics.

  • Performance. This is the throughput of each application on the hardware.

  • Resource Utilization. This is the neuron, synapse, buffer, connection, and input and output bandwidth utilization on the hardware for each application.

  • Energy Consumption. This is the energy consumed on the hardware for each application. This is the total energy consumed to generate spikes on each tile and communicate spike between tiles via the shared interconnect.

  • Cluster Connection. This is the average degree of the SDFG as percentage of the total number of nodes, obtained using the clustering technique for each application.

  • Spike Communication. This is the total number of spikes communicated on the shared interconnect of the neuromorphic hardware.

  • Synthesis Time. This is the time to compile and map each application on the hardware.

7.4. Evaluated Approaches

We evaluate the following approaches.

  • SpiNeMap (Balaji et al., 2020b). This approach first partitions an SNN into clusters of neurons and synapses by incorporating its workload. The objective is to minimize inter-cluster communication. Clusters are then mapped to tiles while minimizing spike communication on the shared interconnect and reducing energy consumption. When mapping SNNs to neuromorphic hardware with fewer tiles than the number of actors, 1) SpiNeMap allocates actors to tiles randomly and 2) SpiNeMap schedules the actors on each tile arbitrarily. Therefore, SpiNeMap does not consider throughput.

  • PyCARL (Balaji et al., 2020a). This approach maps neurons and synapses to tiles of a neuromorphic hardware, balancing the number of neurons and synapses on each tile. PyCARL does not incorporate SNN workload, i.e., spikes generated by neurons in the SNN. Therefore, some tiles may end up communicating more spikes than others, i.e., those tiles become the energy bottleneck.

  • SDFSNN (Song et al., 2020a). This approach uses the load-balancing mapping of PyCARL to allocate actors to tiles. It uses dataflow scheduling to improve the throughput.

  • DFSynthesizer. The proposed approach first clusters an SNN, considering its workload. The objective is to improve cluster utilization. This is done by first decomposing the SNN into homogeneous neural units with fanin-of-two. The clusters are then mapped to tiles, jointly optimizing throughput and energy consumption. DFSynthesizer uses dataflow-based scheduling of actors to tiles to further improve the throughput.

8. Results and Discussions

8.1. Throughput

Figure 17 reports the throughput on DYNAP-SE for the evaluated approaches, for each application normalized to SpiNeMap. For reference, we have reported the maximum throughput in frames-per-second obtained with unlimited hardware resources for each application. For image-based applications (LeNet, AlexNet, VGGNet, EdgeDet, ImgSmooth, and DigitSTDP), a frame corresponds to an individual image. For other time-series applications (HeartClass, HeartEstm, and VisualPursuit), a frame corresponds to a window of 500ms. We make the following four key observations.

Figure 17. Throughput on DYNAP-SE for each evaluated application normalized to SpiNeMap. The throughput in frames-per-second is reported for the maximum throughput approach for each application assuming unlimited hardware resources.

First, although the number of neurons and synapses of larger applications such as AlexNet and VGG16 is significantly higher than LeNet, the throughput of LeNet on a hardware with unlimited resources,888In the context of this work, unlimited resources refer to a neuromorphic hardware that has at least the same number of crossbars as there are clusters in the machine learning program. i.e., without time-multiplexing of crossbars is only 1.5x higher than AlexNet and 2x higher than VGG16. This is because with no time-multiplexing of crossbars, computations in a machine learning program take place concurrently on the crossbars, the basic philosophy of distributed computing, which is enabled using neuromorphic platforms. Therefore, the overhead due to time-multiplexing of crossbars is no longer the throughput bottleneck. Rather, the bottleneck shifts to spike delay between the clusters. Additionally, in our framework we cluster machine learning programs to minimize inter-cluster spikes. Therefore, even though Alexnet has significantly higher number of neurons and synapses than LeNet, its number of inter-cluster spikes is not significantly higher. The throughput of AlexNet is only 33% lower than LeNet. Similarly, VGG16, which has higher inter-cluster spikes than AlexNet, has 25% lower throughput.

Second, the throughput obtained using SpiNeMap is the least because SpiNeMap does not guarantee throughput during actor-to-tile mapping and actor scheduling on tiles. The throughput of PyCARL is on average 4% higher than SpiNeMap. This is because PyCARL balances the load on the tiles and therefore, the average number of actors mapped to each tile is lower than SpiNeMap, which results in higher throughput. The throughput of SDFSNN is on average 9.7% higher than PyCARL. This improvement is because of the use of dataflow-based scheduling, which maximizes the throughput. DFSynthesizer improves throughput by an average of 17% compared SDFSNN. This improvement is because unlike SDFSNN, which maps actors to tiles balancing the tile load without considering the throughput, DFSynthesizer performs throughput- and energy-aware mapping of actors to tiles and then uses dataflow-based scheduling to further improve the throughput. We have analyzed such throughput differences in Section 5.4.

Third, the throughput using DFSynthesizer is only 16% lower on average than the maximum throughput obtained with unlimited hardware resources. Finally, the throughput of DigitMLP is a very small application. All the techniques generate the same number of clusters for this application, resulting in similar throughput.

8.2. Workload Energy

Figure 18 reports the workload energy estimated on DYNAP-SE of the evaluated approaches for each application normalized to SpiNeMap. For reference, we have reported the workload energy in obtained using the maximum throughput approach, which assumes unlimited hardware resources. We make the following observation.

Figure 18. Workload energy on DYNAP-SE for each evaluated application normalized to SpiNeMap. The workload energy in is reported for the maximum throughput approach for each application assuming unlimited hardware resources.

The energy consumption of SpiNeMap is the least because this approach partitions SNNs into clusters to explicitly minimize the number of inter-cluster spikes. Therefore, when the clusters are mapped to hardware, the energy consumption on the shared interconnect is reduced.999The mapping exploration only impacts the communication energy on the shared interconnect. The spike generation energy remains the same for all approaches. Second, the energy consumption of PyCARL is on average 15% higher than SpiNeMap. This is because PyCARL balances the tile load without incorporating energy consumption. Therefore, clusters with high volume of spike communication between them may get placed on different tiles, increasing the communication energy. SpiNeMap places those tiles on the same tile lowering the communication energy. The energy consumption of SDFSNN is the same as PyCARL because the cluster-to-tile mapping of these two approaches is the same. SDFSNN gains over PyCARL in terms of throughput due to its dataflow-based cluster scheduling on tiles. We analyzed this in Section 8.1. The energy consumption of DFSynthesizer is lower than SDFSNN by an average of 8%. This reduction is due to the cluster-to-tile mapping of DFSynthesizer, which incorporates energy consumption.

8.3. Scheduling

Figure 19 reports throughput of each of our applications for our proposed approach normalized to PyCARL. We compare throughput obtained using DFSynthesizer where schedules are independently constructed for each tile against the throughput obtained using our proposed single-tile based schedule (DFSynthesizer+STS). We make the following three observations.

Figure 19. Throughput normalized to PyCARL.

First, throughput obtained from a single-tile static-order schedule is on average 15% lower than the case when schedules are constructed independently — that is, by using DFSynthesizer. This verifies our Lemma 2. Second, for some applications such as HeartEstm and HeratClass, throughput obtained using DFSynthesizer+STS is exactly the same as that obtained using DFSynthesizer. Third, throughput using DFSynthesizer+STS is still higher than PyCARL by an average of 41%.

8.4. Resource Utilization

Table 4 reports the utilization of hardware resources (tile resources, buffer size, connections, and input and output bandwidth) on the DYNAP-SE neuromorphic hardware for each application. The average utilization of hardware resources is 92.5% for the crossbar IOs on each tile, 9.0% for buffer space, 42.6% for connections, and 15% for input and output tile bandwidth. Since we perform hardware-aware analysis, resource utilization never exceeds 100%.

Application Utilization (%)
Tile Buffer Connections Bandwidth
Input Output
LeNet 100 87.8 37.5 20.34 20.34
AlexNet 100 91.8 46.87 17.09 17.09
VGG16 100 94.2 15.62 6.51 6.51
HeartClass 100 79.1 25 9.76 9.76
DigitMLP 81.25 9.67 46.87 22.78 22.78
EdgeDet 87.5 11.23 68.75 22.78 22.78
ImgSmooth 87.5 8.39 37.5 17.08 17.08
HeartEstm 96.87 9.61 62.5 4.7 4.7
VisualPursuit 90.12 21.2 25.04 12.11 16.6
DigitSTDP 89.33 20.13 22.19 11.94 11.7
Table 4. Resource utilization on DYNAP-SE.

These results illustrate that DFSynthesizer can be used to design neuromorphic hardware while considering key hardware parameters such as number of tiles, but all other resources such as buffer space, connections, and input and output bandwidth.

To give more insight on the utilization within each tile, Figure 20 reports the average synapse utilization on tiles of the evaluated approaches for each application normalized to PyCARL. We make the following two key observations.

Figure 20. Average synapse utilization on tiles for each evaluated application normalized to PyCARL.

First, the synapse utilization on tiles using SpiNeMap is the least of all three evaluated approaches. This is because SpiNeMap produces the highest number of clusters (Sec. 8.5) and therefore, the average number of synapses per cluster is the least. Subsequently, when these clusters are mapped to tiles, the average synapse utilization on tiles reduces. Second, DFSynthesizer generates fewer clusters than both SpiNeMap and PyCARL due to its dense packing of synapses using Algorithm 2. Therefore, the average number of synapses per cluster is higher, which increases synapse utilization on tiles when the clusters are mapped to tiles. On average, the average synapse utilization of DFSynthesizer is 2x higher than PyCARL and 2.2x higher than SpiNeMap.

8.5. Number of Clusters

Figure 21 reports the total number of clusters of the evaluated approaches for each application normalized to PyCARL. We make the following two key observations.

Figure 21. Number of clusters for each evaluated application normalized to PyCARL.

First, the number of clusters of SpiNeMap is the highest of all three evaluated approaches. This is because SpiNeMap minimizes the number of inter-cluster communication during clustering of an SNN. Therefore, neurons that spike the most are placed within individual clusters along with their fanins. Since SpiNeMap does not consider cluster utilization, it results in creating more clusters than PyCARL. Second, DFSynthesizer clusters an SNN to maximize the resource utilization on each tile. Therefore, the number of clusters generated by DFSynthesizer is the lowest. Overall, the number of clusters of DFSynthesizer is 41% lower than SpiNeMap and 47% lower than PyCARL. Lower the number of clusters, lower is the size of hardware needed to achieve highest throughput (Sec. 8.1). Therefore, DFSynthesizer reduces the hardware requirement for machine learning applications.

8.6. Cluster Connections

Figure 22 reports the cluster connections of the evaluated approaches for each application normalized to PyCARL. We make the following two key observations.

Figure 22. Cluster connections for each evaluated application normalized to PyCARL.

First, the number of inter-cluster connections of SpiNeMap is the least of all three evaluated approaches. This is because SpiNeMap minimizes the number of inter-cluster communication while clustering an SNN, which indirectly reduces the cluster connectivity. Second, DFSynthesizer clusters an SNN to maximize the resource utilization on each tile. Therefore, the number of connections between the clusters is higher in DFSynthesizer because of the higher number of post-synaptic neurons mapped to each cluster. Overall, the average cluster connections of DFSynthesizer is 3.1x higher than SpiNeMap and 3.9x higher than PyCARL.

8.7. Architecture Exploration

Figure 23 reports the number of clusters generated using DFSynthesizer for neuromorphic hardware with , , and crossbars, normalized to a DYNAP-SE configuration with crossbars. We observe that the number of clusters generated using DFSynthesizer reduces by 60% and 92% when the size of a crossbar increases to and , respectively.

Figure 23. Number of clusters generated using DFSynthesizer for , , and crossbars, normalized to the configuration of DYNAP-SE with crossbars.

Fewer number of clusters increases throughput. To illustrate this, Figure 24 reports the throughput using DFSynthesizer for different crossbar sizes normalized to throughput on DYNAP-SE with four crossbars. We make the following two observations.

Figure 24. Throughput achieved using DFSynthesizer for , , and crossbars, normalized to throughput on DYNAP-SE with crossbars.

First, throughput increases by 18% and 30% when using and crossbars, respectively. This improvement is because with larger size crossbars, there are fewer clusters generated by DFSynthesizer (Fig. 23). Therefore, the number of clusters per tile reduces, which reduces the bottleneck of time-multiplexing clusters on tiles. This increases throughput. Second, for applications such as DigitMLP, EdgeDet, and HeartEstm, there is no throughput improvement when the crossbar size increased from to . This is because for these applications, crossbar configuration is sufficient to achieve the highest throughput. For all other applications, the throughput increases by 11% when going from to crossbars.

8.8. Synthesis Time

Figure 25 reports the synthesis time on DYNAP-SE for the evaluated approaches, for each application normalized to PyCARL. We make the following three key observations.

Figure 25. Synthesis time for each application normalized to PyCARL.

First, the synthesis time of SpiNeMap is on average 61.6% higher than PyCARL. The higher synthesis time of SpiNeMap is due to the analysis it performs with the workload to obtain the minimum energy mapping. Second, the synthesis time of DFSynthesizer is the highest. On average, the synthesis time of DFSynthesizer is 35x higher than PyCARL and 25x higher than SpiNeMap. This higher synthesis time is due to 1) DFSynthesizer’s mapping explorations using Algorithm 3, and 2) DFSynthesizer’s SDFG analysis mechanism using the proposed Max Plus formulation. Third, the synthesis time of DFSynthesizer increases with model complexity. The synthesis time of DFSynthesizer is higher than PyCARL by 3.1x for LeNet, 25.5x for AlexNet, and 272.3x for VGG16.

8.9. Model Quality

DFSynthesizer does not alter synaptic connections. Therefore, the model quality, e.g., accuracy is not impacted by the analysis technique of DFSynthesizer. The only impact DFSynthesizer introduces is in converting CNNs. The accuracy impact is reported in Table 1. For all other applications, DFSynthesizer’s accuracy is the same as the baseline accuracy reported in Table 2.

9. Related Works

Recently, many approaches are proposed to map machine learning workloads to neuromorphic hardware. Corelet (Amir et al., 2013) is used to map SNNs to TrueNorth (DeBole et al., 2019). PACMAN (Galluppi et al., 2015) is used to map SNNs to SpiNNaker (Furber et al., 2014). PyNN (Balaji et al., 2020a) is used to map SNNs on Loihi (Davies et al., 2018), BrainScaleS (Schemmel et al., 2012), and Neurogrid (Benjamin et al., 2014) by balancing the load on each tile. PyCARL (Balaji et al., 2020a) is used to map SNNs to DYNAP-SE (Moradi et al., 2017). The primary objective of these approaches is to balance the workload on each tile by distributing the neurons and synapses evenly.

Beyond load balancing, recent techniques have also explored other objectives. PSOPART (Das et al., 2018c) is used to map SNNs to neuromorphic hardware, reducing the energy consumption on the shared interconnect. SpiNeMap (Balaji et al., 2020b) performs energy-aware clustering of SNNs and then maps the clusters to tiles, reducing the communication energy. DecomposeSNN (Balaji et al., 2020d) decomposes an SNN to improve the cluster utilization. There are also performance-oriented SNN mapping approaches such as (Balaji et al., 2020c; Song et al., 2020a; Balaji et al., 2019b; Balaji and Das, 2020), energy-aware SNN mapping approaches such as (Titirsha et al., 2021a), circuit aging-aware SNN mapping approaches such as (Song et al., 2020c; Song and Das, 2020a; Balaji et al., 2019a; Kundu et al., 2021; Song et al., 2021b), endurance-aware SNN mapping approaches such as (Titirsha and Das, 2020a; Titirsha et al., 2021b; Song et al., 2021d), and thermal-aware SNN mapping approaches such as (Titirsha and Das, 2020b). These approaches are evaluated with emerging SNN based applications (Moyer et al., 2020; Balaji et al., 2018; Das et al., 2018b; Diehl and Cook, 2015; Das et al., 2018a; Kashyap et al., 2018), which we also use to evaluate DFSynthesizer.

There are also other mapping approaches such as  (Ankit et al., 2018; Zhang et al., 2018; Xia and Yang, 2019; Lee et al., 2019; Wijesinghe et al., 2018; Wen et al., 2015; Ramasubramanian et al., 2014). We compare DFSynthesizer against PyCARL and SpiNeMap, and found it to perform significantly better.

Similar Concept in Related Domain

SDFGs are widely used for predictable mapping of applications to multiprocessor systems. Numerous approaches to throughput analysis of SDFGs have been previously proposed (Stuijk et al., 2006b, 2007; Damavandpeyma et al., 2012; Zhu et al., 2012; Shafik et al., 2015; Das et al., 2015b; Shafik et al., 2015). Bonfietti et al. evaluated mappings of SDFG to multiprocessor system, maximizing the throughput (Bonfietti et al., 2013). Stemmer et al. propose to use probabilistic analysis to allocate and schedule SDFGs on multiprocessor systems (Stemmer et al., 2020). Das et al. evaluated the fault-tolerant mapping of SDFGs to multiprocessor systems (Das et al., 2013b, 2015a, 2014a; Das and Kumar, 2012; Das et al., 2013a, 2014b, 2012, c, 2016). Recently, SDFG-based analysis is also proposed for analyzing machine learning applications (Das and Kumar, 2018; Balaji and Das, 2019; Hong et al., 2017; Chen, Yu-Hsin and Emer, Joel and Sze, Vivienne, 2017; Bacis et al., 2017; Song et al., 2021c). However, none of these approaches address application analysis with limited hardware resources, both at design-time and at run-time.

10. Conclusions

We introduce DFSynthesizer for predictable synthesis of SNN-based applications on state-of-the-art neuromorphic hardware. Prior works have only addressed design-time mapping, considering unlimited resources in the underlying hardware. These approaches present significant limitations when used to compile and map machine-learning applications to a resource-constrained hardware. DFSynthesizer makes five key contributions. First, we present an approach to analyze machine-learning programs and generate SNN workload using representative data. Second, we present an approach to decompose and partition complex SNN workloads to generate clusters of neurons and synapses such that each cluster can fit onto a crossbar of the hardware. Third, we exploit the rich semantics of Synchronous Dataflow Graphs (SDFGs) to represent clustered SNN programs. This allows for the SNN’s performance, e.g., throughput, to be estimated on the hardware as a function of key properties such as number of crossbars, dimension of crossbars, buffer space on tiles, and tile communication bandwidth. Four, we develop a novel scheduling algorithm based on Self-Timed Execution for executing clusters on crossbars of a neuromorphic hardware, providing performance guarantee in scenarios with dynamic resource availability. Five, we propose a design-space exploration framework incorporating DFSynthesizer that allows the Pareto-space of different SNN mappings to hardware to be explored while considering other hardware metrics such as energy, latency, and reliability.

We evaluate DFSynthesizer using 10 machine learning programs that are representative of the three most commonly used neural network classes — convolutional neural network (CNN), multi-layer perceptron (MLP), and recurrent neural network (RNN). Our results demonstrate that DFSynthesizer provides much tighter performance guarantee compared to current practices.

Acknowledgements.

This work is supported by 1) the National Science Foundation Award CCF-1937419 (RTML: Small: Design of System Software to Facilitate Real-Time Neuromorphic Computing) and 2) the National Science Foundation Faculty Early Career Development Award CCF-1942697 (CAREER: Facilitating Dependable Neuromorphic Computing: Vision, Architecture, and Impact on Programmability).

References

  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In OSDI, Cited by: §3.2.1, Table 1.
  • A. Amir, P. Datta, W. P. Risk, A. S. Cassidy, J. A. Kusnitz, S. K. Esser, A. Andreopoulos, T. M. Wong, M. Flickner, R. Alvarez-Icaza, et al. (2013) Cognitive computing programming paradigm: a corelet language for composing networks of neurosynaptic cores. In IJCNN, Cited by: §9.
  • A. Ankit, A. Sengupta, and K. Roy (2017) TraNNsformer: Neural network transformation for memristive crossbar based neuromorphic system design. In ICCAD, Cited by: Figure 5, §4.1.
  • A. Ankit, A. Sengupta, and K. Roy (2018) Neuromorphic computing across the stack: devices, circuits and architectures. In SIPS, Cited by: §9.
  • L. Atzori, A. Iera, and G. Morabito (2010) The internet of things: a survey. Computer Networks. Cited by: §1.
  • M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio (2017) A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In IPDPSW, Cited by: §9.
  • A. Balaji, P. Adiraju, H. J. Kashyap, A. Das, J. L. Krichmar, N. D. Dutt, and F. Catthoor (2020a) PyCARL: A PyNN interface for hardware-software co-simulation of spiking neural network. In IJCNN, Cited by: §3.2.2, §5.4, 2nd item, §7.1, §7, §9.
  • A. Balaji, F. Corradi, A. Das, S. Pande, S. Schaafsma, and F. Catthoor (2018) Power-accuracy trade-offs for heartbeat classification on neural networks hardware. JOLPE. Cited by: §3.3.1, §7.1, Table 2, §7, §9, footnote 5.
  • A. Balaji, A. Das, Y. Wu, K. Huynh, F. G. Dell’anna, G. Indiveri, J. L. Krichmar, N. D. Dutt, S. Schaafsma, and F. Catthoor (2020b) Mapping spiking neural networks to neuromorphic hardware. TVLSI. External Links: Document, 2004.03717, ISSN 15579999 Cited by: §4.3, §5.4, 1st item, §7.1, §7, §9.
  • A. Balaji and A. Das (2019) A framework for the analysis of throughput-constraints of SNNs on neuromorphic hardware. In ISVLSI, Cited by: §9.
  • A. Balaji and A. Das (2020) Compiling spiking neural networks to mitigate neuromorphic hardware constraints. In IGSC Workshops, Cited by: §9.
  • A. Balaji, T. Marty, A. Das, and F. Catthoor (2020c) Run-time mapping of spiking neural networks to neuromorphic hardware. JSPS. Cited by: §9.
  • A. Balaji, S. Song, A. Das, N. Dutt, J. Krichmar, N. Kandasamy, and F. Catthoor (2019a) A framework to explore workload-specific performance and lifetime trade-offs in neuromorphic computing. CAL. Cited by: §9.
  • A. Balaji, S. Song, A. Das, J. Krichmar, N. Dutt, J. Shackleford, N. Kandasamy, and F. Catthoor (2020d) Enabling resource-aware mapping of spiking neural networks via spatial decomposition. ESL. Cited by: §4.2, §9.
  • A. Balaji, S. Song, T. Titirsha, A. Das, J. Krichmar, N. Dutt, J. Shackleford, N. Kandasamy, and F. Catthoor (2021) NeuroXplorer 1.0: An extensible framework for architectural exploration with spiking neural networks. In ICONS, Cited by: Figure 16, §7.
  • A. Balaji, S. Ullah, A. Das, and A. Kumar (2019b) Design methodology for embedded approximate artificial neural networks. In GLSVLSI, Cited by: §9.
  • A. Balaji, Y. Wu, A. Das, F. Catthoor, and S. Schaafsma (2019c) Exploration of segmented bus as scalable global interconnect for neuromorphic computing. In GLSVLSI, Cited by: footnote 6.
  • S. S. Battacharyya, P. K. Murthy, and E. A. Lee (1996) Loose interdependence algorithms. In Software Synthesis from Dataflow Graphs, Cited by: Figure 10, §4.5.
  • T. Bekolay, J. Bergstra, E. Hunsberger, T. DeWolf, T. C. Stewart, D. Rasmussen, X. Choo, A. Voelker, and C. Eliasmith (2014) Nengo: a python tool for building large-scale functional brain models. Frontiers in Neuroinformatics. Cited by: §3.3.1.
  • L. Benini and G. De Micheli (2002) Networks on chip: a new paradigm for systems on chip design. In DATE, Cited by: footnote 6.
  • B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chandrasekaran, J. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. A. Merolla, and K. Boahen (2014) Neurogrid: a mixed-analog-digital multichip system for large-scale neural simulations. Proceedings of the IEEE. Cited by: §9.
  • O. Bichler, D. Briand, V. Gacoin, and B. Bertelone (2017) N2D2: Neural network design & deployment. https://github.com/CEA-LIST/N2D2. Cited by: §3.3.1.
  • A. Bonfietti, M. Lombardi, M. Milano, and L. Benini (2013) Maximum-throughput mapping of SDFGs on multi-core SoC platforms. JPDC. Cited by: §9.
  • G. W. Burr, R. M. Shelby, et al. (2017) Neuromorphic computing using non-volatile memory. Advances in Physics: X. External Links: ISSN 23746149 Cited by: §1, §1.
  • F. Catthoor, S. Mitra, A. Das, and S. Schaafsma (2018) Very large-scale neuromorphic systems for biological signal processing. In CMOS Circuits for Biological Sensing and Processing, Cited by: Figure 2, §1, Figure 5, §4.1.
  • Y. Chen, T. Krishna, J. S. Emer, and V. Sze (2016) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. JSSC. Cited by: §1.
  • Chen, Yu-Hsin and Emer, Joel and Sze, Vivienne (2017) Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro. Cited by: §9.
  • T-S. Chou, H. J. Kashyap, J. Xing, S. Listopad, E. L. Rounds, M. Beyeler, N. Dutt, and J. L. Krichmar (2018)

    CARLsim 4: An open source library for large scale, biologically detailed spiking neural network simulation using heterogeneous clusters

    .
    In IJCNN, Cited by: §3.2.2, §3.3.1, Table 1, Figure 11, §4.6, §7.1, Table 2, §7.
  • J. Cong and Z. Zhang (2006) An efficient and versatile scheduling algorithm based on SDC formulation. In DAC, Cited by: §4.6.
  • M. Damavandpeyma, S. Stuijk, T. Basten, M. Geilen, and H. Corporaal (2012) Modeling static-order schedules in synchronous dataflow graphs. In DATE, Cited by: §9.
  • A. Das, P. Pradhapan, W. Groenendaal, P. Adiraju, R.T. Rajan, F. Catthoor, S. Schaafsma, J.L. Krichmar, N. Dutt, and C. Van Hoof (2018a) Unsupervised heart-rate estimation in wearables with Liquid states and a probabilistic readout. Neural Networks. Cited by: §7.1, Table 2, §7, §9.
  • A. Das, B. M. Al-Hashimi, and G. V. Merrett (2016) Adaptive and hierarchical runtime manager for energy-aware thermal management of embedded systems. TECS. Cited by: §9.
  • A. Das, F. Catthoor, and S. Schaafsma (2018b)

    Heartbeat classification in wearables using multi-layer perceptron and time-frequency joint distribution of ECG

    .
    In CHASE, Cited by: §7.1, §9.
  • A. Das, A. Kumar, and B. Veeravalli (2012) Energy-aware communication and remapping of tasks for reliable multimedia multiprocessor systems. In ICPADS, Cited by: §6, §9.
  • A. Das, A. Kumar, and B. Veeravalli (2013a) Aging-aware hardware-software task partitioning for reliable reconfigurable multiprocessor systems. In CASES, Cited by: §9.
  • A. Das, A. Kumar, and B. Veeravalli (2013b) Communication and migration energy aware design space exploration for multicore systems with intermittent faults. In DATE, Cited by: §9.
  • A. Das, A. Kumar, and B. Veeravalli (2014a) Communication and migration energy aware task mapping for reliable multiprocessor systems. FGCS. Cited by: §4.3, §6, §9.
  • A. Das, A. Kumar, and B. Veeravalli (2014b) Energy-aware task mapping and scheduling for reliable embedded computing systems. TECS. Cited by: §9.
  • A. Das, A. Kumar, and B. Veeravalli (2015a) Reliability and energy-aware mapping and scheduling of multimedia applications on multiprocessor systems. TPDS. Cited by: §9.
  • A. Das and A. Kumar (2012) Fault-aware task re-mapping for throughput constrained multimedia applications on NoC-based MPSoCs. In RSP, Cited by: §9.
  • A. Das and A. Kumar (2018) Dataflow-based mapping of spiking neural networks on neuromorphic hardware. In GLSVLSI, Cited by: §9.
  • A. Das, A. K. Singh, and A. Kumar (2013c) Energy-aware dynamic reconfiguration of communication-centric applications for reliable MPSoCs. In ReCoSoC, Cited by: §9.
  • A. Das, M. J. Walker, A. Hansson, B. M. Al-Hashimi, and G. V. Merrett (2015b) Hardware-software interaction for run-time power optimization: a case study of embedded linux on multicore smartphones. In Proceedings of ISLPED, Cited by: §9.
  • A. Das, Y. Wu, K. Huynh, F. Dell’Anna, F. Catthoor, and S. Schaafsma (2018c) Mapping of local and global synapses on spiking neuromorphic hardware. In DATE, Cited by: §7.1, §7, §9.
  • M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al. (2018) Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro. Cited by: Figure 2, §1, §3.3.1, §7, §9.
  • A. P. Davison, D. Brüderle, J. M. Eppler, J. Kremkow, E. Muller, D. Pecevski, L. Perrinet, and P. Yger (2009) PyNN: a common interface for neuronal network simulators. Frontiers in Neuroinformatics. Cited by: §3.2.2.
  • M. V. DeBole, B. Taba, A. Amir, F. Akopyan, A. Andreopoulos, W. P. Risk, J. Kusnitz, C. O. Otero, T. K. Nayak, R. Appuswamy, et al. (2019) TrueNorth: Accelerating from zero to 64 million neurons in 10 years. Computer. Cited by: Figure 2, §1, §3.3.1, §7, §9.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §3.2.1.
  • L. Deng (2012) The MNIST database of handwritten digit images for machine learning research [best of the web]. Signal Processing Magazine. Cited by: §3.2.1.
  • P. U. Diehl and M. Cook (2015) Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in Computational Neuroscience. Cited by: §7.1, Table 2, §9.
  • J. M. Eppler, M. Helias, E. Muller, M. Diesmann, and M. Gewaltig (2009) PyNEST: A convenient interface to the NEST simulator. Frontiers in Neuroinformatics. Cited by: §3.2.2.
  • S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana (2014) The SpiNNaker project. Proceedings of the IEEE. Cited by: §9.
  • F. Galluppi, X. Lagorce, E. Stromatias, M. Pfeiffer, L. A. Plana, S. B. Furber, and R. B. Benosman (2015) A framework for plasticity implementation on the spinnaker neural architecture. Frontiers in Neuroscience. Cited by: §9.
  • A. H. Ghamarian, M. C. Geilen, S. Stuijk, T. Basten, B. D. Theelen, M. R. Mousavi, A. J. Moonen, and M. J. Bekooij (2006) Throughput analysis of synchronous data flow graphs. In ACSD, Cited by: §6.
  • D. F. Goodman and R. Brette (2009) The brian simulator. Frontiers in Neuroscience. Cited by: §3.2.2.
  • R. Gopalakrishnan, Y. Chua, P. Sun, A. J. S. Kumar, and A. Basu (2020) HFNet: A CNN architecture co-designed for neuromorphic hardware with a crossbar array of synapses. Frontiers in Neuroscience. Cited by: Figure 5, §4.1.
  • A. Gulli and S. Pal (2017) Deep learning with keras. Cited by: §3.2.1, Table 1, §7.
  • B. Heidergott, G. J. Olsder, and J. Van Der Woude (2014) Max Plus at work: Modeling and analysis of synchronized systems: a course on Max-Plus algebra and its applications. Cited by: §4.6.
  • M. L. Hines and N. T. Carnevale (1997) The NEURON simulation environment. Neural Computation. Cited by: §3.2.2.
  • H. Hong, H. Oh, and S. Ha (2017) Hierarchical dataflow modeling of iterative applications. In DAC, Cited by: §9.
  • M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J. J. Yang, and R. S. Williams (2016) Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication. In DAC, Cited by: Figure 5, §4.1.
  • G. Indiveri (2003) A low-power adaptive integrate-and-fire neuron circuit. In ISCAS, Cited by: §1.
  • Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. Chen (2016) NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In MICRO, Cited by: §4.3.
  • H. J. Kashyap, G. Detorakis, N. Dutt, J. L. Krichmar, and E. Neftci (2018) A recurrent neural network based model of predictive smooth pursuit eye movement in primates. In IJCNN, External Links: ISBN 9781509060146 Cited by: §7.1, Table 2, §9.
  • B. W. Kernighan and S. Lin (1970)

    An efficient heuristic procedure for partitioning graphs

    .
    Bell System Technical Journal. Cited by: §4.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. NeurIPS. Cited by: §3.2.1.
  • S. Kundu, K. Basu, M. Sadi, T. Titirsha, S. Song, A. Das, and U. Guin (2021) Special Session: reliability analysis for ML/AI hardware. In VTS, Cited by: §9.
  • Y. LeCun et al. (2015) LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet. Cited by: §3.2.1.
  • E.A. Lee and D.G. Messerschmitt (1987) Synchronous data flow. Proceedings of the IEEE. Cited by: 3rd item, §4.4.
  • M. K. F. Lee, Y. Cui, T. Somu, T. Luo, J. Zhou, W. T. Tang, W. Wong, and R. S. M. Goh (2019) A system-level simulator for RRAM-based neuromorphic computing chips. TACO. Cited by: §9.
  • W. Maass (1997) Networks of spiking neurons: the third generation of neural network models. Neural Networks. Cited by: §1.
  • A. Mallik, D. Garbin, A. Fantini, D. Rodopoulos, R. Degraeve, J. Stuijt, A. Das, S. Schaafsma, P. Debacker, G. Donadio, et al. (2017) Design-technology co-optimization for oxrram-based synaptic processing unit. In VLSIT, Cited by: §1.
  • S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri (2017) A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (DYNAPs). TBCAS. Cited by: Figure 2, §1, §3.3.1, Figure 5, §7.2, Table 3, §9, footnote 7.
  • O. M. Moreira and M. J. Bekooij (2007) Self-timed scheduling analysis for real-time applications. JASP. Cited by: §5.3.
  • E. J. Moyer, A. Das, et al. (2020) Machine learning applications to dna subsequence and restriction site analysis. In SPMB, Cited by: §9.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: An imperative style, high-performance deep learning library. arXiv. Cited by: §3.2.1.
  • S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan (2014) SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing. In ISLPED, Cited by: §9.
  • V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou, et al. (2020) Mlperf inference benchmark. In ISCA, Cited by: §3.2.1.
  • B. Rueckauer, I. Lungu, Y. Hu, and M. Pfeiffer (2016) Theory and tools for the conversion of analog to spiking convolutional neural networks. arXiv. Cited by: §3.3.1.
  • J. Schemmel, A. Grübl, S. Hartmann, A. Kononov, C. Mayr, K. Meier, S. Millner, J. Partzsch, S. Schiefer, S. Scholze, et al. (2012) Live demonstration: a scaled-down version of the brainscales wafer-scale neuromorphic system. In ISCAS, Cited by: §9.
  • R. A. Shafik, A. Das, S. Yang, G. Merrett, and B. M. Al-Hashimi (2015) Adaptive energy minimization of openmp parallel applications on many-core systems. In PARMA-DITAM, Cited by: §9.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv. Cited by: §3.2.1.
  • S. Song, A. Balaji, A. Das, N. Kandasamy, and J. Shackleford (2020a) Compiling spiking neural networks to neuromorphic hardware. In LCTES, Cited by: §5.4, 3rd item, §9, footnote 2.
  • S. Song, A. Das, and N. Kandasamy (2020b) Exploiting inter- and intra-memory asymmetries for data mapping in hybrid tiered-memories. In ISMM, Cited by: footnote 1.
  • S. Song, A. Das, and N. Kandasamy (2020c) Improving dependability of neuromorphic computing with non-volatile memory. In EDCC, Cited by: §9.
  • S. Song, A. Das, O. Mutlu, and N. Kandasamy (2019) Enabling and exploiting partition-level parallelism (PALP) in phase change memories. TECS. Cited by: footnote 1.
  • S. Song, A. Das, O. Mutlu, and N. Kandasamy (2020d) Improving phase change memory performance with data content aware access. In ISMM, Cited by: footnote 1.
  • S. Song, A. Das, O. Mutlu, and N. Kandasamy (2021a) Aging-aware request scheduling for non-volatile main memory. In ASP-DAC, Cited by: footnote 1.
  • S. Song and A. Das (2020a) A case for lifetime reliability-aware neuromorphic computing. In MWSCAS, Cited by: §9.
  • S. Song and A. Das (2020b) Design methodologies for reliable and energy-efficient PCM systems. In IGSC Workshops, Cited by: footnote 1.
  • S. Song, J. Hanamshet, A. Balaji, A. Das, J. Krichmar, N. Dutt, N. Kandasamy, and F. Catthoor (2021b) Dynamic reliability management in neuromorphic computing. JETC. Cited by: §9.
  • S. Song, A. Paul, L. V. Mirtinti, A. Das, and N. Kandasamy (2021c) A design flow for mapping spiking neural networks to many-core neuromorphic hardware. arXiv. Cited by: §9.
  • S. Song, T. Titirsha, and A. Das (2021d) Improving inference lifetime of neuromorphic systems via intelligent synapse mapping. In ASAP, Cited by: §9.
  • S. Sriram and S.S. Bhattacharyya (2000) Embedded Multiprocessors; Scheduling and Synchronization. Cited by: §4.4.
  • R. Stemmer, H. Vu, K. Grüttner, S. Le Nours, W. Nebel, and S. Pillement (2020) Towards probabilistic timing analysis for SDFGs on tile based heterogeneous MPSoCs. In ECRTS, Cited by: §9.
  • S. Stuijk, M. Geilen, and T. Basten (2006a) Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In DAC, Cited by: §4.4.
  • S. Stuijk, T. Basten, M. Geilen, and H. Corporaal (2007) Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In DAC, Cited by: §9.
  • S. Stuijk, M. Geilen, and T. Basten (2006b) Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In DAC, Cited by: §9.
  • T. Titirsha and A. Das (2020a) Reliability-performance trade-offs in neuromorphic computing. In IGSC Workshops, Cited by: §4.6, §9.
  • T. Titirsha and A. Das (2020b) Thermal-aware compilation of spiking neural networks to neuromorphic hardware. In LCPC, Cited by: §4.6, §9.
  • T. Titirsha, S. Song, A. Balaji, and A. Das (2021a) On the role of system software in energy management of neuromorphic computing. In CF, Cited by: §5.4, §9.
  • T. Titirsha, S. Song, A. Das, J. Krichmar, N. Dutt, N. Kandasamy, and F. Catthoor (2021b) Endurance-aware mapping of spiking neural networks to neuromorphic hardware. TPDS. Cited by: §4.6, §9.
  • W. Wen, C. Wu, X. Hu, B. Liu, T. Ho, X. Li, and Y. Chen (2015) An EDA framework for large scale hybrid neuromorphic computing systems. In DAC, Cited by: §9.
  • P. Wijesinghe, A. Ankit, A. Sengupta, and K. Roy (2018) An all-memristor deep spiking neural computing system: a step toward realizing the low-power stochastic brain. TETCI. Cited by: §9.
  • Q. Xia and J. J. Yang (2019) Memristive crossbar arrays for brain-inspired computing. Nature Materials. Cited by: §9.
  • X. Zhang, A. Huang, Q. Hu, Z. Xiao, and P. K. Chu (2018) Neuromorphic computing with memristor crossbar. Physica Status Solidi (a). Cited by: §9.
  • Z. Zhang and B. Liu (2013) SDC-based modulo scheduling for pipeline synthesis. In ICCAD, Cited by: §4.6.
  • X. Zhu, M. Geilen, T. Basten, and S. Stuijk (2012) Static rate-optimal scheduling of multirate DSP algorithms via retiming and unfolding. In RTAS, Cited by: §9.

Appendix A Converting Analog Operations to Spiking Equivalent

In this section, we briefly elaborate how an analog operation such as Rectified Linear Unit (ReLU) is implemented using Spiking Neural Network (SNN). The output

of a ReLU activation function is given by

(16)

where is the weight and is the activation on the synapse of the neuron. To map the ReLU activation function, we consider a particular type of spiking neuron model known as an Integrate and Fire (IF) neuron model. The IF spiking neuron’s transfer function can be represented as

(17)

where is the membrane potential of the IF neuron at time , is the weight, and is the activation on the synapse of the neuron at time . The IF spiking neuron integrates incoming spikes () and generates an output spike () when the membrane potential () exceeds the threshold voltage () of the IF neuron. Therefore, by ensuring that the output spiking rate is proportional to the ReLU activation , i.e., , we accurately convert the ReLU activation to the spike-based model. To further illustrate this, we consider the multi-layer perceptron (MLP) of Figure 26a and its SNN conversion using rate-based encoding (Figure 26b) and inter-spike interval (ISI) encoding (Figure 26c).

Figure 26. Example of converting an analog MLP to its spiking equivalent.

In Figure 26a, neurons 1,2 and 3 are the input neurons and neurons 4 and 5 are the output neurons. To keep the model simple, let us consider the case where the activations of the input neurons 1,2 and 3 are equal to 1. Using Equation 16, we know that the output of neurons 4 and 5 are 0.6 and 0.3, respectively. Figures 26b and 26c show the mapped SNN model, using rate-based and inter-spike interval encoding schemes, respectively. In the rate-based model in Figure 26b, the rate of spikes generated is expected to be proportional to the output of neurons 4 and 5 in the MLP. In the case of the ISI-based SNN model, the inter-spike interval of the spikes generated by neurons 4 and 5 is expected to be proportional to the output generated in the MLP, as shown in Figure 26c.

We note that non-linear activation functions such as sigmoid and tanh cannot be accurately mapped to a spike-based model. This can be attributed to the transfer function of a biological spiking neuron (neuron response curve) closely resembling a ReLU and not sigmoid and tanh activation functions. While approximate implementations of the sigmoid and tanh operators using spiking neurons can be found in literature, they induce significant inaccuracies into the conversion process and require more resources (neurons) to implement. The tanh activation function, for instance, generates output values ranging between -1.0 to 1.0. In order to represent the tanh function in a spike-based model, both excitatory and inhibitory spiking neurons will be required to represent the positive and negative output values, respectively. This will require doubling the number of spiking neurons needed to represent the tanh activation function.