1. Introduction
Spiking Neural Network (SNN) is an emerging computing model that uses spikebased computations and bioinspired learning algorithms (Maass, 1997). In an SNN, presynaptic neurons communicate information encoded in spike trains to postsynaptic neurons, via synapses (see Fig. 1). Performance, e.g., accuracy of an SNN model, is assessed in terms of the interspike interval (ISI), which is defined as inverse of the mean firing rate of the neurons.
SNNs are typically executed on neuromorphic hardware platforms such as DYNAPSE (Moradi et al., 2017), TrueNorth (DeBole et al., 2019), and Loihi (Davies et al., 2018). These hardware platforms are designed as a tilebased architecture with a shared, hierarchical interconnect to facilitate intertile communication (see Fig. 2) (Catthoor et al., 2018). Each tile consists of a crossbar for mapping neurons and synapses, and input and output buffer space for communicating spikes over the interconnect. A crossbar is a 2D organization of horizontal and vertical wires, where the horizontal wires are connected to presynaptic neurons while the vertical wires are connected to postsynaptic neurons. NonVolatile Memory (NVM) cells are placed at the crosspoints of each crossbar to implement storage of synaptic weights (Mallik et al., 2017; Burr et al., 2017).^{1}^{1}1Beyond neuromorphic computing, NVMs are also used as main memory for conventional computing using sharedmemory computers (Song et al., 2021a, 2019; Song and Das, 2020b; Song et al., 2020b, d).
Energy consumed by neuromorphic hardware can be several orders of magnitude lower than a conventional machinelearning accelerator such as Eyeriss (Chen et al., 2016). This is due to lowpower VLSI implementation of analog neurons (Indiveri, 2003), lowpower and highdensity NVMbased synaptic storage (Burr et al., 2017), as well as distributed computing and storage architecture using crossbars. Given these advantages, a neuromorphic hardware can implement machinelearning tasks for powerconstrained platforms such as embedded systems and edge nodes of the InternetofThings (IoT) (Atzori et al., 2010).
Unlike conventional vonNeumann computing systems, where CPUs compute by exchanging data centrally from the main memory, synthesizing, i.e., compiling and mapping a machinelearning program on a neuromorphic hardware is challenging. This is because in a neuromorphic hardware, computation units (i.e., the neurons) and storage units (i.e., the synapses) are distributed within the hardware as crossbars. It is therefore important to properly partition a large SNN model such that it can be mapped efficiently to the underlying resources. Additionally, each crossbar also presents limitations on how many presynaptic connections are allowed per postsynaptic neuron, and how much buffer space is available to send and receive spikes over the interconnect. These hardware limitations impact both model accuracy and hardware performance such as throughput, latency, and energy consumption.
We develop DFSynthesizer, a systematic and endtoend framework to analyze and map machinelearning programs to stateoftheart neuromorphic hardware, while guaranteeing performance. Following are our key contributions.^{2}^{2}2Contributions 2, 3, and 4 appeared in our prior work (Song et al., 2020a). This work introduces the contributions 1, 5, and 6.

Contribution 1. We present an approach to analyze machinelearning programs and generate SNN workload using representative data. Our framework allows workload generation with only a modest impact on model performance.

Contribution 2. We present an approach to decompose and partition complex SNN workloads and generate clusters of neurons and synapses such that each cluster can fit onto the resources of a crossbar in the hardware.

Contribution 3. We exploit the rich semantics of Synchronous Dataflow Graphs (SDFGs) (Lee and Messerschmitt, 1987)
to represent clustered SNN programs. This allows for the SNN’s performance, e.g., throughput, to be estimated on the hardware as a function of key properties such as number of crossbars, dimension of crossbars, buffer space on tiles, and tile communication bandwidth.

Contribution 4. We develop a novel scheduling algorithm based on SelfTimed Execution for executing clusters on crossbars of a neuromorphic hardware, providing performance guarantee in scenarios with dynamic resource availability.

Contribution 5. We propose a designspace exploration framework incorporating DFSynthesizer that allows the Paretospace of different SNN mappings to hardware to be explored while considering other hardware metrics such as energy, latency, and reliability.

Contribution 6.
We evaluate DFSynthesizer using 10 machine learning programs that are representative of the three most commonly used neural network classes — convolutional neural network (CNN), multilayer perceptron (MLP), and recurrent neural network (RNN).
2. Scope and HighLevel Overview of DFSynthesizer
DFSynthesizer is developed for supervised machine learning approaches, where a machinelearning model is first trained using representative data from the field. Machine learning inference refers to generating output from the trained model by feeding live data. To improve energy efficiency, the inference is performed on a neuromorphic hardware. Once deployed on the hardware, the model is expected to perform inference in realtime on a continuous basis from data collected using sensors.^{3}^{3}3Camera sensors are used for image classification models, e.g., LeNet, AlexNet, and VGG16, while electrocardiogram sensors are used for heartrate classification and estimation models. See our evaluation setup in Section 7. Therefore, a key performance metric for neuromorphic hardware performing realtime inference is throughput, defined as the number of frames processed per unit time, where a frame is defined as an individual image (for imagebased models) or a window of timeseries data.^{4}^{4}4By maximizing the throughput, DFSynthesizer minimizes the time to process individual frame using the neuromorphic inference hardware, which makes DFSynthesizer applicable to both realtime and non realtime applications.
Figure 3 illustrates the proposed endtoend framework of DFSynthesizer, which synthesizes, i.e., compiles and maps a machine learning program to a neuromorphic hardware in four steps. First, it analyzes a machine learning program written in a highlevel language such as Python and C/C++ to generate SNN workload (Section 3). Second, it compiles SNN workloads to an intermediate representation format (h5 and json), performing spatial decomposition and clustering to fit onto the resources of a crossbar (Section 4). Third, it uses Synchronous Dataflow Graph (SDF) to represent clustered SNN (in XML representation), allocating resources to the clusters considering hardware resource constraints (Section 5). Finally, it schedules the SDF representation of a clustered SNN to the hardware crossbars, guaranteeing performance (Section 6).
3. Program Analysis and Workload Generation
In this step, a machinelearning program is analyzed to generate its workload. In the following, we discuss the steps involved in the workload generation.
3.1. Workflow for Workload Generation
Figure 4 summarizes the workflow of the workload generation step of DFSynthesizer, where a machinelearning program is analyzed to generate its workload which is then used to map the application to a neuromorphic hardware.
DFSynthesizer can incorporate both Artificial Neural Networks (ANNs) and Spiking Neural Networks (SNNs) in its workflow. At a high level, the proposed workflow consists of a model training component followed by model analysis. In the following, we elaborate on these components.
3.2. Model Training
3.2.1. Training Artificial Neural Networks
DFSynthesizer
’s frontend is integrated with Keras
(Gulli and Pal, 2017), which is used to define a model and train it on a database. Keras utilizes Tensorflow backend
(Abadi et al., 2016). DFSynthesizer also supports other frameworks such as PyTorch
(Paszke et al., 2019). To demonstrate the capabilities of DFSynthesizer, we evaluate it with three Convolutional Neural Network (CNN) architectures – 1) LeNet (LeCun and others, 2015), trained on MNIST handwritten digit dataset
(Deng, 2012), 2) AlexNet (Krizhevsky et al., 2012), trained on ImageNet dataset
(Deng et al., 2009), and 3) VGGNet (Simonyan and Zisserman, 2014), trained on ImageNet dataset. These models are derived from the MLPerf
(Reddi et al., 2020) dataset and instantiated in Keras. We use a Lambda workstation with two GPUs (see our evaluation setup in Section 7) to train these models.3.2.2. Training Spiking Neural Networks
DFSynthesizer’s frontend supports training SNN models using PyCARL (Balaji et al., 2020a), a Python frontend to CARLsim (Chou et al., 2018). CARLsim facilitates SNN simulations using CPUs and multiGPUs. PyCARL is designed to integrate with PyNN (Davison et al., 2009), which provides a common frontend to different SNN simulators with various degrees of neurobiological details. We use CARLsim for model training. CARLsim’s support for builtin biologically realistic neuron, synapse, current and emerging learning models and continuous integration and testing, make it an easytouse and powerful simulator of biologicallyplausible SNN models. DFSynthesizer can also utilize other SNN simulators such as Brian (Goodman and Brette, 2009), NEST (Eppler et al., 2009), and NEURON (Hines and Carnevale, 1997) for model training.
3.3. Model Analysis
3.3.1. Model Parsing and Conversion
Unfortunately, ANN models cannot be executed directly on eventdriven neuromorphic hardware platforms such as DYNAPSE (Moradi et al., 2017), TrueNorth (DeBole et al., 2019), and Loihi (Davies et al., 2018). Recently, many tools have been proposed to convert ANN operations to SNNs. Examples include Nengo (Bekolay et al., 2014), N2D2 (Bichler et al., 2017), and SNNToolBox (Rueckauer et al., 2016). A common limitation of these toolboxes is that they are openloop converters, meaning that the conversion is performed considering performance degradation only. In our prior work (Balaji et al., 2018), we have proposed a closedloop conversion mechanism, where the conversion of analog operations to spiking equivalent is performed considering the energy consumption on hardware. These conversion steps are briefly discussed below.^{5}^{5}5The conversion framework was introduced in (Balaji et al., 2018) for converting CNNbased HeartClass application to its equivalent SNN representation. We used this application to evaluate DFSynthesizer. Additionally, we have extended the conversion framework to add other key functionalities such as Layer Flattening, Concatenation, Binary Weight Activation, and NonZero Biases. These new functionalities allowed the conversion framework to convert stateoftheart CNN architectures such as LeNet, AlexNet, and VGG16, which are used to evaluate DFSynthesizer.

ReLU Activation Functions: This is implemented as the approximate firing rate of a leaky integrate and fire (LIF) neuron.

Bias: A bias is represented as a constant input current to a neuron, the value of which is proportional to the bias of the neuron in the corresponding analog model.

Weight Normalization: This is achieved by setting a factor to control the firing rate of spiking neurons.

Softmax: To implement softmax, an external Poisson spike generator is used to generate spikes proportional to the weighted sum accumulated at each neuron.

Max and Average Pooling:
To implement max pooling, the neuron which fires first is considered to be the winning neuron, and therefore, its responses are forwarded to the next layer, suppressing the responses from other neurons in the pooling function. To implement average pooling, the average firing rate (obtained from total spike count) of the pooling neurons are forwarded to the next layer.
We have extended our framework with the following new functionalities to allow for the conversion of CNN architectures such as LeNet, AlexNet, and VGGNet to their spiking counterparts.

1D Convolution: The 1D convolution is implemented to extract patterns from inputs in a single spatial dimension. A 1xn filter, called a kernel, slides over the input while computing the elementwise dotproduct between the input and the kernel at each step.

Residual Connections: Residual connections are implemented to convert the residual block used in CNN models such as ResNet. Typically, the residual connection connects the input of the residual block directly to the output neurons of the block, with a synaptic weight of ‘1’. This allows for the input to be directly propagated to the output of the residual block while skipping the operations performed within the block.

Flattening: The flatten operation converts the 2D output of the final pooling operation into a 1D array. This allows for the output of the pooling operation to be fed as individual features into the decisionmaking fully connected layers of the CNN model.

Concatenation:
The concatenation operation, also known as a merging operation, is used as a channelwise integration of the features extracted from 2 or more layers into a single output.
Table 1 reports the accuracy impact due to the SNN conversion of three stateoftheart supervised CNN models. These accuracy numbers are obtained from CARLsim (Chou et al., 2018), which allows functional simulation and performance estimation of SNNbased applications. We use these three converted CNN models to evaluate DFSynthesizer (See Section 7).
Application  Top1 Accuracy (%)  Application  Top1 Accuracy (%)  Application  Top1 Accuracy (%)  
Original  SNN  Original  SNN  Original  SNN  
LeNet  94.98%  94.08%  AlexNet  74.1%  71.7%  VGG16  93.56%  91.62% 
3.3.2. Workload Generation
The SNN model (or the converted ANN model) is analyzed in CARLsim to generate the following information.

Spike Data: the exact spike times of all neurons in the SNN model. We let represents a list of spike times of the neuron in the model.

Weight Data: the synaptic strength of all synapses in the SNN model. We let represents the synaptic weight of the connection between the and neurons in the SNN model.
The spike and weight data of a trained SNN form the SNN workload. Formally, an SNN workload is defined as
Definition 1 ().
(SNN Workload) An SNN Workload is a directed graph consisting of a finite set of neurons, a set of spikes, and a set of synapses between the neurons.
4. Program Compilation and Performance Estimation
In this step, DFSynthesizer clusters a given machinelearning model to map onto the crossbars of a neuromorphic hardware. To do so, we first introduce the system architecture and then discuss the clustering step needed to map applications to this architecture.
4.1. System Architecture
Figure 5 illustrates our system architecture. DFSynthesizer is designed for crossbarbased neuromorphic hardware designs as shown in Figure 2. This is representative of many recent neuromorphic designs (Catthoor et al., 2018; Gopalakrishnan et al., 2020; Ankit et al., 2017; Hu et al., 2016). A machine learning model (ANN or SNN) is first analyzed to generate its workload (Section 3). This workload is then partitioned to generate clusters, where each cluster consists of a fraction of the neurons and synapses of the original machine learning model. The cluster workload is stored in a disk along with other machine learning workloads. To execute a specific workload on the neuromorphic hardware, it is first loaded into the host memory and then the clusters are programmed on to the crossbars of the hardware via the PCIe interface.^{6}^{6}6Although we illustrate the crossbars to be interconnected in a meshbased architecture such as NetworksonChip (NoC) (Benini and De Micheli, 2002), DFSynthesizer can work with other interconnect types such as Segmented Bus (Balaji et al., 2019c).
In the remainder of this section, we describe the workload compilation step of DFSynthesizer, which consists of the following two design components – Workload Decomposition and Workload Clustering. We conclude this section by providing a dataflow modeling approach for clustered workloads and performance estimation using such model.
4.2. Workload Decomposition
We note that each crossbar in a neuromorphic hardware can accommodate up to presynaptic connections per postsynaptic neuron, with typical value of set between 128 (in DYNAPSE) and 256 (in TrueNorth). Figure 6 illustrates an example of mapping a) one 4input, b) one 3input, and c) two 2input neurons on a crossbar. Unfortunately, neurons with more than 4 presynaptic connections per postsynaptic neuron cannot be mapped to the crossbar. In fact, in many complex machine learning models such as AlexNet and VGG16, the number of presynaptic connections per postsynaptic neuron is much higher than 128. Therefore, these neurons cannot be mapped to a crossbar in DYNAPSE.
To address the above limitation, we have previously proposed a spatial decomposition technique which exploits the firing principle of LIF neurons, decomposing each neuron with many presynaptic connections into a sequence of homogeneous faninoftwo (FIT) neural units (Balaji et al., 2020d).
Figure 7 illustrates the spatial decomposition using a small example of a 3input neuron shown in Figure 7(a). We consider the mapping of this neuron to 2x2 crossbars. Since each crossbar can accommodate a maximum of two presynaptic connections per neuron, the example 3input neuron cannot be mapped to the crossbar directly. The most common solution is to eliminate a synaptic connection, which may lead to accuracy loss. Figure 7(b) illustrates the decomposition mechanism, where the 3input neuron is implemented using two FIT neural units connected in sequence as shown in Figure 7(b). Each FIT unit is similar to a 2input neuron and it exploits the leaky integrate behavior in hardware to maintain the functional equivalence between Figures 7(a) and 7(b).
For the sake of completeness, Figure 7(c) illustrates the mapping of the decomposed neuron utilizing two 2x2 crossbars. The functionality of the FIT neural units is implemented using the NonVolatile Memory (NVM) cells of the two crossbars.
To describe the decomposition Algorithm, we introduce the following notations. Let be the presynaptic connections of the neuron . Let be the () FIT neural units that are generated by spatially decomposing this neuron. The input of unit denoted as can be represented as
(1) 
where is the output of the unit . When decomposing a neuron, we note that the first FIT unit uses two of the original inputs of the original neuron. Subsequently, all other FIT units use one of the original inputs and the output of the preceding FIT units as shown in Figure 7(b).
Formally, a decomposed SNN graph is defined as follows.
Definition 2 ().
(Decomposed SNN Graph) A decomposed SNN graph is a directed graph consisting of a finite set F of FIT neural units and a finite set L of links between these units.
Algorithm 1 shows the pseudocode of the spatial decomposition technique, which performs the graph transformation . For each neuron (line 1), a set of inputs to this neuron is obtained (line 2). The first FIT unit is formed using two input inputs (line 3). This is in accordance with Equation 1 and Figure 7(b). The FIT unit is inserted into the decomposed graph (line 4). The algorithm then creates the other FIT units iteratively (lines 58) using Equation 1 and stores those units in . Finally, the graph is returned (line 10).
The overall complexity of this algorithm is calculated as follows. The Out for loop (lines 19) is executed for the neurons in the original graph , i.e., for times. Within each iteration, the algorithm creates a total of FIT units, where is the set of input of neuron . Therefore, the algorithmic complexity is
(2) 
In deriving the final expression, we note that the input connections of all the neurons in the graph are the edges in the graph.
4.3. Workload Clustering
The decomposed SNN graph is clustered such that each cluster is able to fit onto a crossbar. Figure 8 illustrates the concept using an example of a decomposed SNN graph shown in ( ❶ ). The nodes are the FIT neural units and the links are the synaptic connections. The number on a link represents the average number of spikes communicated between the source and destination FIT units for the representative training data. We consider the mapping of this decomposed SNN graph to a hardware with crossbars. Since a crossbar in this hardware can only accommodate a maximum of 2 presynaptic connections, we partition the graph of ( ❶ ) into two partitions (shown in two different colors) in ( ❷ ). These partitions can then be mapped to the two crossbars as shown in ( ❸ ), with an average 8 spikes communicated between the crossbars due to the mapping of the link between neuron d and e on the shared interconnect of the hardware. Finally, the two clusters generated from the SNN graph are shown in ( ❹ ) along with the intercluster communication.
Formally, a clustered SNN graph is defined as follows.
Definition 3 ().
(Clustered SNN Graph) A clustered SNN graph is a directed graph consisting of a finite set A of clusters and a finite set C of connections between these clusters.
Recently, different approaches have been proposed for clustering SNNs. Examples include SpiNeMap (Balaji et al., 2020b) for energy minimization and NEUTRAMS (Ji et al., 2016) for performance. See Section 9 for a comprehensive overview of other stateoftheart SNN clustering approaches.
We formulate SNN clustering as a graph transformation problem and introduce an efficient algorithm to improve resource utilization. This objective is essential to provide tighter guarantee on performance of SNNs in hardware as we demonstrate in Section 8.
The graph transformation is a classical graph partitioning problem (Kernighan and Lin, 1970), and has been applied in many contexts, including task mapping on multiprocessor systems (Das et al., 2014a). We propose a greedy approach to pack the FIT neural units and synapses of the decomposed SNN graph into clusters, improving cluster resource utilization. Algorithm 2 provides the pseudocode of the clustering algorithm. For each node of the unrolled graph, the algorithm tries to see if the node can be merged into one of the existing clusters (line 3), before creating a new one (lines 4–8). In this algorithm, clusters in are sorted in descending order of neuron and synapse utilization (line 12), so that the heavily utilized clusters are first considered for packing neurons and synapses, further improving their utilization.
4.4. Dataflow Modeling of Clustered Workload
We model a clustered SNN as a Synchronous Data Flow Graph (SDFG) for predictable performance analysis (Lee and Messerschmitt, 1987). SDFGs are commonly used to model streaming applications that are implemented on a multiprocessor systemonchip (Sriram and Bhattacharyya, 2000). These graphs are used to analyze a system in terms of key performance properties such as throughput, execution time, communication bandwidth, and buffer requirements (Stuijk et al., 2006a). Nodes of an SDFG are called actors. Each node is a cluster of the clustered SNN graph . Actors are computed by reading tokens, i.e., spikes from their input ports and writing the results of the computation as tokens on the output ports. The number of tokens produced or consumed in one execution of an actor is called the port rate. They represent the number of spikes per unit time at the input and output of different clusters in the SNN. Port rates are visualized as annotations on edges. Actor execution is also called firing, and it requires a fixed amount of time to execute on a crossbar. Edges in the graph are called channels and they represent dependencies among actors. An actor is said to be ready when it has sufficient input tokens on all its input channels and sufficient buffer space on all its output channels; an actor can only fire when it is ready. A set of ports is assumed, and with each port , a finite rate is associated. Formally, an actor is defined as follows.
Definition 4 ().
(Actor) An actor is a tuple consisting of a set () of input ports, a set () of output ports with , is the execution time of and is its state space, i.e., buffer space needed for communicating spikes on all of its channels.
The source of channel is an output port of actor , the destination is an input port of actor . All ports of all actors are connected to precisely one channel, and all channels are connected to ports of some actors. The source and the destination port of channel are denoted by and respectively. Channels connected to the input and output ports of an actor are denoted by and ) respectively.
Before an actor starts its firing, it requires tokens from all . When the actor completes execution, it produces tokens on every . One important property of an SDFG is throughput, which is defined as the inverse of its longterm period. A period is the average time needed for one iteration of the SDFG. An iteration is defined as the minimum nonzero execution such that the original state of the SDFG is obtained. This is the performance parameter used in this paper. Following definitions are introduced to formulate throughput.
Definition 5 ().
(Repetition Vector)
The Repetition Vector RptV of an SDFG is defined as the vector specifying the number of times actors in the SDFG are executed in one iteration.For the SDFG representation of a clustered SNN, all spikes generated on a channel are consumed by the destination actor. This means that all actors are fired exactly once during one iteration of the application. So, .
4.5. Cyclic Dependency and Deadlock Avoidance
The clustering approach may lead to cyclic dependency among actors. Figure 9(a) illustrates a simple feedforward network of 3 neurons (A, B, & C). Figure 9(b) illustrates a scenario where neurons A and C are placed in cluster 1 (actor 1) and neuron B in cluster 2 (actor 2) during partitioning. Due to the connectivity of the neurons in Figure 9(a), there is a cyclic dependency between the two actors: actor_1actor_2actor_1. SDF graphs allow representing such cyclic dependency among actors, justifying our choice of using them for modeling clustered SNNs.
However, presence of cycles complicates the scheduling problem because cyclic dependences can lead to deadlocks. To address this, a cyclic SDF graph is decomposed into hierarchies of acyclic subgraphs. To describe this, we introduce the following definition.
Definition 6 ().
(Strongly Connected Subgraph) A subgraph of a directed (cyclic or acyclic) graph is called a stronglyconnected subgraph, iff for every pair of vertices and of , there is a path from to and a path from to .
Figure 10 shows the flowchart for cycle breaking, also known as subindependence partitioning, which is the process of decomposition of strongly connected SDF graphs into hierarchies of acyclic graphs. This is roughly based on the Loose Interdependence Algorithms Framework (LIAF) (Battacharyya et al., 1996). A cyclic SDF graph is first decomposed into a series of strongly connected subgraphs . For each strongly connected subgraph , the LIAF algorithm tries to break cycles by properly removing edges that have sufficient delays. Let be the stronglyconnected subgraph of the SDF Graph. An edge can be removed if it has enough initial tokens to satisfy the consumption requirements of its sink actor for a complete iteration of and scheduling without does not lead to deadlock. The edge is called interiteration edge. The interiteration edge removal is performed iteratively until the new subgraph with the interiteration edges removed is no longer a strongly connected subgraph (i.e., it becomes a loosely connected subgraph). The subgraph is pushed into a ready list for scheduling purposes. The algorithm is repeated for all the stronglyconnected subgraphs. At the end, all deadlockfree subgraphs are scheduled.
4.6. Performance Estimation
We present an approach to compute the application period of an SDFG by analyzing its maximum cycle mean (MCM) and assuming infinite hardware resources. For this, we use MaxPlus Algebra (Heidergott et al., 2014; Zhang and Liu, 2013; Cong and Zhang, 2006). The MaxPlus semiring is the set defined with two basic operations , which are related to linear algebra as
(3) 
The identity element for the addition is in linear algebra, i.e., . The identity element for the multiplication is 0 in linear algebra, i.e., .
To use MaxPlus Algebra to analyze an SDFG, it is customary to express the time at which an actor fires in terms of preceding firings in linear algebra and then use standard analysis techniques for MaxPlus Algebra to estimate timing performance. We use the running example of the SDFG in Figure 11(a), which is obtained by clustering EdgeDet (Chou et al., 2018), an application used to evaluate DFSynthesizer (see Section 7). The clustering is performed considering 1024x1024 crossbars.^{7}^{7}7We evaluate DFSynthesizer primarily for DYNAPSE neuromorphic hardware with crossbars (Moradi et al., 2017). Here we configure crossbars to generate fewer clusters from EdgeDet for illustration purposes. The firing end times of all 9 actors in the iteration (in linear algebra) are
(4)  
Observe that the firing end time of actor in the iteration is after its firing end time in the iteration. Furthermore, the production and consumption rates are the same for every channel in the SDFG. Using previously introduced MaxPlus semantics, firing end times for every actor in the SDFG can be expressed as
(5) 
where is a matrix in that captures the actor execution times and . The following definitions are introduced to estimate latency.
Definition 7 ().
(Digraph) The digraph of a matrix with entries defined in is the tuple , where is the set of vertices, i.e., and is the set of connected ordered arcs between vertices i.e., .
To give an example, the matrix corresponds to the digraph shown in Figure 12.
Definition 8 ().
(Walk) A walk in digraph is the sequence of arcs ; head of an arc in the sequence is either the start vertex of the walk or tail vertex of a preceding arc; and the tail vertex of an arc in the sequence is either the end vertex of the walk or head vertex of a succeeding arc. Weight of the walk is given by
(6) 
Definition 9 ().
(Cycle) A cycle in digraph is the walk , such that .
Definition 10 ().
(Maximum Cycle Mean) The maximum cycle mean, is the maximum of the weighttolength ratio of all cycles in i.e.,
(7) 
In this paper, performance of an SNN is defined in terms of throughput of the equivalent SDFG, measured as the inverse of its maximum cycle mean (Equation 7), i.e.,
(8) 
In Equation 8, the performance is computed using the worstcase execution time of an actor on a crossbar. This is obtained from the propagation delay of current through the synaptic elements in the crossbar. As shown in many recent works (Titirsha and Das, 2020a, b; Titirsha et al., 2021b), the current propagation delay within a crossbar depends on the specific synaptic elements that are being activated in the crossbar. This is due to the difference in the amount of parasitic components on the bitlines and wordlines of a crossbar along the different current paths. For performance guarantee purposes, we assume the worstcase propagation delay in the crossbar, and use the same to represent the execution time of actors on the crossbars of a neuromorphic hardware.
The performance metric defined in Equation 8 provides the maximum throughput, considering only the worstcase execution time of actors. However, a neuromorphic hardware introduces constraints such as limited buffer space on the crossbars and nonzero latency on the interconnect, which can lower the throughput significantly. Therefore,
(9) 
In this work, we show that performance is impacted by
We seek to find the lower bound on performance () such that
(10) 
By making close to , we provide a tighter bound on performance.
5. Resource Allocation and Hardware Mapping
The performance obtained using Equation 7 defines the maximum throughput obtained when the clustered SNN is mapped to a hardware with infinite resources, i.e., a hardware with as many crossbars as the number of actors (clusters) in the clustered SNN graph. Additionally, each crossbar is assumed to have sufficient buffer space to send and receive spikes over the shared interconnect. However, stateoftheart neuromorphic hardware platforms present the following three critical limitations. First, the number of crossbars in a neuromorphic hardware is limited. Therefore, the available crossbars need to be timemultiplexed amongst the clusters of an SNN. Second, the input and output buffer space on each crossbar are limited. Therefore, no more than one cluster can be executed on a crossbar concurrently. Third, the communication bandwidth of each tile is limited. Therefore, only a few spikes can be sent or received from the interconnect at once. Formally, a neuromorphic hardware is defined as follows.
Definition 11 ().
(Neuromorphic Hardware Graph) A neuromorphic hardware graph is a directed graph consisting of a finite set T of tiles and a finite set I of interconnect links.
Each tile consists of a crossbar to map neurons and synapses, and input and output buffers to receive and send tokens (spikes) over the interconnect, respectively. A tile is a tuple , where is the dimension of the crossbar on the tile, i.e., the tile can accommodate presynaptic neurons, postsynaptic neurons, and synaptic connections, is the input buffer size on the tile, and is its output buffer size. Each interconnect link is bidirectional, representing twoway communication between the source and destination tiles with a fixed bandwidth .
The mapping is specified by matrix , where is defined as
(11) 
The mapping constraint is that a cluster can be mapped to only one tile, i.e.,
(12) 
The throughput of the clustered SNN graph on the neuromorphic hardware for mapping is computed as
(13) 
where DFSynthesizer is the extended MaxPlus formulation of Equation 7 incorporating platform constraints. The following three steps describe DFSynthesizer. Without loss of generality, we use Equation 14 as a running mapping example, where the 9 actors of Figure 11 are mapped to 4 tiles.
(14)  
The mapping corresponding to Equation 14 is therefore .
5.1. Step 1: Modeling Limited Buffer Sizes of Crossbars
Limited input and output buffer sizes of a tile are modeled as backedges with initial tokens indicating the buffer size available on the tile. This is illustrated in Figure 11(b) with the backedge from to , both of which are mapped to tile 0. When an actor generates spikes on a channel, the available size reduces; when the receiving actor consumes the spike, the available buffer is released. In the example, before can be executed, it has to check if enough buffer space is available. This is modeled by requiring tokens from the backedge to be consumed. Since it produces 5068 spikes per firing, 5068 tokens from the backedge are consumed, indicating reservation of the buffer spaces. On the consumption side, when is executed, it frees 5068 buffer spaces, indicated by a release of these tokens on the backedge. We assume atomic execution of actors on a crossbar, i.e., a crossbar reads input tokens and produces output tokens in the output buffer for no more than one actor at any given instance of time. To prevent other actors mapped to the same tile from firing simultaneously, the output buffer space is claimed at the start of execution and released only at the end of firing.
5.2. Step 2: Actor Ordering on Crossbars
The number of crossbars in a neuromorphic hardware is limited. Therefore they may have to be shared between actors of an SNN. However, on a tile, only one instance of an actor can be executing at the same moment in time. We use timedivision multipleaccess (TDMA) to allocate time slices to actors mapped to the same tile. During its allocated time slice, an actor is executed on the crossbar of the tile and generates spikes, which are stored in the output buffer for communication on the interconnect. Next, we generate the order in which the actors bound to a tile are fired to provide performance guarantee, i.e., throughput. For this, we apply our MaxPlus Algebra formulation (Eq.
7) on the SDFG of Fig. 11(b). This is our staticorder schedule, and is constructed at design time.5.3. Step 3: Actor Execution on Crossbars
Once the staticorder schedule is constructed for all tiles of the hardware, we use a selftimed execution strategy (Moreira and Bekooij, 2007) to execute these actors at run time. Here, the exact firing times of actors are discarded, retaining only the assignment and ordering of actors on each tile as obtained from the designtime analysis (step 2). At run time, ready actors are inserted into a list and fired in the same order previously determined during design time.
5.4. Mapping Exploration
Sections 5.1 through 5.3 extend the MaxPlus formulation to incorporate platform constraints. Using these constraints and the new formulation, one can estimate the throughput of a clustered SNN on a neuromorphic hardware for a specific actortotile mapping. In the following, we explain the mapping scenario where the number of tiles in the hardware is less than the number of actors in the clustered SNN. Therefore, each tile needs to be timemultiplexed between multiple actors.
Figure 13 conceptually illustrates the mapping exploration using DFSynthesizer compared to stateoftheart solutions and the selection of lower bound on throughput. ❶ represents the throughput obtained using SpiNeMap (Balaji et al., 2020b), which optimizes energy consumption for a hardware platform where the number of tiles is higher than the number of actors. When SpiNeMap is applied to the case where the tiles need to be timemultiplexed, it randomly distributes the actors to the tiles and schedules them arbitrarily, without considering throughput. Therefore, the throughput represented by ❶ (SpiNeMap) is significantly lower than the maximum throughput (i.e., the upper bound) represented using ❻ . Therefore, the throughput variation is .
In Figure 13, ❷ represents the throughput obtained using a solution such as PyCARL (Balaji et al., 2020a), which balances the load on each tile for a scenario where actors need to be timemultiplexed on the tiles. However, the actors mapped to a tile are scheduled in an arbitrary order without considering throughput. By balancing the tile load, PyCARL reduces the number of clusters mapped per tile, which improves throughput. Therefore, the throughput represented by ❷ is higher than ❶ , but lower than the maximum throughput ❻ . Therefore, the throughput variation is .
In Figure 13, ❸ represents the throughput obtained using our previous work SDFSNN (Song et al., 2020a), which first balances the load of each tile by distributing the actors evenly, and then uses a dataflow approach to schedule the actors on each tile, improving throughput. The throughput represented by ❸ is therefore higher than both ❶ and ❷ , but lower than the maximum throughput ❻ . Therefore, the throughput variation is .
In Figure 13, ❹ represents the throughput obtained using a mapping exploration framework, which explores a combination of actortotile mapping and dataflowbased scheduling of actors on each tile to maximize the throughput. This throughput is higher than ❶  ❸ , and is closer to the maximum throughput ❻ . Finally, ❺ represents the throughput obtained using an actortotile mapping that jointly optimizes energy and throughput, and uses dataflowbased scheduling of actors on each tile to further improve the throughput. Since this solution takes energy into consideration in the mapping step, the throughput can be somewhat lower than ❹ as illustrated in the figure. In Section 8, we evaluate all these approaches and show that ❺ is still higher than ❶  ❸ .
To conclude, the designspace exploration of DFSynthesizer can generate mappings representing two minimum throughput solutions – ❹ and ❺ . Although the maximum throughput remains the same for DFSynthesizer and other stateoftheart approaches, the minimum throughput of DFSynthesizer (i.e, ❺ ) is higher than the minimum throughput obtained using all stateoftheart mapping solutions (i.e., ❶  ❸ ). Therefore, the difference between maximum and minimum throughput is the least in DFSynthesizer compared to all stateoftheart solutions, meaning that DFSynthesizer provides stricter performance guarantee, which is critical for realtime systems. We now describe DFSynthesizer.
We integrate the extended MaxPlus formulation inside a designspace exploration framework to obtain cluster mappings that are Pareto optimal in terms of hardware metrics such as throughput, latency, energy, and reliability. In the following, we describe our mapping explorations considering energy and throughput. Such formulations can be trivially extended to consider other metrics.
The energy consumption of the mapping is measured considering the number of spikes that are generated inside each tile and the number of spikes that are routed on the interconnect (Titirsha et al., 2021a). The energy parameters are reported in Table 3. Using these parameters, the energy consumption is
(15) 
where is the energy consumed in generating the spikes and propagating the spike current via the synapses, and is the energy consumed in communicating spikes via the shared interconnect. where is the number of spikes generated inside tile and is the number of spikes communicated on the link between tiles and in the hardware.
Our objective is to maximize throughput of a given machinelearning model on hardware (Eq. 7) and minimize the hardware energy consumption (Eq. 15). We formulate a joint metric , and minimize it during our mapping explorations. To this end, we propose an iterative approach, which explores different mapping alternatives, satisfying the cluster mapping constraint (Eq. 12). For each mapping alternative, we evaluate throughput and energy consumption. Finally, Paretooptimal mappings are retained and returned.
Algorithm 3 provides the pseudocode of our proposed mapping exploration. We start by randomly distributing clusters to the tiles (line 3). We evaluate throughput and energy consumption of this mapping and compute the joint metric (lines 4–5). For each cluster, we do the following. We move the cluster from its current tile to every other tile and recalculate (lines 6–10). If reduces, the new mapping is retained (lines 11–13), and the algorithm proceeds to analyze the next cluster. In this way, a local minimum is reached, starting from the initial random allocation of clusters. We reexecute the algorithm times, starting with a different random allocation of the clusters each time. In this way, many mappings are explored. Finally, mappings that are Paretooptimal in terms of throughput and energy consumption are retained.
The complexity of this algorithm is as follows. The unit function GetTileofCluster is essentially an argmax function with a complexity of . The unit function MoveClusterToTile is an update of matrix and can be performed in . Therefore, the complexity of the algorithm is . Here, is a userdefined parameter and controls the compilation time with a tradeoff on the solution quality, i.e., execution time and energy consumption of the application on hardware.
6. Scheduling and Performance Guarantee
Selftimed execution is widely used to schedule SDFGs (Ghamarian et al., 2006). Static schedules are constructed using worstcase actor execution times determined during design time. Actor ordering on each tile is retained while discarding the timing information. At run time, actors are fired while maintaining the same order as determined during design time. In this regard, the following lemmas are stated (Ghamarian et al., 2006; Das et al., 2012, 2014a).
Lemma 1 ().
For a consistent and strongly connected SDFG, the selftimed execution consists of a transient phase followed by a periodic phase.
Lemma 2 ().
For a consistent and strongly connected SDFG, the throughput of an actor is given by the average firing of the actor per unit time in the periodic phase of the selftimed execution.
A modern neuromorphic hardware is expected to execute many SNN applications simultaneously. When a new application is to be admitted to a hardware, which is currently running other applications, the incoming application needs to be compiled and mapped to the hardware within a short time window, based on resources currently available on the hardware. Furthermore, when an existing application finishes execution, its hardware resources are freed, meaning that such resources can now be allocated to other running applications to improve their performance. For such dynamic scenarios, SDFG schedules must be constructed for every allocation scenario. If the runtime schedule is different from that used for analysis at design time, the throughput obtained will be significantly different than what is guaranteed at design time. There are therefore two approaches to generating runtime schedules.

Store the actor mapping and scheduling for all resourceallocation scenarios and for all applications from design time (storagebased solution).

Construct the schedule at run time based on the mappings stored from the designtime (constructionbased solution)
The former is associated with high storage overhead and the latter with longer execution time. Both storage and schedule construction time are crucial for machinelearning systems deployed in resource and powerconstrained environments. Therefore, we propose a modification of the selftimed execution scheduling as follows. First, we construct the staticorder schedule for all actors of an SNN on a single tile at design time. This is achieved using the MaxPlus Algebra formulation of Equation 7. Next, we discard the exact timing information, retaining only the actor firing orders for runtime use. At run time, we first construct the cluster mapping to tiles (Section 5.4), considering the available tiles. Next, we use the singletile staticorder schedule to derive the actor schedules on each tile, without having to construct them from scratch.
Figure 15 illustrates the construction of pertile schedules for an SNN application with 9 actors, and with two different mappings of actors to tiles from the same singletile static order schedule. We illustrate two scenarios in this example. In the first scenario (left), the application uses two tiles of the hardware. In the second scenario (right), the application uses three tiles of the hardware. In both scenarios, actor orders on each tile are the same as those on the single tile. Since tile schedules are not constructed from scratch, the schedule construction time is much lower.
However, performance obtained using this singletile schedule can be lower than the maximum performance of a multitile schedule constructed independently. As long as this performance deviation is bounded, the actor schedule for any tile can be easily derived from the binding of actors to this tile and a given singletile staticorder schedule. See Section 8 for performance evaluation.
7. Evaluation Methodology
We conduct all simulations on a Lambda workstation, which has AMD Threadripper 3960X with 24 cores, 128 MB cache, 128 GB RAM, and 2 RTX3090 GPUs. Keras (Gulli and Pal, 2017) and CARLsim (Chou et al., 2018) use the two GPUs to accelerate model training and SNN function simulation, respectively.
Figure 16 illustrates our evaluation setup using the cycleaccurate NeuroXplorer (Balaji et al., 2021) framework. This framework is validated extensively against the DYNAPSE neuromorphic hardware (Balaji et al., 2018, 2020b; Das et al., 2018c, a; Balaji et al., 2020a), and can model the architecture of other neuromorphic hardware platforms such as Loihi (Davies et al., 2018) and TrueNorth (DeBole et al., 2019). NeuroXplorer can simulate multicompartment neuron models and 9parameter Izhikevich and leaky integrateandfire (LIF) spiking neuron models. Additionally, NeuroXplorer can model NonVolatile Memory (NVM) synapses such as Phase Change Memory (PCM) and Oxidebased Resistive Random Access Memory (OxRRAM). NeuroXplorer also models the spike delay on the shared interconnect as well as the delay in propagating spikes through the synapses of a crossbar (Balaji et al., 2021). The mapping and scheduling results obtained using DFSynthesizer are used in NeuroXplorer to estimate energy, accuracy, and throughput.
7.1. Evaluated Applications
We evaluate 10 machine learning programs which are representative of three most commonlyused neural network classes: convolutional neural network (CNN), multilayer perceptron (MLP), and recurrent neural network (RNN). These applications are 1) LeNet based handwritten digit recognition with images of handwritten digits from the MNIST dataset; 2) AlexNet for ImageNet classification; 3) VGG16, also for ImageNet classification; 4) ECGbased heartbeat classification (HeartClass) (Balaji et al., 2018; Das et al., 2018b) using electrocardiogram (ECG) data; 5) image smoothing (ImgSmooth) (Chou et al., 2018) on images; 6) edge detection (EdgeDet) (Chou et al., 2018) on images using differenceofGaussian; 7) multilayer perceptron (MLP)based handwritten digit recognition (DigitRecogMLP) (Diehl and Cook, 2015) using the MNIST database; 8) heartrate estimation (HeartEstm) (Das et al., 2018a) using ECG data; 9) RNNbased predictive visual pursuit (VisualPursuit) (Kashyap et al., 2018); and 10) recurrent digit recognition (DigitRecogSTDP) (Diehl and Cook, 2015). To demonstrate the potential of DFSynthesizer, we consider a realtime neuromorphic system, where these machine learning programs are executed continuously in a streaming fashion. Therefore, by optimizing throughput, DFSynthesizer improves realtime performance.
Table 2 summarizes the topology, the number of neurons and synapses of these applications, and their baseline accuracy on the DYNAPSE neuromorphic hardware using the SpiNeMap (Balaji et al., 2020b) mapping framework. As reported in many recent works (Das et al., 2018c; Balaji et al., 2020b, a), spike latency on the shared interconnect of a neuromorphic hardware can lead to interspike interval (ISI) distortion and spike disorder. Since the performance of an SNN is a function of ISI, such nonidealities can lead to accuracy loss. Therefore, the accuracy of the three CNN architectures – LeNet, AlexNet, and VGG16 in Table 2 is somewhat lower than that reported via functional simulation in Table 1.
Class  Applications  Dataset  Synapses  Neurons  Topology  Top1 Accuracy (%) 
CNN  LeNet  MNIST  282,936  20,602  CNN  85.1% 
AlexNet  ImageNet  38,730,222  230,443  CNN  69.8%  
VGG16  ImageNet  99,080,704  554,059  CNN  90.7 %  
HeartClass (Balaji et al., 2018)  Physionet  1,049,249  153,730  CNN  63.7%  
MLP  ImgSmooth (Chou et al., 2018)  CARLsim  9,025  4,096  FeedForward (4096, 1024)  100% 
EdgeDet (Chou et al., 2018)  CARLsim  114,057  6,120  FeedForward (4096, 1024, 1024, 1024)  100%  
DigitRecogMLP  MNIST  79,400  884  FeedForward (784, 100, 10)  91.6%  
RNN  HeartEstm (Das et al., 2018a)  Physionet  66,406  166  Recurrent Reservoir  100% 
VisualPursuit (Kashyap et al., 2018)  (Kashyap et al., 2018)  163,880  205  Recurrent Reservoir  47.3%  
DigitRecogSTDP (Diehl and Cook, 2015)  MNIST  11,442  567  Recurrent Reservoir  83.6% 
7.2. Hardware Parameters
We model the DYNAPSE neuromorphic hardware (Moradi et al., 2017) with 1024 tiles organized in a mesh. Each tile has one crossbar. To test the scalability of DFSynthesizer, we also evaluate other crossbar configurations, e.g., , , and . Table 3 reports the relevant hardware parameters.
Neuron technology  28nm FDSOI 

Synapse technology  HfO based OxRAM 
Supply voltage  1.0V 
Energy per spike  50pJ at 30Hz spike frequency 
Energy per routing  147pJ 
Switch bandwidth  1.8G. Events/s 
The additional overhead in time multiplexing the tiles among multiple crossbars is incorporated in computing the throughput using NeuroXplorer. Specifically, once the cluster mapping to tiles are generated using DFSynthesizer, the synaptic weights of all clusters mapped to a tile are preloaded into the tile’s local memory (see our system architecture in Figure 5). In this way, DFSynthesizer reduces the overhead of transferring synaptic weights at runtime from the shared main memory. Additionally, since the loading of clusters (context switching) in crossbars happen concurrently from their respective private memory, the timemultiplexing overhead is minimal.
7.3. Evaluated Metrics
We evaluate the following performance metrics.

Performance. This is the throughput of each application on the hardware.

Resource Utilization. This is the neuron, synapse, buffer, connection, and input and output bandwidth utilization on the hardware for each application.

Energy Consumption. This is the energy consumed on the hardware for each application. This is the total energy consumed to generate spikes on each tile and communicate spike between tiles via the shared interconnect.

Cluster Connection. This is the average degree of the SDFG as percentage of the total number of nodes, obtained using the clustering technique for each application.

Spike Communication. This is the total number of spikes communicated on the shared interconnect of the neuromorphic hardware.

Synthesis Time. This is the time to compile and map each application on the hardware.
7.4. Evaluated Approaches
We evaluate the following approaches.

SpiNeMap (Balaji et al., 2020b). This approach first partitions an SNN into clusters of neurons and synapses by incorporating its workload. The objective is to minimize intercluster communication. Clusters are then mapped to tiles while minimizing spike communication on the shared interconnect and reducing energy consumption. When mapping SNNs to neuromorphic hardware with fewer tiles than the number of actors, 1) SpiNeMap allocates actors to tiles randomly and 2) SpiNeMap schedules the actors on each tile arbitrarily. Therefore, SpiNeMap does not consider throughput.

PyCARL (Balaji et al., 2020a). This approach maps neurons and synapses to tiles of a neuromorphic hardware, balancing the number of neurons and synapses on each tile. PyCARL does not incorporate SNN workload, i.e., spikes generated by neurons in the SNN. Therefore, some tiles may end up communicating more spikes than others, i.e., those tiles become the energy bottleneck.

SDFSNN (Song et al., 2020a). This approach uses the loadbalancing mapping of PyCARL to allocate actors to tiles. It uses dataflow scheduling to improve the throughput.

DFSynthesizer. The proposed approach first clusters an SNN, considering its workload. The objective is to improve cluster utilization. This is done by first decomposing the SNN into homogeneous neural units with faninoftwo. The clusters are then mapped to tiles, jointly optimizing throughput and energy consumption. DFSynthesizer uses dataflowbased scheduling of actors to tiles to further improve the throughput.
8. Results and Discussions
8.1. Throughput
Figure 17 reports the throughput on DYNAPSE for the evaluated approaches, for each application normalized to SpiNeMap. For reference, we have reported the maximum throughput in framespersecond obtained with unlimited hardware resources for each application. For imagebased applications (LeNet, AlexNet, VGGNet, EdgeDet, ImgSmooth, and DigitSTDP), a frame corresponds to an individual image. For other timeseries applications (HeartClass, HeartEstm, and VisualPursuit), a frame corresponds to a window of 500ms. We make the following four key observations.
First, although the number of neurons and synapses of larger applications such as AlexNet and VGG16 is significantly higher than LeNet, the throughput of LeNet on a hardware with unlimited resources,^{8}^{8}8In the context of this work, unlimited resources refer to a neuromorphic hardware that has at least the same number of crossbars as there are clusters in the machine learning program. i.e., without timemultiplexing of crossbars is only 1.5x higher than AlexNet and 2x higher than VGG16. This is because with no timemultiplexing of crossbars, computations in a machine learning program take place concurrently on the crossbars, the basic philosophy of distributed computing, which is enabled using neuromorphic platforms. Therefore, the overhead due to timemultiplexing of crossbars is no longer the throughput bottleneck. Rather, the bottleneck shifts to spike delay between the clusters. Additionally, in our framework we cluster machine learning programs to minimize intercluster spikes. Therefore, even though Alexnet has significantly higher number of neurons and synapses than LeNet, its number of intercluster spikes is not significantly higher. The throughput of AlexNet is only 33% lower than LeNet. Similarly, VGG16, which has higher intercluster spikes than AlexNet, has 25% lower throughput.
Second, the throughput obtained using SpiNeMap is the least because SpiNeMap does not guarantee throughput during actortotile mapping and actor scheduling on tiles. The throughput of PyCARL is on average 4% higher than SpiNeMap. This is because PyCARL balances the load on the tiles and therefore, the average number of actors mapped to each tile is lower than SpiNeMap, which results in higher throughput. The throughput of SDFSNN is on average 9.7% higher than PyCARL. This improvement is because of the use of dataflowbased scheduling, which maximizes the throughput. DFSynthesizer improves throughput by an average of 17% compared SDFSNN. This improvement is because unlike SDFSNN, which maps actors to tiles balancing the tile load without considering the throughput, DFSynthesizer performs throughput and energyaware mapping of actors to tiles and then uses dataflowbased scheduling to further improve the throughput. We have analyzed such throughput differences in Section 5.4.
Third, the throughput using DFSynthesizer is only 16% lower on average than the maximum throughput obtained with unlimited hardware resources. Finally, the throughput of DigitMLP is a very small application. All the techniques generate the same number of clusters for this application, resulting in similar throughput.
8.2. Workload Energy
Figure 18 reports the workload energy estimated on DYNAPSE of the evaluated approaches for each application normalized to SpiNeMap. For reference, we have reported the workload energy in obtained using the maximum throughput approach, which assumes unlimited hardware resources. We make the following observation.
The energy consumption of SpiNeMap is the least because this approach partitions SNNs into clusters to explicitly minimize the number of intercluster spikes. Therefore, when the clusters are mapped to hardware, the energy consumption on the shared interconnect is reduced.^{9}^{9}9The mapping exploration only impacts the communication energy on the shared interconnect. The spike generation energy remains the same for all approaches. Second, the energy consumption of PyCARL is on average 15% higher than SpiNeMap. This is because PyCARL balances the tile load without incorporating energy consumption. Therefore, clusters with high volume of spike communication between them may get placed on different tiles, increasing the communication energy. SpiNeMap places those tiles on the same tile lowering the communication energy. The energy consumption of SDFSNN is the same as PyCARL because the clustertotile mapping of these two approaches is the same. SDFSNN gains over PyCARL in terms of throughput due to its dataflowbased cluster scheduling on tiles. We analyzed this in Section 8.1. The energy consumption of DFSynthesizer is lower than SDFSNN by an average of 8%. This reduction is due to the clustertotile mapping of DFSynthesizer, which incorporates energy consumption.
8.3. Scheduling
Figure 19 reports throughput of each of our applications for our proposed approach normalized to PyCARL. We compare throughput obtained using DFSynthesizer where schedules are independently constructed for each tile against the throughput obtained using our proposed singletile based schedule (DFSynthesizer+STS). We make the following three observations.
First, throughput obtained from a singletile staticorder schedule is on average 15% lower than the case when schedules are constructed independently — that is, by using DFSynthesizer. This verifies our Lemma 2. Second, for some applications such as HeartEstm and HeratClass, throughput obtained using DFSynthesizer+STS is exactly the same as that obtained using DFSynthesizer. Third, throughput using DFSynthesizer+STS is still higher than PyCARL by an average of 41%.
8.4. Resource Utilization
Table 4 reports the utilization of hardware resources (tile resources, buffer size, connections, and input and output bandwidth) on the DYNAPSE neuromorphic hardware for each application. The average utilization of hardware resources is 92.5% for the crossbar IOs on each tile, 9.0% for buffer space, 42.6% for connections, and 15% for input and output tile bandwidth. Since we perform hardwareaware analysis, resource utilization never exceeds 100%.
Application  Utilization (%)  

Tile  Buffer  Connections  Bandwidth  
Input  Output  
LeNet  100  87.8  37.5  20.34  20.34 
AlexNet  100  91.8  46.87  17.09  17.09 
VGG16  100  94.2  15.62  6.51  6.51 
HeartClass  100  79.1  25  9.76  9.76 
DigitMLP  81.25  9.67  46.87  22.78  22.78 
EdgeDet  87.5  11.23  68.75  22.78  22.78 
ImgSmooth  87.5  8.39  37.5  17.08  17.08 
HeartEstm  96.87  9.61  62.5  4.7  4.7 
VisualPursuit  90.12  21.2  25.04  12.11  16.6 
DigitSTDP  89.33  20.13  22.19  11.94  11.7 
These results illustrate that DFSynthesizer can be used to design neuromorphic hardware while considering key hardware parameters such as number of tiles, but all other resources such as buffer space, connections, and input and output bandwidth.
To give more insight on the utilization within each tile, Figure 20 reports the average synapse utilization on tiles of the evaluated approaches for each application normalized to PyCARL. We make the following two key observations.
First, the synapse utilization on tiles using SpiNeMap is the least of all three evaluated approaches. This is because SpiNeMap produces the highest number of clusters (Sec. 8.5) and therefore, the average number of synapses per cluster is the least. Subsequently, when these clusters are mapped to tiles, the average synapse utilization on tiles reduces. Second, DFSynthesizer generates fewer clusters than both SpiNeMap and PyCARL due to its dense packing of synapses using Algorithm 2. Therefore, the average number of synapses per cluster is higher, which increases synapse utilization on tiles when the clusters are mapped to tiles. On average, the average synapse utilization of DFSynthesizer is 2x higher than PyCARL and 2.2x higher than SpiNeMap.
8.5. Number of Clusters
Figure 21 reports the total number of clusters of the evaluated approaches for each application normalized to PyCARL. We make the following two key observations.
First, the number of clusters of SpiNeMap is the highest of all three evaluated approaches. This is because SpiNeMap minimizes the number of intercluster communication during clustering of an SNN. Therefore, neurons that spike the most are placed within individual clusters along with their fanins. Since SpiNeMap does not consider cluster utilization, it results in creating more clusters than PyCARL. Second, DFSynthesizer clusters an SNN to maximize the resource utilization on each tile. Therefore, the number of clusters generated by DFSynthesizer is the lowest. Overall, the number of clusters of DFSynthesizer is 41% lower than SpiNeMap and 47% lower than PyCARL. Lower the number of clusters, lower is the size of hardware needed to achieve highest throughput (Sec. 8.1). Therefore, DFSynthesizer reduces the hardware requirement for machine learning applications.
8.6. Cluster Connections
Figure 22 reports the cluster connections of the evaluated approaches for each application normalized to PyCARL. We make the following two key observations.
First, the number of intercluster connections of SpiNeMap is the least of all three evaluated approaches. This is because SpiNeMap minimizes the number of intercluster communication while clustering an SNN, which indirectly reduces the cluster connectivity. Second, DFSynthesizer clusters an SNN to maximize the resource utilization on each tile. Therefore, the number of connections between the clusters is higher in DFSynthesizer because of the higher number of postsynaptic neurons mapped to each cluster. Overall, the average cluster connections of DFSynthesizer is 3.1x higher than SpiNeMap and 3.9x higher than PyCARL.
8.7. Architecture Exploration
Figure 23 reports the number of clusters generated using DFSynthesizer for neuromorphic hardware with , , and crossbars, normalized to a DYNAPSE configuration with crossbars. We observe that the number of clusters generated using DFSynthesizer reduces by 60% and 92% when the size of a crossbar increases to and , respectively.
Fewer number of clusters increases throughput. To illustrate this, Figure 24 reports the throughput using DFSynthesizer for different crossbar sizes normalized to throughput on DYNAPSE with four crossbars. We make the following two observations.
First, throughput increases by 18% and 30% when using and crossbars, respectively. This improvement is because with larger size crossbars, there are fewer clusters generated by DFSynthesizer (Fig. 23). Therefore, the number of clusters per tile reduces, which reduces the bottleneck of timemultiplexing clusters on tiles. This increases throughput. Second, for applications such as DigitMLP, EdgeDet, and HeartEstm, there is no throughput improvement when the crossbar size increased from to . This is because for these applications, crossbar configuration is sufficient to achieve the highest throughput. For all other applications, the throughput increases by 11% when going from to crossbars.
8.8. Synthesis Time
Figure 25 reports the synthesis time on DYNAPSE for the evaluated approaches, for each application normalized to PyCARL. We make the following three key observations.
First, the synthesis time of SpiNeMap is on average 61.6% higher than PyCARL. The higher synthesis time of SpiNeMap is due to the analysis it performs with the workload to obtain the minimum energy mapping. Second, the synthesis time of DFSynthesizer is the highest. On average, the synthesis time of DFSynthesizer is 35x higher than PyCARL and 25x higher than SpiNeMap. This higher synthesis time is due to 1) DFSynthesizer’s mapping explorations using Algorithm 3, and 2) DFSynthesizer’s SDFG analysis mechanism using the proposed Max Plus formulation. Third, the synthesis time of DFSynthesizer increases with model complexity. The synthesis time of DFSynthesizer is higher than PyCARL by 3.1x for LeNet, 25.5x for AlexNet, and 272.3x for VGG16.
8.9. Model Quality
DFSynthesizer does not alter synaptic connections. Therefore, the model quality, e.g., accuracy is not impacted by the analysis technique of DFSynthesizer. The only impact DFSynthesizer introduces is in converting CNNs. The accuracy impact is reported in Table 1. For all other applications, DFSynthesizer’s accuracy is the same as the baseline accuracy reported in Table 2.
9. Related Works
Recently, many approaches are proposed to map machine learning workloads to neuromorphic hardware. Corelet (Amir et al., 2013) is used to map SNNs to TrueNorth (DeBole et al., 2019). PACMAN (Galluppi et al., 2015) is used to map SNNs to SpiNNaker (Furber et al., 2014). PyNN (Balaji et al., 2020a) is used to map SNNs on Loihi (Davies et al., 2018), BrainScaleS (Schemmel et al., 2012), and Neurogrid (Benjamin et al., 2014) by balancing the load on each tile. PyCARL (Balaji et al., 2020a) is used to map SNNs to DYNAPSE (Moradi et al., 2017). The primary objective of these approaches is to balance the workload on each tile by distributing the neurons and synapses evenly.
Beyond load balancing, recent techniques have also explored other objectives. PSOPART (Das et al., 2018c) is used to map SNNs to neuromorphic hardware, reducing the energy consumption on the shared interconnect. SpiNeMap (Balaji et al., 2020b) performs energyaware clustering of SNNs and then maps the clusters to tiles, reducing the communication energy. DecomposeSNN (Balaji et al., 2020d) decomposes an SNN to improve the cluster utilization. There are also performanceoriented SNN mapping approaches such as (Balaji et al., 2020c; Song et al., 2020a; Balaji et al., 2019b; Balaji and Das, 2020), energyaware SNN mapping approaches such as (Titirsha et al., 2021a), circuit agingaware SNN mapping approaches such as (Song et al., 2020c; Song and Das, 2020a; Balaji et al., 2019a; Kundu et al., 2021; Song et al., 2021b), enduranceaware SNN mapping approaches such as (Titirsha and Das, 2020a; Titirsha et al., 2021b; Song et al., 2021d), and thermalaware SNN mapping approaches such as (Titirsha and Das, 2020b). These approaches are evaluated with emerging SNN based applications (Moyer et al., 2020; Balaji et al., 2018; Das et al., 2018b; Diehl and Cook, 2015; Das et al., 2018a; Kashyap et al., 2018), which we also use to evaluate DFSynthesizer.
There are also other mapping approaches such as (Ankit et al., 2018; Zhang et al., 2018; Xia and Yang, 2019; Lee et al., 2019; Wijesinghe et al., 2018; Wen et al., 2015; Ramasubramanian et al., 2014). We compare DFSynthesizer against PyCARL and SpiNeMap, and found it to perform significantly better.
Similar Concept in Related Domain
SDFGs are widely used for predictable mapping of applications to multiprocessor systems. Numerous approaches to throughput analysis of SDFGs have been previously proposed (Stuijk et al., 2006b, 2007; Damavandpeyma et al., 2012; Zhu et al., 2012; Shafik et al., 2015; Das et al., 2015b; Shafik et al., 2015). Bonfietti et al. evaluated mappings of SDFG to multiprocessor system, maximizing the throughput (Bonfietti et al., 2013). Stemmer et al. propose to use probabilistic analysis to allocate and schedule SDFGs on multiprocessor systems (Stemmer et al., 2020). Das et al. evaluated the faulttolerant mapping of SDFGs to multiprocessor systems (Das et al., 2013b, 2015a, 2014a; Das and Kumar, 2012; Das et al., 2013a, 2014b, 2012, c, 2016). Recently, SDFGbased analysis is also proposed for analyzing machine learning applications (Das and Kumar, 2018; Balaji and Das, 2019; Hong et al., 2017; Chen, YuHsin and Emer, Joel and Sze, Vivienne, 2017; Bacis et al., 2017; Song et al., 2021c). However, none of these approaches address application analysis with limited hardware resources, both at designtime and at runtime.
10. Conclusions
We introduce DFSynthesizer for predictable synthesis of SNNbased applications on stateoftheart neuromorphic hardware. Prior works have only addressed designtime mapping, considering unlimited resources in the underlying hardware. These approaches present significant limitations when used to compile and map machinelearning applications to a resourceconstrained hardware. DFSynthesizer makes five key contributions. First, we present an approach to analyze machinelearning programs and generate SNN workload using representative data. Second, we present an approach to decompose and partition complex SNN workloads to generate clusters of neurons and synapses such that each cluster can fit onto a crossbar of the hardware. Third, we exploit the rich semantics of Synchronous Dataflow Graphs (SDFGs) to represent clustered SNN programs. This allows for the SNN’s performance, e.g., throughput, to be estimated on the hardware as a function of key properties such as number of crossbars, dimension of crossbars, buffer space on tiles, and tile communication bandwidth. Four, we develop a novel scheduling algorithm based on SelfTimed Execution for executing clusters on crossbars of a neuromorphic hardware, providing performance guarantee in scenarios with dynamic resource availability. Five, we propose a designspace exploration framework incorporating DFSynthesizer that allows the Paretospace of different SNN mappings to hardware to be explored while considering other hardware metrics such as energy, latency, and reliability.
We evaluate DFSynthesizer using 10 machine learning programs that are representative of the three most commonly used neural network classes — convolutional neural network (CNN), multilayer perceptron (MLP), and recurrent neural network (RNN). Our results demonstrate that DFSynthesizer provides much tighter performance guarantee compared to current practices.
Acknowledgements.
This work is supported by 1) the National Science Foundation Award CCF1937419 (RTML: Small: Design of System Software to Facilitate RealTime Neuromorphic Computing) and 2) the National Science Foundation Faculty Early Career Development Award CCF1942697 (CAREER: Facilitating Dependable Neuromorphic Computing: Vision, Architecture, and Impact on Programmability).
References
 Tensorflow: a system for largescale machine learning. In OSDI, Cited by: §3.2.1, Table 1.
 Cognitive computing programming paradigm: a corelet language for composing networks of neurosynaptic cores. In IJCNN, Cited by: §9.
 TraNNsformer: Neural network transformation for memristive crossbar based neuromorphic system design. In ICCAD, Cited by: Figure 5, §4.1.
 Neuromorphic computing across the stack: devices, circuits and architectures. In SIPS, Cited by: §9.
 The internet of things: a survey. Computer Networks. Cited by: §1.
 A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In IPDPSW, Cited by: §9.
 PyCARL: A PyNN interface for hardwaresoftware cosimulation of spiking neural network. In IJCNN, Cited by: §3.2.2, §5.4, 2nd item, §7.1, §7, §9.
 Poweraccuracy tradeoffs for heartbeat classification on neural networks hardware. JOLPE. Cited by: §3.3.1, §7.1, Table 2, §7, §9, footnote 5.
 Mapping spiking neural networks to neuromorphic hardware. TVLSI. External Links: Document, 2004.03717, ISSN 15579999 Cited by: §4.3, §5.4, 1st item, §7.1, §7, §9.
 A framework for the analysis of throughputconstraints of SNNs on neuromorphic hardware. In ISVLSI, Cited by: §9.
 Compiling spiking neural networks to mitigate neuromorphic hardware constraints. In IGSC Workshops, Cited by: §9.
 Runtime mapping of spiking neural networks to neuromorphic hardware. JSPS. Cited by: §9.
 A framework to explore workloadspecific performance and lifetime tradeoffs in neuromorphic computing. CAL. Cited by: §9.
 Enabling resourceaware mapping of spiking neural networks via spatial decomposition. ESL. Cited by: §4.2, §9.
 NeuroXplorer 1.0: An extensible framework for architectural exploration with spiking neural networks. In ICONS, Cited by: Figure 16, §7.
 Design methodology for embedded approximate artificial neural networks. In GLSVLSI, Cited by: §9.
 Exploration of segmented bus as scalable global interconnect for neuromorphic computing. In GLSVLSI, Cited by: footnote 6.
 Loose interdependence algorithms. In Software Synthesis from Dataflow Graphs, Cited by: Figure 10, §4.5.
 Nengo: a python tool for building largescale functional brain models. Frontiers in Neuroinformatics. Cited by: §3.3.1.
 Networks on chip: a new paradigm for systems on chip design. In DATE, Cited by: footnote 6.
 Neurogrid: a mixedanalogdigital multichip system for largescale neural simulations. Proceedings of the IEEE. Cited by: §9.
 N2D2: Neural network design & deployment. https://github.com/CEALIST/N2D2. Cited by: §3.3.1.
 Maximumthroughput mapping of SDFGs on multicore SoC platforms. JPDC. Cited by: §9.
 Neuromorphic computing using nonvolatile memory. Advances in Physics: X. External Links: ISSN 23746149 Cited by: §1, §1.
 Very largescale neuromorphic systems for biological signal processing. In CMOS Circuits for Biological Sensing and Processing, Cited by: Figure 2, §1, Figure 5, §4.1.
 Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. JSSC. Cited by: §1.
 Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro. Cited by: §9.

CARLsim 4: An open source library for large scale, biologically detailed spiking neural network simulation using heterogeneous clusters
. In IJCNN, Cited by: §3.2.2, §3.3.1, Table 1, Figure 11, §4.6, §7.1, Table 2, §7.  An efficient and versatile scheduling algorithm based on SDC formulation. In DAC, Cited by: §4.6.
 Modeling staticorder schedules in synchronous dataflow graphs. In DATE, Cited by: §9.
 Unsupervised heartrate estimation in wearables with Liquid states and a probabilistic readout. Neural Networks. Cited by: §7.1, Table 2, §7, §9.
 Adaptive and hierarchical runtime manager for energyaware thermal management of embedded systems. TECS. Cited by: §9.

Heartbeat classification in wearables using multilayer perceptron and timefrequency joint distribution of ECG
. In CHASE, Cited by: §7.1, §9.  Energyaware communication and remapping of tasks for reliable multimedia multiprocessor systems. In ICPADS, Cited by: §6, §9.
 Agingaware hardwaresoftware task partitioning for reliable reconfigurable multiprocessor systems. In CASES, Cited by: §9.
 Communication and migration energy aware design space exploration for multicore systems with intermittent faults. In DATE, Cited by: §9.
 Communication and migration energy aware task mapping for reliable multiprocessor systems. FGCS. Cited by: §4.3, §6, §9.
 Energyaware task mapping and scheduling for reliable embedded computing systems. TECS. Cited by: §9.
 Reliability and energyaware mapping and scheduling of multimedia applications on multiprocessor systems. TPDS. Cited by: §9.
 Faultaware task remapping for throughput constrained multimedia applications on NoCbased MPSoCs. In RSP, Cited by: §9.
 Dataflowbased mapping of spiking neural networks on neuromorphic hardware. In GLSVLSI, Cited by: §9.
 Energyaware dynamic reconfiguration of communicationcentric applications for reliable MPSoCs. In ReCoSoC, Cited by: §9.
 Hardwaresoftware interaction for runtime power optimization: a case study of embedded linux on multicore smartphones. In Proceedings of ISLPED, Cited by: §9.
 Mapping of local and global synapses on spiking neuromorphic hardware. In DATE, Cited by: §7.1, §7, §9.
 Loihi: a neuromorphic manycore processor with onchip learning. IEEE Micro. Cited by: Figure 2, §1, §3.3.1, §7, §9.
 PyNN: a common interface for neuronal network simulators. Frontiers in Neuroinformatics. Cited by: §3.2.2.
 TrueNorth: Accelerating from zero to 64 million neurons in 10 years. Computer. Cited by: Figure 2, §1, §3.3.1, §7, §9.
 Imagenet: a largescale hierarchical image database. In CVPR, Cited by: §3.2.1.
 The MNIST database of handwritten digit images for machine learning research [best of the web]. Signal Processing Magazine. Cited by: §3.2.1.
 Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in Computational Neuroscience. Cited by: §7.1, Table 2, §9.
 PyNEST: A convenient interface to the NEST simulator. Frontiers in Neuroinformatics. Cited by: §3.2.2.
 The SpiNNaker project. Proceedings of the IEEE. Cited by: §9.
 A framework for plasticity implementation on the spinnaker neural architecture. Frontiers in Neuroscience. Cited by: §9.
 Throughput analysis of synchronous data flow graphs. In ACSD, Cited by: §6.
 The brian simulator. Frontiers in Neuroscience. Cited by: §3.2.2.
 HFNet: A CNN architecture codesigned for neuromorphic hardware with a crossbar array of synapses. Frontiers in Neuroscience. Cited by: Figure 5, §4.1.
 Deep learning with keras. Cited by: §3.2.1, Table 1, §7.
 Max Plus at work: Modeling and analysis of synchronized systems: a course on MaxPlus algebra and its applications. Cited by: §4.6.
 The NEURON simulation environment. Neural Computation. Cited by: §3.2.2.
 Hierarchical dataflow modeling of iterative applications. In DAC, Cited by: §9.
 Dotproduct engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrixvector multiplication. In DAC, Cited by: Figure 5, §4.1.
 A lowpower adaptive integrateandfire neuron circuit. In ISCAS, Cited by: §1.
 NEUTRAMS: Neural network transformation and codesign under neuromorphic hardware constraints. In MICRO, Cited by: §4.3.
 A recurrent neural network based model of predictive smooth pursuit eye movement in primates. In IJCNN, External Links: ISBN 9781509060146 Cited by: §7.1, Table 2, §9.

An efficient heuristic procedure for partitioning graphs
. Bell System Technical Journal. Cited by: §4.3.  Imagenet classification with deep convolutional neural networks. NeurIPS. Cited by: §3.2.1.
 Special Session: reliability analysis for ML/AI hardware. In VTS, Cited by: §9.
 LeNet5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet. Cited by: §3.2.1.
 Synchronous data flow. Proceedings of the IEEE. Cited by: 3rd item, §4.4.
 A systemlevel simulator for RRAMbased neuromorphic computing chips. TACO. Cited by: §9.
 Networks of spiking neurons: the third generation of neural network models. Neural Networks. Cited by: §1.
 Designtechnology cooptimization for oxrrambased synaptic processing unit. In VLSIT, Cited by: §1.
 A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (DYNAPs). TBCAS. Cited by: Figure 2, §1, §3.3.1, Figure 5, §7.2, Table 3, §9, footnote 7.
 Selftimed scheduling analysis for realtime applications. JASP. Cited by: §5.3.
 Machine learning applications to dna subsequence and restriction site analysis. In SPMB, Cited by: §9.
 PyTorch: An imperative style, highperformance deep learning library. arXiv. Cited by: §3.2.1.
 SPINDLE: SPINtronic deep learning engine for largescale neuromorphic computing. In ISLPED, Cited by: §9.
 Mlperf inference benchmark. In ISCA, Cited by: §3.2.1.
 Theory and tools for the conversion of analog to spiking convolutional neural networks. arXiv. Cited by: §3.3.1.
 Live demonstration: a scaleddown version of the brainscales waferscale neuromorphic system. In ISCAS, Cited by: §9.
 Adaptive energy minimization of openmp parallel applications on manycore systems. In PARMADITAM, Cited by: §9.
 Very deep convolutional networks for largescale image recognition. arXiv. Cited by: §3.2.1.
 Compiling spiking neural networks to neuromorphic hardware. In LCTES, Cited by: §5.4, 3rd item, §9, footnote 2.
 Exploiting inter and intramemory asymmetries for data mapping in hybrid tieredmemories. In ISMM, Cited by: footnote 1.
 Improving dependability of neuromorphic computing with nonvolatile memory. In EDCC, Cited by: §9.
 Enabling and exploiting partitionlevel parallelism (PALP) in phase change memories. TECS. Cited by: footnote 1.
 Improving phase change memory performance with data content aware access. In ISMM, Cited by: footnote 1.
 Agingaware request scheduling for nonvolatile main memory. In ASPDAC, Cited by: footnote 1.
 A case for lifetime reliabilityaware neuromorphic computing. In MWSCAS, Cited by: §9.
 Design methodologies for reliable and energyefficient PCM systems. In IGSC Workshops, Cited by: footnote 1.
 Dynamic reliability management in neuromorphic computing. JETC. Cited by: §9.
 A design flow for mapping spiking neural networks to manycore neuromorphic hardware. arXiv. Cited by: §9.
 Improving inference lifetime of neuromorphic systems via intelligent synapse mapping. In ASAP, Cited by: §9.
 Embedded Multiprocessors; Scheduling and Synchronization. Cited by: §4.4.
 Towards probabilistic timing analysis for SDFGs on tile based heterogeneous MPSoCs. In ECRTS, Cited by: §9.
 Exploring tradeoffs in buffer requirements and throughput constraints for synchronous dataflow graphs. In DAC, Cited by: §4.4.
 Multiprocessor resource allocation for throughputconstrained synchronous dataflow graphs. In DAC, Cited by: §9.
 Exploring tradeoffs in buffer requirements and throughput constraints for synchronous dataflow graphs. In DAC, Cited by: §9.
 Reliabilityperformance tradeoffs in neuromorphic computing. In IGSC Workshops, Cited by: §4.6, §9.
 Thermalaware compilation of spiking neural networks to neuromorphic hardware. In LCPC, Cited by: §4.6, §9.
 On the role of system software in energy management of neuromorphic computing. In CF, Cited by: §5.4, §9.
 Enduranceaware mapping of spiking neural networks to neuromorphic hardware. TPDS. Cited by: §4.6, §9.
 An EDA framework for large scale hybrid neuromorphic computing systems. In DAC, Cited by: §9.
 An allmemristor deep spiking neural computing system: a step toward realizing the lowpower stochastic brain. TETCI. Cited by: §9.
 Memristive crossbar arrays for braininspired computing. Nature Materials. Cited by: §9.
 Neuromorphic computing with memristor crossbar. Physica Status Solidi (a). Cited by: §9.
 SDCbased modulo scheduling for pipeline synthesis. In ICCAD, Cited by: §4.6.
 Static rateoptimal scheduling of multirate DSP algorithms via retiming and unfolding. In RTAS, Cited by: §9.
Appendix A Converting Analog Operations to Spiking Equivalent
In this section, we briefly elaborate how an analog operation such as Rectified Linear Unit (ReLU) is implemented using Spiking Neural Network (SNN). The output
of a ReLU activation function is given by(16) 
where is the weight and is the activation on the synapse of the neuron. To map the ReLU activation function, we consider a particular type of spiking neuron model known as an Integrate and Fire (IF) neuron model. The IF spiking neuron’s transfer function can be represented as
(17) 
where is the membrane potential of the IF neuron at time , is the weight, and is the activation on the synapse of the neuron at time . The IF spiking neuron integrates incoming spikes () and generates an output spike () when the membrane potential () exceeds the threshold voltage () of the IF neuron. Therefore, by ensuring that the output spiking rate is proportional to the ReLU activation , i.e., , we accurately convert the ReLU activation to the spikebased model. To further illustrate this, we consider the multilayer perceptron (MLP) of Figure 26a and its SNN conversion using ratebased encoding (Figure 26b) and interspike interval (ISI) encoding (Figure 26c).
In Figure 26a, neurons 1,2 and 3 are the input neurons and neurons 4 and 5 are the output neurons. To keep the model simple, let us consider the case where the activations of the input neurons 1,2 and 3 are equal to 1. Using Equation 16, we know that the output of neurons 4 and 5 are 0.6 and 0.3, respectively. Figures 26b and 26c show the mapped SNN model, using ratebased and interspike interval encoding schemes, respectively. In the ratebased model in Figure 26b, the rate of spikes generated is expected to be proportional to the output of neurons 4 and 5 in the MLP. In the case of the ISIbased SNN model, the interspike interval of the spikes generated by neurons 4 and 5 is expected to be proportional to the output generated in the MLP, as shown in Figure 26c.
We note that nonlinear activation functions such as sigmoid and tanh cannot be accurately mapped to a spikebased model. This can be attributed to the transfer function of a biological spiking neuron (neuron response curve) closely resembling a ReLU and not sigmoid and tanh activation functions. While approximate implementations of the sigmoid and tanh operators using spiking neurons can be found in literature, they induce significant inaccuracies into the conversion process and require more resources (neurons) to implement. The tanh activation function, for instance, generates output values ranging between 1.0 to 1.0. In order to represent the tanh function in a spikebased model, both excitatory and inhibitory spiking neurons will be required to represent the positive and negative output values, respectively. This will require doubling the number of spiking neurons needed to represent the tanh activation function.