Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution

by   Ramyad Hadidi, et al.

The rise of deep neural networks (DNNs) is inspiring new studies in myriad of edge use cases with robots, autonomous agents, and Internet-of-things (IoT) devices. However, in-the-edge inferencing of DNNs is still a severe challenge mainly because of the contradiction between the inherent intensive resource requirements and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices is not an effective solution in edge domains. Therefore, to benefit from available compute resources with low communication overhead, we propose new edge-tailored perception (ETP) models that consist of several almost-independent and narrow branches. ETP models offer close-to-minimum communication overheads with better distribution opportunities while significantly reducing memory and computation footprints, all with a trivial accuracy loss for not accuracy-critical tasks. To show the benefits, we deploy ETP models on two real systems, Raspberry Pis and edge-level PYNQ FPGAs. Additionally, we share our insights about tailoring a systolic-based architecture for edge computing with FPGA implementations. ETP models created based on LeNet, CifarNet, VGG-S/16, AlexNet, and ResNets and trained on MNIST, CIFAR10/100, Flower102, and ImageNet, achieve a maximum and average speedups of 56x and 7x, compared to originals. ETP is an addition to existing single-device optimizations for embedded devices by enabling the exploitation of multiple devices. As an example, we show applying pruning and quantization on ETP models improves the average speedup to 33x.



There are no comments yet.


page 1

page 4

page 5

page 6

page 10


Deep Learning at the Edge

The ever-increasing number of Internet of Things (IoT) devices has creat...

Collaborative Execution of Deep Neural Networks on Internet of Things Devices

With recent advancements in deep neural networks (DNNs), we are able to ...

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Deep Neural Networks (DNNs) have achieved extraordinary performance in v...

The Case for Retraining of ML Models for IoT Device Identification at the Edge

Internet-of-Things (IoT) devices are known to be the source of many secu...

Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices

The inference of Neural Networks is usually restricted by the resources ...

Systimator: A Design Space Exploration Methodology for Systolic Array based CNNs Acceleration on the FPGA-based Edge Nodes

The evolution of IoT based smart applications demand porting of artifici...

Pruning vs XNOR-Net: A Comprehensive Study of Deep Learning for Audio Classification on Edge-devices

Deep Learning has celebrated resounding successes in many application ar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction & Motivation

In-The-Edge Inferencing: The advancements of deep neural networks (DNNs) have made revolutionary changes in domains such as robotics [1, 2, 3, 4, 5] unmanned aerial vehicles (UAVs) [6, 7, 8, 9] and Internet-of-things (IoT) [10, 11, 12, 13, 14, 15]. In such domains, in-the-edge inferencing is rapidly gaining the ground, due to ubiquitous wireless networks, the availability of embedded processors. This paper targets in-the-edge inferencing in environments such as smart home/city/office (e.g., connected cameras, game consoles, TVs, routers) or collaborative robots/drones (e.g., disaster relief [16, 17, 18], agriculture [19], farming [20], mining [21], construction [22], mapping [18, 23]), in which: (i) accuracy is not the ultimate goal (e.g., detecting human sound in a disaster area with either 87% or 90% accuracy necessitates more investigation); (ii) the network of devices is standalone (i.e.., Internet connection is not necessary and no data is offloaded for the sake of security. e.g., military applications); and (iii) the network has unified ownership hence data communication within devices does not hazard privacy, security, and monetary cost.

The Key Challenge: Privacy concerns [24, 25, 26], unreliable connection to the cloud, tight real-time requirements, and personalization are pushing inferencing to the edge. Despite such numerous driving forces for in-the-edge inferencing, the key challenge is that fast inferencing requires high compute resources and memory demands [27] that contradicts the limited energy and computational resources of edge devices (i.e. resource-constrained devices [28, 29]). Such demands are not expected to slow down as modern models [30, 31] encapsulate more parameters for a better generalization.

Current Solutions & Limitations: (1) The first widespread approach is to offload all computations to high-performance servers of cloud providers [32, 33]. However, cloud-based offloading is not always available (e.g., no Internet access) and often relies on unreliable network latency. Furthermore, with the exponential increase in the number of edge devices [34] and the scale of raw collected data, centralized cloud-based approaches might not scale [35, 36]. (2) The second approach to deal with the limited resources on edge devices is distributing computations by taking advantage of the existing surrounding devices such as cameras and other mostly idle devices. Distribution is based on common data- or model-parallelism methods [37, 38]. In data parallelism, the entire model is duplicated on each device for performing separate inferences. Hence, the system needs several live and concurrent inputs to be efficient without real-time jitter. Simply put, data parallelism only increases throughput. In model parallelism, the model is divided and distributed across several devices for the same inference. Neither data- nor model-parallelism jointly reduce communication, memory usage, and computations, as Table I depicts in details. Moreover, although model parallelism could decrease execution latency in theory, in reality, it incurs high communication latency.

Our Solution: To address resulted high latency with model parallelism, we explore an efficient model distribution method, edge-tailored perception (ETP) models that:

  • Reduce Communication: ETP models replace a single, wide, and deep model with several narrow ones that only communicate for input and pre-final activations. Thus, their communication load is low with distributions (see Table I).

  • Reduce Compute & Memory Footprints Per Node: ETP models have fewer connections than that of the original ones, so their number of parameters and computational demands are also lower than those of the peer model-parallelism versions, shown in Table I.

  • Allow Inter-Layer Parallelism: Narrow branches in ETP models are independent of each other, which enables inter-layer parallelism. This is in contrast with model-parallelism methods that only allow intra-layer parallelism due to the single-chain dependency between consecutive layers.

With ETP, our goal is to illustrate the advantages of models designed with the characteristics of their underlying computational domain in mind.

max width=

TABLE I: Methods for distributing inference computations.
Data Model
Parallelism Parallelism Target ETP
Communication Intermediates
Per Inference
  • DNN: Metrics associated with the entire model; : Number of devices.

To restore a possible accuracy loss caused by fewer connections and parameters, we create enhanced versions of ETP models with a similar accuracy (within 3%) as the original models, and only a fraction (5%–24%) of their compute and memory demands. Such models enable locally faster execution with a slight accuracy loss for tolerable and common tasks (e.g., counting the cars passing an intersection). If a task requires high accuracy (e.g., finding a specific license plate), the system relies on existing offloading (1) cloud-based, (2) collaborative [39], or (3) fog-based [40] technologies.

ETP is orthogonal and an addition to current techniques for reducing the computational demand of models, such as weight pruning [27] and quantization [41]. ETP models offer distribution/parallelism opportunities for distributed computing. Whereas, current techniques apply accuracy/performance tradeoffs to single-node models with a single-chain of inter-layer dependency. Such techniques can be applied to each branch of ETP models, as shown in Section IV-C. Hence, ETP models complement such techniques rather than compete with them.

Experiments Overview: (1)

We design, train, and evaluate ETP models based on computer vision DNNs on MNIST 

[42], CIFAR10/100 [43], Flower102 [44], and ImageNet [45] datasets (total of 50 training results). (2) To evaluate the execution performance of ETP models, we conduct real-world implementations on two systems: a system with up to ten Raspberry Pis (RPis), and another with two PYNQ boards. RPis are chosen because they represent the de facto choice for several robotic and edge use cases and they are readily available [46, 47, 48, 49, 50]. (3) We tailor a TPU-like [51] architecture for edge computing and share our findings with an edge-based FPGA. We tailor the microarchitecture for edge computing by maximizing data reuse and enabling data streaming from the memory. We enhance the microarchitecture with a data-driven execution model that eliminates the overheads of instructions in TPU. (4)

Finally, we estimate the area and power for a 28 nm ASIC for the tailored architecture.

Contributions: Our contribution are as follows:

  • We propose the first DNN optimization technique to reduce the communication overhead in a distributed system for in-the-edge inferencing.

  • We propose ETP models that enable inter-layer parallelism for in-the-edge inferencing while reducing the total memory and computation footprints.

  • We conduct real-world experiments on Raspberry Pis and PYNQ boards, and share our findings on an edge-based FPGA for an edge-tailored TPU-like architecture.

Ii Challenges of Edge Computing

Ii-a Growing DNNs & Resource Limitation

DNN models consist of several layers, each of which performs specific computations. The computations are based on custom weights that are learned during the training phase with back-propagation. In the inference phase (i.e., prediction), feed-forward computations are performed on batched inputs and learned parameters stay constant. For edge computation, the most compute- and data-intensive layers [52, 53] are fully-connected and convolution layers.111Since this paper focuses on visual models, we only introduced the layers in such models. For future work, we aim to include other types of DNNs.   In fact, DNNs are inherently compute-intensive; Figure 1 shows the amount of multiply-accumulate operations and parameter size in several DNN models. The left bars illustrate basic models such as LeNet [54] and CifarNet [43]. On the right side, we illustrate the YOLO [55] and C3D [56] models that are used for videos. The newest translation model, Bert [30], significantly surpasses all the previous models in both parameter size and computations. As shown, newer models encapsulate more parameters and perform more computations for better and more generalized feature understanding than their predecessors. In short, this trend of modern models will inevitably surpass the capabilities of any resource-constrained device.

Fig. 1: DNNs Trend – Number of MAC operations/inference and parameters.

Ii-B Single Device Pareto Frontier

The capabilities of resource-constrained platforms are limited. To gain a better understanding, Figure 2 depicts latency per image for state-of-the-art image recognition models for ILSVRC 2012 challenge [45] on RPi [57]

. All implementations heavily utilize modern machine learning optimizations such as pruning 

[27], quantization, low-precision inference [41, 58, 59], and handcrafted models [60]. Additionally, the models are highly optimized for ARMv8 architectures using the ELL compilation tool [61]. However, achieving higher execution performance is impossible on a single device due to the Pareto frontier. As seen, the latency for high-accuracy models is longer than 400ms, and generally, latencies are longer than 100ms. In addition, the data shown in the figure is only for image-recognition models and DNNs in other domains are already surpassing these models in size and complexity. Fitting such an exponentially increasing computation on a single device, especially for edge devices, is a limiting factor for executing DNNs in the edge. In other words, even after applying all optimization techniques for DNNs in embedded systems, the single device Pareto frontier is still limiting the widespread applicability of DNNs in several in-the-edge applications.

Fig. 2: Latency-Accuracy Pareto Frontier – Single device: Latency per image on RPi3 for state-of-the-art ILSVRC models with the optimized platform-specific compilation ELL [61] tool [57]. Multiple devices: Breaking the single device Pareto frontier, but with significant communication overhead.

Ii-C Current Limitations of Distribution Methods

(1) Data parallelism parallelizes the computations of independent inputs. Among distribution methods, data parallelism [37, 38] keeps computation and memory footprints per device similar to the original DNN (see Table I). Data parallelism does not apply to the edge because: (i) serves several independent requests, that are limited in an edge environment; (ii) does not reduce latency, important in several real-time applications in the edge; and (iii) does not reduce computation and memory footprints per device.

(2) Model-parallelism methods divide the computations of a DNN model for a request. These methods first divide the computations based on layers in a model and then internally divide the computations within each layer, by keeping dependencies intact. Depending on the type of the layer, the dividing could take several forms. In Figure 3, we provide a simple example for distributing a fully-connected (fc) layer. The computation of an fc layer is as , in which , , and are weights, input, and biases, respectively. We can write this computation as matrix-matrix multiplication, or . There are two extremes of model parallelism, input and output splitting [5]. In output splitting, producing each set of outputs is divided among the devices. In input splitting, the input is split and each device computes all parts of the output that are dependent on their received input. As shown in Figure 3, each technique has specific communication overhead. Output splitting requires the transmission of the input to all nodes. Input splitting requires the transmission of partial sums to a final node for summation. New model-parallelism methods can also be crafted by mixing these two extremes, but they similarly suffer from the same discussed overheads.

Several model-parallelism techniques also exist for convolution layers by converting their computation to matrix-matrix multiplication [62, 63], and they are similar to the example provided for fc layer. In summary, since model-parallelism techniques do not change the internal network connections of a model, after distribution, we need to keep the dependency chains intact. Hence, although model parallelism reduces the compute and memory footprint per device, the communication overhead resulting from the tightly interconnected layers and inter-layer dependencies stays the same as the original model.

Fig. 3: Basics of Model Parallelism – Distributing a fully-connected layer with input- and output splitting. Note the communication overheads.

Ii-D Communication Challenges

Dependability current distribution methods on the high amount of communications induce the straggler problem, in which a system is lagged by its slowest node. Specifically, since edge devices usually use a wireless or mobile network, the latency deviations are high. Figure 4 depicts the histogram of prediction latencies on a distributed IoT system consisting of six RPis executing AlexNet [64] with model parallelism. The computing time is bounded to 500ms, but the average delay is longer (and for tail latency). To gain perspective, Figure 5a and b depict the VGG-S model and its distributed version with model parallelism, respectively. The VGG-S model has a similar parameter size and compute density as AlexNet [37] (Figure 1) and it is designed for the Flower102 [44] dataset. As seen, dependencies enforce a strongly interconnected network among the divided parts. Although several techniques such as compression could alleviate the cost of communication, they do not reduce the number of connections. As a result, the characteristics for a target distribution method on edge devices besides yielding low memory and computation footprints per node must reduce the number of connections.

Fig. 4: Straggler Problem – Histogram of prediction latencies on a six Raspberry Pi system executing AlexNet with model parallelism (Section IV-B).

Relying extensively on communication in model-parallelism techniques also imposes another challenge: finding an optimal distribution. This is because each distribution depends heavily on communication and network traffic that changes over time. Hence, a single distribution is not always the best answer and actively profiling the distribution in real-time is necessary. In particular, finding an optimal distribution is an NP-hard problem. Since our distribution reduces the dependability to communication, it reduces the complexity of the search space. In fact, our models only need to communicate for the input and before the final classification layer activations. Additionally, ensuring reliability in our models is easier than model-parallelism versions due to the same reasons discussed.

In summary, with parallel execution on multiple devices, ideally, we could pass the frontier. But, as shown in Figure 2, current distribution methods are limited by the communication overhead and the inherent inter-layer data dependency in models. Next section proposes ETP models, which significantly reduce communication and allow inter-layer parallelism.

Fig. 5: Model Parallelism View for The Entire Model – (a) VGG-S, (b) and its distributed version.

Iii Edge-Tailored Perception

This section first provides details on ETP models, discussions about their key features, and their design procedure. The second part of this section focuses on insights about how to tailor a TPU-like systolic-based architecture for edge computing. Finally, the last subsection presents details on the microarchitecture design.

Iii-a ETP Models

Simple Example of ETP Models: We illustrate the design procedure of ETP models by following our simple example in Section II-D. To design ETP models, first, we divide each layer in the original model in equal parts (e.g., split in 2) and then remove their intermediate connections. Figure 6a shows a new model, based on the VGG-S model. As shown, by removing the intermediate connection between the two branches, we create two independent and parallel branches. We only keep the input and pre-final layer connections so that the model acts as a single model. Since, in our example, the size of each branch is half of the original model and has fewer connections, the new model has less than half of the parameters of the original model. Moreover, the communication is only needed for the input and before the final prediction layer activations. In fact, Figure 6b illustrates an example distribution with the new model. The computations of the final node can be a new node or even the user’s device. Later in this section, we provide our design procedure to generalize this simple example. As discussed in Table I, ETP models have low memory and computation footprints, while their communication per inference is significantly reduced with the distribution.

Fig. 6: ETP Model Simple Example – (a) VGG-S split in two branches, (b) showing task assignments for execution.

Key Features of ETP Models: ETP models are designed by considering their underlying computation domain and have the following key features. (1) ETP models only communicate for input and pre-final activation. Therefore, they significantly reduce communication overhead in a distributed system. Additionally, the low communication load per inference helps with the straggler problem. This is in contrast with model parallelism that highly depends on communication between all the intermediate layers. (2) ETP models split the size of a layer, so the total parameter size and computation complexity of the model is reduced. Therefore, ETP models require fewer parameter size, less computation complexity, and no communication between the nodes for intermediate layers. This lower memory and computation footprints allow edge devices to efficiently operate within their limited resources (e.g., no swap space activities due to limited memory). (3) ETP models replace the original wide model with several narrow and independent branches. Since the computations of branches are not dependent anymore, in contrast with the single-chain of dependency in the original model, a distributed system can concurrently execute all branches. In other words, ETP models allow us to go beyond intra-layer parallelism in a model.

Design Procedure of ETP Models: Figure 7 describes design procedure of ETP models. We start by inputting our DNN model architecture and its per-layer memory and computation footprints. Similarly, we input the specification of the hardware, such as its memory, its computation capability, and also any overhead that is associated with executing a DNN on our hardware. For instance, several DNN frameworks have a memory overhead on Ubuntu that runs on Raspberry Pis. A splitter procedure, the algorithm of which is written in Algorithm 1, in a while loop, splits the model, cuts the connection, and measures the approximate footprints of each branch. The

, a hyperparameter, defines the granularity of division/splitting. The loop exits when a single branch is fit on a device (both memory and computation wise). Removing non-branch connections is a simple operation that keeps only one connection per layer. The derived model from the splitter is the

split-only model. By training the split-only model and testing it, we measure its accuracy. Then, depending on the accuracy requirement of our task, we either fatten each branch by , a hyperparameter, or output the model. If we decide to fatten each branch by , since we are introducing new weights, we retrain the new split-fattened model

. We choose the best accuracy among transfer-learning and from-scratch versions during such retraining. The process continues in a loop until we are within an acceptable error range for our task, or

, a hyperparameter. Finally, we output the model and its weights. We showcase various ETP models with 5 datasets and 8 models, including ImageNet in Section IV-A.

Input : DNN: Layer configurations
DNN, DNN: DNN memory and computational footprints
Division: Division Factor for splitting
Dev, Dev: Hardware specification
Output :  DNN: Layer configurations
1Split(DNN, DNN, DNN, Division, Dev, Dev)
2          Mem 0; MAC 0;
3          while not Mem and not MAC do
4                   Mem DNN Dev
5                   MAC DNN Dev
6                   for layer in DNN do
7                            layer.width layer.width/ Division
8                  RemoveNonBranchConnections(DNN)
9         return DNN
Algorithm 1 Split Algorithm

Iii-B Tailoring Hardware for Edge

Current Deep Learning Accelerators:

Systolic-array [65] based designs are the key hardware design for executing DNNs with advantages such as a high degree of concurrent processing through a dataflow compute arrays [lik:hou16, 66, 67, 68, wang:zho17, 51, 69]. TPUs are one of such successful accelerators with widespread usage in the industry. However, datacenter-Level accelerators, such as TPU, Eyeriss [67] and EIE [han2016eie]), are tailored for data-parallelism, which increase throughput by compromising end-to-end latency. In datacenters such tradeoffs are valid, but our target is reducing single-batch inferencing latency, specific for edge computing with limited requests. Moreover, these accelerators are dependent on large on-chip SRAMs and complex instruction-based execution. Such resources are limited in the edge due to energy constraints and cost considerations.

ETP Models Promote Simpler Edge Hardware: Besides the benefits discussed in Section III-A, ETP models encourage less complex hardware. Since each branch of ETP models is divided with the constraints of hardware in mind, they require fewer parameters, less computation complexity, and no communication between the modules for intermediate layers (hence, less constraint on fast data rates). As long as the original model is divided into common-size branches, as algorithm 1 targets, the microarchitecture does not need to be separately fine-tuned for each model to achieve the best execution performance. We explore the effects of such simplifications on a specialized hardware for edge devices. We tailor a TPU-like systolic-array architecture to study the benefits. We provide our insights in the rest of this section. Section IV-C shares our implementation experience with an edge-level FPGA.

Fig. 7: Procedure of ETP models, fattening amount, , and task error, , are hyperparameters or input from user.

Key Insights to Simplify Architecture: While we borrow the main idea of using systolic arrays for inferencing from TPU, our design differs from that in the following key insights to make the design suitable for edge implementation.

  • Small Multiplier Array: Since the computation complexity in branches of ETP models are within a specified range, we use a smaller multiplier array, compared to .

  • Data-Driven Execution Model: With a data-driven execution model, as opposed to sophisticated mechanisms for executing instructions, we eliminate the overheads of decoding and executing instructions.

  • SRAM-Free Design: We utilize burst accesses to memory (DRAM) for streaming data through the systolic array with a memory-mapped design. Therefore, we remove costly SRAM buffers, common in prior work (e.g., TPU). The SRAM-free design is low cost and tailored to edge.

  • Separation of Multipliers and Adder Trees: Unlike the typical MAC-based systolic arrays in TPU (and Eyeriss & EIE), such separation in the interconnection facilitates partitioning and pipelining the operands.

  • Simple Peripheral Logic: We use a simple shift-and-increment logic for indexing data and directing data to/from systolic arrays.

  • Low Clock Frequency & Memory Bandwidth: We tailor architecture for the edge by using a low-bandwidth off-the-shelf LPDDR2, common for edge devices. Additionally, we use a super low-speed frequency (i.e., 100 MHz).

Iii-C Microarchitecture Details

Fig. 8: Details of Tailored Hardware for Edge – (a) Microarchitecture overview, and (b) memory layout of input operands and their metadata.

Overview: Figure 8a illustrates the main microarchitectural components of our design that comprise a weight-stationary systolic array [65] for implementing matrix-matrix multiplication. The following introduces our six design choices and explains how they align with the philosophy of processing independent splits and the goal of reducing single-batch latency. For instance, the structure of the multiplier array connected to the adder trees (which enables flow of data in one direction rather than two) together with the simple stand-alone indexing logic not only facilitates partitioning (e.g., Figure 8b) but also reduces latency from O(n) to O(log(n)). Besides, since the width of the systolic array defines the degrees of memory parallelism, we chose a width of 64. This approach efficiently utilizes the memory bandwidth with DRAM burst reads and makes the efficient use of memory-level parallelism and DRAM burst reads while utilizing the maximum connections to the memory.

(1) Systolic Array Cells: The systolic array cells are organized in a 3264 array ❶. Each cell ❷ includes a multiplier with two integer operands, one stationary and the other streaming (R1). At each cycle during computations, all of the multipliers are active, working synchronously on streamed data. R1 registers with streaming data are connected in a column within the array such that at each cycle their contents shift one row down. To reduce the connections between the array and the main memory, only the first row of the systolic array is connected to the memory ❶. Moreover, each cell of the first row is only connected to one data stream line ❷. Based on the type of an operand (), streaming data is either used for the initialization or multiplication. The buffers, storing stationary operands, are connected similarly to that of streaming registers. During the initialization, stationary operands are poured into connected buffers in a column to fill them by utilizing the connection between them.

(2) Stationary Operands: The stationary operands are often larger than the dimensions of the array. In such cases, we have to partition a multiplication into several small ones (as shown in Figure 8b), more than one of which may share a streaming operand, but have distinct stationary operands. To avoid multiple loads of stationary registers, we choose to integrate a buffer for stationary operands at each multiplier (negligible overhead). As a result, the design serves requests with lower latency. Moreover, since each branch of the model has several layers, integrating these buffers allows a fast context switching without the overhead of reloading the stationary operands.

(3) Adder Trees: Each row of the array is connected to an adder tree ❸. The number of adder trees, which is the same as the depth of the systolic array, defines the number of output elements generated at each cycle. Adder trees, pipelined in five () stages, reduce the result of multiplications to a single integer, which then contributes to creating an output element. During multiplications, the outputs of the multipliers in a row are routed to the adder trees to be summed. To maximize the fine-grained parallelism, we want the width of the 2D-weight matrix to match the number of adder trees, or depth of the systolic array. To enable flexibility and fast in-the-edge execution, we employ a narrow array with a depth of 32.

(4) Memory Mapping: To perform DNN computation, similar to prior work, we convert the computations of DNN layers to general matrix-matrix multiplication (GEMM) [62, 63]. Then, we create a memory layout, as shown in Figure 8b. To assist the smooth streaming of data from memory to the systolic array, we map data to sequential addresses in the physical memory. Figure 8b shows an example of GEMM (), the operands of which are stored in the memory. Since the depth of the systolic array is 32, we divide each operand into 32-element chunks and map them to consecutive addresses with their block index () and type (i.e.., for discerning the operands that reside in R1

registers or buffers). The length value of the streaming operands is also saved along with the operands. Such partitioning is done by a simple and low overhead heuristic algorithm (as shown in Figure 

8b), partitioning both and matrices across their common dimensions. As illustrated, the width of the systolic array defines the level of memory parallelism. Unlike prior work, the array in our design is directly connected to the memory. Since our design stores and reuses the intermediate results within the multipliers, we remove large-SRAM buffers in similar prior work such as TPU, Eyeriss, and EIE.

(5) Simple Indexing Logic: During execution, for each element, the indexing logic (❹) generates the appropriate row and column indices of the element using the index () and the length to accompany the result. The row and column indices will later be used by the memory interface to write the result to physical locations in memory accordingly. By comparing the length and index (), the end of the operations in the current layer is detected. The end of the current layer signals the start of activation and pooling functions (❺) for that layer. Since the stationary operands might not fit in R1 registers, as shown in our example in Figure 8b for , we save partial sums into the memory and later perform the final summation by reading the partial sums. The reading from memory is interleaved between the main read and write operations with no extra overhead.

(6) Data-Driven Execution Model: Our design uses a data-driven execution model, in which data is pushed by the memory to the systolic array and adder trees are triggered by the arrival of data. In this approach, no instruction is used. Instead, the sequence of operand arrivals indicates the sequence of operations. To keep the right sequence of GEMM operations, we use a table to indicate the sequence of the physical locations of the GEMM operands for all the layers. The content of the table is programmed by the host by breaking down the matrix operands of large GEMMs into many small GEMMs (see Memory Mapping). The content of the table is then tracked row-by-row, to stream the right operands at right timings, into the systolic array. After loading one of the operands in the stationary buffers, we pass the remaining operand through the other input of the multiplier, or R1 registers. Once a result is ready, the content of the table is used to write it back to the appropriate physical location. Note that the operations of the next layer are fired once all the GEMMs of the current layer are completed. In this way, the content of the table is also used to guarantee the dependency between sequential operations. Since we are performing inference, we overwrite the results of a layer after reading them once.

Iv Experimental Studies

This section shares our experimental results for ETP models, real-world experiments with RPi and PYNQ, edge FPGA implementation, and ASIC estimations. At the start of each subsection, the setup of related experiments is provided.

Iv-a ETP Models

Training Specifications: We train all the models, including the original model, from scratch to conduct a fair comparison. Normalization [70] layers are included to enhance learning. The training is done with an exponential learning rate with decay a factor of , initial learning rate

, number of epoch per decay of 2 or 10, a dropout rate of

, and L2 regularization with weight decay of . We use ADAM optimizer [71], with and

. All biases are initialized to zeros and all weights are initialized with a normal distribution of mean 0 and a standard deviation of

. All of our models are trained until the loss is flattened or least for 12 epochs. Test and accuracy measurements are done on at least 10% of datasets that have never been used in training to provide an unbiased evaluation of the final model. For ETP, the , , and , are 2, 10%, and , respectively.

Datesets & Models: We use the following datasets: (1) MNIST [42], which contains 70k greyscale handwritten 28x28 images in 10 classes; (2) CIFAR10 [43], which contains 60k colored 32x32 images in 10 classes; (3) CIFAR100 [43], which contains 60k colored 32x32 images in 100 classes; (4) Flower102 [44], which contains 16,378 colored 224x224 images of flowers in 102 classes; and (5) ImageNet [45], which contains 1.33 M colored 224224 images in 1000 classes, with a total size of 140 GB. For each dataset we use the representative model, LeNet [54], LeNet-FC [54], VGG-S [72], CifarNet [43], VGG16 [72], AlexNetv2 [64], and ResNet-50 [73]. We use proof-of-concepts models to explore various design options and then use ImageNet models. In total, for brevity, we only report 50 instances of training results to show ETP extensibility using 5 datasets and 8 models. Our additional results (not reported in the paper) on ResNet-34, DenseNet [74], and DarkNet19 [75] shows a similar trend.

Split-Only Models: For each split-only model, we split the original DNN model in 2,4, and 8 branches, each of which has the same depth as the original model, but each layer has , , and width of the original model, respectively. The width of fully-connected layers is defined as the number of output elements, and for convolution layers as the number of output channels (i.e., filters). The rest of the parameters are similar to the original DNN model, except in cases where splitting renders a layer useless (e.g., kernel size of 4x4 over an input size of 3x3). Table II lists our models’ descriptions and training results. Figure 9a illustrates the accuracy difference of our models, shown in Table II. As shown, the maximum accuracy drop is around 5% for CifarNet. Note that this accuracy drop occurs when we reduced the parameter size of our model extensively (around ). Figure 9b and c show reduction in the number of parameters and computation compared with the original DNN model; as seen, each split reduces both by about times. This is because each convolution and fully-connected layer in the split version create fewer outputs; therefore, the next layer requires fewer parameters. We restore the accuracy of ETP models with a slight increase in the size of each branch in the next section.


TABLE II: Results of split-only ETP models.
Model Name Dataset Layers Top-1 # # MAC
Accuracy Param Opr.
LeNet-FC* MNIST 3fc 97.95 266.6k 266.2k
LeNet MNIST 2fc-3c-2p 98.76 61.7k 61.5k
LeNet-split2 MNIST 3fc-6c-4p 98.86 31.5k 30.5k
LeNet-split4 MNIST 5fc-12c-8p 98.93 16.1k 16.0k
LeNet-split8 MNIST 9fc-24c-16p 98.81 8.8k 8.5k
CifarNet* Cifar10 2fc-2c-2p-2n-1d 80.72 797.97k 14.79M
CifarNet Cifar100 2fc-2c-2p-2n-1d 52.87 815.34k 14.81M
CifarNet-split2 Cifar100 5fc-4c-4p-4n-2d 51.22 410.48k 9.33M
CifarNet-split4 Cifar100 9fc-8c-8p-8n-4d 48.48 208.05k 6.59M
CifarNet-split8 Cifar100 17fc-16c-16p-16n-8d 47.98 106.85k 5.23M
VGG-S* Cifar100 3fc-5c-2p-1n-2d 50.33 76.15M 154.09M
VGG-S Flower102 3fc-5c-3p-1n-2d 88.14 60.79M 1.85G
VGG-S-split2 Flower102 5fc-10c-6p-2n-4d 89.31 30.50M 1.01G
VGG-S-split4 Flower102 9fc-20c-12p-4n-8d 87.55 15.26M 591.65M
VGG-S-split8 Flower102 17fc-40c-24p-8n-16d 85.66 7.64M 382.51M
ResNet-18 ImageNet 18c-2p-17n 70.68 11.69M 1.82G
ResNet-18-split2 ImageNet 35c-3p-34n 69.85 6.11M 0.98G
ResNet-18-split4 ImageNet 69c-5p-68n 68.07 3.32M 0.55G
ResNet-18-split8 ImageNet 137c-9p-136n 66.76 1.93M 0.34G
  • fc: fully-connected, c: convolution, p: pooling, n: normalization, and d: dropout.

  • Detailed results are removed for brevity, refer to Figure 9. The results follows the same trend.

Fig. 9: Split-Only Models – Common visual models: (a) Accuracy difference, (b) reduction in the number of parameters, and (c) reduction in the number of MAC operations in comparison with the original one (Table II).

Split-Fattened Models Fewer connected in split-only models may cause accuracy loss. Accuracy is a defining factor in several applications. Thus, we provide a remedy to restore the accuracy of split-only models. By fattening (i.e., adding more parameters) each branch, we aim to create larger layers in the split-only models. To do so, for each layer (excluding classification layer) in every branch, we increase its output size (or width) by a fraction. So, fattening by 20% means the size of the output in each layer is increased 1.2. The output size, for fully-connected layers, is the number of output elements and for convolution layers is the number of filters. We fatten every branch by 10%, 20%, 30%, and 40%. Our experiments focus on split8 experiments, which have the highest accuracy drops in CifarNet and VGG-S. Figure 10 depicts a summary of these models. As seen, 40% split-fattened models have higher accuracy than the original model while having fewer parameters and MAC operations. On average (for 30% and 40% models), with 4.61–3.81 fewer parameters and 2.95–2.5 fewer MAC operations, split-fattened models achieve similar accuracy, while they jointly optimize memory, computation, and communication loads for distributed edge computation.

Fig. 10: Split-Fattened Models – Common visual models (a) Accuracy difference, (b) reduction in the number of parameters, and (c) reduction in the number of MAC operations in comparison with the original one (Table II).

Large-Scale Models: Table III illustrates the results of ImageNet-based models. For the sake of brevity, we only show split8 and one fattened model. As shown, f40 models restore the accuracy within 3% of the original model. The tradeoff for 3% accuracy loss is about 4 fewer parameters, 4 fewer computations, and 8 less communication load (vs. model parallelism). Figure 11 present a comparative analysis for the communication load between distributed original models with model parallelism and distributed ETP models. Since ETP models avoid communication between their branches, the communication load is reduced significantly. In short, as seen form layer description in Table II

, split models are more complex in terms of the number of layers and neuron connections than the original models. Nevertheless, such complexity enables us to jointly optimize ETP models for edge computing.

Fig. 11: Communication Reduction – Communication load reduction with ETP models compared to model parallelism (required pairs of connections).


TABLE III: Results of ImageNet-based ETP models.
Model Name Dataset Top-1 Top-5 # # MAC
Acc. Acc. Param. MAC Opr.
AlexNet ImageNet 57.02 80.32 50.3M 678.97M
AlexNet-split8 ImageNet 49.03 73.10 6.32M 145.37M
AlexNet-split8-f40 ImageNet 54.68 77.06 12.11M 244M
VGG16 ImageNet 70.48 90.02 138.36M 15.47G
VGG16-split8 ImageNet 58.67 81.54 7.64M 2.01G
VGG16-split8-f40 ImageNet 67.24 89.23 33.78M 3.87G
ResNet-50 ImageNet 75.4 93.1 22.80M 4.87G
ResNet-split8 ImageNet 61.79 81.22 5.42M 0.88G
ResNet-split8-f40 ImageNet 72.12 92.19 8.60M 1.18G
  • For [model_name]-f[number], number represent the percentage of fattening.

Fig. 12: RPi Latency – Latency per image of model-parallelism, SplitNets [76], and ETP models on RPi (number in parenthesis is #devices for the experiment).

Iv-B Real-World Data

RPi Experiments Setup: To study the benefits of ETP models versus only model-parallelism methods, we deploy several models on a system of interconnected Raspberry Pi 3s (RPis), the specifications in Table IV

. On each RPi, with the Ubuntu 16.04 operating system, we use Keras 2.1 


with the TensorFlow 

[78] backend. We use Apache Avro [79], a remote procedure call (RPC) and data serialization framework for communication between RPis. We measure power using a USB digital multimeter [80]. A local WiFi network with the measured bandwidth of 62.24 Mbps and a measured client-to-client latency of 8.83 ms for 64 B is used. All the real-world experiments are full-system measurements with all overheads included without any simulations/estimations.

Raspberry Pi 3B+
CPU 1.2 GHz Quad Core ARM Cortex-A53
Memory 1 GB LPDDR2 SDRAM @ 933Mb/s/pin
GPU No GPGPU Capability
Die Size @ 28 nm
Edge FPGA (Zynq Artix 7 XC7Z020)
Utilization DSP48E FF LUT
#Unit 96 5427 2343
% 44 5 4
Static Power 0.121 W
Dynamic Power Signals: 0.009 W    Logic: 0.003 W
ASIC Estimations
Systolic Array 2016 Adder + 2048 Multiplier @ 100 MHz
Memory 1 GB LPDDR2 SDRAM @ 933Mb/s/pin
Die Size @ 28 nm
TABLE IV: Specification of RPi [81], Edge FPGA, and ASIC.

RPi Performance & Energy: Figure 12 presents latency of inference per image on RPis. On a single device, AlexNet has 2.8 seconds latency, while VGG16 achieves 9.4 seconds latency. By deploying model parallelism variants of the models on 4 and 8 RPis, we achieve a maximum of 0.42s latency, a 6.6 increase, for AlexNet. But, for VGG16, on 4 RPis, we observe a slowdown, which is caused by high communication latency. ETP variants of split4 a split8 can reach up to 115 ms and 400 ms latency per image for AlexNet and VGG16, respectively. This is because ETP models are lightweight, parallelizable, and have low communication. Figure 13 depicts measured energy per inference for RPi implementations. To compare with previous related work, SplitNets [76], Figure 12 present performance of SplitNet models for AlexNet with different configurations. As seen the performance is worst than ETP models. This is because SplitNets create more merging/synchronization points with its tree-structured model design. The resulted model exponentially introduces more merging/synchronization with increased depth which also does not split all the layers. Finally, SplitNets perform parallelization based on dataset semantics, which means every dataset and model needs to be manually split (refer to Section VI).

Fig. 13: RPi Energy – Energy per inference of model-parallelism, and ETP models (number in parenthesis is #devices for the experiment).

TVM Experiments on PYNQ: As a real-world example for edge FPGA implementation, we use TVM [82] on the PYNQ [83] board. PYNQ is designed for embedded applications. We use TVM VTA stack on the PYNQ as the architecture (RISC-style instructions) and only change the models (ResNet-18 vs. ETP ResNet-18 Split2 with accuracy drop). This way, we can measure the benefits of ETP models without relying on any special tailored hardware. Our performance result shares the entire system pipeline performance, from a live camera feed to prediction output on two boards versus one board. Figure 14a depicts a 2.7 speedup in latency, including all communication and system overheads, network latency, and jitter. This is because ETP models are parallelized on two devices and, in total, they have lower computation and memory footprints. In fact, the measured reduction in memory footprint is shown Figure 14b.

Fig. 14: TVM Experiments on PYNQ – Statistics for the entire recognition pipeline, from live camera to prediction: (a) Latency per image, (b) weight memory footprint per device (number in parenthesis is #devices).

Iv-C Edge Tailored Hardware on Edge FPGA

FPGA Experiments Setup: PYNQ FPGA is an SoCs with dual-core Cortex-A9 processor at 650 MHz and a 512 MB DDR3 memory. The FPGA on the PYNQ board is from the Artix-7 family (Zynq series) with 13,300 logic slices, 220 DSP slices, and 630 KB BRAM. Communication for multiple devices is estimated with the network provided in Section IV-B. We implement our tailored microarchitecture using Xilinx Vivado HLS and verify the functionality of our implementation using regression tests. We use relevant #pragrma as hints to describe our desired microarchitectures in C++. We synthesize the designs on Zynq XC7Z020 FPGA and report post-implementation (i.e., place & route) performance numbers and resource utilizations. Inputs and output of our design are transferred through the AXI stream interface. The clock frequency is set to 100 MHz.

Fig. 15: Edge FPGA Latency & Speedup – (a) Latency per image, (b) speedup over one device (number in parenthesis is #devices).
Fig. 16: Edge-FPGA Speedup with Quantization & Pruning – Speedup gained by applying lossless () quantization and structured pruning.

FPGA Performance: Figure 15 shows the experiment results for our edge-tailored hardware. The latency per image is depicted in Figure 15a, with improvement in communication overhead versus model parallelism methods (86% and 60% for 8split and 4split). Depending on the model, the inference per latency on a single device is between 4–29ms; an 221–325 speedup compared to RPi results for AlexNet and VGG16. Our designed ETP models achieve acceptable performance for edge computing, which is 10s of inferences per second, around 10–1ms. As observed, the accuracy loss of our split-only models can be easily restored by fast split-fattened models of f40 with a negligible performance overhead (maximum of 20 ms). Figure 15b illustrates the speedup numbers over one device. The ideal linear speedup shows the ideal scaling speedup with more available devices. As shown, we achieve superlinear speedups. An important parameter in scaling concerns how the overheads scale. The superlinear speedup stems from the dramatic reduction of communication overhead as parallelism increases. In traditional data and model parallelism, such overhead increases, which causes sublinear speedup. Figure 17 compares latency per image for ETP and model parallelism. On average, ETP models are 3.76, 8.89, and 7.17 faster than their model parallelism counterparts for AlexNet, VGG16, and ResNet-50 (4 and 8 devices), respectively. ETP achieves a maximum and average speedups of and , compared to originals (Figure 16, base bars).

Fig. 17: Edge-FPGA ETP vs. Model Parallelism – Latency per image for model-parallelism and ETP models.

Quantization & Pruning: As mention in Section V and Section I

, techniques that reduce the footprint of DNNs can be applied to each ETP individual branch. Basically, the target output for each ETP branch is now its pre-final activations during optimizations. We study the benefits of lossless quantization and structured pruning on top of our ETP models. Based on our experiments, with 3.13 (

integer.fraction) quantization, our models do not lose accuracy. Similarly, applying structured pruning [84], for which systolic arrays gain benefits, reduces the size of parameters between 40%–50% per convolution layer without accuracy drop. Other pruning algorithms increase the sparsity of in the data, which is not necessarily beneficial for systolic arrays. Figure 16 depicts the speedup gained from these techniques normalized to the baseline implementation for each model, the execution performance of which shown in Figure 15a. Quantization and pruning themselves, improve the performance of the original models by 1.96 and 2.2, respectively, and 4.31 when applied together. When quantization and pruning are combined with ETP, the overall performance speedup becomes 14.41 and 16.31, respectively. Compared to original models, ETP + quantization & pruning achieves up to 244 speedup (VGG16-split8), and an average of 33 (across all models and variants).

Iv-D ASIC Estimations

We configure and model a 3264 systolic array connected to LPDDR2 memory with the data rate of 933Mb/s/pin @466 MHz [85] which gives the bandwidth of 3.7 GB/s. Other packaging options with higher memory bandwidths are also feasible. However, seeking a fair comparison with Raspberry Pi3s we choose this memory technology. The data reuse rate of our design is 32 OPs/Byte, which leads to a peak throughput of 118.4 GOPs/s. In 28 nm TSMC technology node, our estimate die size is 11.95 , 16 as small as a quad-core ARM of an RPi. To estimate power consumption, we use the Kitfox 1.1 library[86], which is based on several public domain power models including McPAT [87], at the 28 nm TSMC technology node. Figure 18 shows energy per inference, an important metric for energy-constrained devices for our design. An addition of necessary peripherals, including low power ARM core for managing the access to memory and network, on average, adds 20 mJ per inference. Therefore, compared with Figure 13b, we observe a considerable energy saving potential. Our chip consumes 76 mW, including necessary peripherals. As a comparison point, Eyeriss [67] and EIE [han2016eie] consume 250 mW and 590 mW, respectively. In other words, with our edge-tailored design (Section III-B), we trade large SRAM buffers and systolic arrays for energy efficiency.

Fig. 18: ASIC Energy Estimations – Energy estimation per inference.

V Related Work

We review related techniques to reduce the high computational and memory demands of DNNs [28, 29] that some are specific to resource-constrained devices. We discuss studies on distributing the computation of DNNs next, and finally, review efforts on DNN hardware accelerators.

Techniques without Changing Model Architecture: Several techniques have been developed to reduce the computation and memory footprint of DNNs without changing the network architecture. In weight pruning [88, 27, 89, 90, 91], the close-to-zero weights are pruned and new weights are retrained. It is also shown that moderate pruning cannot affect the accuracy [27]. In quantization and low-precision inference [92, 41, 58, 93, 59], the representations of numbers are changed, which results in simpler calculations. Several methods also have been proposed for resource partitioning [94, 95]

and binarizing the weights 

[96, 97, 98]. Binarizing weights hurt accuracy. Several of the aforementioned techniques are orthogonal to ETP models. In fact, they can be applied to each branch to further reduce the computational and memory cost (Section IV-C).

Techniques that Change Model Architecture: With the prevalence of IoT and edge devices, specific frameworks such as ELL library [61] (see Figure 2) by Microsoft and Tensorflow Lite [99] have been developed by industry. Several proposals also have developed mobile-specific models [100, 60, 101, 74, 102]. The common approach is to handcraft more efficient operations or models. The objective is to reduce the number of parameters [60], create efficient operation to minimize computation density [100], or use resource-efficient connections [102]. Unlike ETP models, all these models have a single-chain of dependency [103] that prevents efficient parallelism. Moreover, several of the models tradeoff the state-of-the-art accuracy with efficiency [102]. We survey SplitNets [76] and SqueezeNet [60] in Figure 12. Recently, with the growing interest in automating the design process [104, 105, 106, 103], learning new networks for mobiles has also gained attention by integrating the constraints of mobile platforms (i.e., latency). These attempts are still limited to single device execution, whereas our paper targets designing models for distributed edge systems. In summary, these related work (1) have a high design cost - i.e., they target only one specific model and dataset without extendibility; (2) target single mobile platforms; and (3) do not consider inter-layer layer parallelism and communication challenges.

Distributing DNN Inference Computations: With large DNN models, distributing a single model has gained the attention of researchers [38, 107, 39, 108, 5, 14]. Usually, the distribution is done in a high-performance computing domain with different goals in mind. In the edge and resource-constrained devices domain, Neurosurgeon [39] dynamically partitions a DNN model between a single edge device and cloud. DDNN [107] partitions the model between edge devices and cloud but uses data parallelism. Hadidi et al. [5, 14] investigate the distribution in robots with model-parallelism methods. In fact, in their results, we observe the effect of the communication barrier in distributing by the diminishing return in performance with a large number of devices. ETP models go beyond model parallelism methods, and enable efficient distribution that is not examined in the above studies.

Edge-Targeted and Systolic-Based Hardware: Several studies explored in/near-the-edge DNN computation without proposing new hardware [39, 109, 110, 27] by training new models, proposing collaboration techniques with cloud, or applying several device-specific and model-specific techniques. A state-of-the-art systolic-based DNN accelerator is TPUv2 [69], which provides a peak of 180 TFLOPS by employing four dual-core chips, each connected to an 8GB HBM package at 300 GB/s. Many other recent deep-learning accelerators utilize systolic arrays concepts [66, 67, 68, 51, 69, 111, 112], which increase the performance of inference by utilizing sparsity, reducing memory accesses by exploring access patterns, or employing weight-stationary architectures. Several studies also target FPGA/ASIC implementations for DNNs [113, 114, 115, 116, 117, 118]. These studies investigate the execution of the entire model on a single device with no resource constraints, whereas our focus is enabling the distribution of inference on several devices.

Vi Discussions

Intuition Behind ETP: We conjecture that ETP models achieve good performance because (1) independent branches can learn more complex non-overlapping features independently within a small search space; whereas original models need to create the same complex features from a higher dimension feature search. We observe that each branch eventually learns an almost disjoint feature representation. (2) In split models compared to the original models, gradient descent updates are more efficient in reaching early layers due to a lower number of parameters in its route.

SplitNets/SqueezeNet: SplitNets [76] shares the same philosophy of splitting models. SplitNet splits the model based on the dataset semantics, which is only applied to the last few layers. As we compared in Section IV by implementing their paper setup for AlexNet, SplitNets achieves low performance. The main disadvantages are two-fold: The architecture cannot be parallelized for shallow layers, which causes longer latency, especially in deeper models like ResNet-50 and AlexNet; and second, the parameter reduction is limited to shallow models, such as ResNet-18. SqueezeNet [60] achieves an accuracy similar to that of AlexNet with fewer parameters by using new compute-heavy Fire modules. SqueezeNet tradeoffs parameters with computations. In fact, it requires 860M MAC operations, whereas our distributed AlexNet requires only 240M MAC operations (see Figure 12). Also, reducing the number of parameters does not necessarily correlate with reducing the memory footprint in inferencing. In fact, for SqueezeNet, we observe a 12 increase the number of activations (12.58 M in SqueezeNet vs. 1.39 M of AlexNet).

Skip/Residual Connections:

As shown in Section IV-A, our procedure also applies to more complex models with residual and skip connections. Simply put, each branch have similar connections, but with smaller depth. Other mobile-specific models with similar fashion such as ShuffleNet [101] tradeoff accuracy for faster execution besides having a high design cost per model and dataset pair. As Figures 2 and 12 show, for mobile-specific models, the latency for edge devices, which have less compute resources than mobile platforms, is still high. In short, we face the single device Pareto frontier (Section II-B).

Alleviating Large Memory Footprints: Sometimes large memory footprints are necessary and access to the next levels of the storage system is enforced. In our design (Section III-C), such accesses do not cause slowdown because data is stored in sequential addresses (i.e., streaming [119]), and we overlap data transfer and computations for independent elements. Thus, the execution model only needs a basic memory technology that simultaneously allows reading/writing from/to two non-overlapping memory locations.

Memory Layout Preprocessing: Our simple algorithm to change the storage format is in (Section III-C(4)). Therefore, the host preprocessing for reordering the data can be done during writing the data to the memory with a single pass.

System-Level Choices: ETP is in conjunction with other technologies available today. ETP does not replace these technologies, but rather enables exploitation of local edge devices. In several cases, relying on cloud-based offloading for accuracy-critical tasks is necessary; whereas, in several others is not. In such cases, ETP models provide an alternative solution. Moreover, if a node fails, various conventional redundancy techniques such as coded computations [120]) can be applied for recovery.

Vii Conclusions

We proposed edge-tailored perception models, ETP, designed for efficient in-the-edge distribution. ETP models optimize communication while reducing memory and computation by utilizing several narrow independent branches. We presented our results on the accuracy of ETP models. We tailored a TPU-like systolic architecture for edge computing, shared our insights, and provided implementation results on an edge-based FPGA. Additionally, we conduct experiments on a system of ten Raspberry Pis and two PYNQ boards.


  • [1] Alessandro Giusti, Jérôme Guzzi, Dan C Cireşan, Fang-Lin He, Juan P Rodríguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016.
  • [2] Mark Pfeiffer, Michael Schaeuble, Juan Nieto, Roland Siegwart, and Cesar Cadena. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. In 2017 ieee international conference on robotics and automation (icra), pages 1527–1533. IEEE, 2017.
  • [3] Peter Corcoran and Soumya Kanti Datta. Mobile-edge computing and the internet of things for consumers: Extending cloud computing and services to the edge of the network. IEEE Consumer Electronics Magazine, 5(4):73–74, 2016.
  • [4] Manuela Veloso, Joydeep Biswas, Brian Coltin, Stephanie Rosenthal, Tom Kollar, Cetin Mericli, Mehdi Samadi, Susana Brandao, and Rodrigo Ventura. Cobots: Collaborative robots servicing multi-floor buildings. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5446–5447. IEEE, 2012.
  • [5] Ramyad Hadidi, Jiashen Cao, Matthew Woodward, Michael S Ryoo, and Hyesoon Kim. Distributed perception by collaborative robots. IEEE Robotics and Automation Letters, 3(4):3709–3716, 2018.
  • [6] Arti Singh, Baskar Ganapathysubramanian, Asheesh Kumar Singh, and Soumik Sarkar. Machine learning for high-throughput stress phenotyping in plants. Trends in plant science, 21(2):110–124, 2016.
  • [7] Huimin Lu, Yujie Li, Shenglin Mu, Dong Wang, Hyoungseop Kim, and Seiichi Serikawa.

    Motor anomaly detection for unmanned aerial vehicles using reinforcement learning.

    IEEE internet of things journal, 5(4):2315–2322, 2018.
  • [8] Zhangjie Fu, Yuanhang Mao, Daojing He, Jingnan Yu, and Guowu Xie. Secure multi-uav collaborative task allocation. IEEE Access, 7:35579–35587, 2019.
  • [9] Nader Mohamed, Jameela Al-Jaroodi, Imad Jawhar, and Sanja Lazarova-Molnar. A service-oriented middleware for building collaborative uavs. Journal of Intelligent & Robotic Systems, 74(1-2):309–321, 2014.
  • [10] Shuochao Yao, Yiran Zhao, Aston Zhang, Shaohan Hu, Huajie Shao, Chao Zhang, Lu Su, and Tarek Abdelzaher. Deep learning for the internet of things. Computer, 51(5):32–41, 2018.
  • [11] Omer Berat Sezer, Erdogan Dogdu, and Ahmet Murat Ozbayoglu. Context-aware computing, learning, and big data in internet of things: a survey. IEEE Internet of Things Journal, 5(1):1–27, 2018.
  • [12] He Li, Kaoru Ota, and Mianxiong Dong. Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE Network, 32(1):96–101, 2018.
  • [13] Tuyen X Tran, Abolfazl Hajisami, Parul Pandey, and Dario Pompili. Collaborative mobile edge computing in 5g networks: New paradigms, scenarios, and challenges. IEEE Communications Magazine, 55(4):54–61, 2017.
  • [14] Ramyad Hadidi, Jiashen Cao, Michael S Ryoo, and Hyesoon Kim. Towards collaborative inferencing of deep neural networks on internet of things devices. IEEE Internet of Things Journal, 2020.
  • [15] Luigi Alfredo Grieco, Alessandro Rizzo, Simona Colucci, Sabrina Sicari, Giuseppe Piro, Donato Di Paola, and Gennaro Boggia. Iot-aided robotics applications: Technological implications, target domains and open issues. Computer Communications, 54:32–47, 2014.
  • [16] Milan Erdelj, Michał Król, and Enrico Natalizio. Wireless sensor networks and multi-uav systems for natural disaster management. Computer Networks, 124:72–86, 2017.
  • [17] Markus Quaritsch, Emil Stojanovski, Christian Bettstetter, Gerhard Friedrich, Hermann Hellwagner, Bernhard Rinner, Michael Hofbaur, and Mubarak Shah. Collaborative microdrones: applications and research challenges. In Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems, pages 1–7, 2008.
  • [18] Nathan Michael, Shaojie Shen, Kartik Mohta, Vijay Kumar, Keiji Nagatani, Yoshito Okada, Seiga Kiribayashi, Kazuki Otake, Kazuya Yoshida, Kazunori Ohno, et al. Collaborative mapping of an earthquake damaged building via ground and aerial robots. In Field and service robotics, pages 33–47. Springer, 2014.
  • [19] Avital Bechar and Clément Vigneault. Agricultural robots for field operations: Concepts and components. Biosystems Engineering, 149:94–111, 2016.
  • [20] H Anil, KS Nikhil, V Chaitra, and BS Guru Sharan. Revolutionizing farming using swarm robotics. In 2015 6th International Conference on Intelligent Systems, Modelling and Simulation, pages 141–147. IEEE, 2015.
  • [21] Y Baudoin and Maki K Habib. Using robots in hazardous environments: Landmine detection, de-mining and other applications. Elsevier, 2010.
  • [22] Marcos Dias de Assuncao, Alexandre da Silva Veith, and Rajkumar Buyya. Distributed data stream processing and edge computing: A survey on resource elasticity and future directions. Journal of Network and Computer Applications, 103:1–17, 2018.
  • [23] Stuart Golodetz, Tommaso Cavallari, Nicholas A Lord, Victor A Prisacariu, David W Murray, and Philip HS Torr. Collaborative large-scale dense 3d reconstruction with online inter-agent pose optimisation. IEEE transactions on visualization and computer graphics, 24(11):2895–2905, 2018.
  • [24] Shancang Li, Li Da Xu, and Shanshan Zhao. The internet of things: a survey. Information Systems Frontiers, 17(2):243–259, 2015.
  • [25] F Biscotti, J Skorupa, R Contu, et al. The impact of the internet of things on data centers. Gartner Research, 18, 2014.
  • [26] In Lee and Kyoochun Lee. The internet of things (iot): Applications, investments, and challenges for enterprises. Business Horizons, 58(4):431–440, 2015.
  • [27] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations. ACM, 2016.
  • [28] Ramyad Hadidi, Jiashen Cao, Yilun Xie, Bahar Asgari, Tushar Krishna, and Hyesoon Kim. Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of IEEE International Symposium on Workload Characterization, 2019.
  • [29] Matthew L Merck, Bingyao Wang, Lixing Liu, Chunjun Jia, Arthur Siqueira, Qiusen Huang, Abhijeet Saraha, Dongsuk Lim, Jiashen Cao, Ramyad Hadidi, et al. Characterizing the execution of deep neural networks on collaborative robots and edge devices. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), pages 1–6. 2019.
  • [30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [31] Microsoft Research. Turing-nlg: A 17-billion-parameter language model by microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/, 2020.
  • [32] Binita Gupta. Discovering cloud-based services for iot devices in an iot network associated with a user, June 4 2015. US Patent App. 14/550,595.
  • [33] Hui Li and Xiaojiang Xing. Internet of things service architecture and method for realizing internet of things service, March 17 2015. US Patent 8,984,113.
  • [34] Inc. Gartner. Gartner says 6.4 billion connected ”things” will be in use in 2016, up 30 percent from 2015. https://www.gartner.com/newsroom/id/3165317, 2015. [Online; accessed 04/01/19].
  • [35] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. Internet of things (iot): A vision, architectural elements, and future directions. Future generation computer systems, 29(7):1645–1660, 2013.
  • [36] Ben Zhang, Nitesh Mor, John Kolb, Douglas S Chan, Ken Lutz, Eric Allman, John Wawrzynek, Edward Lee, and John Kubiatowicz. The cloud is not enough: Saving iot from the cloud. In 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15), 2015.
  • [37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.

    Imagenet classification with deep convolutional neural networks.

    In 26th Annual Conference on Neural Information Processing Systems (NIPS), pages 1097–1105. ACM, 2012.
  • [38] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In NIPS’12, pages 1223–1231. ACM, 2012.
  • [39] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 615–629. ACM, 2017.
  • [40] Mahadev Satyanarayanan. The emergence of edge computing. Computer, 50(1):30–39, 2017.
  • [41] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
  • [42] Yann LeCun.

    The mnist database of handwritten digits.

    http://yann. lecun. com/exdb/mnist/, 1998.
  • [43] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [44] M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  • [45] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • [46] Vladimir Vujović and Mirjana Maksimović. Raspberry pi as a sensor web node for home automation. Computers & Electrical Engineering, 44:153–171, 2015.
  • [47] Richard Grimmett. Raspberry Pi robotics projects. Packt Publishing Ltd, 2015.
  • [48] Alan G Millard, Russell Joyce, James A Hilder, Cristian Fleşeriu, Leonard Newbrook, Wei Li, Liam J McDaid, and David M Halliday. The pi-puck extension board: a raspberry pi interface for the e-puck robot platform. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 741–748. IEEE, 2017.
  • [49] Isaiah Brand, Josh Roy, Aaron Ray, John Oberlin, and Stefanie Oberlix. Pidrone: An autonomous educational drone using raspberry pi and python. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–7. IEEE, 2018.
  • [50] Sean Wilson, Ruben Gameros, Michael Sheely, Matthew Lin, Kathryn Dover, Robert Gevorkyan, Matt Haberland, Andrea Bertozzi, and Spring Berman. Pheeno, a versatile swarm robotic research and education platform. IEEE Robotics and Automation Letters, 1(2):884–891, 2016.
  • [51] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.

    In-datacenter performance analysis of a tensor processing unit.

    In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 1–12. IEEE, 2017.
  • [52] Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In ISCA’17, pages 13–26. ACM, 2017.
  • [53] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 18. IEEE Press, 2016.
  • [54] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [55] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 779–788, 2016.
  • [56] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489–4497. IEEE, 2015.
  • [57] Ofer Dekel - Microsoft Research. Compiling ai for the edge. SysML 2019 Keynote, 2019.
  • [58] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. In Proceeding Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4. ACM, 2011.
  • [59] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
  • [60] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [61] Microsoft. Embedded learning library (ell). https://microsoft.github.io/ELL/, 2017. [Online; accessed 04/01/19].
  • [62] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. Cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
  • [63] Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher Ré. Caffe con troll: Shallow ideas to speed up deep learning. In Proceedings of the Fourth Workshop on Data analytics in the Cloud, page 2. ACM, 2015.
  • [64] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
  • [65] Hsiang-Tsung Kung. Why systolic architectures? IEEE computer, 15(1):37–46, 1982.
  • [66] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 75–84. ACM, 2017.
  • [67] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
  • [68] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, volume 43, pages 92–104. ACM, 2015.
  • [69] Jeff Dean. Machine learning for systems and systems for machine learning, 2017.
  • [70] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML’17, pages 448–456. ACM, 2015.
  • [71] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [72] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations. ACM, 2015.
  • [73] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR’16, pages 770–778. IEEE, 2016.
  • [74] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2752–2761, 2018.
  • [75] Joseph Redmon.

    Darknet: Open source neural networks in c.

    pjreddie.com/darknet, 2013–2016.
  • [76] Juyong Kim, Yookoon Park, Gunhee Kim, and Sung Ju Hwang. Splitnet: Learning to semantically split deep networks for parameter reduction and model parallelization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1866–1874. JMLR. org, 2017.
  • [77] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
  • [78] Martín Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [79] The Apache Software Foundation. Apache avro. https://avro.apache.org, 2017. [Online; accessed 04/01/19].
  • [80] Makerhawk. Um25c usb power meter. makerhawk.com, 2019. [Online; accessed 09/27/19].
  • [81] Raspberry PI Foundation. Raspberry pi 3b+. www. raspberrypi.org/products /raspberry-pi-3-model-b/, 2017. [Online; accessed 04/01/19].
  • [82] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799, pages 1–15, 2018.
  • [83] Xilinx Inc. Pynq: Python productivity for zynq. pynq.io, 2019. [Online; accessed 09/27/19].
  • [84] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
  • [85] JEDEC. Jedec standard: Low power double data rate 2 (lpddr2). https://www.jedec.org/sites/default/files/docs/JESD209-2B.pdf, 2019. [Online; accessed 04/01/19].
  • [86] William J Song, Saibal Mukhopadhyay, and Sudhakar Yalamanchili. Kitfox: Multiphysics libraries for integrated power, thermal, and reliability simulations of multicore microarchitecture. IEEE Transactions on Components, Packaging and Manufacturing Technology, 5(11):1590–1601, 2015.
  • [87] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 469–480. IEEE, 2009.
  • [88] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In 44th International Symposium on Computer Architecture (ISCA), pages 548–560. IEEE, 2017.
  • [89] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS), pages 2181–2191, 2017.
  • [90] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
  • [91] Bahar Asgari, Ramyad Hadidi, Hyesoon Kim, and Sudhakar Yalamanchili. Lodestar: Creating locally-dense cnns for efficient inference on systolic arrays. In Proceedings of the 56th Annual Design Automation Conference 2019, pages 1–2, 2019.
  • [92] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplication. arXiv preprint arXiv:1412.7024, 2014.
  • [93] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1742–1752, 2017.
  • [94] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing CNN accelerator efficiency through resource partitioning. In 44th International Symposium on Computer Architecture (ISCA). IEEE, 2017.
  • [95] Jianxin Guo, Shouyi Yin, Peng Ouyang, Leibo Liu, and Shaojun Wei. Bit-width based resource partitioning for cnn acceleration on fpga. In 25th Annual IEEE International Symposium on. Field- Programmable Custom Computing Machines, 2017.
  • [96] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
  • [97] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or- 1. arXiv preprint arXiv:1602.02830, 2016.
  • [98] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV’16, pages 525–542. Springer, 2016.
  • [99] Google. Introduction to tensorflow lite. https://www.tensorflow.org/mobile/tflite/, 2017. [Online; accessed 11/10/17].
  • [100] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [101] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
  • [102] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • [103] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569, 2019.
  • [104] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
  • [105] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. 2016.
  • [106] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
  • [107] Surat Teerapittayanon, Bradley McDanel, and HT Kung. Distributed deep neural networks over the cloud, the edge and end devices. In 37th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 328–339. IEEE, 2017.
  • [108] Jiashen Cao, Fei Wu, Ramyad Hadidi, Lixing Liu, Tushar Krishna, Micheal S Ryoo, and Hyesoon Kim. An edge-centric scalable intelligent framework to collaboratively execute dnn. In Demo for SysML Conference, Palo Alto, CA, 2019.
  • [109] Manu Mathew, Kumar Desappan, Pramod Kumar Swami, and Soyeb Nagori. Sparse, quantized, full frame cnn for low power embedded devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 11–19, 2017.
  • [110] Nicholas D Lane, Sourav Bhattacharya, Akhil Mathur, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing, 16(3):82–88, 2017.
  • [111] Bahar Asgari, Ramyad Hadidi, Hyesoon Kim, and Sudhakar Yalamanchili. Eridanus: Efficiently running inference of dnns using systolic arrays. IEEE Micro, 39(5):46–54, 2019.
  • [112] Bahar Asgari, Saibal Mukhopadhyay, and Sudhakar Yalamanchili. Mahasim: Machine-learning hardware acceleration using a software-defined intelligent memory system. Journal of Signal Processing Systems, pages 1–17, 2020.
  • [113] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. From high-level deep neural models to fpgas. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 17. IEEE Press, 2016.
  • [114] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161–170. ACM, 2015.
  • [115] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 16–25. ACM, 2016.
  • [116] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 26–35. ACM, 2016.
  • [117] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 267–278. IEEE, 2016.
  • [118] Younmin Bae, Ramyad Hadidi, Bahar Asgari, Jiashen Cao, and Hyesoon Kim. Capella: Customizing perception for edge devices by efficiently allocating fpgas to dnns. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pages 421–421. IEEE, 2019.
  • [119] Bahar Asgari, Ramyad Hadidi, and Hyesoon Kim. Ascella: Accelerating sparse computation by enabling stream accesses to memory. 2020.
  • [120] Ramyad Hadidi, Jiashen Cao, Michael S Ryoo, and Hyesoon Kim. Robustly executing dnns in iot systems using coded distributed computing. In Proceedings of the 56th Annual Design Automation Conference 2019, pages 1–2, 2019.