Survey and Benchmarking of Machine Learning Accelerators

08/29/2019
by   Albert Reuther, et al.
MIT
0

Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore's Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially-available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, compare those results to the reported performance and power consumption values and evaluate them against an Intel CPU that is used in some embedded applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

09/01/2020

Survey of Machine Learning Accelerators

New machine learning accelerators are being announced and released each ...
09/18/2021

AI Accelerator Survey and Trends

Over the past several years, new machine learning accelerators were bein...
09/15/2019

Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

Deep neural networks (DNNs) have become widely used in many AI applicati...
07/07/2021

R2F: A Remote Retraining Framework for AIoT Processors with Computing Errors

AIoT processors fabricated with newer technology nodes suffer rising sof...
01/30/2018

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

Convolutional neural networks (CNNs) are one of the most successful mach...
12/04/2021

On the Implementation of Fixed-point Exponential Function for Machine Learning and Signal Processing Accelerators

The natural exponential function is widely used in modeling many enginee...
01/05/2021

A Survey on Silicon Photonics for Deep Learning

Deep learning has led to unprecedented successes in solving some very di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Artificial Intelligence (AI) and machine learning (ML) have the opportunity to revolutionize the way many industries, militaries, and other organizations address the challenges of evolving events, data deluge, and rapid courses of action. Innovations in computations, data sets, and algorithms have driven many advances for machine learning and its application to many different areas. AI solutions involve a number of different pieces that must work together in order to provide capabilities that can be used by decision makers, warfighters, and analysts; Figure 1 depicts these important pieces that are needed when developing an end-to-end AI solution. While certain components may not be as visible to end-users as others, our experience has shown that each of these interrelated components play a major role in the success or failure of an AI system.

Fig. 1: Canonical AI architecture consists of sensors, data conditioning, algorithms, modern computing, robust AI, human-machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and systems.

On the left side of Figure 1, structured and unstructured data sources provide different views of entities and/or phenomenology. These raw data products are fed into a data conditioning step in which they are fused, aggregated, structured, accumulated, and converted to information. The information generated by the data conditioning step feeds into a host of supervised and unsupervised algorithms such as neural networks, which extract patterns, predict new events, fill in missing data, or look for similarities across datasets, thereby converting the input information to actionable knowledge. This actionable knowledge is then passed to human beings for decision-making processes in the human-machine teaming phase. The phase of human-machine teaming provides the users with useful and relevant insight turning knowledge into actionable intelligence or insight.

Underlying all of these phases is a bedrock of modern computing systems that is comprised of one or more heterogenous computing elements. For example, sensor processing may occur on low power embedded computers, while algorithms may be computed in very large data centers. With regard to performance advances in these computing elements, Moore’s law trends have ended [86], as have a number of related laws and trends including Denard’s scaling (power density), clock frequency, core counts, instructions per clock cycle, and instructions per Joule (Koomey’s law) [41]. Many of the technologies, tricks and techniques of processor chip designers that extended these trends have been exhausted. However, all is not lost, yet; advancements and innovations are still progressing. In fact, there has been a Cambrian explosion of computing technologies and architectures in recent years. Specialization of circuits for certain functionalities is being exploited whereby certain often-used operational kernels, methods, or functions are being accelerated with specialized circuit blocks and chips. These accelerators are designed with a different balance between performance and functional flexibility. One area in which we are seeing an explosion of accelerators is ML processors and accelerators [40]. Understanding the relative benefits of these technologies is of particular importance to applying AI to domains under significant constraints such as size, weight, and power, both in embedded applications and in data centers.

But before we get to the survey of ML processors and accelerators, we must cover several topic that are important for understanding several dimensions of evaluation in the survey. We must discuss the types of neural networks for which these ML accelerators are being designed; the distinction between neural network training and inference; and the numerical precision with which the neural networks are being used for training and inference:

  • Types of Neural Networks – AI and machine learning encompass a wide set of statistics-based technologies as one can see in the taxonomy detailed in the algorithm section (Section 3) of this MIT Lincloln Laboratory technical report [30]. Even among neural networks, there are a growing number of neural network patterns [87]

    . This paper will focus on processors that are geared toward deep neural networks (DNNs) and convolutional neural networks (CNNs). Overall, the most emphasis of computational capability for machine learning is on DNN and CNNs because they are quite computationally intensive 

    [10], with the fully connected and convolutional layers being the most computationally intense. Conversely, pooling, dropout, softmax, and recurrent/skip connection layers are not computationally intensive since these types of layers stipulate datapaths for weight and data operands.

  • Neural Network Training versus Inference – Neural network training uses libraries of input data to converge model weight parameters by applying the labeled input data (forward projection), measuring the output predictions and then adjusting the model weight parameters to better predict output predictions (back projections). Neural network inference is using a trained model of weight parameters and applying it to input data to receive output predictions. Processors designed for training can also perform well at inference, but the converse is not always true.

  • Numerical precision – The numerical precision with which the model weight parameters are stored and computed has an impact on the effectiveness and efficiency with which networks are trained and used for inference. Generally higher numerical precision representations, particularly floating point representations, are used for training, while lower numerical precision representations, including integer representations, have been shown to be reasonably effective for inference [83, 69]. However, it has also generally been established that very limited numerical precisions like int4, int2, and int1 do not adequately represent model weight parameters and significantly affect model output predictions.

The survey in the next section of this paper focuses on the computational throughput of the processors and accelerators along with the power that is consumed to achieve that performance. Other factors include the memory bandwidth to load and update model parameters and data; memory capacity for model weight parameters and input data, both close to the arithmetic units and the global memory of the processors and accelerator; and arithmetic intensity [88] of the neural network models being processed by the processor or accelerator. These factors are involved in managing model parameters and input data flows within the model; hence, they also influence the trade-offs between chip bandwidth capabilities, data flow flexibility, and configuration and amount of computational capability. These factors, however, are beyond the scope of this paper, and they will be addressed in future phases of this research.

Ii Survey of Processors

Many recent advances in AI can be at least partly credited to advances in computing hardware [51, 1]. In particular, modern computing advances have been able to realize many computationally heavy machine-learning algorithms such as neural networks. While machine-learning algorithms such as neural networks have had a rich theoretic history [64], recent advances in computing have made the application of such algorithms a reality by providing the computational power needed to train and process massive quantities of data. Although the computing landscape of the past decade has been rich with numerous innovations, more embedded and mobile applications that require low size, weight, and power (SWaP) systems will need capabilities that are beyond those delivered by the traditional architectures of central processing units (CPUs) and graphics processing units (GPUs). For example, in commercial applications, it is common to off-load data conditioning and algorithms to non-SWaP constrained platforms such high-performance computing clusters or processing clouds. Defense applications among others, on the other hand, may need AI applications to be performed inside low-SWaP platforms or local networks (edge computing) and without the use of the cloud due to insufficient security or communication infrastructure.

The survey in this section gathers performance and power information from publicly available materials including research papers, technical trade press, company benchmarks, etc. While there are ways to access information from companies and startups (including those in their silent period), this information is intentionally left out of this survey; such data will be included in this survey when it becomes publicly available. The key metrics of this public data are plotted in Figure 2, which graphs recent processor capabilities (as of May 2019) mapping peak performance vs. power consumption. The x-axis indicates peak power, and the y-axis indicate peak giga operations per second. (GOps/s) Note the legend on the right, which indicates various parameters used to differentiate computing techniques and technologies. The computational precision of the processing capability is depicted by the geometric shape used; the computational precision spans from single bit int1 to single byte int8 and four-byte float32 to eight-byte float64. The form factor is depicted by the color; this is important for showing how much power is consumed, but also how much computation can be packed onto a single chip, a single PCI card, and a full system. Blue is only the performance and power consumption of a single chip. Orange shows the performance and power of a card (note that they all are in the 200-300 Watt zone). Green shows the performance and power of entire systems – in this case, single node desktop and server systems. This survey is limited to single motherboard, single memory-space systems. Finally, the hollow geometric objects are performance for inference only, while the solid geometric figures are performance for training (and inference) processing. Mostly, low power solutions are only capable of inference, though there are some high-power accelerators (WaveDPU, Goya, Arria, and Turing) that are targeting high performance for inference only.

Fig. 2: Performance vs. power scatter plot of publicly announced AI accelerators and processors.

From Figure 2, we can make a number of general observations. First, much of the recent efforts have focused on processors that are in the 10-300W range in terms of power utilization, since they are being designed and deployed as processing accelerators. (300W is the upper limit for a PCI-based accelerator card.) For this power envelope, the performance can vary depending on a variety of factors such as architecture, precision, and workload (training vs. inference). There are many solutions under the 1 TeraOps/W line; however, there are several inference solutions and a few training solutions that are reporting greater than 1 TeraOps/W.

With the current offerings, at least 100W must be employed to perform training; all of the points on the scatter plot below 100W are inference-only processors/accelerators. There are a number of possible explanations for this, but it is likely that there is currently little driving a requirement for low-power training, though there is much demand for low-power inference on devices ranging from smartphones to remotely piloted aircraft (RPA) and autonomous vehicles. From a technology standpoint, this may suggests that the trade-offs necessary to do neural network training under the 100W envelope affect the performance, numerical accuracy, and prediction accuracy too greatly.

Many hardware manufacturers, faced with limitations in fabrication processes, have been able to exploit the fact that machine-learning algorithms such as neural networks can perform well even when using limited or mixed precision [69, 33]

representation of activation functions, weights, and biases. Such hardware platforms (often designed specifically for inference) may quantize weights and biases to half precision (16 bits) or even single bit representations in order to improve the number of operations/second without significant impact to model prediction, accuracy, or power utilization. To that point, in the inference engines, the entire neural network model is usually loaded onto the chip before any inference is performed. Loading the model turns the model’s parameters into constants that are stored with the operator rather than operands that must be loaded from volatile (DRAM or SRAM) memory thereby reducing the number of operand/parameter loads that must occur separate from the instruction load.

There are a number of dimensions with which we can present the processors and accelerators in this survey. We have chosen to roughly categorize the scatter plot into six regions that roughly correspond to performance and power consumption: Very Low Power and Research Chips, Cell (Smartphone) GPUs, Mobile and Embedded Chips and Systems, FPGA Accelerators, Data Center Chips and Cards, and Data Center Systems. In the following listings, the angle-bracketed string is the label of the item on the scatter plot, and the square bracket after the angle bracket is literature reference from which the performance and power values came. Some of the performance values are reported in frames per second (fps) with a given machine learning model. For those values, Samuel Albanie has Matlab code and a web site that lists all of the major machine learning models with their operations per epoch/inference, parameter memory, feature memory, and input size 

[4]; the operations per epoch/inference are used to compute operations per second from frames per second. Finally, if a neural network model is not mentioned, the performance reported is peak performance.

Ii-a Very Low Power and Research Chips

Chips in the very low power regime have been mainly university and industry research chips. However, a few vendors have announced or are offering products in this space.

  • MIT Eyeriss chip Eyeriss [12, 13, 83] is a research chip from Vivienne Sze’s group in MIT CSAIL. Their goal was to develop the most energy efficient inference chip possible. The result was acquired running AlexNet with no mention of batch size.

  • The TrueNorth TrueNorth [3, 23]

    is a digital neuromorphic research chip from the IBM Almaden research lab. It was developed under DARPA funding in the Synapse program to demonstrate the efficacy of digital spiking neural network (neuromorphic) chips. Note that there are points on the graph for both the system, which draws the 44 W power, and the chip, which itself only draws up to 275 mW.

  • The Intel MovidiusX processor MovidiusX [44] is an embedded video processor that includes a Neural Engine for video processing and object detection.

  • In early 2019, Google released a TPU Edge processor TPUEdge [21]

    for embedded inference application. The TPU Edge uses TensorFlow Lite, which encodes the neural network model with low precision parameters for inference.

  • The DianNao series of dataflow research chips came from a university research team in China. They published four different designs aimed at different types of ML processing [14]. The DianNao DianNao [14] is a neural network inference accelerator, and the DaDianNao DaDianNao [15] is a many-tile version of the DianNao for larger NN model inference. The ShiDianNao ShiDianNao [19] is designed specifically for convolutional neural network inference. Finally, the PuDianNao PuDianNao [55]

    is designed for seven representative machine learning techniques: k-means, k-NN, naïve Bayes, support vector machines, linear regression, classification tree, and deep neural networks.

  • San Jose startup AIStorm AIStorm [62] claims to do some of the math of inference up at the sensor in the analog domain. They originally came to the embedded space scene with biometric sensors and processing. They call their chip an AI-on-Sensor capability.

  • The Rockchip RK3399Pro Rockchip [76] is an image and neural co-processor from Chinese company Rockchip. They published raw performance numbers for 8bit inference. This appears to be a GPU-based co-processor but details are few.

Ii-B Cell / Smartphone GPU-based Neural Engines

A number of smartphone vendors are embedding GPU-based neural engines in their smartphones to enable object detection, face recognition, and other inference-based tasks. The performance metrics for five inference neural engines, which were benchmarked with AImark, are included in this survey. AImark runs VGG-16, ResNet34 and InceptionV3 on smartphones, and it is available in the Apple App Store and the Google Play Store. It is reasonably safe to assume that these GPU-based vector processors are executing with Int8 precision.

  • The Apple A12 processor A12 [29, 73] in the iPhone Xs tops out this set. This A12 neural engine bursts its power utilization to 5.5W for short time periods (above its usually 5W maximum for battery life) for fast inference runs, and this performance point is on the VGG-16 model.

  • The Huawei Kirin 980 (with AMD Mali-76 GPU IP) Mali-76 [27] and Kirin 970 (with AMD Mali-75 GPU IP) Mali-75 [28] make their performance mark with the ResNet34 and VGG-16 models, respectively.

  • Finally, the Qualcomm Snapdragon 835 S835 and 845 S845 [27] are also on the chart with performance numbers using the ResNet34 and InceptionV3 models, respectively.

Ii-C Embedded Chips and Systems

The systems in this category are aimed at automotive AI/ML, autonomous vehicles, UAVs, robots, etc. They all have several ARM cores that are mated with NVIDIA CUDA GPU cores.

  • The NVIDIA Jetson-TX1 JetsonTX1 [26] incorporates 4 ARM cores and 256 CUDA Maxwell cores. It is aimed at low power applications for inference only. The performance was achieved with GoogLeNet with a batch size of 128.

  • The Jetson-TX2 JetsonTX2 [26] mates 6 ARM cores with 256 CUDA Pascal cores. It also is aimed at low power applications for inference only. The performance was achieved with GoogLeNet with a batch size of 128.

  • The NVIDIA Xavier Xavier [45]

    deploys 8 ARM cores with 512 CUDA Volta cores and 64 Tensor cores. It is aimed also at low power applications for inference only.

Ii-D FPGA Co-processors

In public literature, the use of FPGAs for neural networks has been primarily in the technical research domain. Quite a number of research teams around the world have mapped one or more neural network models onto one or more FPGAs and collected a variety of performance and model prediction accuracy metrics. Several survey papers have been published including [53] and [65], and the most comprehensive survey paper of mapping and running DNNs on FPGAs is here [32]. This last paper lists 25 top results from published research literature, of which we have chosen 12 that are the performance leaders for their numerical precision and/or FPGA model. They are labeled with an abbreviation of their chip type: Zynq-020 int1 [68], int2 [48], int8 [31]; Zynq-060 int16 accumulator/int12 result [34], ZCU102 int16 [57], Stratix-V int32 [74], ArriaGX1150 int16 accumulator/int8 result [58], int16 [94], fp16 [9], fp32 [94]; and ArriaGX1155 1-bit [67] points with different numerical precisions. They are all used for inference. Finally, there is a 7-FPGA Xilinx Cluster XilinxCluster [93] in which the research team ganged together one control FPGA and six computational FPGAs to execute much larger neural network models. All of these results are from running one of the following models: AlexNet, VGG-16, VGG-19, DoReFa-Net, and an LSTM model. Details are in [32].

Ii-E Data Center Chips and Cards

There are a variety of technologies in this category including several CPUs, a number of GPUs, a CPU-controlled FPGA solution, and dataflow accelerators. They are addressed in their own subsections to group similar processing technologies.

Ii-E1 CPU-based Processors

  • The Intel SkyLake SP processors 2xSkyLakeSP [77, 46]

    are conventional Xeon server processors. Intel has been marketing these chips to data analytics companies as very versatile inference engines with reasonable power budgets. The performance numbers were measured using Caffe ResNet-50 with batch size of 64 on a 2-socket SkyLakeSP system.

  • The Intel Xeon Phi processor chips have 64, 68, or 72 cores, with each core having four hardware hyper-threads and two AVX-512 (512-bit wide) vector units [47]. Having these 128 AVX-512 vector units on a 64-core chip is equal to 2048 double precision floating point vector ALUs or 4096 single precision floating point vector ALUs. The Phi7210F Phi7210F [90] is the 64-core chip we have in the TX-Green Petaflop system, while the Phi7290F Phi7290F [90] is the top bin, 72-core Xeon Phi (KNL).

Ii-E2 CPU-Controlled FPGA

The Intel Arria solution pairs an Intel Xeon CPU with an Altera Arria FPGA Arria GX1150 [37, 2] (next to the Baidu point). The CPU is used to rapidly download FPGA hardware configurations to the Arria, and then farms out the operations to the Arria for processing certain key kernels. Since inference models do not change, this technique is well geared toward this CPU-FPGA processing paradigm. However, it would be more challenging to farm ML model training out to the FPGAs. The performance benchmark was on an Arria 10 1150 FPGA using GoogLeNet reporting 900 fps.

Ii-E3 GPU-based Accelerators

There are four NVIDIA cards and two AMD/ATI cards on the chart (listed respectively): the Maxwell architecture K80 K80 [79], the Pascal architecture P100P100 [70, 80], the Volta architecture V100 V100 [71, 81], the TU106 Turing Turing [49], the MI6 MI6 [22], and MI60 MI60 [82]. The K80, P100, V100, MI6, and MI60 GPUs are pure computation cards intended for both inference and training, while the TU106 Turing GPU is geared to the gaming/graphics market for including inference processing within the graphics processing.

Ii-E4 Data Center Chips and Cards

This subsection lists a series of chips and cards intended for data center deployment.

  • Intel Corp. bought AI chip startup Nervana in August 2016 to enter the AI accelerator market. The first Nervana chip Nervana [75] called Lake Crest is scheduled to ship in 2019. The follow-on is called Spring Crest Nervana2 [75], and it is scheduled to ship in late 2019.

  • Google has released three versions of their Tensor Processing Unit (TPU) [1]. The TPU1 TPU1 [84] is only for inference, but Google soon made improvements that enabled both training and inference on the TPU2 TPU2 [84] and TPU3 TPU3 [84].

  • GraphCore.ai has released their C2 card GraphCoreC2 [52]

    in early 2019, which is being shipped in their GraphCore server node (see below). This company is a startup headquartered in Bristol, UK with an office in Palo Alto. They have strong venture backing from Dell, Samsung, and others. The performance values were achieved with ResNet-50 training for the single C2 card with a batch size for training of 8. The card power is an estimate based on a typical PCI card power draw.

  • The Goya chip Goya [8, 25] is an inference chip being developed by startup Habana Labs, which is based in San Jose and Tel Aviv. The performance was achieved on ResNet50 inference. Habana Labs is also working on a training chip called the Gaudi, which is expected to be released in mid-2019.

  • Wave Computing has released their Dataflow Processing Unit (DPU) WaveDPU [36]. Each card has four DPUs.

  • The Cambricon dataflow chip Cambricon [18] was designed by a Chinese university team along with the Cambricon company, which came out of the university team. They published both int8 inference and float16 training numbers that are both significant, so both are on the chart. This is the same team that is behind the AMD Mali GPU-based Huawei Kirin chip series (see above) that are integrated into Huawei smartphones.

  • Baidu has announced an AI accelerator chip called Kunlun Baidu [60, 20]. Presumably this chip is aimed at low power data center training and inference and is supposed to be deployed in early 2019. The two variants of the Kunlun are the 818-100 for inference and the 818-300 for training. The performance number in this chart is the Kunlun 818-300 for training.

Ii-F Data Center Systems

  • There are three NVIDIA server systems on the graph: the DGX-Station, the DGX-1, and the DGX-2: The DGX-Station is a tower workstation DGX-Station [5] for use as a desktop system that includes four V100 GPUs. The DGX-1 DGX-1 [5, 17] is a server that includes eight V100 GPUs that occupies three rack units, while the DGX-2 DGX-2 [17] is a server that includes sixteen V100 GPUs that occupies ten rack units. The DGX-2 networks those sixteen GPUs together using a proprietary NV-Link switch.

  • GraphCore.ai has released a Dell/EMC based server GraphCoreNode [52] in early 2019, which contains eight C2 cards (see above). The performance values were achieved with ResNet-50 training on the full server with eight C2 cards. The training batch size for full server was 64. The server power is an estimate based on the components of a typical Intel based, dual-socket server with 8 PCI cards.

  • Along with the aforementioned card, Wave Computing also released a server appliance WaveSystem [36, 24]. The Wave server appliance includes four cards for a total of sixteen DPUs in the server chassis.

Ii-G Announced Chips

A number of other accelerator chips have been announced but have not published any performance and power numbers. These include: Intel Loihi [39], Facebook [50], Groq [66], Mythic [38], Amazon Web Services Inferentia [7], Stanford’s Braindrop [63], Brainchip’s Akida [89, 61], Tesla [35], Adapteva [72], Horizon Robotics [42], Bitmain [11], Simple Machines [85], Eta Compute [91], and Alibaba [92], among others. As performance and power numbers become available for these and other chips, they will be added in future iterations of this work.

Iii Benchmarking

Most of the processors in the very low power space are either research chips that were developed as proof of concepts in university research labs or they are FPGA-based solutions, also usually from university research labs. However, there are a few processors that have been commercially released. These commercial low-power accelerators are of interest for many embedded machine learning inference applications in the DoD and beyond. Amazon Web Services has disclosed that ”… inference actually accounts for the majority of the cost and complexity for running machine learning in production (for every dollar spent on training, nine are spent on inference).” [6]. In this section, we will present the preliminary results of benchmarking Google TPU Edge [21] and Intel Movidius X-based [45] Neural Compute Stick 2 (NCS2) systems and comparing them to an Intel Core i9-9900k processor system.

All of the benchmarks in this section were executed on an Intel-based tower desktop computer with an Intel Core i9-9900k, 32GB (3200Mhz) RAM, and a Samsung 970 Pro NVME storage disk. It was running Windows 10 Pro (10.0.17763 Build 17763) in a VirtualBox v 6.0 virtual machine. The neural network model that the Google EdgeTPU ran was Mobilenet v1 [43] with single shot multibox detectors (SSD) [56] trained with the Microsoft COCO images library [54]. The model that the Intel Neural Compute Stick 2 (NCS2) and the Intel i9 9900 system ran was Mobilenet v2 [78] also with SSD and trained with COCO. The Edge TPU and NCS2 both had throttles imposed by the software that only allowed one image to be submitted for classification at a time (batch size = 1). Further, for both systems the entire neural network model had to be loaded onto the device for each image that is processed. This seems to be in place to emphasize that these are development products rather than production products, but in an actual embedded system, this limitation would not be in place since more performance would be gained by simultaneously submitting more than one image for classification (batch size 1), but that was not enabled or tested with this benchmarking effort. The NCS2 model was prepared for download to the device with the Intel Distribution of the OpenVINO (Open Visual Inference and Neural network Optimization) toolkit: 2018 R5.0.1 (30, Jan 2019). For both the TPU Edge and NCS2 devices, power draw was measured with a USB multimeter. Finally, on the Intel Core i9-9900k, TensorFlow was compiled to separately use the SSE4 and AVX2 vector engine instruction sets. The measurements for these two trials are depicted as i9-SSE4 and i9-AVX2, respectively. The Intel Core i9-9900k performs somewhat better and draws more power than typical VPX board based embedded single board computers from companies including Curtiss-Wright and Mercury Systems [16, 59], which generally are based on Intel Core i7 processors that draw a maximum of 70W for the entire system.

EdgeTPU NCS2 i9-SSE4 i9-AVX2
NN Environment TensorFlow Lite OpenVINO TensorFlow TensorFlow
Mobilenet Model v1 v2 v2 v2
Reported GOPS 58.5 160
Measured GOPS 47.4 8.29 38.4 40.9
Reported Power (W) 2.0 2.0 205 205
Measured Power (W) 0.85 1.35
Reported GOPS/W 29.3 80.0
Measured GOPS/W 55.8 6.14
Avg. Model Load Time (s) 3.66 5.32 0.36 0.36
Avg. Single Image Inference Time (ms) 27.4 96.4 19.6 20.8
TABLE I: Embedded Device Descriptions
Fig. 3: Box and whisker plot of single image inference times.

Table I summarizes the reported and measured giga operations per second (GOPS), power (W), and GOPS/W along with average model load time in seconds and average single image inference time in milliseconds. One can observe that the TPU Edge and NCS2 have much lower power consumption and much higher model load times then the Intel i9. However, single image inference times are generally the same, though the NCS2 is somewhat slower. Also, the Edge TPU GOPS/W numbers are reasonably similar, while the measured GOPS/W is much lower than the reported GOPS/W for the NCS2. Further, Figure 3

shows a box and whiskers plot of the average and standard deviation of single image inference times for each of the four technologies. From the box and whiskers plot, we see that the single image inference times are reasonably uniform across all four technologies.

As more low power commercial systems become available, we intend to purchase and benchmark them to add to this body of work. We expect to have performance and power numbers for the NVIDIA Jetson Xavier [45] and perhaps the NVIDIA Jetson NANO in time for the conference.

Iv Summary

In this paper, we have presented a survey of processors and accelerators for machine learning, specifically deep neural networks along with some benchmarking results that we conducted on commercial low power processing systems that are relevant to DoD and other embedded applications. We started by overviewing the trends in machine learning processor technologies – that many processor trends including transistor density, power density, clock frequency, and core counts are no longer increasing. This is prompting a drive to application specific accelerators that are designed specifically for deep neural networks. Several factors that determine accelerator designs were discussed including the types of neural networks, training versus inference, and numerical precision for the computations. We then surveyed and analyzed machine learning processors categorized into six regions that roughly correspond to performance and power consumption. Finally, we presented benchmarking results for two low power machine learning accelerator systems, the Google Edge TPU and the Intel Movidius X Neural Compute Stick 2 (NCS2) and compared the results to an Intel i9-9900k processor system using the SSE4 and AVX2 vector engine instruction sets.

References

  • [1] Note: From Duplicate 1 (A Domain-specific Architecture for Deep Neural Networks - Jouppi, Norman P; Young, Cliff; Patil, Nishant; Patterson, David)

    References

    • [1] Observation that both Moore’s Law and Denards Law have limited all opportunities to improve performance much, even with just adding more processing elements (Ahmdahl’s Law). JouppiNorman P.YoungCliffPatilNishantPattersonDavid Document ISSN 00010782 Communications of the ACM aug 9 50–59 ACM A Domain-Specific Architecture for Deep Neural Networks Link 61 2018-08 @article{jouppi2018domain, address = {New York, NY, USA}, annote = {From Duplicate 1 (A Domain-specific Architecture for Deep Neural Networks - Jouppi, Norman P; Young, Cliff; Patil, Nishant; Patterson, David) Observation that both Moore's Law and Denards Law have limited all opportunities to improve performance much, even with just adding more processing elements (Ahmdahl's Law).}, author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David}, doi = {10.1145/3154484}, issn = {00010782}, journal = {Communications of the ACM}, month = {aug}, number = {9}, pages = {50–59}, publisher = {ACM}, title = {{A Domain-Specific Architecture for Deep Neural Networks}}, url = {http://doi.acm.org/10.1145/3154484}, volume = {61}, year = {2018}}
    • ChenYKrishnaTEmerJ SSzeVDocumentISSN 0018-9200IEEE Journal of Solid-State CircuitsAI systems,AlexNet,CNN shapes,Clocks,Computer architecture,Convolutional neural networks (CNNs),DRAM accesses,DRAM chips,Eyeriss,Hardware,MAC,Neural networks,RS dataflow reconfiguration,Random access memory,Shape,Throughput,accelerator chip,convolutional layers,data flow computing,data movement energy cost,dataflow processing,deep convolutional neural networks,deep learning,energy conservation,energy efficiency,energy-efficient accelerators,energy-efficient reconfigurable accelerator,feedforward neural nets,learning (artificial intelligence),multiply and accumulation,neural net architecture,off-chip DRAM,power aware computing,reconfigurable architectures,reconfiguring architecture,row stationary,spatial architecturejan1127–138Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks522017-01@article{chen2017eyeriss, author = {Chen, Y and Krishna, T and Emer, J S and Sze, V}, doi = {10.1109/JSSC.2016.2616357}, issn = {0018-9200}, journal = {IEEE Journal of Solid-State Circuits}, keywords = {AI systems,AlexNet,CNN shapes,Clocks,Computer architecture,Convolutional neural networks (CNNs),DRAM accesses,DRAM chips,Eyeriss,Hardware,MAC,Neural networks,RS dataflow reconfiguration,Random access memory,Shape,Throughput,accelerator chip,convolutional layers,data flow computing,data movement energy cost,dataflow processing,deep convolutional neural networks,deep learning,energy conservation,energy efficiency,energy-efficient accelerators,energy-efficient reconfigurable accelerator,feedforward neural nets,learning (artificial intelligence),multiply and accumulation,neural net architecture,off-chip DRAM,power aware computing,reconfigurable architectures,reconfiguring architecture,row stationary,spatial architecture}, month = {jan}, number = {1}, pages = {127–138}, title = {{Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks}}, volume = {52}, year = {2017}} Edge TPULink2019@misc{tpu2019edge, title = {{Edge TPU}}, url = {

      https://cloud.google.com/edge-tpu/

      }, year = {2019}} SzeVChenYYangTEmerJ SDocumentISSN 0018-9219Proceedings of the IEEEASIC,Artificial intelligence,Benchmark testing,Biological neural networks,Computer architecture,Convolutional neural networks,DNN hardware designs,DNN hardware implementations,Machine learning,Neural networks,Neurons,Tutorials,VLSI,artificial intelligence,computation cost reduction,computational complexity,computer architecture,convolutional neural networks,dataflow processing,deep learning,deep neural networks,energy efficiency,energy-efficient accelerators,hardware architecture,hardware cost,hardware design changes,hardware platforms,low power,machine learning,neural nets,spatial architecturesdec122295–2329Efficient Processing of Deep Neural Networks: A Tutorial and Survey1052017-12@article{sze2017efficient, author = {Sze, V and Chen, Y and Yang, T and Emer, J S}, doi = {10.1109/JPROC.2017.2761740}, issn = {0018-9219}, journal = {Proceedings of the IEEE}, keywords = {ASIC,Artificial intelligence,Benchmark testing,Biological neural networks,Computer architecture,Convolutional neural networks,DNN hardware designs,DNN hardware implementations,Machine learning,Neural networks,Neurons,Tutorials,VLSI,artificial intelligence,computation cost reduction,computational complexity,computer architecture,convolutional neural networks,dataflow processing,deep learning,deep neural networks,energy efficiency,energy-efficient accelerators,hardware architecture,hardware cost,hardware design changes,hardware platforms,low power,machine learning,neural nets,spatial architectures}, month = {dec}, number = {12}, pages = {2295–2329}, title = {{Efficient Processing of Deep Neural Networks: A Tutorial and Survey}}, volume = {105}, year = {2017}} ChenYEmerJSzeVDocumentISSN 0272-1732IEEE MicroAccelerators,Computer architecture,Dataflows,Deep Neural Networks,Energy consumption,Energy efficiency,Energy-Efficient Hardware,Program processors,Radio frequency,Random access memory,Shape,Spatial Architecture1Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks2018@article{chen2018eyeriss, author = {Chen, Y and Emer, J and Sze, V}, doi = {10.1109/MM.2017.265085944}, issn = {0272-1732}, journal = {IEEE Micro}, keywords = {Accelerators,Computer architecture,Dataflows,Deep Neural Networks,Energy consumption,Energy efficiency,Energy-Efficient Hardware,Program processors,Radio frequency,Random access memory,Shape,Spatial Architecture}, pages = {1}, title = {{Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks}}, year = {2018}} New York, NY, USAHanSongKangJunlongMaoHuiziHuYimingLiXinLiYubinXieDongliangLuoHongYaoSongWangYuYangHuazhongDallyWilliam (Bill) JProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysDocumentISBN 978-1-4503-4354-1FPGA,deep learning,hardware acceleration,model compression,software-hardware co-design,speech recognition75–84ACMFPGA ’17ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGALink2017@inproceedings{han2017ese, address = {New York, NY, USA}, author = {Han, Song and Kang, Junlong and Mao, Huizi and Hu, Yiming and Li, Xin and Li, Yubin and Xie, Dongliang and Luo, Hong and Yao, Song and Wang, Yu and Yang, Huazhong and Dally, William (Bill) J}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, doi = {10.1145/3020078.3021745}, isbn = {978-1-4503-4354-1}, keywords = {FPGA,deep learning,hardware acceleration,model compression,software-hardware co-design,speech recognition}, pages = {75–84}, publisher = {ACM}, series = {FPGA '17}, title = {{ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA}}, url = {

      http://doi.acm.org/10.1145/3020078.3021745}, year = {2017}} New York, NY, USAZhangChenWuDiSunJiayuSunGuangyuLuoGuojieCongJasonProceedings of the 2016 International Symposium on Low Power Electronics and DesignDocumentISBN 978-1-4503-4185-1326–331ACMISLPED ’16Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA ClusterLink2016@inproceedings{zhang2016energy, address = {New York, NY, USA}, author = {Zhang, Chen and Wu, Di and Sun, Jiayu and Sun, Guangyu and Luo, Guojie and Cong, Jason}, booktitle = {Proceedings of the 2016 International Symposium on Low Power Electronics and Design}, doi = {10.1145/2934583.2934644}, isbn = {978-1-4503-4185-1}, pages = {326–331}, publisher = {ACM}, series = {ISLPED '16}, title = {{Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster}}, url = {http://doi.acm.org/10.1145/2934583.2934644}, year = {2016}} FeldmanMichaelTop500.orgsepIBM Finds Killer App for TrueNorth Neuromorphic ChipLink2016-09@misc{feldman2016ibm, author = {Feldman, Michael}, booktitle = {Top500.org}, month = {sep}, title = {{IBM Finds Killer App for TrueNorth Neuromorphic Chip}}, url = {https://www.top500.org/news/ibm-finds-killer-app-for-truenorth-neuromorphic-chip/}, year = {2016}} DuZidongFasthuberRobertChenTianshiIennePaoloLiLingLuoTaoFengXiaobingChenYunjiTemamOlivierACM SIGARCH Computer Architecture News3ACM92–104ShiDianNao: Shifting vision processing closer to the sensor432015@inproceedings{du2015shidiannao, author = {Du, Zidong and Fasthuber, Robert and Chen, Tianshi and Ienne, Paolo and Li, Ling and Luo, Tao and Feng, Xiaobing and Chen, Yunji and Temam, Olivier}, booktitle = {ACM SIGARCH Computer Architecture News}, number = {3}, organization = {ACM}, pages = {92–104}, title = {{ShiDianNao: Shifting vision processing closer to the sensor}}, volume = {43}, year = {2015}} New York, NY, USAZhangJialiangLiJingProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysDocumentISBN 978-1-4503-4354-1convolutional neural networks,fpga,hardware accelerator,opencl25–34ACMFPGA ’17Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural NetworkLink2017@inproceedings{zhang2017improving, address = {New York, NY, USA}, author = {Zhang, Jialiang and Li, Jing}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, doi = {10.1145/3020078.3021698}, isbn = {978-1-4503-4354-1}, keywords = {convolutional neural networks,fpga,hardware accelerator,opencl}, pages = {25–34}, publisher = {ACM}, series = {FPGA '17}, title = {{Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network}}, url = {http://doi.acm.org/10.1145/3020078.3021698}, year = {2017}} LiuDaofuChenTianshiLiuShaoliZhouJinhongZhouShengyuanTemanOlivierFengXiaobingZhouXuehaiChenYunjiACM SIGARCH Computer Architecture News1ACM369–381Pudiannao: A polyvalent machine learning accelerator432015@inproceedings{liu2015pudiannao, author = {Liu, Daofu and Chen, Tianshi and Liu, Shaoli and Zhou, Jinhong and Zhou, Shengyuan and Teman, Olivier and Feng, Xiaobing and Zhou, Xuehai and Chen, Yunji}, booktitle = {ACM SIGARCH Computer Architecture News}, number = {1}, organization = {ACM}, pages = {369–381}, title = {{Pudiannao: A polyvalent machine learning accelerator}}, volume = {43}, year = {2015}} Fleming JrKermin ElliottothersMassachusetts Institute of TechnologyScalable reconfigurable computing leveraging latency-insensitive channels2013Ph.D. Thesis@phdthesis{fleming2013scalable, author = {{Fleming Jr}, Kermin Elliott and Others}, school = {Massachusetts Institute of Technology}, title = {{Scalable reconfigurable computing leveraging latency-insensitive channels}}, year = {2013}} MerrittRickEE TimesfebStartup Accelerates AI at the SensorLink2019-02@misc{merrit2019startup, author = {Merritt, Rick}, booktitle = {EE Times}, month = {feb}, title = {{Startup Accelerates AI at the Sensor}}, url = {https://www.eetimes.com/document.asp?doc{\_}id=1334301}, year = {2019}} HemsothNicoleThe Next PlatformjulIntel FPGA Architecture Focuses on Deep Learning InferenceLink2018-07@misc{hemsoth2018intel, author = {Hemsoth, Nicole}, booktitle = {The Next Platform}, month = {jul}, title = {{Intel FPGA Architecture Focuses on Deep Learning Inference}}, url = {https://www.nextplatform.com/2018/07/31/intel-fpga-architecture-focuses-on-deep-learning-inference/

      }, year = {2018}} RockchipjanRockchip Released Its First AI Processor RK3399Pro NPU Performance up to 2.4TOPs2018-01@misc{rockchip2018rockchip, booktitle = {Rockchip}, month = {jan}, title = {{Rockchip Released Its First AI Processor RK3399Pro NPU Performance up to 2.4TOPs}}, year = {2018}} A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1’s counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.NakaharaHirokiFujiiTomoyaSatoSshimpei2017 27th International Conference on Field Programmable Logic and Applications (FPL)DocumentISSN 1946-1488ASIC,CIFAR10 image classification task,CNN variations,CPU,Computer architecture,FPGA,Field programmable gate arrays,Hardware,Kernel,Neurons,Radiation detectors,Random access memory,VGG-11 benchmark CNN,XNOR circuit,Xilinx Inc. Zedboard,area efficiency,binarizec convolutional neural network,binarized CNN,binarized average pooling layer,binarized values,binary 2-values,classification accuracy,conventional binarized implementations,convolution,embedded GPU,embedded systems,field programmable gate arrays,fully-connected layer elimination,graphics processing units,high-performance MAC circuit,image classification,internal FC layers,learning (artificial intelligence),majority circuit,neural nets,object detection,power efficiency,pre-trained convolutional deep neural network,weight memory1–4A Fully Connected Layer Elimination for a Binarizec Convolutional Neural Network on an FPGA2017@inproceedings{nakahara2017fully, abstract = {A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1's counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.}, author = {Nakahara, Hiroki and Fujii, Tomoya and Sato, Sshimpei}, booktitle = {2017 27th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.23919/FPL.2017.8056771}, issn = {1946-1488}, keywords = {ASIC,CIFAR10 image classification task,CNN variations,CPU,Computer architecture,FPGA,Field programmable gate arrays,Hardware,Kernel,Neurons,Radiation detectors,Random access memory,VGG-11 benchmark CNN,XNOR circuit,Xilinx Inc. Zedboard,area efficiency,binarizec convolutional neural network,binarized CNN,binarized average pooling layer,binarized values,binary 2-values,classification accuracy,conventional binarized implementations,convolution,embedded GPU,embedded systems,field programmable gate arrays,fully-connected layer elimination,graphics processing units,high-performance MAC circuit,image classification,internal FC layers,learning (artificial intelligence),majority circuit,neural nets,object detection,power efficiency,pre-trained convolutional deep neural network,weight memory}, pages = {1–4}, title = {{A Fully Connected Layer Elimination for a Binarizec Convolutional Neural Network on an FPGA}}, year = {2017}} FrumusanuAndreiAnandTechmarThe Samsung Galaxy S9 and S9+ Review: Exynos and Snapdragon at 960fps2018-03@misc{frumusanu2018samsung, author = {Frumusanu, Andrei}, booktitle = {AnandTech}, month = {mar}, title = {{The Samsung Galaxy S9 and S9+ Review: Exynos and Snapdragon at 960fps}}, year = {2018}} FrumusanuAndreiAnandTechaugHiSilicon Announces The Kirin 980: First A76, G76 on 7nm2018-08@misc{frumusanu2018hisilicon, author = {Frumusanu, Andrei}, booktitle = {AnandTech}, month = {aug}, title = {{HiSilicon Announces The Kirin 980: First A76, G76 on 7nm}}, year = {2018}} FranklinDustinNVIDIA Developer BlogmarNVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge2017-03@misc{franklin2017nvidia, author = {Franklin, Dustin}, booktitle = {NVIDIA Developer Blog}, month = {mar}, title = {{NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge}}, year = {2017}} HruskaJoelExtremeTechjunNvidia’s Jetson Xavier Stuffs Volta Performance Into Tiny Form Factor2018-06@misc{hruska2018nvidia, author = {Hruska, Joel}, booktitle = {ExtremeTech}, month = {jun}, title = {{Nvidia's Jetson Xavier Stuffs Volta Performance Into Tiny Form Factor}}, year = {2018}} RodriguezAndresSegalEdenMeiriEtayFomenkoEvaristKimYoung JimShenHaihaoZivBarukhIntel Corporationjan1–19Lower Numerical Precision Deep Learning Inference and Training2018-01Technical report@techreport{rodriguez2018lower, author = {Rodriguez, Andres and Segal, Eden and Meiri, Etay and Fomenko, Evarist and Kim, Young Jim and Shen, Haihao and Ziv, Barukh}, institution = {Intel Corporation}, month = {jan}, pages = {1—-19}, title = {{Lower Numerical Precision Deep Learning Inference and Training}}, year = {2018}} Upper Saddle River, NJ, USAMinskyMarvin LISBN 0-13-165563-9Prentice-Hall, Inc.Computation: Finite and Infinite Machines1967@book{minsky1967computation, address = {Upper Saddle River, NJ, USA}, author = {Minsky, Marvin L}, isbn = {0-13-165563-9}, publisher = {Prentice-Hall, Inc.}, title = {{Computation: Finite and Infinite Machines}}, year = {1967}} PengTonymedium.comsepAI Chip Duel: Apple A12 Bionic vs Huawei Kirin 980Link2018-09@misc{peng2018ai, author = {Peng, Tony}, booktitle = {medium.com}, month = {sep}, title = {{AI Chip Duel: Apple A12 Bionic vs Huawei Kirin 980}}, url = {

      https://medium.com/syncedreview/ai-chip-duel-apple-a12-bionic-vs-huawei-kirin-980-ec29cfe68632}, year = {2018}} FrumusanuAndreiAnandTechoctThe iPhone XS & XS Max Review: Unveiling the Silicon SecretsLink2018-10@misc{frumusanu2018iphone, author = {Frumusanu, Andrei}, booktitle = {AnandTech}, month = {oct}, title = {{The iPhone XS {\&} XS Max Review: Unveiling the Silicon Secrets}}, url = {https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-silicon-secrets}, year = {2018}} The insights contained in Gordon Moore’s now famous 1965 and 1975 papers have broadly guided the development of semiconductor electronics for over 50 years. However, the field-effect transistor is approaching some physical limits to further miniaturization, and the associated rising costs and reduced return on investment appear to be slowing the pace of development. Far from signaling an end to progress, this gradual ”end of Moore’s law” will open a new era in information technology as the focus of research and development shifts from miniaturization of long-established technologies to the coordinated introduction of new devices, new integration technologies, and new architectures for computing.TheisT NWongH -. PDocumentISSN 1521-9615Computing in Science EngineeringAlgorithm design and analysis,Computer architecture,Field effect transistors,Gordon Moore,Memory management,Moore’s Law,Moore’s law,Random access memory,Scientific computing,Switching circuits,algorithms implemented in hardware,emerging technologies,field effect transistor,information technology,introductory and survey,memory technologies,neural nets,research and development,scientific computing,semiconductor electronicsmar241–50The End of Moore’s Law: A New Beginning for Information Technology192017-03@article{theis2017end, abstract = {The insights contained in Gordon Moore's now famous 1965 and 1975 papers have broadly guided the development of semiconductor electronics for over 50 years. However, the field-effect transistor is approaching some physical limits to further miniaturization, and the associated rising costs and reduced return on investment appear to be slowing the pace of development. Far from signaling an end to progress, this gradual "end of Moore's law" will open a new era in information technology as the focus of research and development shifts from miniaturization of long-established technologies to the coordinated introduction of new devices, new integration technologies, and new architectures for computing.}, author = {Theis, T N and Wong, H -. P}, doi = {10.1109/MCSE.2017.29}, issn = {1521-9615}, journal = {Computing in Science Engineering}, keywords = {Algorithm design and analysis,Computer architecture,Field effect transistors,Gordon Moore,Memory management,Moore's Law,Moore's law,Random access memory,Scientific computing,Switching circuits,algorithms implemented in hardware,emerging technologies,field effect transistor,information technology,introductory and survey,memory technologies,neural nets,research and development,scientific computing,semiconductor electronics}, month = {mar}, number = {2}, pages = {41–50}, title = {{The End of Moore's Law: A New Beginning for Information Technology}}, volume = {19}, year = {2017}} CanzianiAlfredoPaszkeAdamCulurcielloEugenioarXiv preprint arXiv:1605.07678An Analysis of Deep Neural Network Models for Practical Applications2016@article{canziani2016analysis, author = {Canziani, Alfredo and Paszke, Adam and Culurciello, Eugenio}, journal = {arXiv preprint arXiv:1605.07678}, title = {{An Analysis of Deep Neural Network Models for Practical Applications}}, year = {2016}} Van VeenFjodor (Asimov Institute)LeijnenStefan (Asimov Institute)The Neural Network ZooLink2019@misc{vanveen2019neural, author = {{Van Veen}, Fjodor (Asimov Institute) and Leijnen, Stefan (Asimov Institute)}, title = {{The Neural Network Zoo}}, url = {http://www.asimovinstitute.org/neural-network-zoo/}, year = {2019}} AkopyanFSawadaJCassidyAAlvarez-IcazaRArthurJMerollaPImamNNakamuraYDattaPNamGTabaBBeakesMBrezzoBKuangJ BManoharRRiskW PJacksonBModhaD SDocumentISSN 0278-0070IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsArchitecture,Asynchronous circuits,Biological neural networks,CAD tools,CMOS digital integrated circuits,CMOS scaling trends,Computer architecture,Nerve fibers,Real-time systems,Synchronization,TrueNorth,TrueNorth architecture,asynchronous communication,asynchronous-synchronous circuits,cognitive perception applications,conventional computer-aided design tools,defect-tolerant architec- ture,design automation,design methodology,event-driven routing infrastructure,image recognition,intelligent computing,large-scale integration CAD placement,logic design,low-power architecture,low-power consumption,low-power electronics,neural network hardware,neural networks,neuromorphics,neuron programmable neurosynaptic chip,noisy multisensory data,non-von Neumann architecture,parallel architectures,power 65 mW,real-time operation,real-time systems,sensory perception applications,synchronous circuits,very large-scale integrationoct101537–1557TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip342015-10@article{akopyan2015truenorth, author = {Akopyan, F and Sawada, J and Cassidy, A and Alvarez-Icaza, R and Arthur, J and Merolla, P and Imam, N and Nakamura, Y and Datta, P and Nam, G and Taba, B and Beakes, M and Brezzo, B and Kuang, J B and Manohar, R and Risk, W P and Jackson, B and Modha, D S}, doi = {10.1109/TCAD.2015.2474396}, issn = {0278-0070}, journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems}, keywords = {Architecture,Asynchronous circuits,Biological neural networks,CAD tools,CMOS digital integrated circuits,CMOS scaling trends,Computer architecture,Nerve fibers,Real-time systems,Synchronization,TrueNorth,TrueNorth architecture,asynchronous communication,asynchronous-synchronous circuits,cognitive perception applications,conventional computer-aided design tools,defect-tolerant architec- ture,design automation,design methodology,event-driven routing infrastructure,image recognition,intelligent computing,large-scale integration CAD placement,logic design,low-power architecture,low-power consumption,low-power electronics,neural network hardware,neural networks,neuromorphics,neuron programmable neurosynaptic chip,noisy multisensory data,non-von Neumann architecture,parallel architectures,power 65 mW,real-time operation,real-time systems,sensory perception applications,synchronous circuits,very large-scale integration}, month = {oct}, number = {10}, pages = {1537–1557}, title = {{TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip}}, volume = {34}, year = {2015}} Artificial Intelligence (AI) has the opportunity to revolutionize the way the United States Department of Defense (DoD) and Intelligence Community (IC) address the challenges of evolving threats, data deluge, and rapid courses of action. Developing an end-to-end artificial intelligence system involves parallel development of different pieces that must work together in order to provide capabilities that can be used by decision makers, warfighters and analysts. These pieces include data collection, data conditioning, algorithms, computing, robust artificial intelligence, and human-machine teaming. While much of the popular press today surrounds advances in algorithms and computing, most modern AI systems leverage advances across numerous different fields. Further, while certain components may not be as visible to end-users as others, our experience has shown that each of these interrelated components play a major role in the success or failure of an AI system. This article is meant to highlight many of these technologies that are involved in an end-to-end AI system. The goal of this article is to provide readers with an overview of terminology, technical details and recent highlights from academia, industry and government. Where possible, we indicate relevant resources that can be used for further reading and understanding.Lexington, MAGadepallyVijayGoodwinJustinKepnerJeremyReutherAlbertReynoldsHayleySamsiSiddharthSuJonathanMartinezDavidMIT Lincoln Laboratory1–54AI Enabling Technologies2019Technical report@techreport{gadepally2019enabling, abstract = {Artificial Intelligence (AI) has the opportunity to revolutionize the way the United States Department of Defense (DoD) and Intelligence Community (IC) address the challenges of evolving threats, data deluge, and rapid courses of action. Developing an end-to-end artificial intelligence system involves parallel development of different pieces that must work together in order to provide capabilities that can be used by decision makers, warfighters and analysts. These pieces include data collection, data conditioning, algorithms, computing, robust artificial intelligence, and human-machine teaming. While much of the popular press today surrounds advances in algorithms and computing, most modern AI systems leverage advances across numerous different fields. Further, while certain components may not be as visible to end-users as others, our experience has shown that each of these interrelated components play a major role in the success or failure of an AI system. This article is meant to highlight many of these technologies that are involved in an end-to-end AI system. The goal of this article is to provide readers with an overview of terminology, technical details and recent highlights from academia, industry and government. Where possible, we indicate relevant resources that can be used for further reading and understanding.}, address = {Lexington, MA}, author = {Gadepally, Vijay and Goodwin, Justin and Kepner, Jeremy and Reuther, Albert and Reynolds, Hayley and Samsi, Siddharth and Su, Jonathan and Martinez, David}, institution = {MIT Lincoln Laboratory}, pages = {1–54}, title = {{AI Enabling Technologies}}, year = {2019}} HorowitzMark2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)DocumentISBN 978-1-4799-0920-9feb10–14IEEEComputing’s Energy Problem (and What We Can Do About It)Link2014-02@inproceedings{horowitz2014computing, author = {Horowitz, Mark}, booktitle = {2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)}, doi = {10.1109/ISSCC.2014.6757323}, isbn = {978-1-4799-0920-9}, month = {feb}, pages = {10–14}, publisher = {IEEE}, title = {{Computing's Energy Problem (and What We Can Do About It)}}, url = {http://ieeexplore.ieee.org/document/6757323/}, year = {2014}} HennessyJohn L.PattersonDavid A.DocumentISSN 00010782Communications of the ACMjan248–60A New Golden Age for Computer ArchitectureLink622019-01@article{hennessy2019new, author = {Hennessy, John L. and Patterson, David A.}, doi = {10.1145/3282307}, issn = {00010782}, journal = {Communications of the ACM}, month = {jan}, number = {2}, pages = {48–60}, title = {{A New Golden Age for Computer Architecture}}, url = {http://dl.acm.org/citation.cfm?doid=3310134.3282307}, volume = {62}, year = {2019}} MittalSparshDocument:Users/al17856/Documents/Mendeley Desktop/Mittal/Neural Computing and Applications/Mittal - 2018 - A survey of FPGA-based accelerators for convolutional neural networks.pdf:pdfISSN 0941-0643Neural Computing and Applicationsoct1–31Springer LondonA survey of FPGA-based accelerators for convolutional neural networksLink2018-10@article{mittal2018survey, author = {Mittal, Sparsh}, doi = {10.1007/s00521-018-3761-1}, file = {:Users/al17856/Documents/Mendeley Desktop/Mittal/Neural Computing and Applications/Mittal - 2018 - A survey of FPGA-based accelerators for convolutional neural networks.pdf:pdf}, issn = {0941-0643}, journal = {Neural Computing and Applications}, month = {oct}, pages = {1–31}, publisher = {Springer London}, title = {{A survey of FPGA-based accelerators for convolutional neural networks}}, url = {http://link.springer.com/10.1007/s00521-018-3761-1}, year = {2018}} LiZhenWangYuqingZhiTianChenTianshiDocument:Users/al17856/Documents/Mendeley Desktop/Li et al/Frontiers of Computer Science/Li et al. - 2017 - A survey of neural network accelerators.pdf:pdfISSN 2095-2228Frontiers of Computer Scienceoct5746–761Higher Education PressA survey of neural network acceleratorsLink112017-10@article{li2017survey, author = {Li, Zhen and Wang, Yuqing and Zhi, Tian and Chen, Tianshi}, doi = {10.1007/s11704-016-6159-1}, file = {:Users/al17856/Documents/Mendeley Desktop/Li et al/Frontiers of Computer Science/Li et al. - 2017 - A survey of neural network accelerators.pdf:pdf}, issn = {2095-2228}, journal = {Frontiers of Computer Science}, month = {oct}, number = {5}, pages = {746–761}, publisher = {Higher Education Press}, title = {{A survey of neural network accelerators}}, url = {http://link.springer.com/10.1007/s11704-016-6159-1}, volume = {11}, year = {2017}} ShinDongjooLeeJinmookLeeJinsuYooHoi-Jun2017 IEEE International Solid-State Circuits Conference (ISSCC)DocumentISBN 978-1-5090-3758-2feb240–241IEEE14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networksLink2017-02@inproceedings{Shin2017, author = {Shin, Dongjoo and Lee, Jinmook and Lee, Jinsu and Yoo, Hoi-Jun}, booktitle = {2017 IEEE International Solid-State Circuits Conference (ISSCC)}, doi = {10.1109/ISSCC.2017.7870350}, isbn = {978-1-5090-3758-2}, month = {feb}, pages = {240–241}, publisher = {IEEE}, title = {{14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks}}, url = {http://ieeexplore.ieee.org/document/7870350/}, year = {2017}} ChenYunjiChenTianshiXuZhiweiSunNinghuiTemamOlivierDocument:Users/al17856/Documents/Mendeley Desktop/Chen et al/Communications of the ACM/Chen et al. - 2016 - DianNao family.pdf:pdfISSN 00010782Communications of the ACMoct11105–112ACMDianNao Family: Energy-Efficient Accelerators For Machine LearningLink592016-10@article{chen2016diannao, author = {Chen, Yunji and Chen, Tianshi and Xu, Zhiwei and Sun, Ninghui and Temam, Olivier}, doi = {10.1145/2996864}, file = {:Users/al17856/Documents/Mendeley Desktop/Chen et al/Communications of the ACM/Chen et al. - 2016 - DianNao family.pdf:pdf}, issn = {00010782}, journal = {Communications of the ACM}, month = {oct}, number = {11}, pages = {105–112}, publisher = {ACM}, title = {{DianNao Family: Energy-Efficient Accelerators For Machine Learning}}, url = {http://dl.acm.org/citation.cfm?doid=3013530.2996864}, volume = {59}, year = {2016}} ChenYunjiLuoTaoLiuShaoliZhangShijinHeLiqiangWangJiaLiLingChenTianshiXuZhiweiSunNinghuiTemamOlivier2014 47th Annual IEEE/ACM International Symposium on MicroarchitectureDocument:Users/al17856/Documents/Mendeley Desktop/Chen et al/2014 47th Annual IEEEACM International Symposium on Microarchitecture/Chen et al. - 2014 - DaDianNao A Machine-Learning Supercomputer.pdf:pdfISBN 978-1-4799-6998-2dec609–622IEEEDaDianNao: A Machine-Learning SupercomputerLink2014-12@inproceedings{chen2014dadiannao, author = {Chen, Yunji and Luo, Tao and Liu, Shaoli and Zhang, Shijin and He, Liqiang and Wang, Jia and Li, Ling and Chen, Tianshi and Xu, Zhiwei and Sun, Ninghui and Temam, Olivier}, booktitle = {2014 47th Annual IEEE/ACM International Symposium on Microarchitecture}, doi = {10.1109/MICRO.2014.58}, file = {:Users/al17856/Documents/Mendeley Desktop/Chen et al/2014 47th Annual IEEEACM International Symposium on Microarchitecture/Chen et al. - 2014 - DaDianNao A Machine-Learning Supercomputer.pdf:pdf}, isbn = {978-1-4799-6998-2}, month = {dec}, pages = {609–622}, publisher = {IEEE}, title = {{DaDianNao: A Machine-Learning Supercomputer}}, url = {http://ieeexplore.ieee.org/document/7011421/

      }, year = {2014}} New York, New York, USAZhangChiPrasannaViktorProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17Document:Users/al17856/Documents/Mendeley Desktop/Zhang, Prasanna/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Prasanna - 2017 - Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System.pdf:pdfISBN 9781450343541CPU,FPGA,concurrent processing,convolutional neural networks,discrete fourier transform,double buffering,overlap-and-add,shared memory35–44ACM PressFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory SystemLink2017@inproceedings{Zhang2017a, address = {New York, New York, USA}, author = {Zhang, Chi and Prasanna, Viktor}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17}, doi = {10.1145/3020078.3021727}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang, Prasanna/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Prasanna - 2017 - Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System.pdf:pdf}, isbn = {9781450343541}, keywords = {CPU,FPGA,concurrent processing,convolutional neural networks,discrete fourier transform,double buffering,overlap-and-add,shared memory}, pages = {35–44}, publisher = {ACM Press}, title = {{Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System}}, url = {

      http://dl.acm.org/citation.cfm?doid=3020078.3021727

      }, year = {2017}} KrizhevskyAlexSutskeverIlyaE. HintonGeoffreyDocumentNeural Information Processing SystemsImageNet Classification with Deep Convolutional Neural Networks252012@article{krizhevsky2012imagenet, author = {Krizhevsky, Alex and Sutskever, Ilya and {E. Hinton}, Geoffrey}, doi = {10.1145/3065386}, journal = {Neural Information Processing Systems}, title = {{ImageNet Classification with Deep Convolutional Neural Networks}}, volume = {25}, year = {2012}} KrizhevskyAlexSutskeverIlyaHintonGeoffrey E.Document:Users/al17856/Documents/Mendeley Desktop/Krizhevsky, Sutskever, Hinton/Communications of the ACM/Krizhevsky, Sutskever, Hinton - 2017 - ImageNet classification with deep convolutional neural networks.pdf:pdfISSN 00010782Communications of the ACMmay684–90ACMImageNet classification with deep convolutional neural networksLink602017-05@article{Krizhevsky2017, author = {Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E.}, doi = {10.1145/3065386}, file = {:Users/al17856/Documents/Mendeley Desktop/Krizhevsky, Sutskever, Hinton/Communications of the ACM/Krizhevsky, Sutskever, Hinton - 2017 - ImageNet classification with deep convolutional neural networks.pdf:pdf}, issn = {00010782}, journal = {Communications of the ACM}, month = {may}, number = {6}, pages = {84–90}, publisher = {ACM}, title = {{ImageNet classification with deep convolutional neural networks}}, url = {

      http://dl.acm.org/citation.cfm?doid=3098997.3065386}, volume = {60}, year = {2017}} New York, New York, USAZhangChenLiPengSunGuangyuGuanYijinXiaoBingjunCongJasonProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’15Document:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2015 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15/Zhang et al. - 2015 - Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks.pdf:pdfISBN 9781450333153acceleration,convolutional neural network,fpga,roofline model161–170ACM PressOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksLink2015@inproceedings{zhang2015optimizing, address = {New York, New York, USA}, author = {Zhang, Chen and Li, Peng and Sun, Guangyu and Guan, Yijin and Xiao, Bingjun and Cong, Jason}, booktitle = {Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15}, doi = {10.1145/2684746.2689060}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2015 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15/Zhang et al. - 2015 - Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks.pdf:pdf}, isbn = {9781450333153}, keywords = {acceleration,convolutional neural network,fpga,roofline model}, pages = {161–170}, publisher = {ACM Press}, title = {{Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks}}, url = {http://dl.acm.org/citation.cfm?doid=2684746.2689060

      }, year = {2015}} GuanYijinYuanZhihangSunGuangyuCongJason2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC)DocumentISBN 978-1-5090-1558-0jan629–634IEEEFPGA-based accelerator for long short-term memory recurrent neural networksLink2017-01@inproceedings{Guan2017a, author = {Guan, Yijin and Yuan, Zhihang and Sun, Guangyu and Cong, Jason}, booktitle = {2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC)}, doi = {10.1109/ASPDAC.2017.7858394}, isbn = {978-1-5090-1558-0}, month = {jan}, pages = {629–634}, publisher = {IEEE}, title = {{FPGA-based accelerator for long short-term memory recurrent neural networks}}, url = {

      http://ieeexplore.ieee.org/document/7858394/}, year = {2017}} PodiliAbhinavZhangChiPrasannaViktor2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)DocumentISBN 978-1-5090-4825-0jul11–18IEEEFast and efficient implementation of Convolutional Neural Networks on FPGALink2017-07@inproceedings{podili2017fast, author = {Podili, Abhinav and Zhang, Chi and Prasanna, Viktor}, booktitle = {2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)}, doi = {10.1109/ASAP.2017.7995253}, isbn = {978-1-5090-4825-0}, month = {jul}, pages = {11–18}, publisher = {IEEE}, title = {{Fast and efficient implementation of Convolutional Neural Networks on FPGA}}, url = {http://ieeexplore.ieee.org/document/7995253/}, year = {2017}} LuLiqiangLiangYunXiaoQingchengYanShengen2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)DocumentISBN 978-1-5386-4037-1apr101–108IEEEEvaluating Fast Algorithms for Convolutional Neural Networks on FPGAsLink2017-04@inproceedings{lu2017evaluating, author = {Lu, Liqiang and Liang, Yun and Xiao, Qingcheng and Yan, Shengen}, booktitle = {2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)}, doi = {10.1109/FCCM.2017.64}, isbn = {978-1-5386-4037-1}, month = {apr}, pages = {101–108}, publisher = {IEEE}, title = {{Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs}}, url = {http://ieeexplore.ieee.org/document/7966660/}, year = {2017}} New York, New York, USAZhangJialiangLiJingProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17Document:Users/al17856/Documents/Mendeley Desktop/Zhang, Li/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Li - 2017 - Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network.pdf:pdfISBN 9781450343541convolutional neural networks,fpga,hardware accelerator,opencl25–34ACM PressImproving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural NetworkLink2017@inproceedings{Zhang2017, address = {New York, New York, USA}, author = {Zhang, Jialiang and Li, Jing}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17}, doi = {10.1145/3020078.3021698}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang, Li/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Li - 2017 - Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network.pdf:pdf}, isbn = {9781450343541}, keywords = {convolutional neural networks,fpga,hardware accelerator,opencl}, pages = {25–34}, publisher = {ACM Press}, title = {{Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network}}, url = {http://dl.acm.org/citation.cfm?doid=3020078.3021698}, year = {2017}} New York, New York, USAZhangChenWuDiSunJiayuSunGuangyuLuoGuojieCongJasonProceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED ’16Document:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16/Zhang et al. - 2016 - Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster.pdf:pdfISBN 9781450341851326–331ACM PressEnergy-Efficient CNN Implementation on a Deeply Pipelined FPGA ClusterLink2016@inproceedings{Zhang2016a, address = {New York, New York, USA}, author = {Zhang, Chen and Wu, Di and Sun, Jiayu and Sun, Guangyu and Luo, Guojie and Cong, Jason}, booktitle = {Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16}, doi = {10.1145/2934583.2934644}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16/Zhang et al. - 2016 - Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster.pdf:pdf}, isbn = {9781450341851}, pages = {326–331}, publisher = {ACM Press}, title = {{Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster}}, url = {http://dl.acm.org/citation.cfm?doid=2934583.2934644}, year = {2016}} New York, New York, USAShenJunzhongHuangYouWangZelongQiaoYuranWenMeiZhangChunyuanProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’18Document:Users/al17856/Documents/Mendeley Desktop/Shen et al/Proceedings of the 2018 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18/Shen et al. - 2018 - Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.pdf:pdfISBN 97814503561453d cnn,uniform templates,winograd algorithm97–106ACM PressTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGALink2018@inproceedings{Shen2018, address = {New York, New York, USA}, author = {Shen, Junzhong and Huang, You and Wang, Zelong and Qiao, Yuran and Wen, Mei and Zhang, Chunyuan}, booktitle = {Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18}, doi = {10.1145/3174243.3174257}, file = {:Users/al17856/Documents/Mendeley Desktop/Shen et al/Proceedings of the 2018 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18/Shen et al. - 2018 - Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.pdf:pdf}, isbn = {9781450356145}, keywords = {3d cnn,uniform templates,winograd algorithm}, pages = {97–106}, publisher = {ACM Press}, title = {{Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA}}, url = {http://dl.acm.org/citation.cfm?doid=3174243.3174257}, year = {2018}} Huimin LiXitian FanLi JiaoWei CaoXuegong ZhouLingli Wang2016 26th International Conference on Field Programmable Logic and Applications (FPL)DocumentISBN 978-2-8399-1844-2aug1–9IEEEA high performance FPGA-based accelerator for large-scale convolutional neural networksLink2016-08@inproceedings{HuiminLi2016, author = {{Huimin Li} and {Xitian Fan} and {Li Jiao} and {Wei Cao} and {Xuegong Zhou} and {Lingli Wang}}, booktitle = {2016 26th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.1109/FPL.2016.7577308}, isbn = {978-2-8399-1844-2}, month = {aug}, pages = {1–9}, publisher = {IEEE}, title = {{A high performance FPGA-based accelerator for large-scale convolutional neural networks}}, url = {http://ieeexplore.ieee.org/document/7577308/}, year = {2016}} GuanYijinLiangHaoXuNingyiWangWenqiangShiShaoshuaiChenXiSunGuangyuZhangWeiCongJason2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)DocumentISBN 978-1-5386-4037-1apr152–159IEEEFP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid TemplatesLink2017-04@inproceedings{Guan2017, author = {Guan, Yijin and Liang, Hao and Xu, Ningyi and Wang, Wenqiang and Shi, Shaoshuai and Chen, Xi and Sun, Guangyu and Zhang, Wei and Cong, Jason}, booktitle = {2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)}, doi = {10.1109/FCCM.2017.25}, isbn = {978-1-5386-4037-1}, month = {apr}, pages = {152–159}, publisher = {IEEE}, title = {{FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates}}, url = {http://ieeexplore.ieee.org/document/7966671/}, year = {2017}} New York, New York, USAZhangChenFangZhenmanZhouPeipeiPanPeichenCongJasonProceedings of the 35th International Conference on Computer-Aided Design - ICCAD ’16DocumentISBN 97814503446611–8ACM PressCaffeineLink2016@inproceedings{Zhang2016, address = {New York, New York, USA}, author = {Zhang, Chen and Fang, Zhenman and Zhou, Peipei and Pan, Peichen and Cong, Jason}, booktitle = {Proceedings of the 35th International Conference on Computer-Aided Design - ICCAD '16}, doi = {10.1145/2966986.2967011}, isbn = {9781450344661}, pages = {1–8}, publisher = {ACM Press}, title = {{Caffeine}}, url = {http://dl.acm.org/citation.cfm?doid=2966986.2967011}, year = {2016}} New York, New York, USAXiaoQingchengLiangYunLuLiqiangYanShengenTaiYu-WingProceedings of the 54th Annual Design Automation Conference 2017 on - DAC ’17Document:Users/al17856/Documents/Mendeley Desktop/Xiao et al/Proceedings of the 54th Annual Design Automation Conference 2017 on - DAC '17/Xiao et al. - 2017 - Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs.pdf:pdfISBN 97814503492771–6ACM PressExploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAsLink2017@inproceedings{Xiao2017, address = {New York, New York, USA}, author = {Xiao, Qingcheng and Liang, Yun and Lu, Liqiang and Yan, Shengen and Tai, Yu-Wing}, booktitle = {Proceedings of the 54th Annual Design Automation Conference 2017 on - DAC '17}, doi = {10.1145/3061639.3062244}, file = {:Users/al17856/Documents/Mendeley Desktop/Xiao et al/Proceedings of the 54th Annual Design Automation Conference 2017 on - DAC '17/Xiao et al. - 2017 - Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs.pdf:pdf}, isbn = {9781450349277}, pages = {1–6}, publisher = {ACM Press}, title = {{Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs}}, url = {http://dl.acm.org/citation.cfm?doid=3061639.3062244}, year = {2017}} New York, New York, USAQiuJiantaoSongSenWangYuYangHuazhongWangJieYaoSongGuoKaiyuanLiBoxunZhouErjinYuJinchengTangTianqiXuNingyiProceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’16Document:Users/al17856/Documents/Mendeley Desktop/Qiu et al/Proceedings of the 2016 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16/Qiu et al. - 2016 - Going Deeper with Embedded FPGA Platform for Convolutional Neural Network.pdf:pdfISBN 9781450338561bandwidth utilization,convolutional neural network (cnn),dynamic-precision data quantization,embedded fpga26–35ACM PressGoing Deeper with Embedded FPGA Platform for Convolutional Neural NetworkLink2016@inproceedings{Qiu2016, address = {New York, New York, USA}, author = {Qiu, Jiantao and Song, Sen and Wang, Yu and Yang, Huazhong and Wang, Jie and Yao, Song and Guo, Kaiyuan and Li, Boxun and Zhou, Erjin and Yu, Jincheng and Tang, Tianqi and Xu, Ningyi}, booktitle = {Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16}, doi = {10.1145/2847263.2847265}, file = {:Users/al17856/Documents/Mendeley Desktop/Qiu et al/Proceedings of the 2016 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16/Qiu et al. - 2016 - Going Deeper with Embedded FPGA Platform for Convolutional Neural Network.pdf:pdf}, isbn = {9781450338561}, keywords = {bandwidth utilization,convolutional neural network (cnn),dynamic-precision data quantization,embedded fpga}, pages = {26–35}, publisher = {ACM Press}, title = {{Going Deeper with Embedded FPGA Platform for Convolutional Neural Network}}, url = {http://dl.acm.org/citation.cfm?doid=2847263.2847265}, year = {2016}} New York, New York, USAVenierisStylianos I.BouganisChristos-SavvasProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17DocumentISBN 9781450343541FPGA,convolutional neural networks,design space exploration,synchronous dataflow291–292ACM PressfpgaConvNetLink2017@inproceedings{Venieris2017, address = {New York, New York, USA}, author = {Venieris, Stylianos I. and Bouganis, Christos-Savvas}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17}, doi = {10.1145/3020078.3021791}, isbn = {9781450343541}, keywords = {FPGA,convolutional neural networks,design space exploration,synchronous dataflow}, pages = {291–292}, publisher = {ACM Press}, title = {{fpgaConvNet}}, url = {http://dl.acm.org/citation.cfm?doid=3020078.3021791}, year = {2017}} Zhiqiang LiuYong DouJingfei JiangJinwei Xu2016 International Conference on Field-Programmable Technology (FPT)DocumentISBN 978-1-5090-5602-6dec61–68IEEEAutomatic code generation of convolutional neural networks in FPGA implementationLink2016-12@inproceedings{ZhiqiangLiu2016, author = {{Zhiqiang Liu} and {Yong Dou} and {Jingfei Jiang} and {Jinwei Xu}}, booktitle = {2016 International Conference on Field-Programmable Technology (FPT)}, doi = {10.1109/FPT.2016.7929190}, isbn = {978-1-5090-5602-6}, month = {dec}, pages = {61–68}, publisher = {IEEE}, title = {{Automatic code generation of convolutional neural networks in FPGA implementation}}, url = {http://ieeexplore.ieee.org/document/7929190/}, year = {2016}} New York, New York, USASudaNaveenChandraVikasDasikaGaneshMohantyAbinashMaYufeiVrudhulaSarmaSeoJae-sunCaoYuProceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’16DocumentISBN 9781450338561convolutional neural networks,fpga,opencl,optimization16–25ACM PressThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural NetworksLink2016@inproceedings{Suda2016, address = {New York, New York, USA}, author = {Suda, Naveen and Chandra, Vikas and Dasika, Ganesh and Mohanty, Abinash and Ma, Yufei and Vrudhula, Sarma and Seo, Jae-sun and Cao, Yu}, booktitle = {Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16}, doi = {10.1145/2847263.2847276}, isbn = {9781450338561}, keywords = {convolutional neural networks,fpga,opencl,optimization}, pages = {16–25}, publisher = {ACM Press}, title = {{Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks}}, url = {http://dl.acm.org/citation.cfm?doid=2847263.2847276}, year = {2016}} We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1
      % top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.arXiv1606.06160ZhouShuchangWuYuxinNiZekunZhouXinyuWenHeZouYuheng1606.06160:Users/al17856/Documents/Mendeley Desktop/Zhou et al/Unknown/Zhou et al. - 2016 - DoReFa-Net Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.pdf:pdfjunDoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth GradientsLink2016-06@article{Zhou2016, abstract = {We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1$\backslash${\%} top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.}, archiveprefix = {arXiv}, arxivid = {1606.06160}, author = {Zhou, Shuchang and Wu, Yuxin and Ni, Zekun and Zhou, Xinyu and Wen, He and Zou, Yuheng}, eprint = {1606.06160}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhou et al/Unknown/Zhou et al. - 2016 - DoReFa-Net Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.pdf:pdf}, month = {jun}, title = {{DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients}}, url = {http://arxiv.org/abs/1606.06160}, year = {2016}} GuoKaiyuanSuiLingzhiQiuJiantaoYuJinchengWangJunbinYaoSongHanSongWangYuYangHuazhongDocumentISSN 0278-0070IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systemsjan135–47Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGALink372018-01@article{guo2018angeleye, author = {Guo, Kaiyuan and Sui, Lingzhi and Qiu, Jiantao and Yu, Jincheng and Wang, Junbin and Yao, Song and Han, Song and Wang, Yu and Yang, Huazhong}, doi = {10.1109/TCAD.2017.2705069}, issn = {0278-0070}, journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems}, month = {jan}, number = {1}, pages = {35–47}, title = {{Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA}}, url = {http://ieeexplore.ieee.org/document/7930521/}, volume = {37}, year = {2018}} JiaoLiLuoChengCaoWeiZhouXuegongWangLingli2017 27th International Conference on Field Programmable Logic and Applications (FPL)DocumentISBN 978-9-0903-0428-1sep1–4IEEEAccelerating Low Bit-Width Convolutional Neural Networks with Embedded FPGALink2017-09@inproceedings{jiao2017accelerating, author = {Jiao, Li and Luo, Cheng and Cao, Wei and Zhou, Xuegong and Wang, Lingli}, booktitle = {2017 27th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.23919/FPL.2017.8056820}, isbn = {978-9-0903-0428-1}, month = {sep}, pages = {1–4}, publisher = {IEEE}, title = {{Accelerating Low Bit-Width Convolutional Neural Networks with Embedded FPGA}}, url = {http://ieeexplore.ieee.org/document/8056820/

      }, year = {2017}} Loihi is a 60-mm2chip fabricated in Intels 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and, most importantly, programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation, outperforming all known conventional solutions.DaviesMSrinivasaNLinTChinyaGCaoYChodayS HDimouGJoshiPImamNJainSLiaoYLinCLinesALiuRMathaikuttyDMcCoySPaulATseJVenkataramananGWengYWildAYangYWangHDocument:Users/al17856/Documents/Mendeley Desktop/Davies et al/IEEE Micro/08259423.pdf:pdfISSN 0272-1732IEEE MicroAlgorithm design and analysis,Biological neural networks,CPU iso-process-voltage-area,Computational modeling,Computer architecture,Intels process,LASSO optimization problems,Loihi,Neuromorphics,Neurons,artificial intelligence,circuit optimisation,dendritic compartments,hierarchical connectivity,integrated circuit modelling,learning (artificial intelligence),locally competitive algorithm,machine learning,magnitude superior energy-delay-product,microprocessor chips,multiprocessing systems,neural chips,neuromorphic computing,neuromorphic manycore processor,on-chip learning,programmable synaptic learning rules,size 14 nm,spike-based computation,spiking neural networks,synaptic delaysjan182–99Loihi: A Neuromorphic Manycore Processor with On-Chip Learning382018-01@article{8259423, abstract = {Loihi is a 60-mm2chip fabricated in Intels 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and, most importantly, programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation, outperforming all known conventional solutions.}, author = {Davies, M and Srinivasa, N and Lin, T and Chinya, G and Cao, Y and Choday, S H and Dimou, G and Joshi, P and Imam, N and Jain, S and Liao, Y and Lin, C and Lines, A and Liu, R and Mathaikutty, D and McCoy, S and Paul, A and Tse, J and Venkataramanan, G and Weng, Y and Wild, A and Yang, Y and Wang, H}, doi = {10.1109/MM.2018.112130359}, file = {:Users/al17856/Documents/Mendeley Desktop/Davies et al/IEEE Micro/08259423.pdf:pdf}, issn = {0272-1732}, journal = {IEEE Micro}, keywords = {Algorithm design and analysis,Biological neural networks,CPU iso-process-voltage-area,Computational modeling,Computer architecture,Intels process,LASSO optimization problems,Loihi,Neuromorphics,Neurons,artificial intelligence,circuit optimisation,dendritic compartments,hierarchical connectivity,integrated circuit modelling,learning (artificial intelligence),locally competitive algorithm,machine learning,magnitude superior energy-delay-product,microprocessor chips,multiprocessing systems,neural chips,neuromorphic computing,neuromorphic manycore processor,on-chip learning,programmable synaptic learning rules,size 14 nm,spike-based computation,spiking neural networks,synaptic delays}, month = {jan}, number = {1}, pages = {82–99}, title = {{Loihi: A Neuromorphic Manycore Processor with On-Chip Learning}}, volume = {38}, year = {2018}} Loihi is Intel’s novel, manycore neuromorphic processor and is the first of its kind to feature a microcode-programmable learning engine that enables on-chip training of spiking neural networks (SNNs). The authors present the Loihi toolchain, which consists of an intuitive Python-based API for specifying SNNs, a compiler and runtime for building and executing SNNs on Loihi, and several target platforms (Loihi silicon, FPGA, and functional simulator). To showcase the toolchain, the authors describe how to build, train, and use a SNN to classify handwritten digits from the MNIST database.LinCWildAChinyaG NCaoYDaviesMLaveryD MWangHDocument:Users/al17856/Documents/Mendeley Desktop/Lin et al/Computer/08303802.pdf:pdfISSN 0018-9162ComputerComputational modeling,FPGA,Intel Loihi silicon,Loihi toolchain,MNIST database,Mathematical model,Neuromorphic engineering,Programming,SNN,Synapses,application program interfaces,compiler,field programmable gate arrays,functional simulator,handwritten character recognition,handwritten digits,intuitive Python-based API,learning (artificial intelligence),manycore neuromorphic processor,microcode-programmable learning engine,neural chips,neural networks,neuromorphic computing,neuromorphic processor,on-chip training,programming paradigms,spiking neural networksmar352–61Programming Spiking Neural Networks on Intel’s Loihi512018-03@article{8303802, abstract = {Loihi is Intel's novel, manycore neuromorphic processor and is the first of its kind to feature a microcode-programmable learning engine that enables on-chip training of spiking neural networks (SNNs). The authors present the Loihi toolchain, which consists of an intuitive Python-based API for specifying SNNs, a compiler and runtime for building and executing SNNs on Loihi, and several target platforms (Loihi silicon, FPGA, and functional simulator). To showcase the toolchain, the authors describe how to build, train, and use a SNN to classify handwritten digits from the MNIST database.}, author = {Lin, C and Wild, A and Chinya, G N and Cao, Y and Davies, M and Lavery, D M and Wang, H}, doi = {10.1109/MC.2018.157113521}, file = {:Users/al17856/Documents/Mendeley Desktop/Lin et al/Computer/08303802.pdf:pdf}, issn = {0018-9162}, journal = {Computer}, keywords = {Computational modeling,FPGA,Intel Loihi silicon,Loihi toolchain,MNIST database,Mathematical model,Neuromorphic engineering,Programming,SNN,Synapses,application program interfaces,compiler,field programmable gate arrays,functional simulator,handwritten character recognition,handwritten digits,intuitive Python-based API,learning (artificial intelligence),manycore neuromorphic processor,microcode-programmable learning engine,neural chips,neural networks,neuromorphic computing,neuromorphic processor,on-chip training,programming paradigms,spiking neural networks}, month = {mar}, number = {3}, pages = {52–61}, title = {{Programming Spiking Neural Networks on Intel's Loihi}}, volume = {51}, year = {2018}} MossDuncan J. M.NurvitadhiErikoSimJaewoongMishraAsitMarrDebbieSubhaschandraSuchitLeongPhilip H. W.2017 27th International Conference on Field Programmable Logic and Applications (FPL)DocumentISBN 978-9-0903-0428-1sep1–4IEEEHigh performance binary neural networks on the Xeon+FPGA™ platformLink2017-09@inproceedings{moss2017high, author = {Moss, Duncan J. M. and Nurvitadhi, Eriko and Sim, Jaewoong and Mishra, Asit and Marr, Debbie and Subhaschandra, Suchit and Leong, Philip H. W.}, booktitle = {2017 27th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.23919/FPL.2017.8056823}, isbn = {978-9-0903-0428-1}, month = {sep}, pages = {1–4}, publisher = {IEEE}, title = {{High performance binary neural networks on the Xeon+FPGA™ platform}}, url = {

      http://ieeexplore.ieee.org/document/8056823/}, year = {2017}} Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.arXiv1712.08934GuoKaiyuanZengShulinYuJinchengWangYuYangHuazhong1712.08934:Users/al17856/Documents/Mendeley Desktop/Guo et al/arXiv preprint arXiv1712.08934/Guo et al. - 2017 - A Survey of FPGA-Based Neural Network Accelerator.pdf:pdfarXiv preprint arXiv:1712.08934decA Survey of FPGA-Based Neural Network AcceleratorLink2017-12@article{guo2017survey, abstract = {Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.}, archiveprefix = {arXiv}, arxivid = {1712.08934}, author = {Guo, Kaiyuan and Zeng, Shulin and Yu, Jincheng and Wang, Yu and Yang, Huazhong}, eprint = {1712.08934}, file = {:Users/al17856/Documents/Mendeley Desktop/Guo et al/arXiv preprint arXiv1712.08934/Guo et al. - 2017 - A Survey of FPGA-Based Neural Network Accelerator.pdf:pdf}, journal = {arXiv preprint arXiv:1712.08934}, month = {dec}, title = {{A Survey of FPGA-Based Neural Network Accelerator}}, url = {http://arxiv.org/abs/1712.08934}, year = {2017}} TraderTiffanyHPC WireBenchmark,DL,MLBenchmark,DL,MLNvidia Leads Alpha MLPerf Benchmarking Round2018@misc{Trader2018, author = {Trader, Tiffany}, booktitle = {HPC Wire}, keywords = {Benchmark,DL,ML}, mendeley-tags = {Benchmark,DL,ML}, title = {{Nvidia Leads Alpha MLPerf Benchmarking Round}}, year = {2018}} MerrittRickEE TimesjulBaidu Accelerator Rises in AILink2018-07@misc{merritt2018baidu, author = {Merritt, Rick}, booktitle = {EE Times}, month = {jul}, title = {{Baidu Accelerator Rises in AI}}, url = {https://www.eetimes.com/document.asp?doc

      {\_}id=1333449}, year = {2018}} This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better address the computational requirements in domains such as high-performance computing, data analytics, computer vision, and machine learning. Second was the desire to introduce an extension that can scale across multiple implementations, both now and into the future, allowing CPU designers to choose the vector length most suitable for their power, performance, and area targets. Finally, the architecture should avoid imposing a software development cost as the vector length changes and where possible reduce it by improving the reach of compiler auto-vectorization technologies. SVE achieves these goals. It allows implementations to choose a vector register length between 128 and 2,048 bits. It supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation. Finally, it introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.arXiv1803.06185StephensNigelBilesStuartBoettcherMatthiasEapenJacobEyoleMbouGabrielliGiacomoHorsnellMattMagklisGrigoriosMartinezAlejandroPremillieuNathanaelReidAlastairRicoAlejandroWalkerPaulDocument1803.06185:Users/al17856/Documents/Mendeley Desktop/Stephens et al/IEEE Micro/1803.06185.pdf:pdfISBN 0272-1732 VO - 37ISSN 02721732IEEE MicroARM,HPC,SIMD,SVE,Scalable Vector Extension,VLA,Vector Length Agnostic,autovectorization,data parallelism,high-performance computing,instruction set architecture,predication,scalable vector architecture,vector length agnostic226–39The ARM Scalable Vector Extension372017@article{Stephens2017, abstract = {This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better address the computational requirements in domains such as high-performance computing, data analytics, computer vision, and machine learning. Second was the desire to introduce an extension that can scale across multiple implementations, both now and into the future, allowing CPU designers to choose the vector length most suitable for their power, performance, and area targets. Finally, the architecture should avoid imposing a software development cost as the vector length changes and where possible reduce it by improving the reach of compiler auto-vectorization technologies. SVE achieves these goals. It allows implementations to choose a vector register length between 128 and 2,048 bits. It supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation. Finally, it introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.}, archiveprefix = {arXiv}, arxivid = {1803.06185}, author = {Stephens, Nigel and Biles, Stuart and Boettcher, Matthias and Eapen, Jacob and Eyole, Mbou and Gabrielli, Giacomo and Horsnell, Matt and Magklis, Grigorios and Martinez, Alejandro and Premillieu, Nathanael and Reid, Alastair and Rico, Alejandro and Walker, Paul}, doi = {10.1109/MM.2017.35}, eprint = {1803.06185}, file = {:Users/al17856/Documents/Mendeley Desktop/Stephens et al/IEEE Micro/1803.06185.pdf:pdf}, isbn = {0272-1732 VO - 37}, issn = {02721732}, journal = {IEEE Micro}, keywords = {ARM,HPC,SIMD,SVE,Scalable Vector Extension,VLA,Vector Length Agnostic,autovectorization,data parallelism,high-performance computing,instruction set architecture,predication,scalable vector architecture,vector length agnostic}, number = {2}, pages = {26–39}, title = {{The ARM Scalable Vector Extension}}, volume = {37}, year = {2017}}

    Cited by: 2nd item, §II.
  • [2] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O’Connell, N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, and G. R. Chiu (2018-08) DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pp. 411–4117. External Links: Document, 1807.06434, ISSN 1946-1488 Cited by: §II-E2.
  • [3] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha (2015-10) TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34 (10), pp. 1537–1557. External Links: Document, ISSN 0278-0070 Cited by: 2nd item.
  • [4] S. Albanie (2019) Convnet Burden. External Links: Link Cited by: §II.
  • [5] P. Alcorn (2017-05) Nvidia Infuses DGX-1 with Volta, Eight V100s in a Single Chassis. External Links: Link Cited by: 1st item.
  • [6] (2018-11) Amazon Web Services Announces 13 New Machine Learning Services and Capabilities, Including a Custom Chip for Machine Learning Inference, and a 1/18 Scale Autonomous Race Car for Developers. External Links: Link Cited by: §III.
  • [7] (2018-11) Announcing AWS Inferentia: Machine Learning Inference Chip. Cited by: §II-G.
  • [8] L. Armasu (2018-09) Move Over GPUs: Startup’s Chip Claims to Do Deep Learning Inference Better. External Links: Link Cited by: 4th item.
  • [9] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu (2017) An OpenCLtexttrademark Deep Learning Accelerator on Arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, New York, NY, USA, pp. 55–64. External Links: Document, ISBN 978-1-4503-4354-1, Link Cited by: §II-D.
  • [10] A. Canziani, A. Paszke, and E. Culurciello (2016) An Analysis of Deep Neural Network Models for Practical Applications. arXiv preprint arXiv:1605.07678. Cited by: 1st item.
  • [11] M. Chafkin and D. Ramli (2018-06) China’s Crypto-Chips King Sets His Sights on AI - Bloomberg. External Links: Link Cited by: §II-G.
  • [12] Y. Chen, J. Emer, and V. Sze (2018) Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. IEEE Micro, pp. 1. External Links: Document, ISSN 0272-1732 Cited by: 1st item.
  • [13] Y. Chen, T. Krishna, J. S. Emer, and V. Sze (2017-01) Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits 52 (1), pp. 127–138. External Links: Document, ISSN 0018-9200 Cited by: 1st item.
  • [14] Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam (2016-10) DianNao Family: Energy-Efficient Accelerators For Machine Learning. Communications of the ACM 59 (11), pp. 105–112. External Links: Document, ISSN 00010782, Link Cited by: 5th item.
  • [15] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam (2014-12) DaDianNao: A Machine-Learning Supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. External Links: Document, ISBN 978-1-4799-6998-2, Link Cited by: 5th item.
  • [16] (2019) Curtiss-Wright 3U Intel Single Board Computers. External Links: Link Cited by: §III.
  • [17] I. Cutress (2018-03) NVIDIA’s DGX-2: Sixteen Tesla V100s, 30TB of NVMe, Only $400K. External Links: Link Cited by: 1st item.
  • [18] I. Cutress (2018-05) Cambricon, Maker of Hauwei’s Kirin NPU IP, Build a Big AI Chip and PCIe Card. External Links: Link Cited by: 6th item.
  • [19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam (2015) ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43, pp. 92–104. Cited by: 5th item.
  • [20] C. Duckett (2018-07) Baidu Creates Kunlun Silicon for AI. External Links: Link Cited by: 7th item.
  • [21] (2019) Edge TPU. External Links: Link Cited by: 4th item, §III.
  • [22] ExxactCorp (2017-12) Taking a Deeper Look at AMD Radeon Instinct GPUs for Deep Learning. External Links: Link Cited by: §II-E3.
  • [23] M. Feldman (2016-09) IBM Finds Killer App for TrueNorth Neuromorphic Chip. External Links: Link Cited by: 2nd item.
  • [24] M. Feldman (2017-04) Wave Computing Launches Machine Learning Appliance. External Links: Link Cited by: 3rd item.
  • [25] M. Feldman (2019-01) AI Chip Startup Puts Inference Cards on the Table. External Links: Link Cited by: 4th item.
  • [26] D. Franklin (2017-03) NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge. Cited by: 1st item, 2nd item.
  • [27] A. Frumusanu (2018-03) The Samsung Galaxy S9 and S9+ Review: Exynos and Snapdragon at 960fps. Cited by: 2nd item, 3rd item.
  • [28] A. Frumusanu (2018-08) HiSilicon Announces The Kirin 980: First A76, G76 on 7nm. Cited by: 2nd item.
  • [29] A. Frumusanu (2018-10) The iPhone XS & XS Max Review: Unveiling the Silicon Secrets. External Links: Link Cited by: 1st item.
  • [30] V. Gadepally, J. Goodwin, J. Kepner, A. Reuther, H. Reynolds, S. Samsi, J. Su, and D. Martinez (2019) AI Enabling Technologies. Technical report MIT Lincoln Laboratory, Lexington, MA. Cited by: 1st item.
  • [31] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang (2018-01) Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37 (1), pp. 35–47. External Links: Document, ISSN 0278-0070, Link Cited by: §II-D.
  • [32] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang (2017-12) A Survey of FPGA-Based Neural Network Accelerator. arXiv preprint arXiv:1712.08934. External Links: 1712.08934, Link Cited by: §II-D.
  • [33] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep Learning with Limited Numerical Precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1737–1746. External Links: Link Cited by: §II.
  • [34] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. (. J. Dally (2017) ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, New York, NY, USA, pp. 75–84. External Links: Document, ISBN 978-1-4503-4354-1, Link Cited by: §II-D.
  • [35] K. Hao (2019-04) Tesla Says Its New Self-Driving Chip Will Help Make Its Cars Autonomous. MIT Technology Review. Cited by: §II-G.
  • [36] N. Hemsoth (2017-08) First In-Depth View of Wave Computing’s DPU Architecture, System. External Links: Link Cited by: 5th item, 3rd item.
  • [37] N. Hemsoth (2018-07) Intel FPGA Architecture Focuses on Deep Learning Inference. External Links: Link Cited by: §II-E2.
  • [38] N. Hemsoth (2018-08) A Mythic Approach to Deep Learning Inference. External Links: Link Cited by: §II-G.
  • [39] N. Hemsoth (2018-09) First Wave of Spiking Neural Network Hardware Hits. External Links: Link Cited by: §II-G.
  • [40] J. L. Hennessy and D. A. Patterson (2019-01) A New Golden Age for Computer Architecture. Communications of the ACM 62 (2), pp. 48–60. External Links: Document, ISSN 00010782, Link Cited by: §I.
  • [41] M. Horowitz (2014-02) Computing’s Energy Problem (and What We Can Do About It). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. External Links: Document, ISBN 978-1-4799-0920-9, Link Cited by: §I.
  • [42] J. Horwitz (2019-02) Chinese AI chip maker Horizon Robotics raises $600 million from SK Hynix, others - Reuters. External Links: Link Cited by: §II-G.
  • [43] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017-04) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861. External Links: 1704.04861, Link Cited by: §III.
  • [44] J. Hruska (2017-08) New Movidius Myriad X VPU Packs a Custom Neural Compute Engine. Cited by: 3rd item.
  • [45] J. Hruska (2018-06) Nvidia’s Jetson Xavier Stuffs Volta Performance Into Tiny Form Factor. Cited by: 3rd item, §III, §III.
  • [46] (2019) Intel Xeon Platinum 8180 Processor. External Links: Link Cited by: 1st item.
  • [47] J. Jeffers, J. Reinders, and A. Sodani (2016) Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2Nd Edition. 2nd edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 0128091940, 9780128091944 Cited by: 2nd item.
  • [48] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang (2017-09) Accelerating Low Bit-Width Convolutional Neural Networks with Embedded FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. External Links: Document, ISBN 978-9-0903-0428-1, Link Cited by: §II-D.
  • [49] E. Kilgariff, H. Moreton, N. Stam, and B. Bell (2018-09) NVIDIA Turing Architecture In-Depth. Cited by: §II-E3.
  • [50] W. Knight (2019-01) Cheaper AI for Everyone is the Promise with Intel and Facebook’s New Chip. MIT Technology Review. External Links: Link Cited by: §II-G.
  • [51] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems 25. External Links: Document Cited by: §II.
  • [52] D. Lacey (2017-10) Preliminary IPU Benchmarks. External Links: Link Cited by: 3rd item, 2nd item.
  • [53] Z. Li, Y. Wang, T. Zhi, and T. Chen (2017-10) A survey of neural network accelerators. Frontiers of Computer Science 11 (5), pp. 746–761. External Links: Document, ISSN 2095-2228, Link Cited by: §II-D.
  • [54] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, pp. 740–755. External Links: Document, ISBN 978-3-319-10602-1, Link Cited by: §III.
  • [55] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen (2015) Pudiannao: A polyvalent machine learning accelerator. In ACM SIGARCH Computer Architecture News, Vol. 43, pp. 369–381. Cited by: 5th item.
  • [56] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot MultiBox Detector. In Computer Vision – ECCV 2016, pp. 21–37. External Links: Document, Link Cited by: §III.
  • [57] L. Lu, Y. Liang, Q. Xiao, and S. Yan (2017-04) Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 101–108. External Links: Document, ISBN 978-1-5386-4037-1, Link Cited by: §II-D.
  • [58] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo (2017) Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, New York, NY, USA, pp. 45–54. External Links: Document, ISBN 978-1-4503-4354-1, Link Cited by: §II-D.
  • [59] (2019) Mercury Systems BuiltSAFE CIOV-2231. Mercury Systems. External Links: Link Cited by: §III.
  • [60] R. Merritt (2018-07) Baidu Accelerator Rises in AI. External Links: Link Cited by: 7th item.
  • [61] R. Merritt (2018-09) BrainChip Discloses SNN Chip. External Links: Link Cited by: §II-G.
  • [62] R. Merritt (2019-02) Startup Accelerates AI at the Sensor. External Links: Link Cited by: 6th item.
  • [63] R. Merritt (2019-05) AI Vet Pushes for Neuromorphic Chips — EE Times. External Links: Link Cited by: §II-G.
  • [64] M. L. Minsky (1967) Computation: Finite and Infinite Machines. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. External Links: ISBN 0-13-165563-9 Cited by: §II.
  • [65] S. Mittal (2018-10) A survey of FPGA-based accelerators for convolutional neural networks. Neural Computing and Applications, pp. 1–31. External Links: Document, ISSN 0941-0643, Link Cited by: §II-D.
  • [66] J. Morra (2017-11) Groq Portrays Power of Its Artificial Intelligence Silicon. External Links: Link Cited by: §II-G.
  • [67] D. J. M. Moss, E. Nurvitadhi, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. W. Leong (2017-09) High performance binary neural networks on the Xeon+FPGA™ platform. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. External Links: Document, ISBN 978-9-0903-0428-1, Link Cited by: §II-D.
  • [68] H. Nakahara, T. Fujii, and S. Sato (2017) A Fully Connected Layer Elimination for a Binarizec Convolutional Neural Network on an FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. External Links: Document, ISSN 1946-1488 Cited by: §II-D.
  • [69] S. Narang, G. Diamos, E. Elsen, P. Micikevicius, J. Alben, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. (2018) Mixed precision training. Proc. of ICLR,(Vancouver Canada). Cited by: 3rd item, §II.
  • [70] NVIDIA Tesla P100. External Links: Link Cited by: §II-E3.
  • [71] (2019) NVIDIA Tesla V100 Tensor Core GPU. External Links: Link Cited by: §II-E3.
  • [72] A. Olofsson (2016-10) Epiphany-V: A 1024-core 64-bit RISC processor — Parallella. External Links: Link Cited by: §II-G.
  • [73] T. Peng (2018-09) AI Chip Duel: Apple A12 Bionic vs Huawei Kirin 980. External Links: Link Cited by: 1st item.
  • [74] A. Podili, C. Zhang, and V. Prasanna (2017-07) Fast and efficient implementation of Convolutional Neural Networks on FPGA. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 11–18. External Links: Document, ISBN 978-1-5090-4825-0, Link Cited by: §II-D.
  • [75] N. Rao (2018-05) Beyond the CPU or GPU: Why Enterprise-Scale Artificial Intelligence Requires a More Holistic Approach. Cited by: 1st item.
  • [76] (2018-01) Rockchip Released Its First AI Processor RK3399Pro NPU Performance up to 2.4TOPs. Cited by: 7th item.
  • [77] A. Rodriguez (2017-11) Intel Processors for Deep Learning Training. External Links: Link Cited by: 1st item.
  • [78] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-01) MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv preprint arXiv:1801.04381. External Links: 1801.04381, Link Cited by: §III.
  • [79] R. Smith (2014-11) NVIDIA Launches Tesla K80, GK210 GPU. External Links: Link Cited by: §II-E3.
  • [80] R. Smith (2016-04) NVIDIA Announces Tesla P100 Accelerator - Pascal GP100 Power for HPC. External Links: Link Cited by: §II-E3.
  • [81] R. Smith (2018-05) 16GB NVIDIA Tesla V100 Gets Reprieve; Remains in Production. External Links: Link Cited by: §II-E3.
  • [82] R. Smith (2018-11) AMD Announces Radeon Instinct MI60 & MI50 Accelerators Powered By 7nm Vega. External Links: Link Cited by: §II-E3.
  • [83] V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017-12) Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. External Links: Document, ISSN 0018-9219 Cited by: 3rd item, 1st item.
  • [84] P. Teich (2018-05) Tearing Apart Google’s TPU 3.0 AI Coprocessor. Cited by: 2nd item.
  • [85] D. Tenenbaum (2017-05) As computing moves to cloud, UW-Madison spinoff offers faster, cleaner chip for data centers. External Links: Link Cited by: §II-G.
  • [86] T. N. Theis and H. -. P. Wong (2017-03) The End of Moore’s Law: A New Beginning for Information Technology. Computing in Science Engineering 19 (2), pp. 41–50. External Links: Document, ISSN 1521-9615 Cited by: §I.
  • [87] F. (. I. Van Veen and S. (. I. Leijnen (2019) The Neural Network Zoo. External Links: Link Cited by: 1st item.
  • [88] S. Williams, A. Waterman, and D. Patterson (2009-04) Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52 (4), pp. 65–76. External Links: Document, ISSN 0001-0782, Link Cited by: §I.
  • [89] W. Wong (2018-09) BrainChip Unveils Akida Architecture. Cited by: §II-G.
  • [90] (2019) Xeon Phi. External Links: Link Cited by: 2nd item.
  • [91] J. Yoshida (2018-03) Startup Runs Spiking Neural Network on Arm — EE Times. External Links: Link Cited by: §II-G.
  • [92] E. Yu (2018-09) Alibaba to launch own AI chip next year — ZDNet. External Links: Link Cited by: §II-G.
  • [93] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong (2016) Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, ISLPED ’16, New York, NY, USA, pp. 326–331. External Links: Document, ISBN 978-1-4503-4185-1, Link Cited by: §II-D.
  • [94] J. Zhang and J. Li (2017) Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, New York, NY, USA, pp. 25–34. External Links: Document, ISBN 978-1-4503-4354-1, Link Cited by: §II-D.

References

  • [1] Observation that both Moore’s Law and Denards Law have limited all opportunities to improve performance much, even with just adding more processing elements (Ahmdahl’s Law). JouppiNorman P.YoungCliffPatilNishantPattersonDavid Document ISSN 00010782 Communications of the ACM aug 9 50–59 ACM A Domain-Specific Architecture for Deep Neural Networks Link 61 2018-08 @article{jouppi2018domain, address = {New York, NY, USA}, annote = {From Duplicate 1 (A Domain-specific Architecture for Deep Neural Networks - Jouppi, Norman P; Young, Cliff; Patil, Nishant; Patterson, David) Observation that both Moore's Law and Denards Law have limited all opportunities to improve performance much, even with just adding more processing elements (Ahmdahl's Law).}, author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David}, doi = {10.1145/3154484}, issn = {00010782}, journal = {Communications of the ACM}, month = {aug}, number = {9}, pages = {50–59}, publisher = {ACM}, title = {{A Domain-Specific Architecture for Deep Neural Networks}}, url = {http://doi.acm.org/10.1145/3154484}, volume = {61}, year = {2018}}
  • ChenYKrishnaTEmerJ SSzeVDocumentISSN 0018-9200IEEE Journal of Solid-State CircuitsAI systems,AlexNet,CNN shapes,Clocks,Computer architecture,Convolutional neural networks (CNNs),DRAM accesses,DRAM chips,Eyeriss,Hardware,MAC,Neural networks,RS dataflow reconfiguration,Random access memory,Shape,Throughput,accelerator chip,convolutional layers,data flow computing,data movement energy cost,dataflow processing,deep convolutional neural networks,deep learning,energy conservation,energy efficiency,energy-efficient accelerators,energy-efficient reconfigurable accelerator,feedforward neural nets,learning (artificial intelligence),multiply and accumulation,neural net architecture,off-chip DRAM,power aware computing,reconfigurable architectures,reconfiguring architecture,row stationary,spatial architecturejan1127–138Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks522017-01@article{chen2017eyeriss, author = {Chen, Y and Krishna, T and Emer, J S and Sze, V}, doi = {10.1109/JSSC.2016.2616357}, issn = {0018-9200}, journal = {IEEE Journal of Solid-State Circuits}, keywords = {AI systems,AlexNet,CNN shapes,Clocks,Computer architecture,Convolutional neural networks (CNNs),DRAM accesses,DRAM chips,Eyeriss,Hardware,MAC,Neural networks,RS dataflow reconfiguration,Random access memory,Shape,Throughput,accelerator chip,convolutional layers,data flow computing,data movement energy cost,dataflow processing,deep convolutional neural networks,deep learning,energy conservation,energy efficiency,energy-efficient accelerators,energy-efficient reconfigurable accelerator,feedforward neural nets,learning (artificial intelligence),multiply and accumulation,neural net architecture,off-chip DRAM,power aware computing,reconfigurable architectures,reconfiguring architecture,row stationary,spatial architecture}, month = {jan}, number = {1}, pages = {127–138}, title = {{Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks}}, volume = {52}, year = {2017}} Edge TPULink2019@misc{tpu2019edge, title = {{Edge TPU}}, url = {

    https://cloud.google.com/edge-tpu/

    }, year = {2019}} SzeVChenYYangTEmerJ SDocumentISSN 0018-9219Proceedings of the IEEEASIC,Artificial intelligence,Benchmark testing,Biological neural networks,Computer architecture,Convolutional neural networks,DNN hardware designs,DNN hardware implementations,Machine learning,Neural networks,Neurons,Tutorials,VLSI,artificial intelligence,computation cost reduction,computational complexity,computer architecture,convolutional neural networks,dataflow processing,deep learning,deep neural networks,energy efficiency,energy-efficient accelerators,hardware architecture,hardware cost,hardware design changes,hardware platforms,low power,machine learning,neural nets,spatial architecturesdec122295–2329Efficient Processing of Deep Neural Networks: A Tutorial and Survey1052017-12@article{sze2017efficient, author = {Sze, V and Chen, Y and Yang, T and Emer, J S}, doi = {10.1109/JPROC.2017.2761740}, issn = {0018-9219}, journal = {Proceedings of the IEEE}, keywords = {ASIC,Artificial intelligence,Benchmark testing,Biological neural networks,Computer architecture,Convolutional neural networks,DNN hardware designs,DNN hardware implementations,Machine learning,Neural networks,Neurons,Tutorials,VLSI,artificial intelligence,computation cost reduction,computational complexity,computer architecture,convolutional neural networks,dataflow processing,deep learning,deep neural networks,energy efficiency,energy-efficient accelerators,hardware architecture,hardware cost,hardware design changes,hardware platforms,low power,machine learning,neural nets,spatial architectures}, month = {dec}, number = {12}, pages = {2295–2329}, title = {{Efficient Processing of Deep Neural Networks: A Tutorial and Survey}}, volume = {105}, year = {2017}} ChenYEmerJSzeVDocumentISSN 0272-1732IEEE MicroAccelerators,Computer architecture,Dataflows,Deep Neural Networks,Energy consumption,Energy efficiency,Energy-Efficient Hardware,Program processors,Radio frequency,Random access memory,Shape,Spatial Architecture1Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks2018@article{chen2018eyeriss, author = {Chen, Y and Emer, J and Sze, V}, doi = {10.1109/MM.2017.265085944}, issn = {0272-1732}, journal = {IEEE Micro}, keywords = {Accelerators,Computer architecture,Dataflows,Deep Neural Networks,Energy consumption,Energy efficiency,Energy-Efficient Hardware,Program processors,Radio frequency,Random access memory,Shape,Spatial Architecture}, pages = {1}, title = {{Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks}}, year = {2018}} New York, NY, USAHanSongKangJunlongMaoHuiziHuYimingLiXinLiYubinXieDongliangLuoHongYaoSongWangYuYangHuazhongDallyWilliam (Bill) JProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysDocumentISBN 978-1-4503-4354-1FPGA,deep learning,hardware acceleration,model compression,software-hardware co-design,speech recognition75–84ACMFPGA ’17ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGALink2017@inproceedings{han2017ese, address = {New York, NY, USA}, author = {Han, Song and Kang, Junlong and Mao, Huizi and Hu, Yiming and Li, Xin and Li, Yubin and Xie, Dongliang and Luo, Hong and Yao, Song and Wang, Yu and Yang, Huazhong and Dally, William (Bill) J}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, doi = {10.1145/3020078.3021745}, isbn = {978-1-4503-4354-1}, keywords = {FPGA,deep learning,hardware acceleration,model compression,software-hardware co-design,speech recognition}, pages = {75–84}, publisher = {ACM}, series = {FPGA '17}, title = {{ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA}}, url = {

    http://doi.acm.org/10.1145/3020078.3021745}, year = {2017}} New York, NY, USAZhangChenWuDiSunJiayuSunGuangyuLuoGuojieCongJasonProceedings of the 2016 International Symposium on Low Power Electronics and DesignDocumentISBN 978-1-4503-4185-1326–331ACMISLPED ’16Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA ClusterLink2016@inproceedings{zhang2016energy, address = {New York, NY, USA}, author = {Zhang, Chen and Wu, Di and Sun, Jiayu and Sun, Guangyu and Luo, Guojie and Cong, Jason}, booktitle = {Proceedings of the 2016 International Symposium on Low Power Electronics and Design}, doi = {10.1145/2934583.2934644}, isbn = {978-1-4503-4185-1}, pages = {326–331}, publisher = {ACM}, series = {ISLPED '16}, title = {{Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster}}, url = {http://doi.acm.org/10.1145/2934583.2934644}, year = {2016}} FeldmanMichaelTop500.orgsepIBM Finds Killer App for TrueNorth Neuromorphic ChipLink2016-09@misc{feldman2016ibm, author = {Feldman, Michael}, booktitle = {Top500.org}, month = {sep}, title = {{IBM Finds Killer App for TrueNorth Neuromorphic Chip}}, url = {https://www.top500.org/news/ibm-finds-killer-app-for-truenorth-neuromorphic-chip/}, year = {2016}} DuZidongFasthuberRobertChenTianshiIennePaoloLiLingLuoTaoFengXiaobingChenYunjiTemamOlivierACM SIGARCH Computer Architecture News3ACM92–104ShiDianNao: Shifting vision processing closer to the sensor432015@inproceedings{du2015shidiannao, author = {Du, Zidong and Fasthuber, Robert and Chen, Tianshi and Ienne, Paolo and Li, Ling and Luo, Tao and Feng, Xiaobing and Chen, Yunji and Temam, Olivier}, booktitle = {ACM SIGARCH Computer Architecture News}, number = {3}, organization = {ACM}, pages = {92–104}, title = {{ShiDianNao: Shifting vision processing closer to the sensor}}, volume = {43}, year = {2015}} New York, NY, USAZhangJialiangLiJingProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysDocumentISBN 978-1-4503-4354-1convolutional neural networks,fpga,hardware accelerator,opencl25–34ACMFPGA ’17Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural NetworkLink2017@inproceedings{zhang2017improving, address = {New York, NY, USA}, author = {Zhang, Jialiang and Li, Jing}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, doi = {10.1145/3020078.3021698}, isbn = {978-1-4503-4354-1}, keywords = {convolutional neural networks,fpga,hardware accelerator,opencl}, pages = {25–34}, publisher = {ACM}, series = {FPGA '17}, title = {{Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network}}, url = {http://doi.acm.org/10.1145/3020078.3021698}, year = {2017}} LiuDaofuChenTianshiLiuShaoliZhouJinhongZhouShengyuanTemanOlivierFengXiaobingZhouXuehaiChenYunjiACM SIGARCH Computer Architecture News1ACM369–381Pudiannao: A polyvalent machine learning accelerator432015@inproceedings{liu2015pudiannao, author = {Liu, Daofu and Chen, Tianshi and Liu, Shaoli and Zhou, Jinhong and Zhou, Shengyuan and Teman, Olivier and Feng, Xiaobing and Zhou, Xuehai and Chen, Yunji}, booktitle = {ACM SIGARCH Computer Architecture News}, number = {1}, organization = {ACM}, pages = {369–381}, title = {{Pudiannao: A polyvalent machine learning accelerator}}, volume = {43}, year = {2015}} Fleming JrKermin ElliottothersMassachusetts Institute of TechnologyScalable reconfigurable computing leveraging latency-insensitive channels2013Ph.D. Thesis@phdthesis{fleming2013scalable, author = {{Fleming Jr}, Kermin Elliott and Others}, school = {Massachusetts Institute of Technology}, title = {{Scalable reconfigurable computing leveraging latency-insensitive channels}}, year = {2013}} MerrittRickEE TimesfebStartup Accelerates AI at the SensorLink2019-02@misc{merrit2019startup, author = {Merritt, Rick}, booktitle = {EE Times}, month = {feb}, title = {{Startup Accelerates AI at the Sensor}}, url = {https://www.eetimes.com/document.asp?doc{\_}id=1334301}, year = {2019}} HemsothNicoleThe Next PlatformjulIntel FPGA Architecture Focuses on Deep Learning InferenceLink2018-07@misc{hemsoth2018intel, author = {Hemsoth, Nicole}, booktitle = {The Next Platform}, month = {jul}, title = {{Intel FPGA Architecture Focuses on Deep Learning Inference}}, url = {https://www.nextplatform.com/2018/07/31/intel-fpga-architecture-focuses-on-deep-learning-inference/

    }, year = {2018}} RockchipjanRockchip Released Its First AI Processor RK3399Pro NPU Performance up to 2.4TOPs2018-01@misc{rockchip2018rockchip, booktitle = {Rockchip}, month = {jan}, title = {{Rockchip Released Its First AI Processor RK3399Pro NPU Performance up to 2.4TOPs}}, year = {2018}} A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1’s counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.NakaharaHirokiFujiiTomoyaSatoSshimpei2017 27th International Conference on Field Programmable Logic and Applications (FPL)DocumentISSN 1946-1488ASIC,CIFAR10 image classification task,CNN variations,CPU,Computer architecture,FPGA,Field programmable gate arrays,Hardware,Kernel,Neurons,Radiation detectors,Random access memory,VGG-11 benchmark CNN,XNOR circuit,Xilinx Inc. Zedboard,area efficiency,binarizec convolutional neural network,binarized CNN,binarized average pooling layer,binarized values,binary 2-values,classification accuracy,conventional binarized implementations,convolution,embedded GPU,embedded systems,field programmable gate arrays,fully-connected layer elimination,graphics processing units,high-performance MAC circuit,image classification,internal FC layers,learning (artificial intelligence),majority circuit,neural nets,object detection,power efficiency,pre-trained convolutional deep neural network,weight memory1–4A Fully Connected Layer Elimination for a Binarizec Convolutional Neural Network on an FPGA2017@inproceedings{nakahara2017fully, abstract = {A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1's counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.}, author = {Nakahara, Hiroki and Fujii, Tomoya and Sato, Sshimpei}, booktitle = {2017 27th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.23919/FPL.2017.8056771}, issn = {1946-1488}, keywords = {ASIC,CIFAR10 image classification task,CNN variations,CPU,Computer architecture,FPGA,Field programmable gate arrays,Hardware,Kernel,Neurons,Radiation detectors,Random access memory,VGG-11 benchmark CNN,XNOR circuit,Xilinx Inc. Zedboard,area efficiency,binarizec convolutional neural network,binarized CNN,binarized average pooling layer,binarized values,binary 2-values,classification accuracy,conventional binarized implementations,convolution,embedded GPU,embedded systems,field programmable gate arrays,fully-connected layer elimination,graphics processing units,high-performance MAC circuit,image classification,internal FC layers,learning (artificial intelligence),majority circuit,neural nets,object detection,power efficiency,pre-trained convolutional deep neural network,weight memory}, pages = {1–4}, title = {{A Fully Connected Layer Elimination for a Binarizec Convolutional Neural Network on an FPGA}}, year = {2017}} FrumusanuAndreiAnandTechmarThe Samsung Galaxy S9 and S9+ Review: Exynos and Snapdragon at 960fps2018-03@misc{frumusanu2018samsung, author = {Frumusanu, Andrei}, booktitle = {AnandTech}, month = {mar}, title = {{The Samsung Galaxy S9 and S9+ Review: Exynos and Snapdragon at 960fps}}, year = {2018}} FrumusanuAndreiAnandTechaugHiSilicon Announces The Kirin 980: First A76, G76 on 7nm2018-08@misc{frumusanu2018hisilicon, author = {Frumusanu, Andrei}, booktitle = {AnandTech}, month = {aug}, title = {{HiSilicon Announces The Kirin 980: First A76, G76 on 7nm}}, year = {2018}} FranklinDustinNVIDIA Developer BlogmarNVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge2017-03@misc{franklin2017nvidia, author = {Franklin, Dustin}, booktitle = {NVIDIA Developer Blog}, month = {mar}, title = {{NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge}}, year = {2017}} HruskaJoelExtremeTechjunNvidia’s Jetson Xavier Stuffs Volta Performance Into Tiny Form Factor2018-06@misc{hruska2018nvidia, author = {Hruska, Joel}, booktitle = {ExtremeTech}, month = {jun}, title = {{Nvidia's Jetson Xavier Stuffs Volta Performance Into Tiny Form Factor}}, year = {2018}} RodriguezAndresSegalEdenMeiriEtayFomenkoEvaristKimYoung JimShenHaihaoZivBarukhIntel Corporationjan1–19Lower Numerical Precision Deep Learning Inference and Training2018-01Technical report@techreport{rodriguez2018lower, author = {Rodriguez, Andres and Segal, Eden and Meiri, Etay and Fomenko, Evarist and Kim, Young Jim and Shen, Haihao and Ziv, Barukh}, institution = {Intel Corporation}, month = {jan}, pages = {1—-19}, title = {{Lower Numerical Precision Deep Learning Inference and Training}}, year = {2018}} Upper Saddle River, NJ, USAMinskyMarvin LISBN 0-13-165563-9Prentice-Hall, Inc.Computation: Finite and Infinite Machines1967@book{minsky1967computation, address = {Upper Saddle River, NJ, USA}, author = {Minsky, Marvin L}, isbn = {0-13-165563-9}, publisher = {Prentice-Hall, Inc.}, title = {{Computation: Finite and Infinite Machines}}, year = {1967}} PengTonymedium.comsepAI Chip Duel: Apple A12 Bionic vs Huawei Kirin 980Link2018-09@misc{peng2018ai, author = {Peng, Tony}, booktitle = {medium.com}, month = {sep}, title = {{AI Chip Duel: Apple A12 Bionic vs Huawei Kirin 980}}, url = {

    https://medium.com/syncedreview/ai-chip-duel-apple-a12-bionic-vs-huawei-kirin-980-ec29cfe68632}, year = {2018}} FrumusanuAndreiAnandTechoctThe iPhone XS & XS Max Review: Unveiling the Silicon SecretsLink2018-10@misc{frumusanu2018iphone, author = {Frumusanu, Andrei}, booktitle = {AnandTech}, month = {oct}, title = {{The iPhone XS {\&} XS Max Review: Unveiling the Silicon Secrets}}, url = {https://www.anandtech.com/show/13392/the-iphone-xs-xs-max-review-unveiling-the-silicon-secrets}, year = {2018}} The insights contained in Gordon Moore’s now famous 1965 and 1975 papers have broadly guided the development of semiconductor electronics for over 50 years. However, the field-effect transistor is approaching some physical limits to further miniaturization, and the associated rising costs and reduced return on investment appear to be slowing the pace of development. Far from signaling an end to progress, this gradual ”end of Moore’s law” will open a new era in information technology as the focus of research and development shifts from miniaturization of long-established technologies to the coordinated introduction of new devices, new integration technologies, and new architectures for computing.TheisT NWongH -. PDocumentISSN 1521-9615Computing in Science EngineeringAlgorithm design and analysis,Computer architecture,Field effect transistors,Gordon Moore,Memory management,Moore’s Law,Moore’s law,Random access memory,Scientific computing,Switching circuits,algorithms implemented in hardware,emerging technologies,field effect transistor,information technology,introductory and survey,memory technologies,neural nets,research and development,scientific computing,semiconductor electronicsmar241–50The End of Moore’s Law: A New Beginning for Information Technology192017-03@article{theis2017end, abstract = {The insights contained in Gordon Moore's now famous 1965 and 1975 papers have broadly guided the development of semiconductor electronics for over 50 years. However, the field-effect transistor is approaching some physical limits to further miniaturization, and the associated rising costs and reduced return on investment appear to be slowing the pace of development. Far from signaling an end to progress, this gradual "end of Moore's law" will open a new era in information technology as the focus of research and development shifts from miniaturization of long-established technologies to the coordinated introduction of new devices, new integration technologies, and new architectures for computing.}, author = {Theis, T N and Wong, H -. P}, doi = {10.1109/MCSE.2017.29}, issn = {1521-9615}, journal = {Computing in Science Engineering}, keywords = {Algorithm design and analysis,Computer architecture,Field effect transistors,Gordon Moore,Memory management,Moore's Law,Moore's law,Random access memory,Scientific computing,Switching circuits,algorithms implemented in hardware,emerging technologies,field effect transistor,information technology,introductory and survey,memory technologies,neural nets,research and development,scientific computing,semiconductor electronics}, month = {mar}, number = {2}, pages = {41–50}, title = {{The End of Moore's Law: A New Beginning for Information Technology}}, volume = {19}, year = {2017}} CanzianiAlfredoPaszkeAdamCulurcielloEugenioarXiv preprint arXiv:1605.07678An Analysis of Deep Neural Network Models for Practical Applications2016@article{canziani2016analysis, author = {Canziani, Alfredo and Paszke, Adam and Culurciello, Eugenio}, journal = {arXiv preprint arXiv:1605.07678}, title = {{An Analysis of Deep Neural Network Models for Practical Applications}}, year = {2016}} Van VeenFjodor (Asimov Institute)LeijnenStefan (Asimov Institute)The Neural Network ZooLink2019@misc{vanveen2019neural, author = {{Van Veen}, Fjodor (Asimov Institute) and Leijnen, Stefan (Asimov Institute)}, title = {{The Neural Network Zoo}}, url = {http://www.asimovinstitute.org/neural-network-zoo/}, year = {2019}} AkopyanFSawadaJCassidyAAlvarez-IcazaRArthurJMerollaPImamNNakamuraYDattaPNamGTabaBBeakesMBrezzoBKuangJ BManoharRRiskW PJacksonBModhaD SDocumentISSN 0278-0070IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsArchitecture,Asynchronous circuits,Biological neural networks,CAD tools,CMOS digital integrated circuits,CMOS scaling trends,Computer architecture,Nerve fibers,Real-time systems,Synchronization,TrueNorth,TrueNorth architecture,asynchronous communication,asynchronous-synchronous circuits,cognitive perception applications,conventional computer-aided design tools,defect-tolerant architec- ture,design automation,design methodology,event-driven routing infrastructure,image recognition,intelligent computing,large-scale integration CAD placement,logic design,low-power architecture,low-power consumption,low-power electronics,neural network hardware,neural networks,neuromorphics,neuron programmable neurosynaptic chip,noisy multisensory data,non-von Neumann architecture,parallel architectures,power 65 mW,real-time operation,real-time systems,sensory perception applications,synchronous circuits,very large-scale integrationoct101537–1557TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip342015-10@article{akopyan2015truenorth, author = {Akopyan, F and Sawada, J and Cassidy, A and Alvarez-Icaza, R and Arthur, J and Merolla, P and Imam, N and Nakamura, Y and Datta, P and Nam, G and Taba, B and Beakes, M and Brezzo, B and Kuang, J B and Manohar, R and Risk, W P and Jackson, B and Modha, D S}, doi = {10.1109/TCAD.2015.2474396}, issn = {0278-0070}, journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems}, keywords = {Architecture,Asynchronous circuits,Biological neural networks,CAD tools,CMOS digital integrated circuits,CMOS scaling trends,Computer architecture,Nerve fibers,Real-time systems,Synchronization,TrueNorth,TrueNorth architecture,asynchronous communication,asynchronous-synchronous circuits,cognitive perception applications,conventional computer-aided design tools,defect-tolerant architec- ture,design automation,design methodology,event-driven routing infrastructure,image recognition,intelligent computing,large-scale integration CAD placement,logic design,low-power architecture,low-power consumption,low-power electronics,neural network hardware,neural networks,neuromorphics,neuron programmable neurosynaptic chip,noisy multisensory data,non-von Neumann architecture,parallel architectures,power 65 mW,real-time operation,real-time systems,sensory perception applications,synchronous circuits,very large-scale integration}, month = {oct}, number = {10}, pages = {1537–1557}, title = {{TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip}}, volume = {34}, year = {2015}} Artificial Intelligence (AI) has the opportunity to revolutionize the way the United States Department of Defense (DoD) and Intelligence Community (IC) address the challenges of evolving threats, data deluge, and rapid courses of action. Developing an end-to-end artificial intelligence system involves parallel development of different pieces that must work together in order to provide capabilities that can be used by decision makers, warfighters and analysts. These pieces include data collection, data conditioning, algorithms, computing, robust artificial intelligence, and human-machine teaming. While much of the popular press today surrounds advances in algorithms and computing, most modern AI systems leverage advances across numerous different fields. Further, while certain components may not be as visible to end-users as others, our experience has shown that each of these interrelated components play a major role in the success or failure of an AI system. This article is meant to highlight many of these technologies that are involved in an end-to-end AI system. The goal of this article is to provide readers with an overview of terminology, technical details and recent highlights from academia, industry and government. Where possible, we indicate relevant resources that can be used for further reading and understanding.Lexington, MAGadepallyVijayGoodwinJustinKepnerJeremyReutherAlbertReynoldsHayleySamsiSiddharthSuJonathanMartinezDavidMIT Lincoln Laboratory1–54AI Enabling Technologies2019Technical report@techreport{gadepally2019enabling, abstract = {Artificial Intelligence (AI) has the opportunity to revolutionize the way the United States Department of Defense (DoD) and Intelligence Community (IC) address the challenges of evolving threats, data deluge, and rapid courses of action. Developing an end-to-end artificial intelligence system involves parallel development of different pieces that must work together in order to provide capabilities that can be used by decision makers, warfighters and analysts. These pieces include data collection, data conditioning, algorithms, computing, robust artificial intelligence, and human-machine teaming. While much of the popular press today surrounds advances in algorithms and computing, most modern AI systems leverage advances across numerous different fields. Further, while certain components may not be as visible to end-users as others, our experience has shown that each of these interrelated components play a major role in the success or failure of an AI system. This article is meant to highlight many of these technologies that are involved in an end-to-end AI system. The goal of this article is to provide readers with an overview of terminology, technical details and recent highlights from academia, industry and government. Where possible, we indicate relevant resources that can be used for further reading and understanding.}, address = {Lexington, MA}, author = {Gadepally, Vijay and Goodwin, Justin and Kepner, Jeremy and Reuther, Albert and Reynolds, Hayley and Samsi, Siddharth and Su, Jonathan and Martinez, David}, institution = {MIT Lincoln Laboratory}, pages = {1–54}, title = {{AI Enabling Technologies}}, year = {2019}} HorowitzMark2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)DocumentISBN 978-1-4799-0920-9feb10–14IEEEComputing’s Energy Problem (and What We Can Do About It)Link2014-02@inproceedings{horowitz2014computing, author = {Horowitz, Mark}, booktitle = {2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)}, doi = {10.1109/ISSCC.2014.6757323}, isbn = {978-1-4799-0920-9}, month = {feb}, pages = {10–14}, publisher = {IEEE}, title = {{Computing's Energy Problem (and What We Can Do About It)}}, url = {http://ieeexplore.ieee.org/document/6757323/}, year = {2014}} HennessyJohn L.PattersonDavid A.DocumentISSN 00010782Communications of the ACMjan248–60A New Golden Age for Computer ArchitectureLink622019-01@article{hennessy2019new, author = {Hennessy, John L. and Patterson, David A.}, doi = {10.1145/3282307}, issn = {00010782}, journal = {Communications of the ACM}, month = {jan}, number = {2}, pages = {48–60}, title = {{A New Golden Age for Computer Architecture}}, url = {http://dl.acm.org/citation.cfm?doid=3310134.3282307}, volume = {62}, year = {2019}} MittalSparshDocument:Users/al17856/Documents/Mendeley Desktop/Mittal/Neural Computing and Applications/Mittal - 2018 - A survey of FPGA-based accelerators for convolutional neural networks.pdf:pdfISSN 0941-0643Neural Computing and Applicationsoct1–31Springer LondonA survey of FPGA-based accelerators for convolutional neural networksLink2018-10@article{mittal2018survey, author = {Mittal, Sparsh}, doi = {10.1007/s00521-018-3761-1}, file = {:Users/al17856/Documents/Mendeley Desktop/Mittal/Neural Computing and Applications/Mittal - 2018 - A survey of FPGA-based accelerators for convolutional neural networks.pdf:pdf}, issn = {0941-0643}, journal = {Neural Computing and Applications}, month = {oct}, pages = {1–31}, publisher = {Springer London}, title = {{A survey of FPGA-based accelerators for convolutional neural networks}}, url = {http://link.springer.com/10.1007/s00521-018-3761-1}, year = {2018}} LiZhenWangYuqingZhiTianChenTianshiDocument:Users/al17856/Documents/Mendeley Desktop/Li et al/Frontiers of Computer Science/Li et al. - 2017 - A survey of neural network accelerators.pdf:pdfISSN 2095-2228Frontiers of Computer Scienceoct5746–761Higher Education PressA survey of neural network acceleratorsLink112017-10@article{li2017survey, author = {Li, Zhen and Wang, Yuqing and Zhi, Tian and Chen, Tianshi}, doi = {10.1007/s11704-016-6159-1}, file = {:Users/al17856/Documents/Mendeley Desktop/Li et al/Frontiers of Computer Science/Li et al. - 2017 - A survey of neural network accelerators.pdf:pdf}, issn = {2095-2228}, journal = {Frontiers of Computer Science}, month = {oct}, number = {5}, pages = {746–761}, publisher = {Higher Education Press}, title = {{A survey of neural network accelerators}}, url = {http://link.springer.com/10.1007/s11704-016-6159-1}, volume = {11}, year = {2017}} ShinDongjooLeeJinmookLeeJinsuYooHoi-Jun2017 IEEE International Solid-State Circuits Conference (ISSCC)DocumentISBN 978-1-5090-3758-2feb240–241IEEE14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networksLink2017-02@inproceedings{Shin2017, author = {Shin, Dongjoo and Lee, Jinmook and Lee, Jinsu and Yoo, Hoi-Jun}, booktitle = {2017 IEEE International Solid-State Circuits Conference (ISSCC)}, doi = {10.1109/ISSCC.2017.7870350}, isbn = {978-1-5090-3758-2}, month = {feb}, pages = {240–241}, publisher = {IEEE}, title = {{14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks}}, url = {http://ieeexplore.ieee.org/document/7870350/}, year = {2017}} ChenYunjiChenTianshiXuZhiweiSunNinghuiTemamOlivierDocument:Users/al17856/Documents/Mendeley Desktop/Chen et al/Communications of the ACM/Chen et al. - 2016 - DianNao family.pdf:pdfISSN 00010782Communications of the ACMoct11105–112ACMDianNao Family: Energy-Efficient Accelerators For Machine LearningLink592016-10@article{chen2016diannao, author = {Chen, Yunji and Chen, Tianshi and Xu, Zhiwei and Sun, Ninghui and Temam, Olivier}, doi = {10.1145/2996864}, file = {:Users/al17856/Documents/Mendeley Desktop/Chen et al/Communications of the ACM/Chen et al. - 2016 - DianNao family.pdf:pdf}, issn = {00010782}, journal = {Communications of the ACM}, month = {oct}, number = {11}, pages = {105–112}, publisher = {ACM}, title = {{DianNao Family: Energy-Efficient Accelerators For Machine Learning}}, url = {http://dl.acm.org/citation.cfm?doid=3013530.2996864}, volume = {59}, year = {2016}} ChenYunjiLuoTaoLiuShaoliZhangShijinHeLiqiangWangJiaLiLingChenTianshiXuZhiweiSunNinghuiTemamOlivier2014 47th Annual IEEE/ACM International Symposium on MicroarchitectureDocument:Users/al17856/Documents/Mendeley Desktop/Chen et al/2014 47th Annual IEEEACM International Symposium on Microarchitecture/Chen et al. - 2014 - DaDianNao A Machine-Learning Supercomputer.pdf:pdfISBN 978-1-4799-6998-2dec609–622IEEEDaDianNao: A Machine-Learning SupercomputerLink2014-12@inproceedings{chen2014dadiannao, author = {Chen, Yunji and Luo, Tao and Liu, Shaoli and Zhang, Shijin and He, Liqiang and Wang, Jia and Li, Ling and Chen, Tianshi and Xu, Zhiwei and Sun, Ninghui and Temam, Olivier}, booktitle = {2014 47th Annual IEEE/ACM International Symposium on Microarchitecture}, doi = {10.1109/MICRO.2014.58}, file = {:Users/al17856/Documents/Mendeley Desktop/Chen et al/2014 47th Annual IEEEACM International Symposium on Microarchitecture/Chen et al. - 2014 - DaDianNao A Machine-Learning Supercomputer.pdf:pdf}, isbn = {978-1-4799-6998-2}, month = {dec}, pages = {609–622}, publisher = {IEEE}, title = {{DaDianNao: A Machine-Learning Supercomputer}}, url = {http://ieeexplore.ieee.org/document/7011421/

    }, year = {2014}} New York, New York, USAZhangChiPrasannaViktorProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17Document:Users/al17856/Documents/Mendeley Desktop/Zhang, Prasanna/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Prasanna - 2017 - Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System.pdf:pdfISBN 9781450343541CPU,FPGA,concurrent processing,convolutional neural networks,discrete fourier transform,double buffering,overlap-and-add,shared memory35–44ACM PressFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory SystemLink2017@inproceedings{Zhang2017a, address = {New York, New York, USA}, author = {Zhang, Chi and Prasanna, Viktor}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17}, doi = {10.1145/3020078.3021727}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang, Prasanna/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Prasanna - 2017 - Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System.pdf:pdf}, isbn = {9781450343541}, keywords = {CPU,FPGA,concurrent processing,convolutional neural networks,discrete fourier transform,double buffering,overlap-and-add,shared memory}, pages = {35–44}, publisher = {ACM Press}, title = {{Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System}}, url = {

    http://dl.acm.org/citation.cfm?doid=3020078.3021727

    }, year = {2017}} KrizhevskyAlexSutskeverIlyaE. HintonGeoffreyDocumentNeural Information Processing SystemsImageNet Classification with Deep Convolutional Neural Networks252012@article{krizhevsky2012imagenet, author = {Krizhevsky, Alex and Sutskever, Ilya and {E. Hinton}, Geoffrey}, doi = {10.1145/3065386}, journal = {Neural Information Processing Systems}, title = {{ImageNet Classification with Deep Convolutional Neural Networks}}, volume = {25}, year = {2012}} KrizhevskyAlexSutskeverIlyaHintonGeoffrey E.Document:Users/al17856/Documents/Mendeley Desktop/Krizhevsky, Sutskever, Hinton/Communications of the ACM/Krizhevsky, Sutskever, Hinton - 2017 - ImageNet classification with deep convolutional neural networks.pdf:pdfISSN 00010782Communications of the ACMmay684–90ACMImageNet classification with deep convolutional neural networksLink602017-05@article{Krizhevsky2017, author = {Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E.}, doi = {10.1145/3065386}, file = {:Users/al17856/Documents/Mendeley Desktop/Krizhevsky, Sutskever, Hinton/Communications of the ACM/Krizhevsky, Sutskever, Hinton - 2017 - ImageNet classification with deep convolutional neural networks.pdf:pdf}, issn = {00010782}, journal = {Communications of the ACM}, month = {may}, number = {6}, pages = {84–90}, publisher = {ACM}, title = {{ImageNet classification with deep convolutional neural networks}}, url = {

    http://dl.acm.org/citation.cfm?doid=3098997.3065386}, volume = {60}, year = {2017}} New York, New York, USAZhangChenLiPengSunGuangyuGuanYijinXiaoBingjunCongJasonProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’15Document:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2015 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15/Zhang et al. - 2015 - Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks.pdf:pdfISBN 9781450333153acceleration,convolutional neural network,fpga,roofline model161–170ACM PressOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksLink2015@inproceedings{zhang2015optimizing, address = {New York, New York, USA}, author = {Zhang, Chen and Li, Peng and Sun, Guangyu and Guan, Yijin and Xiao, Bingjun and Cong, Jason}, booktitle = {Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15}, doi = {10.1145/2684746.2689060}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2015 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15/Zhang et al. - 2015 - Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks.pdf:pdf}, isbn = {9781450333153}, keywords = {acceleration,convolutional neural network,fpga,roofline model}, pages = {161–170}, publisher = {ACM Press}, title = {{Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks}}, url = {http://dl.acm.org/citation.cfm?doid=2684746.2689060

    }, year = {2015}} GuanYijinYuanZhihangSunGuangyuCongJason2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC)DocumentISBN 978-1-5090-1558-0jan629–634IEEEFPGA-based accelerator for long short-term memory recurrent neural networksLink2017-01@inproceedings{Guan2017a, author = {Guan, Yijin and Yuan, Zhihang and Sun, Guangyu and Cong, Jason}, booktitle = {2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC)}, doi = {10.1109/ASPDAC.2017.7858394}, isbn = {978-1-5090-1558-0}, month = {jan}, pages = {629–634}, publisher = {IEEE}, title = {{FPGA-based accelerator for long short-term memory recurrent neural networks}}, url = {

    http://ieeexplore.ieee.org/document/7858394/}, year = {2017}} PodiliAbhinavZhangChiPrasannaViktor2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)DocumentISBN 978-1-5090-4825-0jul11–18IEEEFast and efficient implementation of Convolutional Neural Networks on FPGALink2017-07@inproceedings{podili2017fast, author = {Podili, Abhinav and Zhang, Chi and Prasanna, Viktor}, booktitle = {2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)}, doi = {10.1109/ASAP.2017.7995253}, isbn = {978-1-5090-4825-0}, month = {jul}, pages = {11–18}, publisher = {IEEE}, title = {{Fast and efficient implementation of Convolutional Neural Networks on FPGA}}, url = {http://ieeexplore.ieee.org/document/7995253/}, year = {2017}} LuLiqiangLiangYunXiaoQingchengYanShengen2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)DocumentISBN 978-1-5386-4037-1apr101–108IEEEEvaluating Fast Algorithms for Convolutional Neural Networks on FPGAsLink2017-04@inproceedings{lu2017evaluating, author = {Lu, Liqiang and Liang, Yun and Xiao, Qingcheng and Yan, Shengen}, booktitle = {2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)}, doi = {10.1109/FCCM.2017.64}, isbn = {978-1-5386-4037-1}, month = {apr}, pages = {101–108}, publisher = {IEEE}, title = {{Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs}}, url = {http://ieeexplore.ieee.org/document/7966660/}, year = {2017}} New York, New York, USAZhangJialiangLiJingProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17Document:Users/al17856/Documents/Mendeley Desktop/Zhang, Li/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Li - 2017 - Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network.pdf:pdfISBN 9781450343541convolutional neural networks,fpga,hardware accelerator,opencl25–34ACM PressImproving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural NetworkLink2017@inproceedings{Zhang2017, address = {New York, New York, USA}, author = {Zhang, Jialiang and Li, Jing}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17}, doi = {10.1145/3020078.3021698}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang, Li/Proceedings of the 2017 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17/Zhang, Li - 2017 - Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network.pdf:pdf}, isbn = {9781450343541}, keywords = {convolutional neural networks,fpga,hardware accelerator,opencl}, pages = {25–34}, publisher = {ACM Press}, title = {{Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network}}, url = {http://dl.acm.org/citation.cfm?doid=3020078.3021698}, year = {2017}} New York, New York, USAZhangChenWuDiSunJiayuSunGuangyuLuoGuojieCongJasonProceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED ’16Document:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16/Zhang et al. - 2016 - Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster.pdf:pdfISBN 9781450341851326–331ACM PressEnergy-Efficient CNN Implementation on a Deeply Pipelined FPGA ClusterLink2016@inproceedings{Zhang2016a, address = {New York, New York, USA}, author = {Zhang, Chen and Wu, Di and Sun, Jiayu and Sun, Guangyu and Luo, Guojie and Cong, Jason}, booktitle = {Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16}, doi = {10.1145/2934583.2934644}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhang et al/Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16/Zhang et al. - 2016 - Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster.pdf:pdf}, isbn = {9781450341851}, pages = {326–331}, publisher = {ACM Press}, title = {{Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster}}, url = {http://dl.acm.org/citation.cfm?doid=2934583.2934644}, year = {2016}} New York, New York, USAShenJunzhongHuangYouWangZelongQiaoYuranWenMeiZhangChunyuanProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’18Document:Users/al17856/Documents/Mendeley Desktop/Shen et al/Proceedings of the 2018 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18/Shen et al. - 2018 - Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.pdf:pdfISBN 97814503561453d cnn,uniform templates,winograd algorithm97–106ACM PressTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGALink2018@inproceedings{Shen2018, address = {New York, New York, USA}, author = {Shen, Junzhong and Huang, You and Wang, Zelong and Qiao, Yuran and Wen, Mei and Zhang, Chunyuan}, booktitle = {Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18}, doi = {10.1145/3174243.3174257}, file = {:Users/al17856/Documents/Mendeley Desktop/Shen et al/Proceedings of the 2018 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '18/Shen et al. - 2018 - Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.pdf:pdf}, isbn = {9781450356145}, keywords = {3d cnn,uniform templates,winograd algorithm}, pages = {97–106}, publisher = {ACM Press}, title = {{Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA}}, url = {http://dl.acm.org/citation.cfm?doid=3174243.3174257}, year = {2018}} Huimin LiXitian FanLi JiaoWei CaoXuegong ZhouLingli Wang2016 26th International Conference on Field Programmable Logic and Applications (FPL)DocumentISBN 978-2-8399-1844-2aug1–9IEEEA high performance FPGA-based accelerator for large-scale convolutional neural networksLink2016-08@inproceedings{HuiminLi2016, author = {{Huimin Li} and {Xitian Fan} and {Li Jiao} and {Wei Cao} and {Xuegong Zhou} and {Lingli Wang}}, booktitle = {2016 26th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.1109/FPL.2016.7577308}, isbn = {978-2-8399-1844-2}, month = {aug}, pages = {1–9}, publisher = {IEEE}, title = {{A high performance FPGA-based accelerator for large-scale convolutional neural networks}}, url = {http://ieeexplore.ieee.org/document/7577308/}, year = {2016}} GuanYijinLiangHaoXuNingyiWangWenqiangShiShaoshuaiChenXiSunGuangyuZhangWeiCongJason2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)DocumentISBN 978-1-5386-4037-1apr152–159IEEEFP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid TemplatesLink2017-04@inproceedings{Guan2017, author = {Guan, Yijin and Liang, Hao and Xu, Ningyi and Wang, Wenqiang and Shi, Shaoshuai and Chen, Xi and Sun, Guangyu and Zhang, Wei and Cong, Jason}, booktitle = {2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)}, doi = {10.1109/FCCM.2017.25}, isbn = {978-1-5386-4037-1}, month = {apr}, pages = {152–159}, publisher = {IEEE}, title = {{FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates}}, url = {http://ieeexplore.ieee.org/document/7966671/}, year = {2017}} New York, New York, USAZhangChenFangZhenmanZhouPeipeiPanPeichenCongJasonProceedings of the 35th International Conference on Computer-Aided Design - ICCAD ’16DocumentISBN 97814503446611–8ACM PressCaffeineLink2016@inproceedings{Zhang2016, address = {New York, New York, USA}, author = {Zhang, Chen and Fang, Zhenman and Zhou, Peipei and Pan, Peichen and Cong, Jason}, booktitle = {Proceedings of the 35th International Conference on Computer-Aided Design - ICCAD '16}, doi = {10.1145/2966986.2967011}, isbn = {9781450344661}, pages = {1–8}, publisher = {ACM Press}, title = {{Caffeine}}, url = {http://dl.acm.org/citation.cfm?doid=2966986.2967011}, year = {2016}} New York, New York, USAXiaoQingchengLiangYunLuLiqiangYanShengenTaiYu-WingProceedings of the 54th Annual Design Automation Conference 2017 on - DAC ’17Document:Users/al17856/Documents/Mendeley Desktop/Xiao et al/Proceedings of the 54th Annual Design Automation Conference 2017 on - DAC '17/Xiao et al. - 2017 - Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs.pdf:pdfISBN 97814503492771–6ACM PressExploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAsLink2017@inproceedings{Xiao2017, address = {New York, New York, USA}, author = {Xiao, Qingcheng and Liang, Yun and Lu, Liqiang and Yan, Shengen and Tai, Yu-Wing}, booktitle = {Proceedings of the 54th Annual Design Automation Conference 2017 on - DAC '17}, doi = {10.1145/3061639.3062244}, file = {:Users/al17856/Documents/Mendeley Desktop/Xiao et al/Proceedings of the 54th Annual Design Automation Conference 2017 on - DAC '17/Xiao et al. - 2017 - Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs.pdf:pdf}, isbn = {9781450349277}, pages = {1–6}, publisher = {ACM Press}, title = {{Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs}}, url = {http://dl.acm.org/citation.cfm?doid=3061639.3062244}, year = {2017}} New York, New York, USAQiuJiantaoSongSenWangYuYangHuazhongWangJieYaoSongGuoKaiyuanLiBoxunZhouErjinYuJinchengTangTianqiXuNingyiProceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’16Document:Users/al17856/Documents/Mendeley Desktop/Qiu et al/Proceedings of the 2016 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16/Qiu et al. - 2016 - Going Deeper with Embedded FPGA Platform for Convolutional Neural Network.pdf:pdfISBN 9781450338561bandwidth utilization,convolutional neural network (cnn),dynamic-precision data quantization,embedded fpga26–35ACM PressGoing Deeper with Embedded FPGA Platform for Convolutional Neural NetworkLink2016@inproceedings{Qiu2016, address = {New York, New York, USA}, author = {Qiu, Jiantao and Song, Sen and Wang, Yu and Yang, Huazhong and Wang, Jie and Yao, Song and Guo, Kaiyuan and Li, Boxun and Zhou, Erjin and Yu, Jincheng and Tang, Tianqi and Xu, Ningyi}, booktitle = {Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16}, doi = {10.1145/2847263.2847265}, file = {:Users/al17856/Documents/Mendeley Desktop/Qiu et al/Proceedings of the 2016 ACMSIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16/Qiu et al. - 2016 - Going Deeper with Embedded FPGA Platform for Convolutional Neural Network.pdf:pdf}, isbn = {9781450338561}, keywords = {bandwidth utilization,convolutional neural network (cnn),dynamic-precision data quantization,embedded fpga}, pages = {26–35}, publisher = {ACM Press}, title = {{Going Deeper with Embedded FPGA Platform for Convolutional Neural Network}}, url = {http://dl.acm.org/citation.cfm?doid=2847263.2847265}, year = {2016}} New York, New York, USAVenierisStylianos I.BouganisChristos-SavvasProceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17DocumentISBN 9781450343541FPGA,convolutional neural networks,design space exploration,synchronous dataflow291–292ACM PressfpgaConvNetLink2017@inproceedings{Venieris2017, address = {New York, New York, USA}, author = {Venieris, Stylianos I. and Bouganis, Christos-Savvas}, booktitle = {Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '17}, doi = {10.1145/3020078.3021791}, isbn = {9781450343541}, keywords = {FPGA,convolutional neural networks,design space exploration,synchronous dataflow}, pages = {291–292}, publisher = {ACM Press}, title = {{fpgaConvNet}}, url = {http://dl.acm.org/citation.cfm?doid=3020078.3021791}, year = {2017}} Zhiqiang LiuYong DouJingfei JiangJinwei Xu2016 International Conference on Field-Programmable Technology (FPT)DocumentISBN 978-1-5090-5602-6dec61–68IEEEAutomatic code generation of convolutional neural networks in FPGA implementationLink2016-12@inproceedings{ZhiqiangLiu2016, author = {{Zhiqiang Liu} and {Yong Dou} and {Jingfei Jiang} and {Jinwei Xu}}, booktitle = {2016 International Conference on Field-Programmable Technology (FPT)}, doi = {10.1109/FPT.2016.7929190}, isbn = {978-1-5090-5602-6}, month = {dec}, pages = {61–68}, publisher = {IEEE}, title = {{Automatic code generation of convolutional neural networks in FPGA implementation}}, url = {http://ieeexplore.ieee.org/document/7929190/}, year = {2016}} New York, New York, USASudaNaveenChandraVikasDasikaGaneshMohantyAbinashMaYufeiVrudhulaSarmaSeoJae-sunCaoYuProceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’16DocumentISBN 9781450338561convolutional neural networks,fpga,opencl,optimization16–25ACM PressThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural NetworksLink2016@inproceedings{Suda2016, address = {New York, New York, USA}, author = {Suda, Naveen and Chandra, Vikas and Dasika, Ganesh and Mohanty, Abinash and Ma, Yufei and Vrudhula, Sarma and Seo, Jae-sun and Cao, Yu}, booktitle = {Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '16}, doi = {10.1145/2847263.2847276}, isbn = {9781450338561}, keywords = {convolutional neural networks,fpga,opencl,optimization}, pages = {16–25}, publisher = {ACM Press}, title = {{Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks}}, url = {http://dl.acm.org/citation.cfm?doid=2847263.2847276}, year = {2016}} We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1
    % top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.arXiv1606.06160ZhouShuchangWuYuxinNiZekunZhouXinyuWenHeZouYuheng1606.06160:Users/al17856/Documents/Mendeley Desktop/Zhou et al/Unknown/Zhou et al. - 2016 - DoReFa-Net Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.pdf:pdfjunDoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth GradientsLink2016-06@article{Zhou2016, abstract = {We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1$\backslash${\%} top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.}, archiveprefix = {arXiv}, arxivid = {1606.06160}, author = {Zhou, Shuchang and Wu, Yuxin and Ni, Zekun and Zhou, Xinyu and Wen, He and Zou, Yuheng}, eprint = {1606.06160}, file = {:Users/al17856/Documents/Mendeley Desktop/Zhou et al/Unknown/Zhou et al. - 2016 - DoReFa-Net Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.pdf:pdf}, month = {jun}, title = {{DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients}}, url = {http://arxiv.org/abs/1606.06160}, year = {2016}} GuoKaiyuanSuiLingzhiQiuJiantaoYuJinchengWangJunbinYaoSongHanSongWangYuYangHuazhongDocumentISSN 0278-0070IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systemsjan135–47Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGALink372018-01@article{guo2018angeleye, author = {Guo, Kaiyuan and Sui, Lingzhi and Qiu, Jiantao and Yu, Jincheng and Wang, Junbin and Yao, Song and Han, Song and Wang, Yu and Yang, Huazhong}, doi = {10.1109/TCAD.2017.2705069}, issn = {0278-0070}, journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems}, month = {jan}, number = {1}, pages = {35–47}, title = {{Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA}}, url = {http://ieeexplore.ieee.org/document/7930521/}, volume = {37}, year = {2018}} JiaoLiLuoChengCaoWeiZhouXuegongWangLingli2017 27th International Conference on Field Programmable Logic and Applications (FPL)DocumentISBN 978-9-0903-0428-1sep1–4IEEEAccelerating Low Bit-Width Convolutional Neural Networks with Embedded FPGALink2017-09@inproceedings{jiao2017accelerating, author = {Jiao, Li and Luo, Cheng and Cao, Wei and Zhou, Xuegong and Wang, Lingli}, booktitle = {2017 27th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.23919/FPL.2017.8056820}, isbn = {978-9-0903-0428-1}, month = {sep}, pages = {1–4}, publisher = {IEEE}, title = {{Accelerating Low Bit-Width Convolutional Neural Networks with Embedded FPGA}}, url = {http://ieeexplore.ieee.org/document/8056820/

    }, year = {2017}} Loihi is a 60-mm2chip fabricated in Intels 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and, most importantly, programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation, outperforming all known conventional solutions.DaviesMSrinivasaNLinTChinyaGCaoYChodayS HDimouGJoshiPImamNJainSLiaoYLinCLinesALiuRMathaikuttyDMcCoySPaulATseJVenkataramananGWengYWildAYangYWangHDocument:Users/al17856/Documents/Mendeley Desktop/Davies et al/IEEE Micro/08259423.pdf:pdfISSN 0272-1732IEEE MicroAlgorithm design and analysis,Biological neural networks,CPU iso-process-voltage-area,Computational modeling,Computer architecture,Intels process,LASSO optimization problems,Loihi,Neuromorphics,Neurons,artificial intelligence,circuit optimisation,dendritic compartments,hierarchical connectivity,integrated circuit modelling,learning (artificial intelligence),locally competitive algorithm,machine learning,magnitude superior energy-delay-product,microprocessor chips,multiprocessing systems,neural chips,neuromorphic computing,neuromorphic manycore processor,on-chip learning,programmable synaptic learning rules,size 14 nm,spike-based computation,spiking neural networks,synaptic delaysjan182–99Loihi: A Neuromorphic Manycore Processor with On-Chip Learning382018-01@article{8259423, abstract = {Loihi is a 60-mm2chip fabricated in Intels 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and, most importantly, programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation, outperforming all known conventional solutions.}, author = {Davies, M and Srinivasa, N and Lin, T and Chinya, G and Cao, Y and Choday, S H and Dimou, G and Joshi, P and Imam, N and Jain, S and Liao, Y and Lin, C and Lines, A and Liu, R and Mathaikutty, D and McCoy, S and Paul, A and Tse, J and Venkataramanan, G and Weng, Y and Wild, A and Yang, Y and Wang, H}, doi = {10.1109/MM.2018.112130359}, file = {:Users/al17856/Documents/Mendeley Desktop/Davies et al/IEEE Micro/08259423.pdf:pdf}, issn = {0272-1732}, journal = {IEEE Micro}, keywords = {Algorithm design and analysis,Biological neural networks,CPU iso-process-voltage-area,Computational modeling,Computer architecture,Intels process,LASSO optimization problems,Loihi,Neuromorphics,Neurons,artificial intelligence,circuit optimisation,dendritic compartments,hierarchical connectivity,integrated circuit modelling,learning (artificial intelligence),locally competitive algorithm,machine learning,magnitude superior energy-delay-product,microprocessor chips,multiprocessing systems,neural chips,neuromorphic computing,neuromorphic manycore processor,on-chip learning,programmable synaptic learning rules,size 14 nm,spike-based computation,spiking neural networks,synaptic delays}, month = {jan}, number = {1}, pages = {82–99}, title = {{Loihi: A Neuromorphic Manycore Processor with On-Chip Learning}}, volume = {38}, year = {2018}} Loihi is Intel’s novel, manycore neuromorphic processor and is the first of its kind to feature a microcode-programmable learning engine that enables on-chip training of spiking neural networks (SNNs). The authors present the Loihi toolchain, which consists of an intuitive Python-based API for specifying SNNs, a compiler and runtime for building and executing SNNs on Loihi, and several target platforms (Loihi silicon, FPGA, and functional simulator). To showcase the toolchain, the authors describe how to build, train, and use a SNN to classify handwritten digits from the MNIST database.LinCWildAChinyaG NCaoYDaviesMLaveryD MWangHDocument:Users/al17856/Documents/Mendeley Desktop/Lin et al/Computer/08303802.pdf:pdfISSN 0018-9162ComputerComputational modeling,FPGA,Intel Loihi silicon,Loihi toolchain,MNIST database,Mathematical model,Neuromorphic engineering,Programming,SNN,Synapses,application program interfaces,compiler,field programmable gate arrays,functional simulator,handwritten character recognition,handwritten digits,intuitive Python-based API,learning (artificial intelligence),manycore neuromorphic processor,microcode-programmable learning engine,neural chips,neural networks,neuromorphic computing,neuromorphic processor,on-chip training,programming paradigms,spiking neural networksmar352–61Programming Spiking Neural Networks on Intel’s Loihi512018-03@article{8303802, abstract = {Loihi is Intel's novel, manycore neuromorphic processor and is the first of its kind to feature a microcode-programmable learning engine that enables on-chip training of spiking neural networks (SNNs). The authors present the Loihi toolchain, which consists of an intuitive Python-based API for specifying SNNs, a compiler and runtime for building and executing SNNs on Loihi, and several target platforms (Loihi silicon, FPGA, and functional simulator). To showcase the toolchain, the authors describe how to build, train, and use a SNN to classify handwritten digits from the MNIST database.}, author = {Lin, C and Wild, A and Chinya, G N and Cao, Y and Davies, M and Lavery, D M and Wang, H}, doi = {10.1109/MC.2018.157113521}, file = {:Users/al17856/Documents/Mendeley Desktop/Lin et al/Computer/08303802.pdf:pdf}, issn = {0018-9162}, journal = {Computer}, keywords = {Computational modeling,FPGA,Intel Loihi silicon,Loihi toolchain,MNIST database,Mathematical model,Neuromorphic engineering,Programming,SNN,Synapses,application program interfaces,compiler,field programmable gate arrays,functional simulator,handwritten character recognition,handwritten digits,intuitive Python-based API,learning (artificial intelligence),manycore neuromorphic processor,microcode-programmable learning engine,neural chips,neural networks,neuromorphic computing,neuromorphic processor,on-chip training,programming paradigms,spiking neural networks}, month = {mar}, number = {3}, pages = {52–61}, title = {{Programming Spiking Neural Networks on Intel's Loihi}}, volume = {51}, year = {2018}} MossDuncan J. M.NurvitadhiErikoSimJaewoongMishraAsitMarrDebbieSubhaschandraSuchitLeongPhilip H. W.2017 27th International Conference on Field Programmable Logic and Applications (FPL)DocumentISBN 978-9-0903-0428-1sep1–4IEEEHigh performance binary neural networks on the Xeon+FPGA™ platformLink2017-09@inproceedings{moss2017high, author = {Moss, Duncan J. M. and Nurvitadhi, Eriko and Sim, Jaewoong and Mishra, Asit and Marr, Debbie and Subhaschandra, Suchit and Leong, Philip H. W.}, booktitle = {2017 27th International Conference on Field Programmable Logic and Applications (FPL)}, doi = {10.23919/FPL.2017.8056823}, isbn = {978-9-0903-0428-1}, month = {sep}, pages = {1–4}, publisher = {IEEE}, title = {{High performance binary neural networks on the Xeon+FPGA™ platform}}, url = {

    http://ieeexplore.ieee.org/document/8056823/}, year = {2017}} Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.arXiv1712.08934GuoKaiyuanZengShulinYuJinchengWangYuYangHuazhong1712.08934:Users/al17856/Documents/Mendeley Desktop/Guo et al/arXiv preprint arXiv1712.08934/Guo et al. - 2017 - A Survey of FPGA-Based Neural Network Accelerator.pdf:pdfarXiv preprint arXiv:1712.08934decA Survey of FPGA-Based Neural Network AcceleratorLink2017-12@article{guo2017survey, abstract = {Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.}, archiveprefix = {arXiv}, arxivid = {1712.08934}, author = {Guo, Kaiyuan and Zeng, Shulin and Yu, Jincheng and Wang, Yu and Yang, Huazhong}, eprint = {1712.08934}, file = {:Users/al17856/Documents/Mendeley Desktop/Guo et al/arXiv preprint arXiv1712.08934/Guo et al. - 2017 - A Survey of FPGA-Based Neural Network Accelerator.pdf:pdf}, journal = {arXiv preprint arXiv:1712.08934}, month = {dec}, title = {{A Survey of FPGA-Based Neural Network Accelerator}}, url = {http://arxiv.org/abs/1712.08934}, year = {2017}} TraderTiffanyHPC WireBenchmark,DL,MLBenchmark,DL,MLNvidia Leads Alpha MLPerf Benchmarking Round2018@misc{Trader2018, author = {Trader, Tiffany}, booktitle = {HPC Wire}, keywords = {Benchmark,DL,ML}, mendeley-tags = {Benchmark,DL,ML}, title = {{Nvidia Leads Alpha MLPerf Benchmarking Round}}, year = {2018}} MerrittRickEE TimesjulBaidu Accelerator Rises in AILink2018-07@misc{merritt2018baidu, author = {Merritt, Rick}, booktitle = {EE Times}, month = {jul}, title = {{Baidu Accelerator Rises in AI}}, url = {https://www.eetimes.com/document.asp?doc

    {\_}id=1333449}, year = {2018}} This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better address the computational requirements in domains such as high-performance computing, data analytics, computer vision, and machine learning. Second was the desire to introduce an extension that can scale across multiple implementations, both now and into the future, allowing CPU designers to choose the vector length most suitable for their power, performance, and area targets. Finally, the architecture should avoid imposing a software development cost as the vector length changes and where possible reduce it by improving the reach of compiler auto-vectorization technologies. SVE achieves these goals. It allows implementations to choose a vector register length between 128 and 2,048 bits. It supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation. Finally, it introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.arXiv1803.06185StephensNigelBilesStuartBoettcherMatthiasEapenJacobEyoleMbouGabrielliGiacomoHorsnellMattMagklisGrigoriosMartinezAlejandroPremillieuNathanaelReidAlastairRicoAlejandroWalkerPaulDocument1803.06185:Users/al17856/Documents/Mendeley Desktop/Stephens et al/IEEE Micro/1803.06185.pdf:pdfISBN 0272-1732 VO - 37ISSN 02721732IEEE MicroARM,HPC,SIMD,SVE,Scalable Vector Extension,VLA,Vector Length Agnostic,autovectorization,data parallelism,high-performance computing,instruction set architecture,predication,scalable vector architecture,vector length agnostic226–39The ARM Scalable Vector Extension372017@article{Stephens2017, abstract = {This article describes the ARM Scalable Vector Extension (SVE). Several goals guided the design of the architecture. First was the need to extend the vector processing capability associated with the ARM AArch64 execution state to better address the computational requirements in domains such as high-performance computing, data analytics, computer vision, and machine learning. Second was the desire to introduce an extension that can scale across multiple implementations, both now and into the future, allowing CPU designers to choose the vector length most suitable for their power, performance, and area targets. Finally, the architecture should avoid imposing a software development cost as the vector length changes and where possible reduce it by improving the reach of compiler auto-vectorization technologies. SVE achieves these goals. It allows implementations to choose a vector register length between 128 and 2,048 bits. It supports a vector-length agnostic programming model that lets code run and scale automatically across all vector lengths without recompilation. Finally, it introduces several innovative features that begin to overcome some of the traditional barriers to autovectorization.}, archiveprefix = {arXiv}, arxivid = {1803.06185}, author = {Stephens, Nigel and Biles, Stuart and Boettcher, Matthias and Eapen, Jacob and Eyole, Mbou and Gabrielli, Giacomo and Horsnell, Matt and Magklis, Grigorios and Martinez, Alejandro and Premillieu, Nathanael and Reid, Alastair and Rico, Alejandro and Walker, Paul}, doi = {10.1109/MM.2017.35}, eprint = {1803.06185}, file = {:Users/al17856/Documents/Mendeley Desktop/Stephens et al/IEEE Micro/1803.06185.pdf:pdf}, isbn = {0272-1732 VO - 37}, issn = {02721732}, journal = {IEEE Micro}, keywords = {ARM,HPC,SIMD,SVE,Scalable Vector Extension,VLA,Vector Length Agnostic,autovectorization,data parallelism,high-performance computing,instruction set architecture,predication,scalable vector architecture,vector length agnostic}, number = {2}, pages = {26–39}, title = {{The ARM Scalable Vector Extension}}, volume = {37}, year = {2017}}