Deep Neural Networks (DNNs) have been widely used to deliver unprecedented levels of accuracy in various applications. However, they rely on the availability of copious amount of labeled training data, which can be costly to obtain as it requires human effort to label. To address this challenge, a new class of deep networks, called Generative Adversarial Networks (GANs), have been developed with the intention of automatically generating larger and richer datasets from a small initial labeled training dataset. GANs combine a generative model, which attempts to create synthetic data similar to the original training dataset, with a discriminative model, a conventional DNN that attempts to discern if the data produced by the generative model is synthetic, or belongs to the original training dataset . The generative and discriminative models compete with each other in a minimax situation, resulting in a stronger generator and discriminator. As such, GANs can create new impressive datasets that are hardly discernible from the original training datasets. With this power, GANs have gained popularity in numerous domains, such as medicine, where overtly costly human-centric studies need to be conducted to collect relatively small labeled datasets [2, 3]. Furthermore, the ability to expand the training datasets has gained considerable popularity in robotics , autonomous driving , and media synthesis [6, 7, 8, 9, 10, 11, 12] as well.
Currently, advances in acceleration for conventional DNNs are breaking the barriers to adoption [13, 14, 15, 16, 17, 18]. However, while GANs are set to push the frontiers in deep learning, there is a lack of hardware accelerators that address their computational needs. This paper sets out to explore this state-of-the-art dimension in deep learning from the hardware acceleration perspective. Given the abundance of the accelerators for conventional DNNs [19, 20, 21, 22, 23, 24, 25, 26, 15, 27, 28, 16, 29, 17, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 18, 41, 42, 43], designing an accelerator for GANs will only be attractive if they pose new challenges in architecture design. By studying the structure of emerging GAN models [6, 7, 8, 9, 10, 11, 12], we observe that they use a fundamentally different type of mathematical operator in their generative model, called transpose convolution, that operates on multidimensional input feature maps.
The transposed convolution operator aims to extrapolate information from input feature maps, in contrast to the conventional convolution operator which aims to interpolate the most relevant information from input feature maps. As such, the transposed convolution operator first inserts zeros within multidimensional input feature maps and then convolves a kernel over this expanded input to augment information to the inserted zeros. The transposed convolution in GANs fundamentally differs from the operators in the backward pass of training conventional DNNs, as these do not insert zeros. Moreover, although there is a convolution stage in the transposed convolution operator, the inserted zeros lead to underutilization of the compute resources if a conventional convolution accelerator were to be used. The following highlights the sources of underutilization and outlines the contributions of this paper, making the first accelerator design for GANs.
Performing multiply-add on the inserted zeros is inconsequential. Unlike conventional convolution, the accelerator should skip over the zeros as they constitute more than 60% of all the multiply-add operations as Figure 1 illustrates. Skipping the zeros creates an irregular dataflow and diminishes data reuse if not handled adequately in the microarchitecture. To address this challenge, we propose a reorganization of the output computations that allocates computing rows with similar patterns of zeros to adjacent processing engines. This forced adjacency reclaims data reuse across these neighboring compute units.
Reorganizing the output computations is inevitable but breaks the SIMD execution model. The inserted zeroes, even with the output computation reorganization, create distinct patterns of computation when sliding the convolution window. As such, the same sequence of operations cannot be repeated across all the processing engines, breaking the full SIMD execution model. Therefore, we propose a unified MIMD-SIMD accelerator architecture that exploits repeated patterns in the computations to create different microprograms that can execute concurrently in SIMD mode. To maximize the benefits from both levels of parallelism, we propose an architecture, called , that supports interleaving MIMD and SIMD operations at the finest granularity of a single microprogrammed operation.
MIMD is inevitable but its overhead needs to be amortized. Changes in the dataflow and the computation order necessitate irregular accesses to multiple different memory structures while the operations are still the same. That is, the data processing part can be SIMD but the irregular data access patterns prevent using this execution model. For , we propose the decoupling of data accesses from data processing. This decoupling leads to breaking each processing engine into an access micro-engine and an execute micro-engine. The proposed architecture extends the concept of access-execute architecture [44, 45, 46, 47] to the finest granularity of computation for each individual operation. Although addresses these challenges to enable efficient execution of the transposed convolution operator, it does not impose extra overhead, but instead offers the same level of performance and efficiency. To establish the effectiveness of our architectural innovation, we evaluate using six recent GAN models, on distinct applications. On average, delivers 3.6 speedup and 3.1 energy savings over a conventional convolution accelerator. These results indicate is an effective step towards designing accelerators for the next generation of deep networks.
Ii Background and Motivation
To enable the machines to do beyond just image and text recognition, they need to have a model of how the world functions which is the ability to predict. The most notable successes in deep learning has been on using supervised learning to map a high-dimensional input to a class label. Conventional supervised learning suffers from scalability as it requires large amounts of labeled data. Hence, there are two main challenges: (1) labeling millions of data points requires extensive time and effort and (2) in various applications even generating data itself is complex and requires a significant time and effort. To overcome these challenges, various semi-supervised and unsupervised learning techniques have been introduced. The goal of unsupervised learning is to learn representations that are (i) interpretable, (ii) easily transferable to novel tasks and novel object categories, and to disentangle the informative representation of the data from the noise purely from unlabeled data . This is key to enable the machines to predict and comprehend without significant human intervention for training. Adversarial networks have recently emerged as an alternative way to efficiently train the machines. An Adversarial network consists of generator and discriminator models where the former tries to generate outputs that are as close as possible to the real counterparts while the latter tries to distinguish the outputs from the generator from the real counterparts. Given the positive feedback loop between these two networks, they optimize themselves such that (1) they generate more realistic outputs and (2) they significantly have a more accurate prediction. Generative adversarial networks have recently been introduced by Goodfellow et al.[gan:goodfellow] to bridge the gap between the success supervised and unsupervised learning. In their proposed GAN, the generative model is set in a competition with an adversary: a discriminative model that learns to predict whether a sample is from the model distribution or the data distribution. In another work, Mirza et al.  introduced conditional GANs to provide control over data generation for various modes. Radford et al. [dcgan]
proposed a class of CNNs called deep convolutional generative adversarial networks (DCGANs) that shows competitive performance compared with other unsupervised learning algorithms while have stable training in most settings. Salimans et al.  proposed new architectural features and training procedures applicable to GANs. Their primary objective is to improve the effectiveness of generative adversarial networks for semi-supervised learning via learning additional unlabeled data. Finally, recently Liu et al.  proposed coupled generative adversarial network (CoGAN) in the context of image recognition where CoGAN can learn from a joint distribution without any tuple of corresponding images. As machine learning becomes ubiquitous in every imaginable application, unsupervised learning becomes significantly important as it has the capability to unlock the true potentials of artificial intelligence. As mentioned above, existing literature primarily focuses on various system-level and algorithmic-level enhancement of the GANs corresponding to unsupervised learning. For unsupervised learning methods and algorithms become practical, we need proper hardware solutions that optimize the overall power/performance envelope to allow us to sustain the exponentially increasing demand for computational power.
Ii-a Generative Adversarial Neural Networks
Generative model. A transposed convolutional layer carries out a regular convolution but reverts its spatial transformation.
There are three main motivations for designing a new accelerator for generative adversarial neural networks:
Since there are imbalance in the number of non-zero rows, some of the PEs in Eyeriss remains idle during the computation of the generative models.
A large fraction of MAC operations within each deconvolution window are zero. Therefore, each PE wastes a large fraction of its cycles without performing ineffectual operations.
For deconvolution operation, each input feature map is padded with a large number of zeros. Sending these zeros to each PE wastes significant amount of energy and cycles.
aims to resolve these problems while efficiently accelerate both generative and discriminative models.
Iii Flow of Data in Generative Models
Generative Adversarial Networks (GANs) have revolutionized modern machine learning by significantly improving generative models while using only limited number of labeled training data. Figure 2 shows an overall visualization of a GAN, consisting of two deep neural network models, a generative model and a discriminative model. These two neural network models oppose each other in a minimax situation. Specifically, the generative model tries to generate data that will trick the discriminative model to believing the data is from the original training dataset. Meanwhile, the discriminative model is handed data from either the generative model or the training data and tries to discern between the two. After these networks compete with each other, they refine their abilities to generate and discriminate, respectively. This process creates a stronger generative model and discriminative model than could be obtained otherwise . This arrangement of neural networks has opened up many applications, some of which include music generation with accompaniment  and the discovery new drugs to cure diseases . GANs are enabling our future by pushing forward development in autonomous vehicles, allowing us to imitate human drivers  and simulate driving scenarios to save testing and training costs . GANs enable imagination , a major advancement for machine learning and a key step towards true general artificial intelligence. Here, we overview the challenges and opportunities that were encountered while designing hardware accelerators for GANs.
Challenges and opportunities for GAN acceleration.
The generative models in GANs are fundamentally different from the discriminative models. As Figure 2 illustrates, while the discriminative model mostly consists of convolution operations, the generative model uses transposed convolution operations. Accelerating convolution operations has been the focus of a handful of studies [19, 20, 21, 22, 23, 24, 25, 26, 15, 27, 28, 16, 29, 17, 30, 31, 32, 33, 34, 35, 36]; however, accelerating transposed convolution operations has remained unexplored. Figure 3 depicts the fundamental difference between the conventional convolution and transposed convolution operations. The convolution operation performs data reduction and generally transforms the input data to a smaller representation. On the other hand, the transposed convolution implements a data expansion and transforms the input data to a larger representation. The transposed convolution operation expands the data by first transforming the input data through inserting zeros between the input rows and columns and then performing the computations by sliding a convolution window over the transformed input data. Due to this fundamental difference between convolution and transposed convolution operations, using the same conventional convolution dataflow for generative model may lead to inefficiency. The main reason for such inefficiency can be attributed to the variable number of operations per each convolution window in the transposed convolution. The variable number of operations per each convolution window is the main result of zero insertion step in transposed convolution. Because of this zero-insertion step, distinct convolution windows may have a different number of consequential multiplications between inputs and weights.111A consequential multiplication is a multiplication in which none of the source operands are zero and contributes to the final value of the convolution operation. This discrepancy in the number of operations is the root cause for inefficiency in the computations of generative models, if the same convolution dataflow is used. As such, we aim to design an efficient flow of data for GANs by focusing on: (1) managing the discrepancy in the number of operations per each convolution window in order to mitigate the inefficiencies in the execution of generative models, (2) leveraging the similarities between convolution and transposed convolution operations in order to accelerate both discriminative and generative models on the same hardware platform, and (3) improving the data reuse in discriminative and generative models.
Why using a conventional convolution dataflow is not efficient for transposed convolution?
Going through a simple example of a 2-D transposed convolution, we illustrate the main sources of inefficiency in performing transposed convolution, if a conventional convolution dataflow is used. Figure 4(a) illustrates an example of performing a transposed convolution operation using a conventional convolution dataflow. In this transposed convolution operation, a 5
5 filter with stride of one and padding of two is applied on a 44 2D input. In the initial step, the transposed convolution operation inserts one row and one column of zeros between successive rows and columns (white squares). Performing this zero-insertion step, the input is expanded from a 44 matrix to a 1111 one. The number of zeros to be inserted for each transposed convolution layer in the generative models may vary from one layer to another and is a parameter of the network. After performing the zero-insertion, the next step is to slide a convolution window over the transformed input and perform the multiply-add operations. Figure 4(b) illustrates performing this convolution operation using a conventional convolution dataflow [20, 22, 16]. To avoid clutter in Figure 4(b), we only show the dataflow for generating the output rows 2-5.
Each circle in Figure 4
(b) represents a compute node that can perform vector-vector multiplications between a row of the filter and a row of the zero-inserted input. The filter rows are spatially reused across each of the computation nodes in a vertical manner. Once a vector-vector multiplications finish, the partial sums are aggregated horizontally to yield the results of performing transposed convolution operation for each output row. The black circles represent the compute nodes that are performing consequential operations, whereas the white circles which represent the compute nodes performing inconsequential operations. As depicted in Figure5(b), there will be inconsequential operation (white circles) if a conventional convolution dataflow is used for the execution of transposed convolution operations. Because of the inserted zeros, some of the filter rows are not used to compute the value of an output row. For example, since the 1, 3, and 5 rows of the input are zero, the 2 output row only needs to perform the operations for non-zero elements; hence using only the 2 and 4 filter rows, leaving three compute nodes idle. Overall, in this example, 50 of the compute nodes remain idle during the execution of this transposed convolution operation. Analyzing this transposed convolution operation reveals three main sources of inefficiency when a conventional convolution dataflow is used.
Coarse-grain resource underutilization: Since the consequential filter rows vary from one output row to another, a significant number of compute nodes remain idle. In the aforementioned example, this underutilization applies to 50 of the compute nodes, which perform vector-vector multiplications.
Fine-grain resource underutilization: Even within a compute node a large fraction of the multiply-add operations are inconsequential due to the columnar zero insertion.
Reuse reduction: While the compute units pass along the filter rows for data reuse, the inserted zeros render this data transfer futile.
We address the first two sources of inefficiency with a series of optimizations on the flow of data in GANs. Then, to address the last source of inefficiency that arises because of the inconsequential multiply-add operations within each compute node, we introduce an architectural solution (Section IV).
Flow of data for generative models in GANAX. Figure 5 illustrates the proposed flow of data optimizations for generative models in . To mitigate the challenges of using conventional convolution dataflow for transposed convolution operations in generative models, we leverage the insight that even though the patterns of computation may vary from one output row to another, they are still structured. Taking a closer look at Figure 4, we learn that there are only two distinct patterns222The location of white and black circles (compute nodes) defines each pattern. in the output row computations. In this example, the even output rows (, 2 and 4
) use one pattern of computation, whereas the odd output rows (, 3and 5) use a different pattern for their computations. Building upon this observation, we introduce a series of flow of data optimizations to mitigate the aforementioned inefficiencies in the computation of transposed convolution operation, if a conventional convolution dataflow used.
The first optimization maximizes the data reuse by reorganizing the computation of the output rows in a way that the rows with the same pattern in their computations become adjacent. Figure 5(a) illustrates the flow of data after applying this output row reorganization. Applying the output row reorganization in this example, make the even-indexed (2 and 4 output rows) output rows adjacent. Similar adjacency is established for odd-indexed (3 and 5 output rows) output rows. Although this optimization addresses the data reuse problem, it does not deal with the resource underutilization (, idle compute nodes (white circles) still exist). To mitigate this resource underutilization, we introduce the second optimization that reorganizes the filter rows. As shown in Figure 5(b), applying the filter row reorganization establishes an adjacency for the 1, 3, and 5 filter rows. Similarly, the 2 and 4 filter rows become adjacent. After applying output and filter row reorganization, as shown in Figure 5(b), the idle compute nodes can be simply eliminated from the dataflow. Figure 5(c) illustrates the flow of data after performing both optimizations, which improves the resource utilization for transposed convolution operation from 50% to 100%.
The proposed flow of data also addresses the inefficiency in performing the horizontal accumulation of partial sums. As shown in Figure 4(b), the conventional convolution dataflow requires five cycles to perform the horizontal accumulation for each output row, regardless of their locations. However, comparing Figure 4(b) and Figure 5(c), we observe that after applying output and filter row reorganization optimizations, the number of required cycles for performing the horizontal accumulation reduces from five to two for even-indexed output rows and from five to three for odd-indexed output rows. While the proposed flow of data optimizations effectively improve the resource utilization for transposed convolution, there arises an interesting architectural challenge: how to fully utilize the parallelism between the computations of the output rows that require different number of cycles for horizontal accumulation (two cycles for even-indexed and three cycles for odd-indexed output rows)? If a SIMD execution model is used, some of the compute nodes have to remain idle until the accumulations for the output rows that require more cycles for horizontal accumulation, finish. The next section elaborates on the architecture that exploits the introduced flow of data for transposed convolution and fully utilize the parallelism between distinct output rows by conjoining the MIMD and SIMD execution models.
Iv Architecture Design for GANAX
The execution flow of the generative model (i.e. zero-insertion and variable number of operations per each convolution window) in GANs poses unique architectural challenges that the traditional convolution accelerators [22, 20, 16, 17, 18] can not adequately address. There are two fundamental architectural challenges for GAN acceleration as follows:
Resource underutilization. The first challenge arises due to the variable number of operations per each convolution window in transposed convolution operation. In most of recent accelerators [20, 22, 17, 18], which mainly target conventional convolution operation, the processing engines generally work in a SIMD manner. The convolution windows in conventional convolution operation follow a regular pattern and the number of operations for each of these windows remains invariable. Due to these algorithmic characteristics of conventional convolution operation, a SIMD execution model is an efficient and practical model. However, since the convolution windows in transposed convolution operations exhibit a variable number of operations, a SIMD execution model is not an adequate design choice for these operations. While using a SIMD model utilizes the data parallelism between the convolution windows with the same number of operations, its efficiency is limited in exploiting this execution model for the windows with a different number of operations. That is, if one uses a convolution accelerator with a SIMD execution model for transposed convolution operations, the processing engines that are performing the operations for a convolution window with fewer number of operations have to remain idle until the operations for other convolution windows finish. To address this challenge, we introduce a unified MIMD-SIMD architecture to accelerate the transposed convolution operation without compromising the efficiency of conventional convolution accelerators for convolution operations. This unified MIMD-SIMD architecture effectively maximizes the utilization of accelerator compute resources while effectively utilizing the parallelism between the convolution windows with different number of operations.
Inconsequential computations. The second challenge emanates from the large number of zeros inserted in the multidimensional input feature map for transposed convolution operations. Performing MAC operations on these zeros is inconsequential and wastes accelerator resources (See Figure 1), if not skipped. We address this challenge by leveraging an observation that even though the data access patterns in transposed convolution operations are irregular, they are still structured. Furthermore, these structured patterns are repetitive across the execution of transposed convolutional operations. Building upon these observations, the architecture decouples the operand access and execution. Each processing engine in this architecture consists of a simple access engine that repetitively generates the addresses for operand accesses without interrupting the execute engine. In the next sections, we examine these architectural challenges in details for GAN acceleration and expound the proposed microarchitectural solutions.
Iv-a Unified MIMD-SIMD Architecture
In order to mitigate the resource underutilization, we devise a unified SIMD-MIMD architecture that reaps the benefits of SIMD and MIMD execution models at the same time. That is, while our architecture executes the operations for convolution windows with distinct computation patterns in a MIMD manner, it performs the operations of the convolution windows with the same computation pattern in a SIMD manner. Figure 6 illustrates the high-level diagram of the architecture, which is comprised of a set of identical processing engines (PE). The PEs are organized in a 2D array and connected through a dedicated network. Each PE consists of two -engines, namely the access -engine and the execute -engine. The access -engine generates the addresses for source and destination operands, whereas execute -engine merely performs simple operations such as multiplication, addition, and multiply-add. The memory hierarchy is composed of an off-chip memory and two separate on-chip global buffers, one for data and one for ops. These global on-chip buffers are shared across all the PEs. Each PE operates on one row of filter and one row of input and generates one row of partial sum values. The partial sum values are further accumulated horizontally across the PEs to generate the final output value. Using a SIMD model for transposed convolution operations leads to resource underutilization. The PEs that perform the computation for convolution windows with fewer number of operations remains idle, wasting computational resources. The simple solution is to replace the SIMD model with a fully MIMD computing model and utilize the parallelism between the convolution windows with different number of operations. However, a MIMD execution model requires augmenting each processing engine with a dedicated operation buffer. While this design resolves the underutilization of resources, it imposes a large area overhead, increasing area consumption by 3. Furthermore, fetching and decoding instructions from each of these dedicated operation buffers significantly increases the von Neumann overhead of instruction fetch and decode. To address these challenges, we design the architecture upon this observation that PEs in the same row perform same operations for a large period of time. As such, the proposed architecture leverages this observation and develop a middle ground between a fully SIMD and a fully MIMD execution model. The goal of designing the architecture is multi-faceted: (1) improve the PE underutilization by combining MIMD/SIMD model of computation for transposed convolution operations (2) without compromising the efficiency of SIMD model for conventional convolution operations. Next, we explain the two novel microarchitectural components that enable an efficient MIMD-SIMD accelerator design for GAN acceleration.
Hierarchical op buffers. To enable a unified MIMD and SIMD model of execution, we introduce a two-level op buffer. Figure 6 illustrates the high-level structure of the two-level op buffer. The two-level op buffer consists of a global and a local op buffer. The local and global op buffers work cooperatively to perform the computations for GANs. Each horizontal group of PEs, called processing vector (PV), shares a local op buffer, whereas, the global op buffer that is shared across all the PVs. The accelerator can operate in two distinct modes: SIMD mode and MIMD-SIMD mode. Since all the convolution windows in the convolution operation have the same number of multiply-adds, the SIMD execution model is a best fit. As such for this case, the global op buffer bypasses the local ops and broadcasts the fetched op to all the PEs. On the other hand, since the number of operations varies from one convolution window to another in transposed convolution operation, the accelerator works in MIMD-SIMD mode. In this mode, the global op buffer sends distinct indices to each local op buffer. Upon receiving the index, each local op buffer broadcasts a op, at the location pointed by the received index, to all the underlying PEs. Using MIMD-SIMD mode enables the accelerator to not only utilize the parallelism between the convolution windows with the same number of operations, but also utilize the parallelism across the windows with distinct number of operations.
Global op buffer. Before starting the computations of a layer, a sequence of high-level instructions, which defines the structure of each GAN layer, are statically translated into a series of ops. These ops are pre-loaded into the global op buffer, and then the execution starts. Each of the ops either performs an operation across all the PEs (SIMD) or initiates an op in each PV (MIMD-SIMD). The initiated operation in the MIMD-SIMD mode may vary from one PV to another. The SIMD and MIMD ops can be stored in the global op buffer in any order. A 1-bit field in the global op identifies the type of op: SIMD or MIMD-SIMD. In the SIMD modeall the PEs share the same op globally but execute it on distinct datathe global op defines the intended operation to be performed by all the PEs. In this mode, the local op buffer is bypassed and the global op are broadcasted to all the PEs at the same time. Upon receiving the op, all the PEs perform the same operation, but on distinct data. In the MIMD-SIMD modeall the PEs within the same PV share the same op but different PVs may execute different opsthe global op is partitioned into multiple fields (one filed per each PV), each of which defines an index for accessing an entry in the local op buffer. Upon receiving the index, each local op buffer retrieves the corresponding op stored at the given index and broadcasts it to all the PEs which it controls. The global op buffer is double-buffered so that the next set of ops for performing the computations of GAN layer can be loaded into the buffer while the ops for GAN layer are being executed.
Local op buffer. In the architecture, each PV has a dedicated local op buffer. In the SIMD mode, the local op buffers are completely bypassed and all the PEs perform the same operation that are sent from global op buffer. In the MIMD-SIMD mode, each local op buffer is accessed at the location specified by a dedicated field in the global op. This location may vary from one local op buffer to another. Then, the fetched op is broadcasted to all the PEs within a PV to perform the same operation but on distinct data. Each GAN layer may require a distinct sequence of ops both globally and locally. Furthermore, each PE may need to access millions of operands at different locations to perform the computations of a GAN layer. Therefore, we may need not to only add large op buffers to each PE, but also drain and refill the op buffers multiple times. Adding large buffers to the PEs adds a large area overhead, which could have been utilized to improve the computing power of the accelerator. Also, the process of draining and refilling the op buffers imposes a significant overhead in terms of both performance and energy. To mitigate these overheads, we introduce decoupled access-execute microarchitecture that enables us to significantly reduce the size of op buffers and eliminate the need to drain and refill the local op buffers for each GAN layer.
Iv-B Decoupled Access-Execute Engines
Though the data access patterns in transposed convolution operation are irregular they are still structured. Furthermore, the data access patterns are repetitive across the convolution windows. Building upon this observation, we devise a microarchitecture that decouples the data accesses from from the data processing. Figure 7 illustrates the organization of the proposed decoupled access-execute architecture. The decoupled access-execute architecture consists of two major microarchitectural units, one for address generation (access -engine) and one for performing the operations (execute -engine).
The access -engine generates the addresses for the input, weight, and output buffers. The input, weight, and output buffers consume the generated addresses for each data read/write. The execute -engine, on the other hand, receives the data from the input and weight buffers, performs an operation, and stores the result in the output buffer. The ops of these two engines are entirely segregated. However, the access and execute -engines work cooperatively to perform an operation. The ops for access -engine handle the configuration of index generator units. The ops for execute engine only specify the type of operation to be performed on data. As such, the execute ops do not need to include any fields for specifying the source/destination operands. Every cycle, the access engine sends out the addresses for source and destination operands based on its preconfigured parameters. Then, the execute engine performs an operation on the source operands. The result of the operation is, then, stored in the location that is defined by the access engine. Having decoupled -engines for accessing the data and executing the operations has a paramount benefit of reusing execute ops. Since there is no address field in the execute ops, we can reuse the same execute op on distinct data over and over again without the need to change any fields in the ops. Reusing the same op on distinct data helps to significantly reduce the size of op buffers.
Access -engine. Figure 7 illustrates the microarchitectural units of access -engine. The main function of access -engine is to generate the addresses for source and destination operands based on a preloaded configuration. While designing a full-fledged access -engine that is capable of generating various patterns of data addresses enables flexibility for the accelerator, but it is an overkill for our target application (, GANs). As mentioned in the dataflow section (Section III), the data access patterns for transposed convolution operations are irregular, yet structured. Based on our analysis over the evaluated GANs, we observe that the data accesses in the dataflow are either strided or sequential. The stride value for a strided data access pattern depends on the number of inserted zeros in the multidimensional input activation. Furthermore, these data access patterns are repetitive across a large number of convolution windows and for large number of cycles. We leverage these observations to simplify the design of the access -engine. Figure 7(a) depicts the block diagram of the access engine in . The access engine mainly consists of one or more strided index generators. The index generator can generate one address every cycle, following a pattern governed by a preloaded configuration. Since the data access patterns may vary from one layer to another, we design a reconfigurable index generator.
Figure 7(b) depicts the block diagram of the proposed reconfigurable index generator. There are five configuration registers that govern the pattern for data address generation.
The Addr. configuration register specifies the initial address from which the data address generation starts, while the Offset configuration register can be used to offset the range of generated addresses as needed. The Step configuration register specifies the step size between two consecutive addresses, while the End configuration register specifies the final value up to which the addresses should be generated. Finally, the Repeat configuration register indicates the number of times that a configured data access pattern should be replayed. The modulo adder, which consists of an adder and a subtractor, is used to enable data address generation in a rotating manner. The modulo adder performs a modulo addition on the values stored in the Addr. and Step registers. If the result of this modulo addition is fewer than the value in End register, the calculated result is sent to the output. This means that the next address to be generated is still within the range of Addr. and End register values. Otherwise, the result of the modulo addition minus the value of End register is sent to the output. That is, the next address to be generated is beyond the End register value and the address generation process must start over from the beginning. In this scenario, the Decrement signal is also asserted which cause the value of the Repeat register to be decreased by one, indicated one round of address generation is finished. Once the Repeat register reaches zero, the Stop signal is asserted and no more addresses are generated. After configuring the parameters, the strided index generator can yield one address per cycle without any further interventions from the controller. Using this configurable index generator along the observation that the data address patterns in GANs are structured, the architecture can bypass the inconsequential computations and save both cycles and energy.
Execute -engine. Figure 7(b) depicts the microarchitectural units of execute -engine. The execute -engine consists of an ALU, which can perform simple operations such as addition, multiplication, comparison, and multiply-add. The main job of execute -engine is just to perform an operation on the received data. At each cycle the execute -engine consumes one op from the op FIFO and performs the operation on the source operands and store the result back into the destination operand. If the Op FIFO becomes empty, the execute op halts and no further operation is performed. In this case, all the input/weight/output buffers are notified to stop their reads/writes. The decoupling between access and execute engines enables us to remove the address field from the execute ops. Removing the address field from the execute ops allow us to reuse the same ops over and over again on different data. Furthermore, we leverage this op reuse and the fact that the computation of the CNN requires a small set of ops ( 16) to simplify the design of the op buffers. Instead of draining and refilling the op buffers, we preload all the necessary ops for convolution and transposed convolution operations in the op buffers. For the local op buffer, we load all the ops before starting the computation of a GAN.
Synchronization between engines. In the architecture (Figure 7), there is one address FIFO for each strided index generator. The address FIFOs perform the synchronization between access -engine and execute -engine. Once an address is generated by a strided index generator, the generated address is pushed into the corresponding address FIFO. The addresses in the address FIFOs are later consumed to read/write data from/into the data buffers (, input/weight/output buffers). If any of the address FIFOs are full, the corresponding strided index generator stops generating new addresses. In the case that any of the address FIFOs are empty, no data is read/written from/into its corresponding address FIFO.
V Instruction Set Architecture Design (Ops)
The ISA should provide a set of ops to efficiently map the proposed flow of data for both generative and discriminative models onto the accelerator. Furthermore, these ops should be sufficiently flexible to serve distinct patterns in the computation for both convolution and transposed convolution operations. Finally, to keep the size of op buffers modest, the set of ops should be succinct. To achieve these multifaceted goals, we first introduce a set of algorithmic observations that are associated with GAN models. Then, we introduce the major ops that enable the execution of GAN models on .
V-a Algorithmic Observations
The following elaborates a set of algorithmic observations that are the foundation of the ops.
(1) MIMD/SIMD execution model. Due to the regular and structured patterns in the computation across the convolution windows in conventional DNNs, they are best suited for SIMD processing. However, the patterns in the computation of GANs are inherently different between generative and discriminative models. Due to the inserted zeros in the generative models, their patterns in the computation vary from one convolutional window to another. We observe that exploiting a combination of SIMD and MIMD execution model can be more efficient in accelerating GAN models than solely relying on SIMD. Therefore, the focus of the ops is to include the operations that enable to fully utilize the SIMD and MIMD execution models.
(2) Repetitive computation patterns. We observe that even though GANs require a large number of computations, most of these computations are similar between generative and discriminative models. In addition, these computations are repetitive over a long period of time. Building upon this observation, we introduce a customized repeat op that significant reduces the op footprints. In addition, the commonality between the operations in generative and discriminative models allows us to design a succinct, yet representative, set of ops. To further reduce the op footprints, we introduce a dedicated set of execute ops that only define the type of operations. These ops are reused for distinct data during the execution of generative and discriminative models on the GANAX architecture.
(3) Structured and repetitive memory access patterns. We observe that despite the irregularity of memory access patterns in generative models, they are still structured and repetitive. Analyzing the data access patterns of various GANs reveals that their memory access patterns are either sequential or strided. Building upon this observation and our decoupled access-execute architecture, we introduce a set of access ops that are used merely to configure the access engines and initiate the address generation process. Once initiated, the access engines generate the configured access patterns over and over until they are intervened.
V-B Access Ops
access ops are used to configure the access engine and initiate/stop the process of address generation. These ops are executed across all the PEs within a PV whose index is indicated by pv_index field in the ops. Furthermore, in all of these ops, %addrgen_idx specifies the index of the targeted address generator in the access engine. The supported ops in the access engines are as follows:
access.cfg %pv_idx, %addrgen_idx, %dst, imm: This op loads a 16-bit imm value into one of the five %dst configuration registers (i.e., as shown in Figure 7(b), these configuration registers are Addr., Offset, Step, End, and Repeat) of one of the address generators in the access engine.
access.start %pv_idx, %addrgen_idx: This op initiates the address generation in one of the address generators in the access engine. The process of address generation continues until an acceess.stop op is executed or the iteration register reaches zero.
access.stop %pv_idx, %addrgen_idx: This op intervenes the address generation of one of the address generators in the access engine. The address generation can be re-initiated again by executing an access.start op.
V-C Execute Ops
Execute ops are categorized into two groups: (1) SIMD ops are fetched from each PE’s local op buffer and executed locally within each PE and (2) the MIMD ops are fetched from the global op buffer and executed across all PEs. The SIMD ops can be executed in the MIMD manner as well. That is, the MIMD ops are a superset of the SIMD ops. We first introduce the SIMD ops, then explain the extra ops that belong to the MIMD group.
SIMD ops. SIMD group only comprises a succinct, yet representative set of ops for performing convolution and transposed convolution operations. The combination of SIMD ops and the decoupled access-execute architecture in helps to reduce the size of local op buffers. The SIMD ops do not have source or destination fields and only specify the type of operation to be executed. Once executed, depending on the type of operation, a given PE consumes the generated addresses by the index generators and delivers the data to the execute engine. Since these ops do not have any source or destination register, they are pre-loaded into the local op buffers before starting the execution. Then, they are re-used over and over, on distinct data whose addresses are generated by the access engines. The SIMD ops are as follows:
add, mul, mac, pool, and act: Depending on the type, these ops consume one or more addresses from the index generators for source and destination operands. For example, add consumes two addresses for the source operands and one address for the destination operand, but act uses one address for the source operand and one address for the destination operand.
repeat: This op causes the next fetched op to be repeated a specified number of times. This number is specified in a microarchitectural register in each PE. This register is pre-loaded with a MIMD op before the execution starts. MIMD ops. The MIMD ops are loaded into the global op buffers and executed globally across all the PEs. In addition to all the SIMD ops, the following ops execute in a MIMD manner:
mimd.ld %pv_idx, %dst, imm: This op loads the immediate value (imm) into one of the microarchitectural registers (%dst) of all the PEs with a PV. The %pv_idx, specifies the index of the target PV. This op is mainly used to load an immediate value into the repeat register.
mimd.exe %op_index1,…, %op_indexi: Upon receiving this op, the ith PV fetches a op located at location %op_indexi from its local op buffer and executes it across all the PEs horizontally. Since the value of the %op_index may vary from one PV to another, this op causes to operate in a MIMD manner.
Workloads. We use several state-of-the-art GANs to evaluate the architecture. Table I, shows the evaluated GANs, a brief description of their applications, and the number of convolution (Conv) and transposed convolution (TConv) layers per generative and discriminative models.
Hardware design and synthesis. We implement the microarchitectural units including the strided index generator, the arithmetic logic of the PEs, controllers, non-linear function, and other logic hardware units in Verilog. We use TSMC 45 nm standard-cell library and Synopsys Design Compiler (L-2016.03-SP5) to synthesize these units and obtain the area, delay, and energy numbers.
Energy measurements. Table II shows the energy numbers for major micro-architectural units, memory operations, and buffer accesses in TSMC 45nm technology. To measure the area and read/write access energy of the register files, SRAMs, and local/global buffers, we use CACTI-P . To have a fair comparison, we use energy numbers reported in , which has a similar PE architecture as . In Table II, the energy overhead of strided index generators is included in the normalized energy cost of PE. For DRAM accesses, we use the Micron’s DDR4 system power calculator . The same frequency (500 MHz) is used for both and in all the experiments.
Architecture configurations. In this paper, we study a configuration of with 16 Processing Vectors (PVs) each with 16 Processing Engines (PEs). We use the default configurations for on-chip memories such as the size of input and partial sum registers, weight SRAM, and global data buffer. The same on-chip memory sizes are used for . Each local op buffer has 16 entries. The number of entries is sufficient to encompass all the execute ops. The global op buffer has 32 entries each with 64 bits, four bits per each PV. Each local op uses these four bits to index its local op buffer. An extra one bit in the global ops determines the execution model of the accelerator for the current operation (, SIMD or MIMD-SIMD).
Area analysis. Table III shows the major architectural components for the baseline architecture ( [20, 16]) and in 45 nm technology node. For logic of the microarchitectural units, we use the reported area from the synthesis. For the memory elements, we use CACTI-P  and the reported numbers in . In order to be consistent in the results, we scaled down the reported area numbers in from 65 nm to 45 nm. To have a fair comparison between and , the same number of PEs and on-chip memory are used for both accelerators. Under this setting, has an area overhead of compared to .
Microarchitectural simulation. Table III shows the major microarchictural parameters of . We implement a microarchitectural simulator on top of the simulator . The extracted energy numbers from logic synthesis and CACTI-P are integrated into the simulator to measure the energy consumption of the evaluated network models on . To evaluate our proposed accelerator, we extend the simulator with the proposed ISA extensions and the flow of data. For all the baseline numbers, we use the plain version of the simulator.
Overall performance and energy consumption comparison. Figure (a)a depicts the speedup of the generative models with over . On average, yields 3.6 speedup improvement over . The generative models with a larger fraction of inserted zeros in the input data and larger number of inconsequential operations in transposed convolution layers enjoy a higher speedup with . Across all the evaluated models, 3D-GAN achieves the highest speedup (6.1). This higher speedup is mainly attributed to its larger number of inserted zeros in its transposed convolution layers. On average, the number of inserted zeros for 3D-GAN is around 80% (See Figure 1). On the other extreme, MAGAN enjoys a speedup of merely 1.3, which is attributed to the lowest number of inserted zeros in its transposed convolution layers compared to other GANs.
Figure (b)b shows the energy reduction achieved by over . On average, effectively reduces the energy consumption by 3.1 over the accelerator. The GANs (3D-GAN, DCGAN, and GP-GAN) with the highest fraction of zeros and inconsequential operations in the transposed convolution layers enjoy an energy reduction of more than 4.0. These results reveal that our proposed architecture is efficient in addressing the main sources of inefficiency in the generative models. Figure 9 shows the normalized runtime and energy breakdown between the discriminative and generative models. The first (second) bar shows the normalized runtime (energy) for (). To be consistent across all the networks, for the discriminative model of MAGAN, we only consider the contribution of convolution layers in the overall runtime and energy consumption. As the results show, while significantly reduces both the runtime and energy consumption of generative models, it delivers the same level of efficiency as for the discriminative models.
Energy breakdown of the microarchitectural units. Figure 10 illustrates the overall normalized energy breakdown of the generative models between distinct microarchitectural components of the architecture. The first and second bars show the normalized energy consumed by and , respectively. As the results show, reduces the energy consumption of all the microarchitectural units. This reduction is mainly attributed to the efficient flow of data in and the decoupled access-execute architecture that cooperatively diminishes the sources of inefficiency in the execution of transposed convolution operations.
Processing elements utilization. To show the effectiveness of dataflow in improving the resource utilization, we measure what percentage of the total runtime, the PEs are actively performing a consequential operation. Figure 11 depicts the utilization of PEs for and . exhibits a high percentage of PE utilization, around 90% across all the evaluated GANs. This high resource utilizations in is mainly attributed to the proposed dataflow that can effectively force the computation of the rows with similar computation pattern adjacent to each other. This forced adjacency of similar computation patterns eliminates inconsequential operations, which leads to a significant improvement in the utilization of the processing engines.
Viii Related Work
has fundamentally a different accelerator architecture than the prior proposals for deep network acceleration. In contrast to prior work that mostly focus on convolution operation, accelerates transposed convolution operation, a fundamentally different operation than conventional convolution. Below, we overview the most relevant work to ours along two dimensions: neural network acceleration and MIMD-SIMD acceleration.
Neural network acceleration. Accelerator design for neural networks has become a major line of computer architecture research in recent years. A handful of prior work explored the design space of neural network acceleration, which can be categorized into ASICs [19, 22, 20, 21, 16, 15, 26, 30, 27, 34, 38, 37, 18, 41, 42], FPGA implementations [28, 17, 36, 35, 43], using unconventional devices for acceleration [33, 29, 40], and dataflow optimizations [23, 25, 24, 32, 16, 31, 39]. Most of these studies have focused on accelerator design and optimization of merely one specific type of convolutional as the most compute-intensive operation in deep convolutional neural networks.
 proposes a row stationary dataflow that yields high energy efficiency for convolutional operation. exploits data gating to skip zero inputs and further improves the energy efficiency of the accelerator. However, still wastes cycles for detecting the zero-valued inputs. Cnvlutin  can save compute cycle and energy for zero-values inputs but still wastes resources for zero-valued weights. In contrast, Cambricon-X  can skip zero-valued weights but still wastes compute cycles and energy for zero-input values. SCNN  proposes an accelerator that can skip both zero-valued inputs and weights and efficiently performs convolution on highly sparse data. However, not only SCNN cannot handle dynamic zero-insertion in input feature maps, but also it is not efficient for non-sparse vector-vector multiplications, which are the dominant operation in discriminative models of GANs. None of these works can perform zero-insertion into the input feature maps, which is fundamentally a requisite for transposed convolution operation in the generative models. Compared to these successful prior work in neural network acceleration, proposes a unified architecture for efficient acceleration of both conventional convolution and transposed convolution operations. As such, encompasses the acceleration of a wider range of neural network models.
MIMD-SIMD accelerators. While the idea of access-execute is not brand-new, extends the concept of access-execute architecture [44, 45, 46, 47] to the finest granularity of computation for each individual operand for deep network acceleration. A wealth of research has studied the benefits of MIMD-SIMD architecture in accelerating specific applications [53, 54, 55, 56, 57, 58, 59, 60, 61]
. Most of these works have focuses on accelerating computer vision applications. For example, PRECISION proposes a reconfigurable hybrid MIMD-SIMD architecture for embedded computer vision. In the same line of research, a recent work  proposes a multicore architecture for real-time processing of augmented reality applications. The proposed architecture leverages SIMD and MIMD for data- and task-level parallelism, respectively. While these works have studied the benefits of MIMD-SIMD acceleration mostly for computer vision applications, they did not study the potential gains of using MIMD and SIMD accelerators for modern machine learning applications. Prior to this work, the benefits, limits, and challenges of MIMD-SIMD architectures for modern deep model acceleration was unexplored. Conclusively, the GANAX architecture is the first to explore this uncharted territory of MIMD-SIMD acceleration for the next generation of deep networks.
Generative adversarial networks harness both generative and discriminative deep models in a game theoretical framework to generate close-to-real synthetic data. The generative model uses a fundamentally different mathematical operator, called transposed convolution, as opposed to the conventional convolution operator. Transposed convolution extrapolates information by first inserting zeros and then applying convolution that needs to cope with irregular placement of none-zero data. To address the associated challenges for executing generative models without sacrificing accelerator performance for conventional DNNs, this paper devised the accelerator. In the proposed accelerator, we introduced a unified architecture that conjoins SIMD and MIMD execution models to maximize the efficiency of the accelerator for both generative and discriminative models. On the one hand, to conform to the irregularities in the generative models, which are formed due to the zero-insertion step, supports selective execution of only the required computations by switching to a MIMD-SIMD mode. To support this mixed execution mode, offers a decoupled micro access-execute paradigm at the finest granularity of its processing engines. On the other hand, for the conventional discriminator DNNs, it sets the architecture in a purely SIMD mode. The evaluation results across a variety of generative adversarial networks reveal that the accelerator delivers, on average, 3.6 speedup and 3.1 energy reduction for the generative models. These significant benefits are attained without sacrificing the execution efficiency of the conventional discriminator DNNs.
We thank Hardik Sharma, Ecclesia Morain, Michael Brzozowski, Hajar Falahati, and Philip J. Wolfe for insightful discussions and comments that greatly improved the manuscript. Amir Yazdanbakhsh is partly supported by a Microsoft Research PhD Fellowship. This work was in part supported by NSF awards CNS#1703812, ECCS#1609823, CCF#1553192, Air Force Office of Scientific Research (AFOSR) Young Investigator Program (YIP) award #FA9550-17-1-0274, NSF-1705047, Samsung Electronics, and gifts from Google, Microsoft, Xilinx, and Qualcomm.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in NIPS, 2014.
-  D. Nie, R. Trullo, J. Lian, C. Petitjean, S. Ruan, Q. Wang, and D. Shen, “Medical Image Synthesis with Context-aware Generative Adversarial Networks,” in MICCAI, 2017.
-  P. Costa, A. Galdran, M. I. Meyer, M. Niemeijer, M. Abràmoff, A. M. Mendonça, and A. Campilho, “End-to-end Adversarial Retinal Image Synthesis,” T-MI, 2017.
J. Ho and S. Ermon, “Generative Adversarial Imitation Learning,” inNIPS, 2016.
-  A. Ghosh, B. Bhattacharya, and S. B. R. Chowdhury, “SAD-GAN: Synthetic Autonomous Driving using Generative Adversarial Networks,” arXiv, 2016.
-  W. R. Tan, C. S. Chan, H. Aguirre, and K. Tanaka, “ArtGAN: Artwork Synthesis with Conditional Categorial GANs,” arXiv, 2017.
-  H. Wu, S. Zheng, J. Zhang, and K. Huang, “GP-GAN: Towards Realistic High-Resolution Image Blending,” arXiv, 2017.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to Discover Cross-Domain Relations with Generative Adversarial Networks,” ArXiv, 2017.
-  L.-C. Y. Y.-H. Y. Hao-Wen Dong, Wen-Yi Hsiao, “MuseGAN: Symbolic-domain Music Generation and Accompaniment with Multi-track Sequential Generative Adversarial Networks,” arXiv, 2017.
-  L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions,” arXiv, 2017.
-  J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling,” in NIPS, 2016.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv, 2015.
-  Microsoft, “Microsoft unveils Project Brainwave for real-time AI.” https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/, 2017.
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, et al.
, “In-datacenter Performance Analysis of a Tensor Processing Unit,” inISCA, 2017.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in ISCA, 2016.
-  Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in ISCA, 2016.
-  H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From High-Level Deep Neural Models to FPGAs,” in MICRO, 2016.
-  T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning,” in ASPLOS, 2014.
-  H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra, and H. Esmaeilzadeh, “Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks,” in ISCA, 2018.
-  Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” JSSC, 2017.
-  A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” in ISCA, 2017.
-  M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in ASPLOS, 2017.
-  Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks,” in FPGA, 2017.
-  W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” in HPCA, 2017.
-  L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-based Accelerator for Deep Learning,” in HPCA, 2017.
-  S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An Accelerator for Sparse Neural Networks,” in MICRO, 2016.
-  S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing,” ArXiv, 2016.
-  D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh, “Tabla: A Unified Template-based Framework for Accelerating Statistical Machine Learning,” in HPCA, 2016.
-  P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory,” in ISCA, 2016.
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” inISCA, 2016.
-  X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz, “A Systematic Approach to Blocking Convolutional Neural Networks,” ArXiv, 2016.
-  S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding,” in ICLR, 2016.
-  A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars,” in ISCA, 2016.
-  Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in ISCA, 2015.
-  C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” in FPGA, 2015.
-  T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze, and M. Oskin, “SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration,” in HPCA, 2015.
-  S. Eldridge, A. Waterland, M. Seltzer, J. Appavoo, and A. Joshi, “Towards General-Purpose Neural Network Computing,” in PACT, 2015.
-  A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Esmaeilzadeh, “Neural Acceleration for GPU Throughput Processors,” in MICRO, 2015.
-  B. Grigorian and G. Reinman, “Accelerating Divergent Applications on SIMD Architectures Using Neural Networks,” TACO, 2015.
-  R. S. Amant, A. Yazdanbakhsh, J. Park, B. Thwaites, H. Esmaeilzadeh, A. Hassibi, L. Ceze, and D. Burger, “General-Purpose Code Acceleration with Limited-Precision Analog Computation,” in ISCA, 2014.
-  B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuous Real-World Inputs Can Open Up Alternative Accelerator Designs,” in ISCA, 2013.
-  H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural Acceleration for General-Purpose Approximate Programs,” in MICRO, 2012.
-  C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision,” in CVPR Workshops, 2011.
-  T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam, “Stream-Dataflow Acceleration,” in ISCA, 2017.
-  K. Wang and C. Lin, “Decoupled Affine Computation for SIMT GPUs,” in ISCA, 2017.
-  T. Chen and G. E. Suh, “Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling,” in MICRO, 2016.
-  J. E. Smith, “Decoupled Access/Execute Computer Architectures,” in ACM SIGARCH Computer Architecture News, 1982.
-  M. Benhenda, “ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity?,” arXiv, 2017.
-  Y. Li, J. Song, and S. Ermon, “Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs,” ArXiv, 2017.
-  H. Che, B. Hu, B. Ding, and H. Wang, “Enabling Imagination: Generative Adversarial Network-Based Object Finding in Robotic Tasks,” in NIPS, 2017.
-  S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques,” in ICCAD, 2011.
-  “DDR4 Spec - Micron Technology, Inc.” https://goo.gl/9Xo51F.
H. J. Siegel, L. J. Siegel, F. C. Kemmerer, M. PT Jr, S. HE Jr, and S. D. Smith, “PASM: A Partitionable SIMD/MIMD System for Image Processing and Pattern Recognition,”IEEE TC, 1981.
-  A. Nieto, D. L. Vilarino, and V. M. Brea, “PRECISION: A reconfigurable SIMD/MIMD coprocessor for Computer Vision Systems-on-Chip,” IEEE TC, 2016.
-  A. N. Choudhary, J. H. Patel, and N. Ahuja, “NETRA: A Hierarchical and Partitionable Architecture for Computer Vision Systems,” IEEE TPDS, 1993.
-  H. P. Zima, H.-J. Bast, and M. Gerndt, “SUPERB: A Tool for Semi-Automatic MIMD/SIMD Parallelization,” Parallel Computing, 1988.
-  P. P. Jonker, “An SIMD-MIMD architecture for Image Processing and Pattern Recognition,” in Computer Architectures for Machine Perception, 1993.
-  A. Nieto, D. L. Vilariño, and V. M. Brea, “SIMD/MIMD Dynamically-reconfigurable Architecture for High-performance Embedded Vision Systems,” in ASAP, 2012.
-  H. M. Waidyasooriya, Y. Takei, M. Hariyama, and M. Kameyama, “FPGA Implementation of Heterogeneous Multicore Platform with SIMD/MIMD Custom Accelerators,” in ISCAS, 2012.
-  X. Wang and S. G. Ziavras, “Performance-energy Tradeoffs for Matrix Multiplication on FPGA-based Mixed-mode Chip Multiprocessors,” in ISQED, 2007.
-  G. Kim, K. Lee, Y. Kim, S. Park, I. Hong, K. Bong, and H.-J. Yoo, “A 1.22 TOPS and 1.52 mW/MHz Augmented Reality Multicore Processor with Neural Network NoC for HMD Applications,” JSSC, vol. 50, no. 1, 2015.