Sidebar: Scratchpad Based Communication Between CPUs and Accelerators

10/23/2019 ∙ by Ayoosh Bansal, et al. ∙ 0

Hardware accelerators for neural networks have shown great promise for both performance and power. These accelerators are at their most efficient when optimized for a fixed functionality. But this inflexibility limits the longevity of the hardware itself as the underlying neural network algorithms and structures undergo improvements and changes. We propose and evaluate a flexible design paradigm for accelerators with a close coordination with host processors. The relatively static matrix operations are implemented in specialized accelerators while fast-evolving functions, such as activations, are computed on the host processor. This architecture is enabled by a low latency shared buffer we call Sidebar. Sidebar memory is shared between the accelerator and host, exists outside of program address space and holds intermediate data only. We show that a generalised DMA dependent flexible accelerator design performs poorly in both perf and energy as compared to an equivalent fixed function accelerator. Sidebar based accelerator design achieves near identical performance and energy to equivalent fixed function accelerator while still providing all the flexibility of computing activations on the host processor.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise in usage of deep neural networks has lead to unique computational demands on modern systems. Initially deployed on CPUs, neural networks have since moved to GPUs and FPGAs. Today, there are dedicated hardware blocks for neural networks in widely-available commodity hardware, including the latest SoCs developed by Apple, Qualcomm, and others [hwaccel]. Accelerators have also found their way into the data center, such as Google’s TPU [tensorflow, tpu] and dominate GPUs in performance per watt for inference tasks.

Accelerators for deep learning excel at matrix multiplication, which forms the most computationally expensive portion of many modern models. However, these models also frequently involve non-linear activation functions. Some are comparatively easy to compute, such as ReLU, but others require special functions like tanh that are more expensive or require space for lookup tables. Many of these activation functions could be implemented in logic within the accelerator, but this approach lacks flexibility. While many advancements in deep learning will continue to map onto matrix operations, these non-linearities are more liable to change in the future and break hardware compatibility. Another option is to perform these operations on the CPU while keeping the matrix math on the accelerator. However, this requires costly DMA operations.

Figure 1: Proposed System Model

We evaluate the usage of a specialized buffer, called Sidebar at the L1 level sitting between a CPU core and an accelerator block, see Figure 1. The accelerator will continue to have data pushed in through DMA operations to its private memory. Once loaded, the accelerator can perform computations on the data. If the accelerator is required to perform a computation that is expensive and not implemented in hardware, it can instead move this computation back to the host processor. The accelerator will copy the intermediate results it has computed into the Sidebar and inform the host processor that it should perform some function on the Sidebar

data. The host processor performs the computation using all available execution resources, including vector units and complex arithmetic units. Then, the CPU sends the data back into the

Sidebar for the accelerator to use and continue execution.

Consider the example of accelerating a neural network. In our scheme, the CPU will initiate a neural network operation, such as a forward pass, on the accelerator. The accelerator will perform some matrix computations and at some point write intermediate values that require the application of the activation function into the Sidebar. The CPU will compute the activation functions, write the results to the Sidebar and then indicate to the accelerator that it may proceed. The accelerator can repeat this process until the neural network operation has completed.

2 Background and Motivation

In this section we provide a brief introduction on neural networks plus the role and types of activation functions.

2.1 Neural Networks

Neural networks are a class of computational models. They have seen usage in a variety of domains, including image classification, translation, finance, autonomous vehicles, and more. In their most basic form, neural networks consist of compositions of linear predictors and activation functions. A linear predictor is a weighted average of its inputs plus a biasing term. The decision boundary is the sign of the output. Linear predictors themselves are used in some machine learning tasks, particularly regression tasks, but have limited representational power.

Compositions of linear predictors are themselves linear predictors. In order to increase the representational power of this composition, some non-linearities must be injected. The outputs of each linear predictor (each "fully connected layer") are passed into an activation function, whose result is then used as the input to another linear predictor ("layer"). In theory, a two layer network with a reasonable activation function can approximate any function arbitrarily well [Cybenko1989]. In practice, deeper networks are used for ease of training, as they require fewer total weights [telgarsky16].

Modern neural networks frequently make use of more than just linear predictors. Convolutional layers perform a convolution on the input using a small kernel of weights. Pooling layers reduce the size of their input by replacing a sliding window over the input with a single entry in the output according to some algorithm, frequently either the max or the average. Other types, such as recurrent layers and dropout layers, have seen usage in some domains.

2.2 Activations

A wide variety of activation functions have been used, with their relative popularity changing over time. Early research in neural networks focused on perceptron networks and used the heaviside function. Later research focused on the sigmoid function, which endured for several decades. The hyborbolic tangent function was also used during this time. After rising to popularity with its usage in the winning entry of Imagenet 2012 

[alexnet], the relu function remains the most popular activation function today. Many variants of it have been proposed and adopted to varying degrees, and more are sure to be developed in the future. Activation functions are distinct from other parts of a neural network in that they cannot be expressed as a matrix operation, and thus require special hardware. If new activation functions come into use, existing accelerators may not be able to implement them without hardware modification.

Name Formula
Heaviside
tanh
Sigmoid
ReLU
Leaky ReLU
ELU
Softplus
Table 1: Common Activation Functions

2.3 Motivation

Consider a large monolithic accelerator, like Figure 4, which does includes a lot of layers and activation functions. As discussed before, the activation functions tend to change over time and with any small change in the algorithm of the accelerator, the complete hardware IP becomes obsolete and would need expensive engineering efforts to update. A better design is to have smaller primitives, S1-S5 in Figure 4, which are relatively static and do the activation computations on a processor. Though, the interface for such a flexible system is a key problem. Several interfaces have been explored [interfaceexploration, spandex] but in most cases the data movement costs make this flexible design prohibitively expensive[shao2015toward]. With the system described later in Section 5, we compare the monolithic accelerator and the flexible DMA accelerators performance and energy in Figures 2 and 3. Clearly the naive Flexible design is prohibitively expensive and requires a better interface to move data to and from the host processor.

Figure 2: Monolithic vs Flexible DMA Inference Performance
Figure 3: Monolithic vs Flexible DMA Inference Energy

3 System Considerations

The goal of this work is to provide a mechanism which can reduce the system overheads associated with fine-grained cooperation between an accelerator and a host CPU. We accomplish this by leveraging a tightly-coupled buffer, referred to as Sidebar, as the point of contact between CPU and accelerator. Much like in a courtroom setting, the Sidebar allows host processor and accelerators to have a quick communication invisible to the rest of the memory system. This mechanism enables the development of flexible accelerator hardware with better longevity than fixed function accelerators. The system model is portrayed in Figure 1.

In order to meet this goal, careful considerations must be made. An explanation of how the Sidebar will be accessed is described in Section 3.1, while its ability to enable fine-grained accelerator-CPU cooperation is detailed in Section 3.2. Further, interactions between our work and existing processor coherence mechanisms are discussed in Section 3.4, while software integration with a host operating system is discussed in Section 3.3.

3.1 Accessing the Sidebar

The usefulness of the Sidebar in our design relies upon the addition of at least two instructions to the host processor. These instructions, sbLD and sbST, allow the processor to load from and store to the Sidebar memory respectively. We use specialized instructions, instead of a memory mapping, to further isolate the Sidebar from the main memory space and avoid coherence issues, discussed in 3.4.

The sbLD instruction will primarily be used after the accelerator has completed an intermediate task and has signaled this completion to the host. The processor uses sbLD to move intermediate data from the Sidebar into its own register file, where it can then perform arbitrary computation on it. The sbST instruction will be used to return data that the CPU has performed additional computation on to the Sidebar.

In both cases, data placement is explicitly managed. There must be agreement between the accelerator and host code at compile-time on where data will be located within the Sidebar, and how it will be arranged. This does place some additional demands on the programmer, but we believe this can be mitigated by simple compilation tools or frameworks, which we leave to future work.

The accelerator may access the Sidebar in a similar manner. We do not allow the accelerator and the host processor to access the Sidebar simultaneously, and we prevent this through hardware mechanisms. The host processor or accelerator must indicate that they have completed using the Sidebar by writing to a hardware register before the other may proceed.

3.2 Fine-Grained Cooperation

In the same way that DMA is used at the beginning and end of accelerator tasks, the Sidebar can be used to pass data between the accelerator and CPU during an accelerator task. Combining this data-passing mechanism with a polling mechanism (detailed further in Section 3.3) allows the accelerator to efficiently pass intermediate results to the host processor and invoke desired functions on these results.

Fine-grained accelerator-CPU cooperation allows for improved performance and flexibility. Performance is improved as the accelerator will invoke the CPU for functions which are either not easily implemented in hardware (saving on area and power in the accelerator) or which run more slowly in hardware than on the CPU (such as non-linear activation functions). Flexibility is improved because difficult or costly hardware implementations of functions can be avoided in-lieu of performing the same function on the highly programmable host CPU. This paradigm of computing is not currently possible given the high overheads associated with accelerator data movement.

3.3 Host System Integration

In order for the host CPU to be able to collaborate with the accelerator, there must be a mechanism for the accelerator to call on the CPU to perform a piece of work. For our project, we plan on ignoring the complexities of integrating such a system into the context of a full operating system. Instead, we elect to have a simplified polling approach running on the host CPU as the sole application. The host will keep a table of functions the accelerator may call on the CPU to perform. These functions will be part of the accelerator’s driver and will therefore be written and compiled ahead of time and reside in the host’s memory.

When the accelerator wishes to invoke the CPU to perform a computation, the accelerator must first write the data needed for the computation in the Sidebar. Once the data has been written, the accelerator will write the arguments of the computation to a specific set of Sidebar locations. These arguments will include variables such as function pointers to host functions, pointers to data in the Sidebar, and other information required for the invocation of the host. Once the data and arguments have been written into the Sidebar, the accelerator writes to a specific Sidebar location that the host is pulling on. This will signal to the host to begin the computation. The return process is similar to the invocation, except that the host will be setting up data and the accelerator will be waiting for the flag location to be pulled low.

3.4 Coherence Interactions

When the accelerator is performing an acceleration task and invoking the host CPU, data must be placed into the Sidebar by the accelerator before notifying the CPU of its task. The mechanism for this data movement is discussed in 3.1. The CPU will then operate on this data, potentially bringing it into its local registers. This data should not enter the cache hierarchy, however, since this intermediate result of accelerator computation is not normally application visible. Because this data should not enter the cache hierarchy and instead remains resident only in registers or within the Sidebar, no coherence concerns are present.

Initial and final data movement to and from the accelerator’s private memory are handled by DMA. This is the current protocol on many existing implementations of heterogeneous systems.

3.5 Consistency Interactions

In out-of-order host CPUs, depending on the consistency model, it may be possible for the status flag to be written before the return data has been written to the Sidebar, even if the flag is written last in program order. To account for this, there are two possible solutions. The first is to have a separate load-store queue for Sidebar memory instructions. This would allow for the system architects to decide on a consistency model specifically for the Sidebar. However, additional fence instructions for the Sidebar memory operations would then be required.

The other solution is to utilize existing load-store queues in the host processor. This means that the Sidebar operations would obey the same consistency model as the host and would therefore be able to utilize existing fence instructions to maintain desired functionality. We see this as the optimal solution since it involves the least amount of modifications to the CPU microarchitecture.

4 Design Overview

With Section 3 mentioning the base components and interactions of Sidebar, it is useful to take a look at the bigger picture and see how Sidebar fits into a real work flow in order to improve performance.

Sidebar is best suited to workloads which are both a strong candidate for hardware acceleration but also contain "CPU-friendly" functionality - that which is better-suited for execution on a powerful, general-purpose core. Sidebar is also applicable to workloads which desire fine-grained cooperation between an accelerator and host, or those applications which desire a high level of flexibility for future algorithmic changes. Our work shows that neural network operations are a prime candidate for use with Sidebar, but many other workloads would benefit from fine-grained cooperation.

Once a target algorithm is identified, an accelerator must be built for that task. This could be at the complexity level of a matrix multiplication kernel or could be a more abstract primitive like an entire convolution kernel. The accelerator is augmented with a finite state machine (FSM) and interface signals (data and control) capable of: (1) receiving commands from the host through a driver and (2) sending commands to the host in order to invoke CPU acceleration. In this work, gem5-Aladdin[aladdin] is used to model this interaction allowing us to combine accelerators with a CPU simulation infrastructure.

Once the accelerator hardware is completely built, a driver is created which allows for communication of data and tasks to the accelerator. Discussed in Section 3.3, a Sidebar implementation requires this driver both for starting and stopping the accelerator, but also for the accelerator to interface with the host CPU and invoke host operations through the Sidebar.

In order to control the communication of intermediate data, Sidebar dedicates a portion of the private, shared memory to accelerator-host communication. More specifically, the host CPU polls the driver-defined Sidebar memory locations checking for flags which indicate a CPU task being invoked by the accelerator. When these flags are set, the CPU finds a function pointer in a dedicated memory location (as described in Section 3.3) which tells the CPU which function it should perform on the contents of the shared memory region.

As the CPU is performing its computation, the accelerator FSM will be polling another region of the scratchpad waiting for the CPU to signal it has completed the work. While this communication is not ideal because it slightly reduces the usable scratchpad space and requires the host CPU to spin and wait, an interrupt-based mechanism might be used to solve both of these issues.

For more complex functions, a sea of multiple accelerators can be built. The obvious option is to build a monolithic accelerator that is designed to perform multiple activation functions. This configuration offers some flexibility, but wastes area and power on potentially unneeded hardware resources. One could build a set of accelerators which do not contain activation hardware, but these accelerators will need to pass intermediate results to the host processor through DMA incurring additional execution time and energy. Finally, could build an accelerator which does not contain activation hardware, but can communicate through a Sidebar. This set of accelerators can invoke the host CPU to compute the activation functions of the network, passing data through the Sidebar instead of DMA.

Through the use of these mechanisms and design flow, Sidebar allows the host CPU to quickly respond to task requests from an accelerator attached to the system. This enables fine-grained cooperation between the CPU and potentially many accelerators with low overhead for communication and reduced energy consumption.

5 Implementation

Figure 4: Lenet Accelerator Models

The complete implementation and evaluation was done on gem-aladdin [aladdin]. The source code is available at [gitlab]. Major components of the system are described further in this section.

5.1 Gem5 System

We use the default gem5 parameters from gem5-aladdin. The core parameters are defined in Table 2.

Component Gem5 Parameter
CPU Single Core DerivO3CPU
Memory 4GB DDR3_1600_8x8
Clock 1 GHz
Table 2: Gem5 Parameters

Limitations: gem5-aladdin only supports system call emulation mode for program execution. In this mode programs are executed without an OS layer. Any systems calls are functionally emulated by gem5. Due to this limitation we do not implement the OS dependent interrupt interface. The program flow is completely controlled by the application running on main processor.

5.2 Accelerators

The basis for our simulations was a neural network model in the Lenet style [lenet, alexnet]

. The exact model was adapted from one in the Pytorch documentation 

[lenet_source]

specifically developed to classify CIFAR-10 

[cifar]. Some of the hyper-parameters were modified for simulation purposes. It consists of two convolutional layers, each followed by an activation and a pooling layer. These are then followed by three fully connected layers, with activations in-between. The complete network was implemented in two distinct forms as shown in Figure 4.

Accelerator Cycles Energy (CyclesmW) Area (uM)
Relu Monolithic 122151 724294354 4.82445e+08
SoftPlus Monolithic 147967 873817638 4.82448e+08
S1 23124 138988189 4.61686e+08
S2 22541 86039447 2.90202e+08
S3 66060 51164791 6.10141e+07
S4 17847 3560833 1.46956e+07
S5 2546 110980 2.60089e+06
Table 3: Accelerator Parameters

5.2.1 Monolithic

The monolithic version implements the complete network in a single accelerator. Consider the blue box in Figure 4. All layers and activations are within this monolithic accelerator. The black arrows represent data motion between the accelerator and main memory. All data transfers are DMA. We use different activation functions in this accelerator. A comparison between Relu and SoftPlus is shown in Table 3. These two activation functions were chosen because Relu is the most commonly used and SoftPlus is the most computationally complex.

5.2.2 Small Primitives

Consider Figure 4 again. For this configuration we define layers without intervening activations as small accelerator primitives. These are represented as green boxes in Figure 4. In this configuration, the activations are computed on the main processor. Hence the activations convert to data transfer back to processor memory and a computation on the processor. The data transfer may be via DMA or a low latency sidebar based transfer. The realized parameters for the small accelerators are also shown in Table 3

Limitation: Gem5 integrates with Aladdin via an ioctl interface. Given this limited interface we could not fully implement low latency cross communication between the two simulation domains. Hence we approximate the evaluations by synthetically controlling the latency of data transfers between Gem5 and Aladdin simulation domains for the activation calls only. Specifically, the access on accelerator side are from its local scratchpad. CPU side accesses are large contiguous memory operations, which, with prefetching reach cache level latency. Hence these accesses emulate indirectly the latency to access the sidebar. Note that the input, output and parameters still use DMA based data transfers. Sidebar like latencies are used only to provide intermediate data to host processor to compute activations and by the next accelerator to access the results of processor based activation computations. The network itself is untrained and hence does not provide meaningful data outputs. Hence the fact that Aladdin and Gem5 incurr Sidebar latency costs without the simulations actually communicating data does not impact the accuracy of our results.

5.3 Scenarios

Figure 5: Execution Timeline

Based on the implementations described above we find three particular scenarios of interest which form the basis of our evaluation, as shown in Figure 5.

5.3.1 Monolithic

The monolithic accelerator implements all layers of the neural network including activation functions in a single accelerator. The execution begins by the host CPU flushing its caches to DRAM and then invalidating the cache lines. Now the actual DMA can begin. The CPU initiates the DMA load and the accelerator receives the data. The accelerator then performs its computation. Once complete, the accelerator then performs a DMA store. The CPU can then read the accelerator result and use it accordingly.

5.3.2 Flexible DMA

For flexible DMA, we wanted to understand how an SoC would need to leverage DMA to achieve the same level of flexibility as Sidebar. For this accelerator structure, the neural network accelerator is split into five different accelerators, corresponding to each of the network layers, excluding activation functions. The initial and final DMA transfer processes are the same as the monolithic accelerator. However, since each accelerator is separate, this process must be replicated for each invocation of an accelerator. The benefit of such a system is that the activation functions are performed on the CPU between DMAs. This allows the network activation functions to be changed very easily and also implemented in software. Existing machine learning accelerators implement a subset of activation functions in hardware and have no mechanism to introduce new functions. The downside of this programmability is that the communication overhead is rather high. Something that our results do not show however, is that breaking up the accelerators would allow for pipelineing of computations. It would be important to note that all of the accelerators attempting to DMA to and from the CPU simultaneously would most likely demonstrate a communication bottleneck.

5.3.3 Sidebar

The goal of Sidebar is to accomplish the same level of programmability as the flexible DMA accelerators with reduced communication overhead. The reduction in communication costs comes from the use of Sidebars between the accelerators and the host CPU. Using Sidebars allows us to forgo the cache flushing and invalidation costs of using DMA. Using Sidebar also allows for faster data transfers since Sidebar sits at the L1 level in the memory hierarchy. This means that the CPU and accelerators need not go to DRAM to get data for DMA. For this work, Sidebar is use to eliminate the intermediate DMA transactions between host and accelerator. The initial and final DMA processes must still take place.

6 Evaluation

The flexibility and performance of Sidebar are evaluated using a neural network inference pass as a workload. In this workload, a host CPU sets up the acceleration task(s) by allocating memory and mapping various arrays. The host then invokes the accelerator(s) and waits until the task is complete before checking the output for correctness. With multiple network layers and activation functions between them, this workload perfectly fits the model of fine-grained cooperation between accelerators and host processors.

Each time the workload is run, Gem5 collects statistics for the inference’s execution. These statistics include performance and power numbers for the accelerators in the system, as well as information about the system interconnect and its traffic. Using these statistics, we evaluate the performance, communication, and energy of Sidebar.

6.1 Latency

Shown in Figure 6 are the performance results of the two baseline designs and our Sidebar implementation. Shown is the latency of a single inference pass.

Figure 6:

Inference latency of Lenet convolutional neural network with hardware acceleration enabled. Monolithic refers to a single, inflexible accelerator. Flexible DMA refers to a flexible set of accelerators with DMA only for communication.

Sidebar represents our implementation in Gem5 + Aladdin.

From the figure, we can see that increased flexibility comes at a cost. Both the flexible DMA baseline and Sidebar incur slight overheads during execution. The flexible DMA configuration has a run time which is 8 to 14 percent longer than the monolithic accelerator, while Sidebar manages to stay within 2 percent of the monolithic accelerator’s performance.

Furthermore, the testing of two different activation functions shows that offloading complex computation to the host CPU is a viable alternative to an expensive hardware implementation as in the monolithic accelerator. We can see that for the more complex activation, softplus, Sidebar allowed for better cooperation with lower overhead than DMA. This is evidenced by the widening delta between the flexible DMA configurations while the Sidebar desin shows consistent performance relative to the monolithic design.

6.2 Data Communication

After assessing the performance of Sidebar, we turned to the evaluation of each system’s energy consumption. This evaluation was performed using data from CACTI [cacti:muralimanohar2009cacti] as well as statistics on data transferred within each system. There are two routes for data transfer. The first is the system or DRAM bus which is where all DMA transfers take place. The second route is the Sidebar implementation which we model as a tightly coupled storage array connected between the host CPU and accelerator pool.

Figure 7: Data communication energy in the tested accelerator configurations. DRAM energy refers to data moved on the system bus while Sidebar energy refers to data moved through the Sidebar.

The results shown in Figure 7 depict the merits of specialization. The use of simple yet generic DMA operations for data transfer leads to a huge amount of data being sent on the system bus in the flexible DMA configuration. This leads to the flexible DMA design using 32 percent more energy per inference than the monolithic accelerator which can keep inter-layer data transfers internal to its data path for improved efficiency. Sidebar incurs only 6 percent more energy consumption from data movement than the monolithic design. This is because Sidebar alleviates some of the energy consumption incurred by moving data between the accelerator and host CPU by transferring the data through the private memory store. This dramatically reduces dynamic energy and allows the Sidebar configuration to offer the flexibility of DMA transfers with nearly the energy consumption of a monolithic implementation.

6.3 Normalized Energy

After looking at system performance and system energy, a useful final metric to consider is energy-delay product (EDP) which allows one to compare the energy efficiency of designs with varying performance and power consumption statistics. EDP is the product of the computation run time with the energy consumed during the execution of the workload. Smaller values are better and signify that a design is very high performance, very power efficient, or a strong balance of the two.

Figure 8: Normalized energy consumption

Figure 8 presents the EDP of each design normalized to the monolithic accelerator. Due to the combination of sizable DMA transfer overheads and the increased overall data movement of the flexible DMA design, a nearly 50 percent increase is EDP can be seen compared with the monolithic design. Sidebar, on the other hand, has only a slight increase in EDP when compared with the monolithic accelerator. This stems from the fact that Sidebar has comparable performance and a greatly reduced amount of high-energy bus communication relative to the flexible DMA design. This means Sidebar sees only a 7 percent increase in EDP compared to the monolithic design, and is nearly 40 percent better than the flexible DMA configuration.

6.4 Discussion

The overall performance and energy evaluations performed with Sidebar are quite encouraging. Through the use of a tightly-coupled storage mechanism, Sidebar enables low-cost cooperation between a host CPU and fixed-function hardware accelerators. Our experiments have shown that Sidebar offers the flexibility of a DMA-based system with performance and energy consumption competitive with monolithic accelerator designs. Sidebar’s private scratchpad offers dramatically reduced energy when compared to the high-capacitance system memory bus. Because of the tight coupling between the host and accelerator, Sidebar also offers fast access to data for cooperative tasks.

Indeed, this level of performance and energy efficiency for cooperative workloads has not been shown before. Sidebar provides a glimpse at the future of cooperative workload execution where a host CPU dictates tasks to a whole pool of accelerators. Sidebar offers great specialization with its support for hardware accelerators, but provides programmers with the flexibility to update and modernize their code as new libraries and activation functions are developed while also enabling a new paradigm of low-cost accelerator cooperation.

7 Future Work

While we have shown that there are compelling gains in terms of system flexibility for our solution, this is only the beginning of the potential for Sidebar. The first area we consider for future work is using Sidebar to stream working data to/from accelerators. This could theoretically decrease the latencies of the initial and final DMAs. However, it would require potentially much larger Sidebars, or much smarter communication methods. Future work could reasonably invent a method for streaming data through Sidebar to initialize accelerator storage.

Figure 9: Latency of image processing, GPU pre-processing and neural network inference. With intermediate CPU DMAs (top), with intermediate Sidebar usage (bottom)

The main idea of this work is that we are reducing the cost of communication between accelerator and host. This work could rather easily be expanded to incorporate accelerator-to-accelerator communication as well. This could be potentially very useful in collaborative workloads between accelerators without requiring the intervention of the host processor. Consider a modern SoC with an image processing pipeline and some sort of machine learning acceleration. The image may first begin in a demosaicing accelerator (converting RAW sensor data to pixel data), but then may be merged with other images in an HDR processing accelerator. Finally, there may be some additional processing that requires the GPU before the image is sent to the neural network accelerator. With modern memory systems, these accelerators must communicate through the CPU using DMA to stream data back and forth. While systems can pipeline such data transfers to amortize the cost of the transfer delays, such processing pipelines are frequently common in self-driving car hardware, where latency is somewhat of a priority. Thus, being able to communicate through an accelerator-to-accelerator Sidebar without the overhead of flushes, invalidations, and higher levels of the memory hierarchy would prove very beneficial in terms of latency.

Since the communication costs are theoretically very small, Sidebar could promote a sea of primitive accelerators. Rather than having few accelerators that accelerate large tasks, an SoC can implement many accelerators that implement small, reusable tasks. For example, it may be possible to create a general convolution accelerator that can be reused between image processing and machine learning pipelines. Normally, the overhead of invoking many small accelerators and the associated passing of data would be too significant. However, with the work shown here, this type of work may be feasible since these overheads are significantly reduced. The only potential drawback of separating accelerators as such would require each accelerator to have its own local memories. We noticed this storage duplication in our evaluation. When separating out each neural network layer into its own accelerator, the area grew significantly because each accelerator required its own private memory. This could be mitigated by allowing some sort of private memory sharing between accelerators. It could potentially even be possible to reuse Sidebar as an accelerator scratchpad, allowing for a rather area efficient solution.

8 Related Work

A major problem with hardware accelerator development is the lack of standardized interfaces. Previous works have focused on identifying this problem and attempt to solve it in various ways [interfaceexploration, spandex, capi].

In this project we define a standard interface to hardware accelerators that is independent of the nature of the accelerator. This interface uses a scratchpad memory between the processor and hardware accelerator. ScratchPad memory has been well studied in literature with works proposing basic [scratchpadmem] to advanced [stash] usage of the scratchpad.

While a cache could also be placed between the processor and hardware accelerator, caches incur high overheads from two mechanisms. First, caches require additional hardware to store and compare tags for the data held within. Second, caches require address translation when the host processor and accelerator are operating in different domains. We believe these overheads are unnecessary since we intend to focus on fixed-function hardware accelerators. Since these accelerators rely upon a known data layout in memory, the host and accelerator can avoid checking tags by ensuring data is placed into and read from the correct locations. Further, the data we intend to pass between the host processor and accelerator is not normally application visible and therefore has no need to enter the application memory space. Using physical scratchpad addresses therefore offers a low-overhead approach to fine-grained data movement and cooperation.

Flexible accelerator design has also been addressed by many previous works. [npu1, npu2] provide a programmable layer structure to emulate a variety of networks. But the more recent notion is to decompose networks to some common primitives which can be then freely used to accelerate different network models. Sidebar technique fits well with this kind of design principle. But we go further and include the host processor into the neural network acceleration hardware.

9 Conclusion

In this paper we explore flexible accelerator design for neural networks. The design uses static tensor computation units with a low latency

Sidebar to allow some computations to be offloaded to host processor. Compared to large fixed function accelerators our design achieves similar performance while being flexible in design. We believe such designs will be required to develop accelerators for applications that undergo rapid development but still require the performance and energy improvements of hardware acceleration.

References