CNNergy
An Analytical CNN Energy Model
view repo
Data processing on convolutional neural networks (CNNs) places a heavy burden on energyconstrained mobile platforms. This work optimizes energy on a mobile client by partitioning CNN computations between in situ processing on the client and offloaded computations in the cloud. A new analytical CNN energy model is formulated, capturing all major components of the in situ computation, for ASICbased deep learning accelerators. The model is benchmarked against measured silicon data. The analytical framework is used to determine the energy optimal partition point between the client and the cloud at runtime. On standard CNN topologies, partitioned computation is demonstrated to provide significant energy savings on the client over fully cloudbased or fully in situ computation. For example, at 60 Mbps bit rate and 0.5 W transmission power, the optimal partition for AlexNet [SqueezeNet] saves up to 47.4 energy over fully cloudbased computation, and 31.3 in situ computation.
READ FULL TEXT VIEW PDF
"How much energy is consumed for an inference made by a convolutional ne...
read it
Convolutional neural networks (CNNs) have been increasingly deployed to ...
read it
The engineering challenges involved in building large scale quantum
comp...
read it
Cloud servers offer data outsourcing facility to their clients. A client...
read it
Despite the soaring use of convolutional neural networks (CNNs) in mobil...
read it
Machine Learning (ML) algorithms, like Convolutional Neural Networks (CN...
read it
During the last decade, the number of devices connected to the Internet ...
read it
An Analytical CNN Energy Model
Machine learning using deep convolutional neural networks (CNNs) constitutes a powerful approach that is capable of processing a wide range of visual processing tasks with high accuracy. Due to the highly energyintensive nature of CNN computations, today’s deep learning (DL) engines using CNNs are largely based in the cloud [1, 2]
, where energy is less of an issue than on batteryconstrained mobile clients. Although a few simple emerging applications such as facial recognition are performed
in situ on mobile processors, today’s dominant mode is to offload DL computations from the mobile device to the cloud. The deployment of specialized hardware accelerators for embedded DL to enable energy efficient execution of CNN tasks is the next frontier.Limited client battery life places stringent energy limitations on embedded DL systems. This work focuses on a large class of inference engine applications where battery life considerations are paramount over performance: e.g., for a health worker in a remote area, who uses a mobile client to capture images processed using DL to diagnose cancer [3], a farmer who takes crop photographs and uses DL to diagnose plant diseases [4], or an unmanned aerial vehicle utilizing DL for monitoring populations of wild birds [5]. In all these examples, client energy is paramount for the operator in the field and processing time is secondary, i.e., while arbitrarily long processing times are unacceptable, somewhat slower processing times are acceptable for these applications. Moreover, for clientfocused design, it is reasonable to assume that the datacenter has plentiful power supply, and the focus is on minimizing client energy rather than cloud energy. While this paradigm may not apply to all DL applications (e.g., our solution is not intended to be applied to applications such as autonomous vehicles, where highspeed data processing is important), the class of energycritical clientside applications without stringent latency requirements encompasses a large corpus of embedded DL tasks that require energy optimization at the client end.
To optimize client energy, this work employs computation partitioning between the client and the cloud. Fig. 1 shows an inference engine computation on AlexNet [6], a representative CNN topology, for recognition of an image from the camera of a mobile client. If the CNN computation is fully offloaded to the cloud, the image from the camera is sent to the datacenter, incurring a communication overhead corresponding to the number of data bits in the compressed image. Fully in situ CNN computation on the mobile client involves no communication, but drains its battery during the energyintensive computation.
Computation partitioning between the client and the cloud represents a middle ground: the computation is partially processed in situ, up to a specific CNN layer, on the client. The data is then transferred to the cloud to complete the computation, after which the inference results are sent back to the client, as necessary. We propose NeuPart, a partitioner for DL tasks that minimizes client energy. NeuPart is based on an analytical modeling framework, and analytically identifies, at runtime, the energyoptimal partition point at which the partial computation on the client is sent to the cloud.
Fig. 2 concretely illustrates the tradeoff between communication and computation for AlexNet. In (a), we show the cumulative computation energy (obtained from our analytical CNN energy model presented in Section IV) from the input to a specific layer of the network. In (b), we show the volume of compressed data that must be transmitted when the partial in situ computation is transmitted to the cloud. The data computed in internal layers of the CNN tend to have significant sparsity (over 80%, as documented later in Fig. 9): NeuPart leverages this sparsity to transmit only nonzero data at the point of partition, thus limiting the communication cost.
The net energy cost for the client for a partition at the layer can be computed as:
(1) 
where is the processing energy on the client, up to the CNN layer, and is the energy required to transmit this partially computed data from the client to the cloud. The inference result corresponds to a trivial amount of data to return the identified class, and involves negligible energy.
NeuPart focuses on minimizing for the energyconstrained client (as stated earlier, this work targets energyconstrained clients and relatively latencyinsensitive applications such as [3, 4, 5]). For the data in Fig. 2, increases monotonically as we move deeper into the network, but can reduce greatly. Thus, the optimal partitioning point for here lies at an intermediate layer .
The specific contributions of NeuPart are as follows. First, unlike prior computation partitioning works [7, 8, 9], NeuPart addresses the client/cloud partitioning problem for specialized DL accelerators that are more energyefficient than CPUs/GPUs/FPGAs, and demonstrates the optimal tradeoff point under various communication environments. Second, the core of NeuPart is driven by a new analytical model (we name our model ‘CNNergy’) for CNN energy, , that accounts for the complexities of scheduling computations over multidimensional data. CNNergy captures key parameters of the hardware and maps the hardware to perform computation on various CNN topologies. CNNergy is benchmarked on several CNNs: AlexNet, SqueezeNetv1.1 [10], VGG16 [11], and GoogleNetv1 [12]. CNNergy is far more detailed than prior energy models [13] and incorporates implementation specific details of a DL accelerator, capturing all major components of the in situ computation, including the cost of arithmetic computations, memory and register access, control, and clocking.
It is important to note that CNNergy may potentially have utility beyond this work, and has been opensourced at https://github.com/manasiumn37/CNNergy. For example, it provides a breakdown of the total energy into specific components, such as data access energy from different memory levels of a DL accelerator, data access energy associated with each CNN data type from each level of memory, MAC computation energy. CNNergy can also be used to explore design phase tradeoffs such as analyzing the impact of changing onchip memory size on the total execution energy of a CNN.
The paper is organized as follows. Section II discusses prior approaches to computational partitioning, and is followed by Section III
, where the general framework of CNNergy for energy estimation on CNN accelerators is outlined. Next, Sections
IVVI present the details of CNNergy and validate the model in several ways, including against silicon data. A method for performing the NeuPart client/cloud partitioning at runtime is discussed in Section VII, after which Section VIII evaluates NeuPart on widely used CNN topologies. The paper concludes in Section IX.Computational partitioning has previously been used in the general context of distributed processing [14]. A few prior works [7, 8, 9] have utilized computation partitioning in the context of mobile DL. In [7], tasks are offloaded to a server from nearby IoT devices for best server utilization, but no attempt is made to minimize edge device energy. The work in [9] uses limited applicationspecific profiling data to schedule computation partitioning. Another profilingbased scheme [8] uses clientspecific profiling data to form regression based computation partitioning model for each device. A limitation of profilingbased approaches is that they require profiling data for each mobile device or each DL application, which implies that a large number of profiling measurements are required for real life deployment. Moreover, profilingbased methods require the hardware to already be deployed and cannot support designphase optimizations. Furthermore, all these prior approaches use a CPU/GPU based platform for the execution of DL workloads.
In contrast with prior methods, NeuPart works with specialized DL accelerators, which are orders of magnitude more energyefficient as compared to the general purpose machines [1, 15, 16], for client/cloud partitioning. NeuPart specifically leverages the highly structured nature of computations on CNN accelerators, and shows that an analytical model predicts the client energy accurately (as demonstrated in Section V). The analytical framework used in the NeuPart CNNergy incorporates implementationspecific details that are not modeled in prior works. For example, [8] uses (a) 8bit uncompressed raw image to transmit at the input, which is not typical: in a real system, images are compressed before transmission to reduce the communication overhead; (b) higher bit width (i.e., 32bit data) for the intermediate layers and ignores any data sparsity at the intermediate CNN layers, thereby, failing to fully leverage the inherent computationcommunication tradeoff of CNN and benefit of the computation partitioning scheme. As mentioned previously, the analytical model is likely to have applications in work on designstage optimizations, beyond just the computational partitioning work in this paper.
The computational layers in CNNs can be categorized into three types: convolution (Conv), fully connected (FC), and pooling (Pool). The computation in a CNN is typically dominated by the Conv layers. In each layer, the computation involves three types of data:
ifmap, the input feature map
filter, the filter weights, and
psum, the intermediate partial sums.
Parameters  Description  

Height/width of a filter  
Padded height/width of an ifmap  
Height/width of an ofmap  
#of channels in an ifmap and filter  


Convolution stride 
Table I summarizes the parameters associated with a convolution layer. As shown in Fig. 3, for a Conv layer, filter and ifmap are both 3D data types consisting of multiple 2D planes (channels). Both the ifmap and filter have the same number of channels, , while and .
During the convolution, an elementwise multiplication between the filter and the green 3D region of the ifmap in Fig. 3 is followed by the accumulation of all products (i.e., psums), and results in one element shown by the green box in the output feature map (ofmap). Each channel () of the filter slides through its corresponding channel () of the ifmap with a stride (), repeating similar multiplyaccumulate (MAC) operations to produce a full 2D plane (
) of the ofmap. A nonlinear activation function (e.g., a rectified linear unit, ReLU) is applied after each layer, introducing sparsity (i.e., zeros) at the intermediate layers, which can be leveraged to reduce computation.
The above operation is repeated for all filters to produce 2D planes for the ofmap, i.e., the number of channels in the ofmap equals the number of 3D filters in that layer. Due to the nature of the convolution operation, there is ample opportunity for data reuse in a Conv layer. FC layers are similar to Conv layers but are smaller in size, and produce a 1D ofmap. Computations in Pool layers serve to reduce dimensionality of the ofmaps produced from the Conv layers by storing the maximum/average value over a window of the ofmap.
Since the inference task of a CNN comprises a very structured and fixed set of computations (i.e., MAC, nonlinear activation, pooling, and data access for the MAC operands), specialized hardware accelerators are very suitable for their execution. Various dedicated accelerators have been proposed in literature for efficient processing of CNNs [1, 17, 16, 18, 19]. The architecture in Google TPU [1] consists of a 2D array of parallel MAC computation units, large onchip buffer for the storage of ifmap and psum data, additional local storage inside the MAC computation core for the filter data, and finally, offchip memory to store all the feature maps and filters together. Similarly, along with an array of parallel processing elements, the architectures in [19, 16, 18] use a separate onchip SRAM to store a chuck of filter, ifmap, and psum data, and an external DRAM to completely store all the ifmap/ofmap and filter data. In [16, 17], local storage is used inside each processing element while allowing data communication between processing elements to facilitate better reuse of data. Additionally, [19, 18, 17] exploit the inherent data sparsity in the internal layers of CNN to save computation energy.
It is evident that the key architectural features of these accelerators are fundamentally similar: an array of processing elements to perform neural computation in parallel, multiple levels of memory for fast data access, and greater data reuse.
Since the goal of NeuPart is to minimize client energy, we develop an analytical model (CNNergy) for energy dissipation in a CNN hardware architecture. The general framework of CNNergy is illustrated in Fig. 4. We use CNNergy to determine the in situ computation energy ( in (1)), accounting for scheduling and computation overheads.
One of the largest contributors to energy is the cost of data access from memory. This data reuse is critical for energyefficient execution of CNN computations to reduce unnecessary highenergy memory accesses, particularly the ifmap and filter weights, and is used in [1, 17, 16, 18, 19]. This may involve, for example, ifmap and filter weight reuse across convolution windows; ifmap reuse across filters, and reduction of psum terms across channels. Given the accelerator memory hierarchy, #of parallel processing elements, and CNN layer shape, Block of Fig. 4 is an automated scheme for scheduling MAC computations while maximizing data reuse. The detailed methodology for obtaining these scheduling parameters is presented in Section IVC.
Depending on the scheduling parameters, the subvolume of the ifmap to be processed at a time is determined. Block then computes the corresponding energy for the MAC operations and associated data accesses. The computation in Block is repeated to process the entire data volume in a layer, as detailed in Section IVD.
The framework of CNNergy is general and its principles apply to a large class of CNN accelerators. However, to validate the framework, we demonstrate it on a specific platform, Eyeriss [17], for which ample performance data is available, including silicon measurements, to validate CNNergy. Eyeriss has an array of processing elements (PEs), each with:
a multiplyaccumulate (MAC) computation unit.
register files (RFs) for filter, ifmap, and psum data.
We define , , and as the maximum number of bit filter, ifmap, and psum elements that can be stored in a PE.
The accelerator consists of four levels of memory: DRAM, global SRAM buffer (GLB), interPE RF access, and local RF within a PE. During the computations of a layer, filters are loaded from DRAM to the RF. In the GLB, storage is allocated for psum and ifmap. After loading data from DRAM, ifmaps are stored into the GLB to be reused from the RF level. The irreducible psums navigate through GLB and RF as needed. After complete processing of a 3D ifmap, the ofmaps are written back to DRAM.
We formulate an analytical model (CNNergy) for the CNN processing energy (used in (1)), , up to the layer, as
(2) 
where is the energy required to process layer of the CNN. To keep the notation compact, we drop the index “” in the remainder of this section. We can write as:
(3) 
where is the energy to compute MAC operations associated with the layer, represents the energy associated with the control and clocking circuitry in the accelerator, and is the memory data access energy,
(4) 
i.e., the sum of data access energy from onchip memory (from GLB, InterPE, and RF), and from the offchip DRAM.
The computation of these energy components, particularly the data access energy, is complicated by their dependence on the data reuse pattern in the CNN. In the following subsections, we develop a heuristic for optimal data reuse and describe the methodology in our CNNergy for estimating these energy components.
Fig. 5 illustrates how the 3D ifmap is processed by convolving one 3D filter with ifmap to obtain one 2D channel of ofmap; this is repeated over all filters to obtain all channels of ofmap. Due to the large volume of data in a CNN layer and the limited availability of onchip storage (register files and SRAM), the data is divided into smaller subvolumes, each of which is processed by the PE array in one pass to generate psums, where the capacity of the PE array and local storage determine the amount of data that can be processed in a pass.
All psums are accumulated to produce the final ofmap entry; if only a subset of psums are accumultated, then the generated psums are said to be irreducible. The pink region of size shows the ifmap volume that is covered in one pass, while the green region shows the volume that is covered in multiple passes before a writeback to DRAM.
As shown in the figure (for reasons provided in Section IVC), consecutive passes first process the ifmap in the direction, and then the direction, and finally, the direction. After a pass, irreducible psums are written back to GLB, to be later consolidated with the remainder of the computation to build ofmap entries. After processing the full direction (i.e., all the channels of a filter and ifmap) the green ofmap region of size is formed and then written back to DRAM. The same process is then repeated until the full volume of ifmap/ofmap are covered.
Fig. 5 is a simplified illustration that shows the processing of one 3D ifmap using one 3D filter. Depending on the amount of available register file storage in the PE array, a convolution operation using filters can be performed in a pass. Furthermore, subvolumes from multiple images (i.e., ifmaps) can be processed together, depending on the SRAM storage capacity.
Due to the high cost of data fetches, it is important to optimize the pattern of fetch operations from the DRAM, GLB, and register file by reusing the fetched data. The level of reuse is determined by the parameters , , , , , , , , and . Hence, the efficiency of the computation is based on the choice of these parameters. The mapping approach that determines these parameters, in a way that attempts to minimize data movement, is described in Section IVC.
In this section we describe how the convolution operations are distributed in the 2D PE array of size . We use the rowstationary scheme to manage the dataflow for convolution operations in the PE array as it is shown to offer higher energyefficiency than other alternatives [20, 21].
We explain the rowstationary dataflow with a simplified example shown in Fig. 6, where a single channel of the filter and ifmap are processed (i.e., ). Fig. 6(b) shows a basic computation where a filter (green region) is multiplied with the part of the ifmap that it is overlaid on, shown by the dark blue region. Based on the rowstationary scheme for the distributed computation, these four rows of the ifmap are processed in four successive PEs within a column of the PE array. Each PE performs an elementwise multiplication of the ifmap row and the filter row to create a psum. The four psums generated are transmitted to the uppermost PEs and accumulated to generate their psum (dark orange).
Extending this operation to a full convolution implies that the ifmap slides under the filter in the negative direction with a stride of , while keeping the filter stationary: for , two strides are required to cover the ifmap in the direction. In our example, for each stride, each of the four PEs performs the elementwise multiplication between one filter row and one ifmap row, producing one 1D row of psum, which is then accumulated to produce the first row of psum, as illustrated in Fig. 6(c).
Thus, the four dark blue rows in Fig. 6(c) are processed by four PEs (one per row) in a column of the PE array. The reuse of the filter avoids memory overheads due to repeated fetches from other levels of memory. To compute psums associated with other rows of the ofmap, a subarray of 12 PEs (4 rows 3 columns) processes the ifmap under a direction ifmap stride. The ifmap regions thus processed are shown by the dark regions of Fig. 6(d),(e).
We define the amount of processing performed in rows, across all columns of the PE array, as a set. For a PE array, a set coresponds to the processing in Fig. 6(c)–(e).
In the general context of the PE array, a set is formed from PEs. Therefore, the number of sets which can fit in the full PE array (i.e., the number of sets in a pass) is given by:
(5) 
i.e., is the ratio of the PE array height to the filter height.
We now generalize the previously simplified assumptions in the example to consider typical parameter ranges. We also move from the assumption of , to a more typical .
First, the filter height can often be less than the size of the PE array. When , the remaining PE array rows can process more filter channels simultaneously, in multiple sets.
Second, typical RF sizes in each PE are large enough to operate on more than one 1D row. Under this scenario, within each set
, a group of 2D filter planes is processed. There are several degrees of freedom in mapping computations to the PE array. When several 1D rows are processed in a PE, the alternatives for choosing this group of 1D rows include:
choosing filter/ifmap rows from different channels of the same 3D filter/ifmap;
choosing filter rows from different 3D filters;
combining (i) and (ii), where some rows come from the channels of the same 3D filter and some rows come from the channels under different 3D filters.
Across sets in the PE array, similar mapping choices are available. Different groups of filter planes (i.e., channels) are processed in different sets. These groups of planes can be chosen either from the same 3D filter, or from different 3D filters, or from a combination of both.
Thus, there is a wide space of mapping choices for performing the CNN computation in the PE array.
Due to the high cost of data accesses from the next memory level, it is critical to reuse data as often as possible to achieve energy efficiency. Specifically, after fetching data from a higher accesscost memory level, data reuse refers to the use of that data element in multiple MAC operations. For example, after fetching a data element from GLB to RF, if that data is used across MAC operations, then the data is reused times with respect to the GLB level.
We now examine the data reuse pattern within the PE array. Within each PE
column, an ifmap plane is being processed along the direction, and multiple
PE columns process the ifmap plane along the direction. Two instances of data
reuse in the PE array are:
(1) In each set, the same ifmap row is processed along the PEs in a
diagonal of the set. This can be seen in the example set in
Fig. 6(c)(e), where the third row of the ifmap plane is common in
the PEs in in (c), in (d), and in (e), where refers to the PE in row and column .
(2) The same filter row is processed in the PEs in a row: in
Fig. 6(c)(e), the first row of the filter plane is common to all
PEs in all three columns of row 1.
Thus, data reuse can be enabled by broadcasting the same ifmap data (for
instance (1)) and the same filter data (for instance (2)) to multiple PEs for
MAC operations after they are fetched from a higher memory level (e.g., DRAM or
GLB).
Notation  Description  

Computation Scheduling Parameters  
#of filters processed in a pass  
#of ifmap/filter channels processed in a pass  
()  Height of ifmap (ofmap) processed in a pass  
()  Width of ifmap (ofmap) processed in a pass  
() 


#of ifmap from different images processed together  
Accelerator Hardware Parameters  
Size of RF storage for filter in one PE  
Size of RF storage for ifmap in one PE  
Size of RF storage for psum in one PE  
Height of the PE array (#of rows)  
Width of the PE array (#of columns)  
GLB  Size of GLB storage  
bit width of each data element 
As seen in Section IVB, depending on the specific CNN and its layer structure, there is a wide space of choices for computations to be mapped to the PE array. The mapping of filter and ifmap parameters to the PE array varies with the CNN and with each layer of a CNN. This mapping is a critical issue in ensuring low energy, and therefore, in this work, we develop an automated mapping scheme for any CNN topology. The scheme computes the parameters for scheduling computations. The parameters are described in Section IVA and summarized in Table II. The table also includes the parameters for the accelerator hardware constraints.
For general CNNs, for each layer, we develop a mapping strategy that follows
predefined rules to determine the computation scheduling. The goal of
scheduling is to attempt to minimize the movement of three types of data (i.e.,
ifmap, psum, and filter), since data movement incurs large energy overheads.
In each pass, the mapping strategy uses the following priority rules:
(i) We process the maximum possible channels of an ifmap to reduce
the number of psum terms that must move back and forth with the next level of memory.
(ii) We prioritize filter reuse, psum reduction over ifmap reuse.
The rationale for Rule (i) and (ii) is that since a very large number of psums is generated in each layer, psum reduction is the most important factor for energy, particularly because transferring psums to the next pass involves expensive transactions with the next level of memory. This in turn implies that filter weights must remain stationary for maximal filter reuse. Criterion (ii) lowers the number of irreducible psums: if the filter is changed and ifmap is kept fixed, the generated psums are not reducible.
In processing the ifmap, proceeding along the  and directions enables the possibility of filter reuse as the filter is kept stationary in the RF while the ifmap is moved. In contrast, if passes were to proceed along the direction, filter reuse would not be possible since new filter channels must be loaded from the DRAM for the convolution with ifmap. Therefore, the direction is the last to be processed. In terms of filter reuse, the  and directions are equivalent, and we arbitrarily prioritize the direction over the direction.
We use the notion of a set and a pass (Section IVB) in the flow graph to devise the choice of scheduling parameters:
The value of and is limited by the number of columns, , in the PE array. The corresponding value of is found using the relation
(6) 
The number of channels of each ifmap in a pass is computed as
(7) 
where is the number of channels per set, and is the number of sets per pass (given by (5)). Recall that the first priority rule of CNNergy is to process the largest possible number of ifmap channels at a time. Therefore, to compute , we find the number of filter rows that can fit into an ifmap RF, i.e., .
To enable perchannel convolution, the filter RF of a PE must be loaded with the same number of channels as from a single filter. The remainder of the dedicated filter RF storage can be used to load channels from different filters so that one ifmap can be convolved with multiple filters resulting in ifmap reuse. Thus, after maximizing the number of channels of an ifmap/filter to be processed in a pass, the remaining storage can be used to enable ifmap reuse. Therefore, the number of filters processed in a pass is
(8) 
During a pass, the ifmap corresponds to the pink region in Fig. 5, and over multiple passes, the entire green volume of the ifmap in the figure is processed before a writeback to DRAM.
We first compute ifmap and psum, the storage requirements of ifmap and psum, respectively, during the computation. The pink region has dimension and over several passes it creates, for each of the filters, a set of psums for the region of the ofmap that are not fully reduced (i.e., they await the results of more passes). Therefore,
(9)  
(10) 
where corresponds to the bit width for ifmap and psum.
Next, we determine how many ifmap passes can be processed for a limited GLB size, GLB. This is the number, , of pink regions that can fit within the GLB, i.e.,
(11) 
To compute , we first set it to the full ifmap width, , and we set to the full ofmap height, , to obtain . If , i.e., ifmappsumGLB, then and are reduced until the data fits into the GLB and .
From the values of and computed above, we can determine and using the relations
(12) 
Fig. 7 shows the flow graph that summarizes how the parameters for scheduling the CNN computation are computed. The module takes the CNN layer shape (Table I) and the accelerator hardware parameters (Table II) as inputs. Based on our automated mapping strategy, the module outputs the computation scheduling parameters (Table II).
The mapping method handles exceptions:
If , some PE columns will remain unused. This is avoided by setting . If the new ifmappsumGLB, is reduced so that the data fits into the GLB.
If , all channels are processed in a pass while increasing , as there is more room in the PE array to process more filters. The cases , proceed by reducing .
All Conv layers whose filter has the dimension
(e.g., inside the inception modules of GoogleNet, or the fire modules of SqueezeNet) are handled under a fixed exception rule that uses a reduced
, and suitably increased .The exceptions are triggered only for a few CNN layers (i.e., layers having relatively few channels or filters).
In the previous section, we have determined the subvolume of ifmap and filter data to be processed in a pass. From the scheduling parameters we can also compute the number of passes before a writeback of ofmap to DRAM. Therefore, we have determined the schedule of computations to generate all channels of ofmap. We now estimate each component of in (3). The steps for this energy computation are summarized in Algorithm 1, which takes as input the computation scheduling parameters, CNN layer shape parameters, and technologydependent parameters that specify the energy per operation (Table III), and outputs .
We begin by computing the subvolume of data loaded in each pass (Lines 1–5). In Fig 5, is illustrated as the pink ifmap region which is processed in one pass for an image, and is the number of psum entries associated with the orange ofmap region for a single filter and single image. The filter data is reused across () passes, and we denote the number of filter elements loaded for these passes by . Thus, for filters and images,
(13)  
(14)  
(15) 
To compute energy, we first determine , the data access energy required to process volume of each ifmap over filters and images. In each pass, a volume of the ifmap is brought from the DRAM to the GLB for data access; psums move between GLB and RF; and RFlevel data accesses () occur for the four operands associated with each MAC operation in a pass. Therefore, the corresponding energy can be computed as:
(16) 
Here, denotes the energy associated with operation , and each energy component can be computed by multiplying the energy per operation by the number of operations. Since filter data is reused across () passes, all components in (16), except the energy associated with filter access, are multiplied by this factor. Each psum is written once and read once, and accounts for both operations.
Next, all channels of ifmap (i.e., the entire green ifmap region in Fig. 5) are processed to form the green region of each ofmap channel, and this data is written back to DRAM. To this end, we compute , the data access energy to produce fraction of each ofmap channel over filters and images, by repeating the operations in (16) to cover all the channels:
(17) 
Finally, the computation in (17) is repeated to produce the entire volume of the ofmap over all filters. Therefore, the total energy for data access is
(18) 
Here, the multipliers , , and represent the number of iterations of this procedure to cover the entire ofmap. These steps are summarized in Lines 6–9 of Algorithm 1. Finally, the computation energy of the Conv layer is computed by:
(19) 
where is the energy per MAC operation, and it is multiplied by the number of MACs required for a CNN layer.
The analytical model exploits sparsity of the data (i.e., zeros in ifmap/ofmap) at internal layers of a CNN. Except the input ifmap to the first Conv layer of a CNN, all data communication with the DRAM (i.e., ifmap read or ofmap write) is performed in runlength compressed (RLC) format [17]. In addition, for a zero valued ifmap, the MAC computation as well as the associated filter and psum read (write) from (to) RF level is skipped to reduce energy.
The control overheard includes the clock power, overheads for control circuitry for the PE array, networkonchip to manage data delivery, I/O pads, etc. Of these, the clock power is a major contributor (documented as 33%–45% in [17]), and other components are relatively modest. The total control energy, , is modeled as:
(20) 
where is the clock power, is the number of cycles required to process a single layer, and is the clock period; is the control energy from components other than the clock network. For a supply voltage of , the clock power is computed as:
(21)  
(22) 
where is leakage in the clock network. The switching capacitance, , includes capacitances of the global clock buffers, wires, clocked registers in the PEs, and clocked SRAM (GLB) components, i.e., the decoder, registers, and precharges for the bitline and sense amplifier:
(23) 
The clock is distributed as an Htree, and we choose the size and number of the clock buffers to maintain a slew rate within 10% of . We model as 15% of excluding , similar to data from the literature.
We validate CNNergy against limited published data for AlexNet and GoogleNetv1:
(i) EyMap, the Eyeriss energy model, utilizing the mapping parameters provided
in [17]. This data only provides parameters for the five
convolution layers of AlexNet.
(ii) EyTool, Eyeriss’s energy estimation tool [23], excludes
and supports AlexNet and GoogleNetv1 only.
(iii) EyChip, measured data from 65nm silicon [17] (AlexNet
Conv layers only, excludes ).
Note that our CNNergy exceeds the capability of these:
CNNergy is suitable for customized energy access (i.e., any intermediate CNN energy component is obtainable).
CNNergy can find energy for various accelerator parameters.
CNNergy can analyze a vast range of CNN topologies and general CNN accelerators, not just Eyeriss.
To enable a direct comparison with Eyeriss, 16bit fixed point arithmetic precision is used to represent feature maps and filter weights. The technology parameters are listed in Table III. The available process data is from 45nm and 65nm nodes, and we use the factor to scale 45nm data for direct comparison with measured 65nm silicon.
For the control energy, capacitive components in (22) and (23) are extracted from the NCSU 45nm PDK [24]. To estimate , per unit wire capacitance of top metal layer, input gate capacitance of clock buffer, capacitive load at the clock input of a flipflop (i.e., register) as well as clockload components from SRAM memory are extracted. The results are scaled to 65nm node by the scaling factor . The resultant clock power is computed by (21), and the for each layer in (20) is inferred as , where the numerator is a property of the CNN topology and the denominator is obtained from [17].
Fig. 8 compares the energy obtained from CNNergy, EyTool, and EyMap to process an input image for AlexNet. As stated earlier, EyTool excludes ; accordingly, our comparison also omits . The numbers match closely.
Fig. 8 shows the energy results for AlexNet including the component in (3) for both CNNergy and EyMap. The EyTool data that neglects is significantly off from the more accurate data for CNNergy, EyMap, and EyChip, particularly in the Conv layers. Due to unavailability of reported data, the bars in Fig. 8 only show the Conv layer energy for EyMap and EyChip. Note that EyChip does not include the component of (4).
Fig. 8 compares the energy from CNNergy with the EyTool energy for GoogleNetv1. Note that the only available hardware data for GoogLeNetv1 is from EyTool, which does not report control energy: this number matches the non component of CNNergy closely. As expected, the energy is higher when is included.
The transmission energy, , is a function of the available data bandwidth, which may vary depending on the environment that the mobile client device is in. We use the following model [14] to estimate the energy required to transmit data bits from the mobile client to the cloud.
(24) 
where is the transmission power of the client, and is the effective transmission bit rate. For the highly sparse data at internal layers of a CNN, runlength compression (RLC) encoding is used to reduce the transmission overhead. The number of transmitted RLC encoded data bits, , is:
(25) 
Here, is the number of output data bits at each layer including zero elements, Sparsity is the fraction of zero elements in the respective data volume, and is the average RLC encoding overhead for each bit associated with the nonzero elements in the raw data (i.e., to encode each bit of a nonzero data element, on average, bits are required). Using 4bit RLC encoding (i.e., to encode information about the number of zeros between nonzero elements) for 8bit data (for evaluations in Section VIII), and 5bit RLC encoding for 16bit data (during Eyeriss validation in Section V), is 3/5 and 1/3, respectively (note that this overhead only applies to the few nonzeros in a very sparse data).
Although our framework aims to optimize client energy, we also evaluate the total time required to complete an inference () in the client+cloud. For a computation partitioned at the layer, the inference delay is modeled as:
(26) 
where [] denote the layer latency at the client [cloud], is the number of layers in the CNN, and is the time required for data transmission at the layer. The latency for each layer is computed as in Section V where the Throughput comes from the client and cloud platforms.
In this section, we discuss how NeuPart is used during runtime for partitioning CNN workloads between a mobile client and the cloud. Fig. 9 shows the average (
) and standard deviation (
) of data sparsity at various CNN layers over10,000 ImageNet validation images for AlexNet, SqueezeNetv1.1, GoogleNetv1, and VGG16. For all four networks, the standard deviation of sparsity at all layers is an order of magnitude smaller than the average. However, at the input layer (i.e., the input image itself), when the image is transmitted in standard JPEG compressed format, the sparsity of the JPEG compressed image,
SparsityIn, shows significant variation (documented in Fig. 11), implying that the transmit energy can vary significantly.Therefore, a significant observation is that for all the intermediate layers, Sparsity is primarily a feature of the network and not the input data, and can be precomputed offline as a standard value, independent of the input image. This implies that , which depends on Sparsity, can be computed offline for all the intermediate layers without incurring any optimality loss on the partitioning decision. Only for the input layer it is necessary to compute during runtime. The runtime optimization algorithm is therefore very simple and summarized in Algorithm 2 (for notational convenience we use superscript/subscript to indicate layer in this algorithm).
The cumulative CNN energy vector () up to each layer of a CNN (i.e., ) depends on the network topology and, therefore, precomputed offline by CNNergy. Likewise, for layer 2 to is precharacterized using the average Sparsity value associated with each CNN layer. During runtime, for an input image with JPEGcompressed sparsity SparsityIn, for layer 1 (i.e., input layer) is computed (Line 2). Finally, at runtime, with a user specified transmission bit rate , and transmission power , is obtained for all the layers, and the layer that minimizes is selected as the optimal partition point, (Lines 3–6).
Since is a user specified parameter in the runtime optimization algorithm, depending on the communication environment (i.e., a bad connection, variable bandwidth), a user can provide the available bit rate and obtain the partitioning decision based on that value at runtime.
Overhead of Runtime Optimization: The computation of Algorithm 2 requires only () multiplications, divisions, () additions, and comparison operations (Lines 2–5), where is the number of layers in the CNN topology. For standard CNNs, is a very small number (e.g., for AlexNet, GoogleNetv1, SqueezeNetv1.1, and VGG16, lies between 12 and 22). This makes NeuPart computationally very cheap to find the optimal partition layer at runtime.
Note that the inference result returned from the cloud computation corresponds to a trivial amount of data (i.e., only one number associated with the identified class) which is, for example, 5 orders of magnitude lower than the number of data bits to transmit at the P2 layer of AlexNet (already very low, see Fig. 2(b)). Therefore, the cost of receiving the result makes no perceptible difference in the partitioning decision.
We now evaluate the computational partitioning scheme, using the models in Sections IV and VI. Similar to the stateoftheart [1, 25], we use 8bit inference for our evaluation. The energy parameters from Table III are quadratically scaled for multiplication and linearly scaled for addition and memory access to obtain 8bit energy parameters. We compare the results of partitioning with
FCC: fully cloudbased computation
FISC: fully in situ computation on the client
The energy cost in (1) for each layer of a CNN is analyzed under various communication environments for the mobile cloudconnected client. For the transmission power in (24), we use similar power numbers reported in literature from smartphone measurments [26, 27, 28]. We present analysis using bit rate () as a variable parameter to evaluate the benefit from the computation partitioning scheme as the available bandwidth changes. For all plots: (i) “In” is the input layer (i.e, the input image data); (ii) layers starting with “C”, “P”, and “FC” denote Conv, Pool, and FC layer, respectively; (iii) layers starting with “Fs” and “Fe” denote squeeze and expand layer, respectively, inside a fire module of SqueezeNetv1.1.
At the In layer, before transmission, the image is JPEGcompressed with a quality level of (lower provides better compression but the worsened image induces CNN accuracy degradation). The energy overhead associated with JPEG compression [29] is incorporated in for the In layer but is negligible.
For an input image, Fig. 10 shows the energy cost associated with each layer of AlexNet at 100 Mbps bit rate and 1 W transmission power . The minimum occurs at an intermediate layer, P2, of AlexNet which is 34.7% energy efficient than the In layer (FCC) and 26.6% energy efficient than the last layer (FISC). It is now clear that offloading data at an intermediate layer is more energyefficient for the client than FCC or FISC. Fig. 10 shows a similar result with an intermediate optimal partitioning layer for SqueezeNetv1.1. Here, the Fs4 layer is optimal with an energy efficiency of 50.2% and 34.4% as compared to FCC and FISC, respectively.
The cost of FCC is imagedependent, and varies with the sparsity, SparsityIn, of the compressed JPEG image, which alters the transmission cost to the cloud. Fig. 11 shows that the 5500 test images in the ImageNet database show large variations in SparsityIn
. We divide this distribution into four quartiles, delimited at points
, , and .For representative images whose sparsity corresponds to , , and , Fig. 12 shows the energy savings on the client at the optimal partition of AlexNet, as compared to FCC (left axis) and FISC (right axis). For various bit rates, the plots correspond to two different of 0.5 W and 1.3 W, corresponding to typical smartphones specifications.
In Fig. 12, a 0% savings with respect to FCC [FISC] indicates the region where the In [output] layer is optimal implying that FCC [FISC] is the most energyefficient choice. Figs. 12 and 12 show that for a wide range of communication environments, the optimal layer is an intermediate layer and provides significant energy savings as compared to both FCC and FISC. However, this also depends on image sparsity: a higher value of SparsityIn makes FCC more competitive or even optimal, especially for images in quartile IV (Fig. 12). However, for many images in the IIII quartiles, there is a large space where offloading neural computation at the intermediate optimal layer is energyoptimal. Similar trends are seen for SqueezeNetv1.1 where the ranges of for which an intermediate layer is optimal are even larger than AlexNet with higher energy savings.
The optimum partition is often, but not always, at an intermediate point for all CNNs. For example, for GoogleNetv1, a very deep CNN, in many cases either FCC or FISC is energyoptimal, due to the large amount of computation as well as the comparatively higher data dimension associated with its intermediate layers. However, for smaller SparsityIn values (i.e., images which do not compress well), the optimum can indeed occur at an intermediate layer, implying energy savings by the client/cloud partitioning. For the VGG16 CNN topology, the optimal solution is FCC, rather than partial onboard computation or FISC. This is not surprising: VGG16 incurs extremely high computation cost along with large data volume in the deeper layers, resulting in high energy for client side processing.
Average percent energy savings with respect to  
FCC  FISC  
CNN  Quartile  
I  II  III  IV  
AlexNet  47.4%  33.8%  17.9%  0.9%  31.3% 
SqueezeNet  70.0%  62.3%  53.2%  31.2%  31.3% 
GoogleNet  23.7%  5.3%  0.0%  0.0%  9.6% 
Under a fixed transmission power and bit rate, Table IV reports the average energy savings at the optimal layer as compared to FCC and FISC for all the images lying in Quartiles I–IV, specified in Fig. 11. Note that the savings with respect to FISC do not depend on SparsityIn. The shaded regions in Table IV indicate the regions where energy saving is obtained by the client/cloud partitioning. For AlexNet, the optimum occurs at an intermediate layer mostly for the images in Quartiles I–III while providing up to 47.4% average energy savings. For SqueezeNetv1.1, in all four quartiles, most images show an optimum at an intermediate layer and provide up to 70% average energy savings on the client.
Evaluation of Inference Delay: To evaluate the inference delay (), we use GoogleTPU [1], a widely deployed DNN accelerator in datacenters, as the cloud platform, with in (26) use Throughput = 92 TeraOps/s. At the median SparsityIn value (), Fig. 13 compares the of energy optimal partitioning of AlexNet with FCC and FISC for various bit rate. The delay of FISC does not depend on communication environment and exhibits a constant value whereas the delay of FCC reduces with higher bit rate. The range of for which an intermediate layer becomes energy optimal is extracted using (Fig. 12). The blue curve in Fig. 13 shows the inference delay when partitioned at those energy optimal intermediate layers. It is evident that in terms of inference delay, energy optimal intermediate layers are either better than FCC (lower bitrate) or closely follow FCC (higher bitrate) and most of the cases are better than FISC.
Impact of Variations in : We have analyzed the impact of changes in the available bandwidth (e.g., due to network crowding) on the optimal partition point. For an image with SparsityIn of and 0.5 W , Fig. 13 shows the energy cost of AlexNet when partitioned at P1, P2, and P3 layers (the candidate layers for an intermediate optimal partitioning). It shows that the energy valley is very flat with respect to when the minimum shifts from P3 to P2 and from P2 to P1 layer (the green vertical lines). Therefore, changes in negligibly change energy gains from computational partitioning. For example, in Fig. 13, layer P3 is optimal for Mbps, P2 is optimal for Mbps, and P1 is optimal for Mbps. However, if changes from 85 to 100 Mbps, even though the optimal layer changes from P2 to P1, the energy for partitioning at P2 instead of P1 is virtually the same.
We demonstrate how our analytical CNN energy model (CNNergy) in Section IV can be used to perform a design space exploration of the CNN hardware accelerator. For the 8bit inference on an AlexNet workload, Fig 13 shows the total energy as a function of the global SRAM buffer (GLB) size. The trend in GLB energy vs. size was extracted using CACTI [30].
When the GLB size is low, data reuse becomes difficult since the GLB can only hold a small chunk of ifmap and psum at a time. This leads to much higher total energy. As the GLB size is increased, data reuse improved until it saturates. Beyond a point, the energy increases due to higher GLB access cost. The minimum energy occurs at a size of 88kB. However, a good engineering solution is 32kB because it saves 63.6% memory cost over the optimum, with only a 2% optimality loss. Our CNNergy supports similar design space exploration for other accelerator parameters as well.
In order to best utilize the batterylimited resources of a cloudconnected mobile client, this paper presents an energyoptimal DL scheme that uses partial in situ execution on the mobile platform, followed by data transmission to the cloud. An accurate analytical model for CNN energy (CNNergy) has been developed by incorporating implementationspecific details of a DL accelerator architecture. To estimate the energy for any CNN topology on this accelerator, an automated computation scheduling scheme is developed, and it is shown to match the performance of layerwise ad hoc scheduling approach of prior work [17]. The analytical framework is used to predict energy optimal partition point for mobile client at runtime, while executing CNN workloads, with an efficient algorithm. The in situ/cloud partitioning scheme is also evaluated under various communication scenario. The evaluation results demonstrate that there exist a wide communication space for AlexNet and SqueezeNet where energyoptimal partitioning can provide remarkable energy savings on the client.
This work was supported in part by NSF Award CCF1763761.
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “Indatacenter Performance Analysis of a Tensor Processing Unit,” in
Proc. ISCA, June 2017, pp. 1–12.
Comments
There are no comments yet.