I Introduction
CNNs have achieved extensive success on miscellaneous artificial intelligence applications such as image classification and object detection. A plethora of models emerge with different operators and architectures, gradually shifting attention from accuracy to efficiency in terms of speed and power. Classified as lightweight CNNs, MobileNets
[4][8] adopt depthwise separable convolution to reduce computation complexity and parameter amount, while other models, such as SqueezeNet [5], alter model topology to spare computation power.However, such modellevel changes reduce runtime hardware efficiency. To be more specific, layers in modern lightweight CNNs are heterogeneous concerning the computation pattern, especially between depthwise convolution layers and regular convolution layers. Even for the same layer type, CTC (Computation to Communication) ratio diffs significantly for various layer characteristic parameters such as input feature map size and kernel size. Consequently, processors with uniform PEs (Processing Elements) can barely accommodate all types of layers efficiently, leading to a low runtime PE efficiency. Separate engines for different layers, however, result in hardware resource redundancy due to the sequential execution of engines. In other words, the balance between generality and specificity has not been addressed well. This is particularly true for workload with multiple CNNs.
There exist two paradigms on how to deal with the heterogeneous CNN workload. One resorts to a uniform architecture, while the other applies custom architectures (often called accelerators) to different models. For the first paradigm, GPGPU focuses on the acceleration of matrix multiplication with hardware multithreading on parallel arithmetic cores and performs poorly on memorybound depthwise convolution. Similarly, TPU [6] is optimized to perform fast, bulky matrix multiplication with homogeneous systolic arrays. Nevertheless, it suffers from poor performance on workload with sparse computation intensity, such as elementwise algebra. LightOPU [14] utilizes uniform overlay architecture on FPGA for lightweight layers. Yet, specialpurpose modules, e.g., line buffer and squeezeandexcitation block, might be redundant for the majority of layers, indicating resource inefficiency and potential room for further speedup. Xilinx DPU [3][12] also includes a nonnegligible additional resource cost for a separate depthwise convolution engine and channel augmentation module for small output channels.
On the other hand, accelerators like [15] adopt layerwise architecture tailored for bottom layers and uniform architecture for top layers based on the observation that CTC ratios fluctuate significantly between the two groups but just slightly within top layers. [11][9][7] map one or some specific layers to heterogeneous convolution accelerators to handle various layer characteristics. However, lightweight models are missing in these accelerators, so the performance is unknown when applied to latest lightweight models that are far more heterogeneous than traditional VGGlike models. [10] utilizes separate PEs for depthwise and regular convolution layers to improve individual throughput, which, in contrast, increases the overall redundancy in hardware resources.
To sum up, existing work with homogeneous PEs suffers from low resource efficiency on lightweight models, while those with heterogeneous PEs have not shown support for lightweight models or have exhibited resource redundancy. To address these problems, we aim to find a balance between generality and specificity by proposing a dualcore processor (called dualOPU), where one core is optimized for regular convolution and the other one for depthwise convolution with extra resource cost on lightweight hardware modules, e.g., line buffer. Still, each core can handle all types of layers but with different efficiency. To reduce the overall resource redundancy, we run multiple heterogeneous layers in parallel such that layers that prefer lightweight modules can be accommodated by a single core. We interleave layers from different models to increase the chance of parallelizing heterogeneous layers. Then we tune the PE numbers and PE input sizes of each core to balance the parallel workload, where the balance is further finetuned in tile granularity by layer split. As a result, runtime efficiency is maximized. Our contributions are listed as follows:

We propose a heterogeneous dualcore architecture with finegrained PE configuration space for high runtime PE efficiency.

We develop a scheduling algorithm to interleave layers from different models to exploit dualcore parallelism for high throughput.

Given a set of target CNN models, we can find a PE configuration, with which high average throughput can be achieved.
Ii Motivation
Lightweight CNN models typically include lowcomputationintensity operators, e.g., depthwise convolution and pointwise convolution, to reduce operation count and parameter amount. However, this modellevel optimization leads to significant irregularity on the computation complexity between layers, where memoryintensive lightweight layers are interleaved with computeintensive regular convolution layers. Therefore, deployed on processor with uniform PEs, some layers have low runtime PE efficiency, as defined in Eq.1, where is the multiplyandaccumulate (MAC) operation amount performed in the measurement with frequency . is the number of allocated PEs, and is the number of MAC operations performed by each PE per clock cycle. is the measured latency in seconds.
(1) 
Runtime PE efficiency measures the ratio between computation time and total latency of a layer. Memoryintensive layers can barely fully overlap communication time with computation time, and thus have lower runtime PE efficiency than computeintensive layers. Furthermore, the gap between runtime PE efficiencies of different layers is exaggerated by variation of layer characteristic parameters such as input/output channels, input feature map height/width and kernel width/height.
Only when these layerspecific parameters match MAC unit sizes can we make full use of arithmetic resources.
To show the layerwise runtime PE efficiency differences, we mimic the architecture of LightOPU proposed in [14]
to measure the latency of each layer. Industrial processors like TPU are not good candidates for layerwise measurement due to their high runtime variance for an individual layer. As shown in Fig.
1, the average runtime efficiency is 59, 41 and 62 for MobileNet v1, MobileNet v2 and SqueezeNet v1, respectively. We find zigzag curves for all three models, where high efficiencies are contributed by regular convolutions and low ones come from depthwise convolutions for MobileNet v1/v2 and pointwise convolutions for SqueezeNet v1. Both source layers of low PE efficiency have limited computing parallelism than regular convolution. Devoid of output channel parallelism, depthwise convolution layers achieve 42 and 37 PE efficiency in MobileNet v1 and v2, respectively. Pointwise convolution layers with small output channels lead to 41 PE efficiency in SqueezeNet v1. The significant performance gap between different layer types calls for heterogeneous PEs, which customize arithmetic structures for specific layer types to improve runtime PE efficiency.Even with heterogeneous PEs, we cannot ignore that these lightweight models are almost purely sequential, which can hardly make full use of the parallel heterogeneous PEs on hardware. With the emphasis on the throughput of multiple input images, prior work concurrently run layers for different input images for better resource efficiency. [12] runs MobileNet v2 in a layer pipeline schedule so that one pointwise convolution layer and one depthwise convolution layer for two input images run on different engines in parallel. The fact that two layers run in parallel, however, still results in low PE runtime efficiency due to the imbalance of the latency even though its depthwise engine already utilizes small number of PEs to compensate. Since its PE number for depthwise engine is not tunable, this imbalance worsens when the CTC ratios differ more significantly between two types of layers in MobileNet v1. So we need to optimize PE allocation as well as layer scheduling.
Iii Heterogeneous Dualcore Architecture
Iiia Proposed Architecture
To overcome the aforementioned problems, i.e., low runtime PE efficiency due to the heterogeneous workload and imbalanced schedule, we propose a novel dualcore architecture with finegrained PE array configuration and an automatic design flow to find the best configuration for high throughput. We define a core as computing unit with independent input/output buffers, a PE array and a postprocessing unit. We introduce two types of cores, channelparallel core (ccore) and pixelparallel core (pcore). ccore exploits input/output channel parallelism for regular convolution, which usually has large channel numbers. pcore takes advantage of the pixel parallelism in the kernel window for depthwise convolution, where line buffer is required as extra hardware support for data fetch due to the reuse of input feature map pixels from the sliding window. PE configurations of the two cores are optimized with respect to layer characteristics for high runtime PE efficiency.
As shown in Fig.2, our heterogeneous architecture consists of one ccore and one pcore. Inside a core, PEs are homogenerous, and PE configuration and buffer sizes are customized. Each core has pingpong structured buffers for input feature maps, weights and biases. Once a group of memory blocks for input feature map, weight and bias are loaded from offchip memory to input buffers, the MAC operation pipeline is initiated. Partial sums are stored in the output buffer for further accumulation. Postprocessing operations, such as pooling and activation, are also included in the computation pipeline and are initiated once a group of output feature maps is ready.
IiiB PE Array Configuration
We optimize PE array in terms of (, ), where is the number of PEs and is the number of multipliers for each PE. Each PE implements an inner product with inputs including multiplication products, which are then reduced to one sum with a balanced adder tree. More adders follow the PE outputs to provide 2 to accumulated results. The number of accumulated results is dynamically configured by instructions to obtain high runtime PE efficiency.
Regarding data fetch for ccore, input feature maps are duplicated and broadcast to PEs to exploit channel parallelism. pcore has an alternative way of data fetch, by using line buffer. and are tiling sizes of kernel height and kernel width to compute for one single memory load from external memory to onchip buffer. When or , line buffer expands input feature maps by before broadcasting.
Our PE may use multiple DSP macros on FPGA. We decompose each DSP into two 8bit multipliers to make full use of computation resources. However, the two decomposed multipliers must share of one input due to hardware constraints. Two multipliers in ccore share one input feature map pixel with two output channel weights, while two pixels share one input channel weight in pcore. Double input feature map buffers are integrated with the pcore PE array. As a result, two groups of sliding window pixels on the dimension of input feature map height are computed in parallel to make full use of DSP resources.
While ccore is more computationally powerful for channel parallelism, pcore is flexible for both channel parallelism and pixel parallelism at the cost of some computation power and extra hardware components. In addition to multipliers, pcore requires more resources for auxiliary components, such as line buffer and extra input feature map buffers, than ccore. Processors using a single pcore not only leads to low runtime PE efficiency but also results in inefficient use of auxiliary components. For example, when deploying regular convolution with 64 input channels and 16 output channels on pcore P(16,64), we prefer not to use line buffer, since PE array configuration perfectly matches layer channel numbers. Line buffer is not useful in such case. If line buffer was used, it should generate multiple of pixels as inputs of inner product, which ranges from 1 to 9 and leaves most of 64 PE inputs idle. Therefore, we need to carefully allocate resource for pcore and ccore to make full use of all resources.
IiiC Design Flow
To determine the resource allocation for ccore and pcore, we propose an automatic design flow to achieve high throughput of target workload. Our flow in Fig.3 takes CNN model description and FPGA resource budget as inputs.
CNN model description is parsed for layerwise characteristics, such as input feature map height/width, input/output channels, kernel height/width, as well as data dependencies between layers. We partition the layers into two groups, where layers in one group are assigned to ccore and layers in the other one are assigned to pcore. For each group, we search for the PE configuration that leads to the highest runtime PE efficiency, under the constraint that the total allocated resources cannot exceed resource budget. For each configuration, we decide tiling sizes on all dimensions with the objective to maximize runtime PE efficiency. Based on the chosen tiling sizes, latency model and resource model are developed for ccore and pcore to estimate latency and resource. Moreover, we balance the workload executed in parallel on ccore and pcore to improve the throughput.
One can see from runtime PE efficiency shown in Fig.1
, high efficiency and low efficiency are interleaved in all three cases. Adjacent layers with large difference are good candidates for potential balanced parallel execution. To leverage this, we choose to interleave layers for two input images of the same CNN model and use topological order. Then PE numbers on ccore and pcore are tuned based on the interleaved schedule to minimize the latency gap between parallel workload. We further reduce the latency gap by a heuristic to split convolution layers along input feature map height dimension for tile reassignment. Finally, we generate hardware code and instructions based on the PE allocation and scheduling.
Iv Modeling
Iva Tile Sizing
We aim to find tiling sizes of each layer for minimal total latency. We use as the abbreviation of (, ) in the following sections. Given
, we aim to find the buffer sizes and if pcore is required for each layer so as to estimate the latency and the total resource. Since tiling sizes compose the tensor size to load and compute at each time and are correlated with the memory and computation resource, we first determine the tiling sizes (
, , , , , ) for each layer. Since each PE can implement inner product with multiple multiplications, product of , , and matches total multiplier count, as shown in Eq.2, where ccore has =1 and =1 without line buffer. The multiple factor of , denoted as , aims to maximize runtime PE efficiency in Eq.3. In other words, is the used number of PE among total PEs. Larger indicates higher portion of PEs are utilized and thus higher runtime PE efficiency.(2) 
(3) 
We iterate from to . For each , we iterate and to get = and . (, ) relate to input buffer depth. To simplify, we assume , since most convolutions have square input size. Eq.4 decides (, ) by memory efficiency in terms of buffer depth.
(4) 
It aims to minimize total input block numbers in size of (). Some available options of tiling sizes might hold the same runtime PE efficiency, in which case we pick the one with less resource cost using the resource model to be discussed in Section IVC.
IvB Latency Modeling
We calculate latency for each layer and then add them up. For each layer, and are latencies of memory load and computation, respectively. The three terms on the numerator of Eq.5 are the data amount to load for input feature map in shape of (), weight in shape of () and bias in shape of (), which are loaded sequentially through the limited bandwidth of external DRAM memory. Memory load works in pipeline with as the last part of latency.
is the column address strobe (CAS) latency of DRAM access, and it stands for the delay between memory read request and the moment data is available for onchip buffers. Eq.
6 depicts the computational latency as the product of tile numbers on each dimension. Computation following by postprocessing runs in a deep pipeline. Assuming that the data bitwidth is , whenever number of output feature map data are ready, they will be passed to postprocessing unit and then stored to DRAM, resulting latency . Both and are not constant in practice. For estimation, we use average values based on multiple execution traces on FPGA. We use ceiling operators for the accuracy of modeling. When the compiler generates and schedules ISA instructions, it aims to overlap and as much as possible. Therefore, we use the maximum between and in Eq.7 to estimate latency of each layer.(5) 
(6) 
(7) 
IvC Area Modeling
We will discuss the resource model that varies with PE configuration (, ). Total resource is the sum of the variants from PE and buffers and the invariant from memory controller, instruction decoder, postprocessing unit, etc. The variant cost on computation and memory resources are discussed as follows.
Dsp
We only use DSP to build multipliers for efficiency. is the MAC number that one DSP can handle. In our design, each DSP48E1 slice is decomposed to two 8bit8bit multipliers that can work simultaneously with one common input. So is 2. The required number of DSP is indicated as follows:
(8) 
Memory Resource
We assume PE array has two types of onchip input buffers that are built by BRAM. is for feature maps and is for weights. Bias amount is usually small so it will be implemented by logic resource. The buffer depth of is . Width of and are and to match the input bandwidth of PE array. For pcore, if depthwise convolution is applied, should have 2 banks as that in the Ccore model. The depth of is . Xilinx FPGAs provide RAMB18K macro for 18kb block RAM. Within total 18kb memory capacity, RAMB18K has configurable combinations, such as 36512, 181k, 92k, 44k, 28k and 116k. We count the number of required RAMB18K given buffer width and depth with priority for width, which means that we tend to use minimum number of RAMB18K in term of width size.
Logic Resource
LUT and FF cost comes from three aspects. (1) the adders following multipliers in the PE array: For each PE with inputs, they are accumulated to one with these adders. Input data width of adder increases with the depth of the adder tree. (2) the delayers in a PE: Delayers are required when is not power of two. Delayers are implemented by simple register insertion. (3) line buffer: To match memory bandwidth with input buffer, is the number of line buffer channels required. For each layer with , the length of line buffer should be such that pixels in sliding window are preloaded before computation. We select the parameter sizes to accommodate all the layers that require line buffers. We collect resource costs from different sizes of adders, delayers and line buffers implemented by Xilinx toolchain and build model for each component.
Validation of Resource Model
LUT  FF  DSP  BRAM  

[14]  137816(67.29)  251433(57.41)  577(68.69)  237.5(53.26) 
our resource model  137149(67.62)  234046(61.67)  577(68.69)  237(53.37) 
As shown in Table I, [14] is a pcore with input/output buffers as well as other resourceinvariant modules. To show the effectiveness of our resource model on PEs and buffers, we compare the resource estimation with the real implementation results. Our model is able to obtain resource estimation error.
V Optimization
Va Scheduling
Given CNN graph and separated (,) of ccore and pcore, we aim to find a schedule that maximizes the throughput of heterogeneous dualOPU. As shown in Fig.4(a), nodes are layers, and edges indicate the data dependencies between layers. We first partition the graph to layer groups, and each group includes one or multiple layers. Groups are assigned to ccore or pcore following topological order. Since the topology limits the chance for groups assigned to different cores to be executed in parallel (i.e., ccoreassigned and pcoreassigned ), we interleave layers for two input images of the same CNN graph such that more groups will be able to run simultaneously. In Fig.4(b), the second region from top indicates that for the first input image and for the second image are scheduled on pcore and ccore in parallel. Due to the fact that layer characteristics and topology vary with different CNN graphs, two parallel groups can still show large latency difference even with the best allocation and partitioning scheme. As a result, one core is idle during the gap, which lowers the performance. We split some layers to sublayers along the dimension of input feature map height to reduce the latency gap. For example, layer 4 in Fig.4(b) is split to layer and in Fig.4(c). Layer forms with layer 3, while layer forms with layer 5. Consequently, latency gap between and is reduced. Although latency gap between and increases in Fig.4(c), split is applied as long as the total throughput is improved. Among multiple partitioning choices with load balancing strategy, we pick the one with highest throughput estimation as the final schedule.
VA1 Allocationaware Partitioning
We perform layer allocation followed by partitioning to find a scheme suitable for the scheduling method in Fig.4(b). We adopt three simple allocation schemes, including greedy allocation, layertype based allocation and robinround allocation. Greedy allocation decides the core assignment based on the latency estimation of a layer on two cores. Each layer is allocated to the core with less latency. Regardless of hardware configuration, layertype based allocation assigns regular convolution layers to ccore and depthwise convolution layers to pcore such that we can exploit channel parallelism and pixel parallelism, respectively. As a workaround in case the two allocation scheme above cannot work, robinround allocation assigns layers to two cores one by one in circular order. We partition the graph to layer groups according to allocation results. Interleaving layers for two input images, we aim to find such an allocation that the variance of
for all odd
and for all even is minimized. Each group includes one or more layers instead of single layer for smaller variance. Once we have a small variance, workload can be balanced well with appropriate PE allocation on ccore and pcore. The partitioning result is shown in Fig.4(a) as an example, where layer 3 and layer 4 form .VA2 Load Balancing
Partitioning in layer granularity cannot completely eliminate latency gap between all two parallel layer groups. For example, a regular convolution layer is scheduled to ccore and run in parallel with a depthwise convolution layer with same parameter sizes. The former one has far more MAC operations than the latter, leading to large latency gap when ccore and pcore have same MAC units. To balance the workload, we propose a heuristic method to split layers. Since input feature map pixels run in pipeline on either core, we pick the height dimension to split for simplicity, since it does not complicate the partial sum accumulation along input channels. As shown in Alg.1, we first compute latency gap for each parallel group pair and pick the pair with largest gap. Once layer to split is located, we aim to find the height to remain in the layer to minimize the average latency of interleaved two batch runs , as defined in Eq.9. is the sum of the maximal latency between any parallel groups. The input feature map size is updated from to . The rest of the layer is reallocated to the other core. The height accommodates the new in the reallocated part. We continue splitting layers until there is no further improvement of .
(9) 
VB Cooptimization of PE Allocation and Scheduling
Aforementioned scheduling methods aim to obtain good throughput given arbitrary PE allocation of ccore and pcore. Clearly, the PE allocation that better matches the heterogeneity of target workload leads to a higher throughput. On the other hand, given PE allocation, we need to find a specific scheduling methods that can make full use the workloadmatching PE allocation. PE allocation and scheduling depend on each other. We will first define the design space for PE allocation and then discuss how to find the best PE allocation along with scheduling for target workload.
VB1 Design Space for PE Allocation
Our proposed heterogeneous dualcore design is driven by PE array configurations of ccore and pcore. We predesign memory buffers to meet bandwidth requirements of PE arrays, and define parameter vector (
,,,,) in Table II as the design space for PE allocation, where specifies the scheduling with respect to layer allocation according to input model and hardware. (,) and (,) are PE configuration (,) for ccore and pcore, respectively. For the constraint from target FPGA device, (,,,) stand for upperbounds of DSP, BRAM, LUT, FF resources, which define the valid design space. Another constraint is the bandwidth between core and external memory, .Parameters  Constraints  
Scheduling  PE Allocation  (,,,) 
(,,,) 
VB2 Search Algorithm
We use branchandbound to find the best PE allocation that minimizes the twobatch latency within the resource constraints. A naive approach is to determine if a DSP is included in our design along branches for all DSPs by enumeration, which, however, leads to huge search space in size of . This search space is redundant in our dualcore design, since we only care about the DSP numbers allocated to ccore and pcore. So, instead, we choose to branch upon the ccore DSP ratio , as defined in Eq.10, where is the MAC number one DSP Macro can perform per clock cycle.
(10) 
To compute the lower bound given , we greedily allocate the remaining DSPs to pcore until we run out of logic resources or DSP resources. We estimate the lower bound of by Eq.9, where is estimated with its lower bound of defined in Eq.11. is the DSP number of the core allocated for the layer , which is for ccore and for pcore.
(11) 
This is a lower bound of because it does not include the potential unmatch between layer characteristic parameters and PE array configuration sizes, which results in higher latency. We try different based on the current and choose the lowest as the lower bound for . Then we branch to the two middle points of the unvisited subsets split by . We start with and search for the with minimal lower bound of twobatch latency. The early termination happens whenever we reach the resource limit or we cannot have a better lower bound.
For the best found in the branchandbound global search, we then locally search for (,,,) for best throughput. To reduce the search space, we select limited available options for . Although our PE array is able to provide runtime configurable to outputs, with fixed , small leads to huge cost on registers for accessibility of intermediate results. On the other hand, as the unit input length, large can easily result in low runtime PE efficiency. Therefore, we choose as candidates of . Prime numbers are excluded since common channel numbers are not multiple of prime numbers. We exhaustively search all valid pairs of (,) for ccore and pcore based on the best . The (,) and corresponding to the best throughput is our design choice.
Vi Experiment
Via Experiment Settings
Software
We leverage parser of TVM [2] framework that handles input models from different CNN developing frameworks (i.e.
, PyTorch, Tensorflow), and then transform Relay IR to our customized IR, which is used to generate ISA simar to that in
[13] for dualOPU.Workload
Our test cases include MobileNet v1, MobileNet v2 and SqueezeNet v1. We denote SqueezeNet v1 as SqueezeNet for simplicity. These test cases cover typical lightweight operators such as depthwise convolution in MobileNet v1/v2 and expand/squeeze layers with small channels in SqueezeNet. We set batch size as 2 for evaluation on throughput and report on average values for each CNN model.
Hardware
We use the notation C() for ccore with a PE array with PEs, which have multipliers for each PE. Same notation is applied to pcore, where PEs are further coupled with line buffers. We run workload on three types of different designs, including singlecore design P(128,9), homogeneous dualcore design P(64,9)+P(64,9) and heterogeneous design with one ccore and one pcore C(128,8)+P(64,9) to show the effectiveness of heterogeneous dualOPU. For fair comparison, PE configurations in each experiments below have same equivalent area. To be more specific, three example designs have roughly same area, so are C(128,8) and P(64,9) in the heterogeneous one. P(64,9) has half multipliers, buffer depth and line buffer channels of P(128,9). C(128,8) has the same buffer depth as P(64,9). Without line buffer, C(128,8) saves LUT resource for more multipliers than P(64,9). Since line buffer only costs LUT while multipliers primarily cost DSP, it is difficult to compare the total resource cost. To quantify the resource cost of PE array in ccore and pcore, we use the equivalent LUT cost as the equivalent area cost. We count the equivalent LUT cost of each multiplier as the LUT cost to achieve the same functionality to one decomposed 8bit multiplier implemented by DSP. pcore PE array in P(64,9) and ccore PE array C(128,8) have close equivalent LUT cost as shown in Table III. P(64,9) has 128channel line buffer to make full use of double input feature map buffers for extra pixel parallelism on the height dimension of depthwise convolution. We use single pcore design P(128,9) as the baseline for comparison.
Equivalent Area/LUT Cost  

Line Buffer  Multipliers  Adders  Total  
P(64,9)  39868  40896  17859  98623 
C(128,8)  0  72704  31749  104453 
Evaluation
We have built a cycleaccurate instruction level latency simulator with configurable core type (ccore/pcore), PE size (,) and memory bandwidth. We run the complete compilation flow to generate ISA instructions for the simulation. For each instruction, we adopts the latency model discussed in Section IVB, which takes CAS latency of DRAM access into account for accuracy. As shown in Table IV, the cycleaccurate instructionlevel simulator shows 1 error on cycle count compared with that on boardlevel FPGA implementation. All the results in the experiment section are measured with cycleaccurate simulation.
Cycle Count  

Boardlevel Performance  Cycleaccurate Simulator  
MobileNet v1  755857()  757149(0.2) 
MobileNet v2  637551()  642940(+0.8) 
SqueezeNet  447457()  443129(0.9) 
ViB Impact of Scheduling
Table V shows the effectiveness of our scheduling method on different (, ) combinations. C(128,8)+P(64,9) and C(180,8)+P(32,9) have different ratio between two structures. C(112,9)+P(72,8) further changes . We compare the performance among four scheduling methods. The first three only apply layertype based allocation, greedy allocation and roundrobin allocation for layer group partitioning. Then we measure the average throughput of two interleaved batches. The last one, loadbalanceheuristic scheduling, further balances the parallel workload of layer groups on two cores based on the three aforementioned schedules. We choose the best one as our final schedule. Loadbalanceheuristic scheduling improves throughput by 10 on average from the three basic schemes. MobileNet v1 and v2 prefer layertype based schedule as the basic scheme, while SqueezeNet gets more room for load balancing staring with roundrobin scheme.
PE Array Configuration  Throughput(fps)  
Layertype  Greedy  Roundrobin  Loadbalanceheuristic  
MobileNet v1  C(128,8)+P(64,9)  267.4  267.4  269.8  304.3 
C(180,8)+P(32,9)  318.9  259.3  266.6  320.2  
C(112,9)+P(72,8)  234.7  238.5  235.0  269.9  
MobileNet v2  C(128,8)+P(64,9)  378.4  378.4  338.5  427.6 
C(180,8)+P(32,9)  392.0  304.9  214.4  384.9  
C(112,9)+P(72,8)  323.7  346.6  317.0  371.1  
SqueezeNet  C(128,8),P(64,9)  413.9  413.9  391.1  529.9 
C(180,8)+P(32,9)  483.9  483.9  228.4  520.4  
C(112,9)+P(72,8)  328.3  375.2  372.5  451.3 
ViC Impact of PE Array Configuration
Within the resource budget, Table VI shows the best PE array configuration for single workload.
PE Array Configuration  DSP/PE Eff  Throughput/fps  

MobileNet v1  P(128,9)  577/59  264.6() 
C(128,12)+P(8,16)  832/  358.4()  
MobileNet v2  P(128,9)  577/41  313.4() 
C(160,8)+P(48,8)  832/  438.4()  
SqueezeNet  P(128,9)  577/62  446.9() 
C(130,8)+P(64,10)  840/  534.7() 
We use single pcore design P(128,9) as the baseline. Generated configurations have similar equivalent area in LUT cost on PE structure, including line buffer, multipliers and adders. The results show that our design flow is able to generate PE array configuration with 31 improvement on throughput and 11 improvement on runtime PE efficiency on average over baseline. On the generated configurations, we find the best configuration for MobileNet v1 holds the largest , indicating highest heterogeneity between parallel layer groups. With largest workload difference between regular convolution and lightweight convolution, MobileNet v1 thus needs the largest among the three workloads to balance the load. The result indicates that more heterogeneous workload can lead to more improvement by our heterogeneous dualOPU.
PE Array Configuration  
C(128,12)+P(8,16)  C(160,8)+P(48,8)  C(130,8)+P(64,10)  C(128,10)+P(32,12)  
Optimized for individual CNN  Optimized for average throughput of multiple CNNs  
MobileNet v1  358.4(+9.9)  249.3(23.6)  314.6(3.6)  326.2() 
MobileNet v2  329.3(24.8)  438.4(+0.2)  428.1(2.2)  437.8() 
SqueezeNet  527.9(+0.2)  436.9(17.0)  534.7(+1.5)  526.6() 
Average  388.5(6.1)  349.6(15.5)  406.2(1.9)  413.9() 
CNN Model  MobileNet v1  MobileNet v2  SqueezeNet  

Design  [1]  [14]  Ours  Xilinx DPU [3]  [14]  Ours  Xilinx DPU [3]  [14]  Ours 
Device  StratixV  XCK325T  XCK325T  ZCU102  XCK325T  XCK325T  ZCU102  XCK325T  XCK325T 
PE Precision  Int16  Int8  Int8  Int8  Int8  Int8  Int8  Int8  Int8 
Allocated DSP  1278  704  832  2070  704  832  1942  704  832 
Frequency(MHz)  133  200  200  287  200  200  333  200  200 
Throughput(fps)  237.1  264.6  326.2  587.2  325.7  437.8  1048  420.9  526.6 
Throughput/DSP  0.11  0.21  0.23  0.08  0.14  0.16  0.20  0.19  0.22 
(GOPs/DSPslice) 
Targeting on workload consisting of multiple CNN models, our flow is able to find the PE array configuration with higher average throughput than designs that are optimized for a single CNN model specifically. We use harmonic mean of throughput for different models as the average throughput. Our design flow finds C(128,10)+P(32,12) as the configuration with highest average throughput. As shown in Table
VII, C(128,10)+P(32,12) shows 7.8 improvement on average throughput of multiple CNNs with 3.8 performance loss on individualbest configurations for single CNN. Results show the effectiveness of our design space exploration approach on searching for which configuration can lead to best average throughput.ViD Comparison with stateoftheart processors
We compare our work with Xilinx DPUv3 [3] and other processors from industry and academia, respectively, since they can handle regular convolution and depthwise convolution. Xilinx DPUv3 is implemented on Xilinx ZCU102 board with three B4096EU cores. In Table VIII, we include extra 48 DSPs per core in allocated DSP count for DPU on MobileNet v2 due to the depthwise convolution, besides the basic cost for regular convolution. DPU performance on MobileNet v1 is not included since it has never been reported by Xilinx. Scaled to same area, our heterogeneous dualcore processor improves throughput/DSP by up to 85 and 15 compared with the latest works from industry (Xilinx DPU) and academia.
Vii Conclusions and Discussions
In this paper, we propose a heterogeneous dualOPU to achieve high throughput of lightweight CNNs with high runtime PE efficiency. In dualOPU, one core is optimized for channel parallelism and regular convolution, and the other core is optimized for pixel parallelism and depthwise convolution. Moreover, the PE number of each core and the input size of each PE can be tuned automatically with our design flow given target CNNs and FPGA device. Meanwhile, we concurrently run layers for different input images of the same CNN and schedule with layer split to optimize the overall runtime PE efficiency. The experiment shows that heterogeneous dualOPU can improve throughput and runtime PE efficiency of homogeneous baseline with the same area by 31 and 11 for single CNN. For a workload of multiple CNNs, compared with stateoftheart processors, our design can improve the throughput by 11 on average than the latest works from industry and academia scaled to the same area.
References
 [1] (2018) A cnn accelerator on fpga using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs 65 (10), pp. 1415–1419. Cited by: TABLE VIII.

[2]
(2018)
TVM: an automated endtoend optimizing compiler for deep learning
. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594. Cited by: §VIA.  [3] (2020) DPU for Convolutional Neural Network: https://www.xilinx.com/support/documentation/ip_documentation/dpu. External Links: Link Cited by: §I, §VID, TABLE VIII.
 [4] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §I.
 [5] (2016) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §I.
 [6] (2017) Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §I.
 [7] (2016) A high performance fpgabased accelerator for largescale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–9. Cited by: §I.

[8]
(2018)
Mobilenetv2: inverted residuals and linear bottlenecks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4510–4520. Cited by: §I.  [9] (2017) Maximizing cnn accelerator efficiency through resource partitioning. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 535–547. Cited by: §I.

[10]
(2018)
Redundancyreduced mobilenet acceleration on reconfigurable logic for imagenet classification
. In International Symposium on Applied Reconfigurable Computing, pp. 16–28. Cited by: §I.  [11] (2018) TGPA: tilegrained pipeline architecture for low latency cnn inference. In Proceedings of the International Conference on ComputerAided Design, pp. 1–8. Cited by: §I.
 [12] (2019) A highperformance cnn processor based on fpga for mobilenets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 136–143. Cited by: §I, §II.
 [13] (2019) Opu: an fpgabased overlay processor for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28 (1), pp. 35–47. Cited by: §VIA.
 [14] (2020) LightOPU: an fpgabased overlay processor for lightweight convolutional neural networks. In The 2020 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 122–132. Cited by: §I, Fig. 1, §II, §IVC, TABLE I, TABLE VIII.
 [15] (2020) DNNExplorer: a framework for modeling and exploring a novel paradigm of fpgabased dnn accelerator. In Proceedings of the 39th International Conference on ComputerAided Design, pp. 1–9. Cited by: §I.
Comments
There are no comments yet.