I Introduction
In the evermore important field of machine learning, there are two trends that are gaining importance and increasing in both volume of academic research and number of industrial deployment. First, the design of deep neural networks (DNNs) is becoming automated through the general field of automated machine learning (AutoML) and more specifically neural architecture search (NAS). Instead of manually designing a DNN, NAS automatically searches through hundreds or thousands of models using reinforcement learning (RL), evolutionary algorithms or other approaches, to hone in on the best DNN model. The second trend is pertaining to the hardware (HW) that runs DNN algorithms. Instead of using commodity HW, there is an increasing number of custom accelerators, either based on FPGAs
[7, 2] or ASICs [4, 12]. In this work, we combine NAS for DNN model discovery and HW design space exploration (DSE) to automatically codesign both the DNN model and its HW accelerator. We expose parameters from both the DNN and the HW thus allowing the bottomup design of the two parts taking into account accuracy and efficiency of the overall design. This, in turn, allows us to tailor the DNN to the HW and vice versa.Related Work. NAS has been successful in discovering DNN models that achieve stateoftheart accuracy on image classification [17]
[5], speech recognition [6] and machine translation [13]. HWaware NAS adds latency to the reward function so that discovered models optimize both accuracy and inference latency, for example, when running on mobile devices [15]. More recently, reinforcement learningbased codesign has been introduced to automate both the discovery of a DNN model and its partitioning across multiple FPGAs based on a theoretical utilization model for each device [11]. Additionally, the authors in [8] propose an automatic codesign methodology based on stochastic coordinate descent to refine a DNN search space based on hardware constraints, then codesign the DNN using the refined operations along with a hardware accelerator.Contributions. Compared to recent automated codesign literature [11, 8], we use reinforcement learning to automatically codesign CNN and HW architecture, investigate different RLbased search strategies to navigate the codesign search space and demonstrate the efficacy of our search strategies under different scenarios, and finally, we use CodesignNAS to simultaneously improve accuracy and efficiency of image classification on a popular FPGA platform. Fig 1 illustrates our system: CodesignNAS. A controller selects a CNN architecture from a CNN search space and a HW architecture from an accelerator design space. Both are sent to the evaluator that implements the CNN on the proposed accelerator to find accuracy and efficiency metrics, such as latency, area and power. All metrics are then used to create a multiobjective reward that influences the controller to find better CNNHW pairs. Our contributions are:

Propose a general formulation for CodesignNAS to automatically find CNNHW pairs using reinforcement learning while optimizing for multiple objectives (area, latency, accuracy).

Enumerate ~4 billion modelaccelerator pairs to study the Paretofront in a representative codesign search space, and propose three different search strategies to navigate the codesign search space.

Demonstrate the effectiveness of CodesignNAS by using it for the task of CIFAR100 image classification. We find CNNHW pairs that outperform both ResNet [9] and GoogLeNet [14] even when paired with their mostoptimal HW accelerators. (Note that ResNet and GoogLeNet are the best two manuallydesigned CNNs on our test FPGA platform  we exceed them both on accuracy and HW efficiency).
Ii Approach
In this section we outline our CodesignNAS system with a focus on the CNNHW search spaces, the multiobjective optimization (MOO) problem, and the accelerator HW model.
Iia Codesign Multiobjective Neural Architecture Search
NAS focuses on searching for the best parametrization of a predefined model architecture by making a number of decisions and evaluating performance of the model when constructed according to the chosen options. We can define our codesign optimization problem addressed by NAS as:
(1) 
where is a DNN option, such as the selection of an operation, and is a hardware option such as the size of a buffer. is the evaluation function to find the performance of a search point ; that is, finding the accuracy and efficiency metrics of running the discovered DNN on the discovered HW.
Since enumerating all points from is often infeasible in practice due to their large number and timeconsuming evaluation, the main challenge of NAS is to produce as good an approximation of as possible while being allowed to evaluate only a limited number of architectures. Therefore, for a sequence of explored architectures (search steps): , Eq. 1 takes the following form:
(2) 
and the main focus is to guarantee that the search is able to explore points which optimize as increases.
In this work, we use a probabilistic, trainable policy to guide the search as proposed in prior work [17]. At each search step the policy (implemented as a single LSTM cell followed by a linear layer as in [17]) is first sampled in order to get a structure sequence
and later updated using REINFORCE and stochastic gradient descent:
.We consider the following three quality metrics when assessing a DNNaccelerator pair: accuracy of the DNN, area of the accelerator and latency of the DNN when run on the accelerator. In order to link the three objectives to the original optimization problem from Eq. 1, we consider a mixture of two standard approaches: we first limit the set of points in by providing a set of thresholds for all/some of the metrics and filter out points with at least one of them being above/below the threshold (constraint [3]); we then take the weighted sum of the remaining ones [3]. This is expressed by the following reward function :
(3) 
where is a linear elementwise normalization function which maps values from the range to ,
is a vector of metrics,
is a vector of thresholds and is a vector of weights. The evaluation function, which is the subject of optimization according to Eq. 1, is defined as:(4) 
If a search point does not meet specified constraints, a punishment function (with opposite sign to the reward) is used as feedback for the RL controller to deter it from searching for similarly bad points.
IiB Codesign Search Space
IiB1 CNN Search Space
NASBench [16] provides a CNN search space for image classification including evaluation statistics such as accuracy and training time on the CIFAR10 dataset for ~423 thousand unique CNNs. Fig. 2 shows the structure of the CNNs within NASBench. The only varying part of each model is the innermost design of a single cell, which is then repeated three times to form a stack. At most 7 operations and 9 connections are allowed per cell, and addition/concatenation operations are inserted automatically according to a set of rules – for more information, please refer to NASBench [16]. In Section III, we use the NASBench database of precomputed accuracy to find exactly the Paretooptimal points within the codesign space (Equation 1), we then compare them to the points found by our search (Equation 2) under different search strategies and optimization targets.
IiB2 Accelerator Design Space
We base our work on CHaiDNN – a library for acceleration of CNNs on Systemonchip FPGAs [10]. The FPGA accelerator supports convolution and pooling operations while the unsupported layers run on the CPU. The accelerator is configurable to maximize the hardware efficiency on different CNNs. Fig 3 shows the accelerator parameters and their valid values. The available parameters are fairly standard ones in custom hardware accelerators, configuring things like buffer depths, external memory interface width, and the amount of parallelism in the filter and pixel dimensions. We add the parameter to CHaiDNN. In its default configuration, CHaiDNN sets , which means that a single general convolution engine runs any type of convolution. When is set to any number below 1, there are two convolution engines  one of them specialized for 3x3 filters, and the other for 1x1 filters – the ratio determines the number of DSPs assigned to each convolution engine. These parameters form 8640 different combinations of valid CHaiDNN accelerators, and is representative of a relatively simple hardware design space. While we use this search space to implement our system and prove its effectiveness, we believe that much more parameterrich hardware design space can be equally leveraged using our methodology, and would be expected to achieve larger improvements.
IiC Accelerator Modelling
To avoid timeconsuming compilations/runs on the FPGA, we create area/latency models to use with CodesignNAS.
IiC1 Area model
Resource  Relative Area (CLB)  Tile Area () 
CLB  1  0.0044 
BRAM  36 Kbit  6  0.026 
DSP  10  0.044 
Total  64,922  286 
We divide the accelerator into its different components: convolution engine, buffers, pooling engine, memory interface, and we create area models based on the utilization of configurable logic blocks (CLB), digital signal processors (DSP) and block RAM (BRAM). For each component, we break it down further into simpler subcomponents – for example, a sliding window buffer within the convolution engine that is parameterized with and would be modeled with an equation that takes these 2 variables as input. We verified our area model against 10 full FPGA compilations with different parameters and our model had on average 1.6% error – we felt this was adequate for area estimation. Based on the FPGA resource utilization, we estimate the accelerator size in such that area is quantified by a single number – silicon area^{1}^{1}1Silicon area is not available for Zync Ultrascale+ devices so we use area for similar devices [1] and account for the process node (20nm vs. 40nm) and the different block properties (8 LUTs per CLB instead of 10, and 36 Kbit per BRAM instead of 9 Kbit). according to Table I.
IiC2 Latency model
The latency model consists of two parts: 1) latency lookup table of operations and 2) scheduler. In our CNN search space, there are 85 unique variations of convolutions, pooling and elementwise operations (different input/filter sizes etc.). We run each operation on the FPGA accelerator with different parameters and measure latency which we then store in a lookup table. The scheduler assigns operations to the parallel compute units greedily and calculates the total latency of the CNN model using the lookup table. To validate the latency model, we pick the CNN model from NASBench with the GoogLeNet cell, and we run it on 10 different accelerator variants with different parameterizations. Our latency model is 85% accurate on this validation set  there is room for improving our perlayer latency model, but we leave that to future work.
Iii Search Investigation on NASBench101
In this section, we analyze the codesign search space using the NASBench dataset and our accelerator latency/area model. The CNN accuracy in NASBench is precomputed and stored in a database, and our FPGA accelerator model runs quickly on a desktop computer. This allows us to enumerate the entire search space, consisting of 3.7 billion data points, and find the Paretooptimal points within that space. Finally, we investigate how to automatically navigate the codesign search space using our RLbased methodology described in Section II. In that context, we evaluate three search strategies in terms of proximity of discovered points to the Paretooptimal ones and the search convergence speed.
Iiia ParetoOptimal Points
To understand the good points in the codesign search space, we look for Paretooptimal [3]
points within the 3.7 billion modelaccelerator pairs. This is done iteratively by filtering dominated points from the search space. The remaining (nondominated) points are better in at least one of our evaluation metrics (area, latency or accuracy) w.r.t. any other point. For our search space, there were only 3096 Paretooptimal CNNHW pairs – these are illustrated in Fig.
4.As Fig 4 shows, there is a threeway tradeoff between area, latency and accuracy – to improve a metric, one or both of the other two must degrade. The Paretooptimal points form concentric accuracylatency tradeoff curves, each at a different accelerator area – different points on the yaxis represent different CNNs, and different points on the xaxis are different accelerator designs. Fig. 4 highlights the diversity and number of good points, and motivates why automated codesign techniques are necessary. First, less than 0.0001% of points in the search space were actually Paretooptimal – it is near impossible to manually pinpoint these modelaccelerator pairs. Second, the Paretooptimal points are very diverse and include 338 accelerator variants and 136 different CNN cells – it is very difficult to manually translate from accuracy/efficiency requirements to an optimal point. In the next two subsections we present and evaluate NAS search strategies that find codesign points close to these Paretooptimal solutions.
IiiB Search Strategies
IiiB1 Combined Search
The first search strategy is to consider both subsearch spaces together as in Equation 1 and apply REINFORCE directly – we call this combined search. This strategy has the ability to update both the CNN and the accelerator in each step, and is therefore able to make faster changes to adapt to the reward function. However, the combined search space is much larger, which may make it more difficult to find the best points. In this approach we run each experiment for 10,000 steps.
IiiB2 Phase Search
We explicitly specify specialized phases during the search by freezing one part of the search space (e.g. a specific accelerator) and only focus on the other (e.g. a CNN design). We would then select the best found CNN, and switch to the accelerator phase to search for suitable hardware. The two phases are interleaved and repeated multiple times in order to find a globally optimal solution. This requires us to have two different controllers – one which only learns to select the best combination of options for the FPGA design and the other one to optimize the CNN structure. This divideandconquer technique may make it easier to find better locallyoptimal points (per search space). However, mutual impact between the phases is limited, which may make it more difficult to adapt the CNN and the accelerator to each other optimally. When running phase search, we set number of steps for each CNN phase to 1000, and 200 steps for each HW phase, repeating them until we hit the total of 10,000 steps.
IiiB3 Separate Search (baseline)
We compare our proposed codesign search strategies above to a baseline where we separately search for a CNN, followed by designspace exploration of accelerator parameters. This methodology is similar to the conventional, sequential design of the two parts. We run the separate experiments for the total of 10,000 steps splitting the two phases into 8,333 and 1,667 steps respectively.
IiiC Search Results
We evaluate our search strategies using three experiments:

Unconstrained: Zero constraints and we arbitrarily^{2}^{2}2The choice of weights is critical in determining the neighbourhood of good points explored, but we do not study that in this work. We refer the reader to prior literature for an indepth analysis [3] and a relevant case study [15]. choose the MOO weights as follows . This scenario may be useful to simply search for many good points to understand the codesign search space.

1 Constraint: One constraint on latency and . This scenario is similar to when an enduser may know the task and realtime requirements, but is not sure which FPGA device to choose – the best accuracy at each device size may aid such decision.

2 Constraints: Two constraints on area and accuracy and we optimize latency (single objective). This occurs when there is a maximum FPGA area budget and a minimum tolerated accuracy for an application – this is a common usecase.
Fig. 5 plots the top 100 Paretooptimal points that maximize each experiment’s reward. Therefore, a good search algorithm would produce results in the vicinity of these top Paretooptimal points. We also plot the top result from our 3 search strategies, and we repeat each experiment 10 times. Therefore, we have a maximum of 10 points per search strategy for each plot. Fig. 6 shows the reward function (averaged over 10 experiments) for each experiment. Note that we only plot the reward function and we do not display the punishment function on the plot. In summary, the results show the following trends.
IiiC1 Separate Search
“Separate” search cannot consistently find good points within the constraints. This is because it searches for the most accurate CNN model without any context of the HW target platform. Fig (b)b shows two lucky “separate” points that are superior to other searches (and that is also reflected by a higher reward in Fig. 6). However, the plots do not show that the 8 remaining points all have latencies that are much higher than the constraint. This is true for all plots in Fig. 5, where only a few “separate” points fit within the displayed axes, while the rest of the points are generally high accuracy but very low efficiency. This shows the randomness of CNNs that are designed without HW context – they may or may not fall within efficiency constraints based on chance, even if the accelerator is heavily tuned for the separatelyfound CNN, further motivating the need of joint codesign.
IiiC2 Combined and Phase Searches
These two search strategies improve upon separate search as they take the HW accelerator into account, and more importantly, they consider all variants of the hardware accelerator and all variants of the CNN simultaneously. Fig. 6 shows that “combined” is generally better in the Unconstrained experiment, whereas “phase” achieves a higher reward with both the constrained experiments. This is also highlighted in Fig. (c)c that clearly shows that phase gets closer to the ideal points. However, the same figure shows a shortcoming of “phase” search. It is more prone to missing the specified constraints, likely because there are only limited opportunities to switch from the CNN phase to the FPGA phase within the 10,000 steps in our experiment – if we increase the number of search steps, we expect these two experiments to also find points within the constraints. More generally, we can say that phase search is slower to converge, compared to combined search. This is also highlighted in Fig. 6 where phase search goes through a few exploration phases before finding its best result. In summary, we believe that both of these search techniques have their merits; combined works better when the search is unconstrained and is generally faster to converge to a solution, while phase finds better points when there are constraints but typically requires more search steps to do so.
Iv CIFAR100 CNNAccelerator Codesign
In this section we use CodesignNAS to discover a CNN modelaccelerator pair that optimizes the task of CIFAR100 image classification. We show that CodesignNAS can exceed both the efficiency and accuracy of wellknown CNN architectures even when paired with their optimal accelerators.
Iva Experimental Setup
Unlike our use of NASBench in previous sections, we have no precomputed results for CIFAR100 image classification, so we must train all discovered CNNs from scratch. However, we still use the same codesign search space as defined in Section II to be able to easily reuse our reliable FPGA latency and area models which were verified on our CNN search space. We use the same training parameters shown in previous work [16]
with 108 epochs of training, standard data augmentation (padding, random crop and flipping), initial learning rate of 0.1 with cosine decay and weights decay of
. Training every new CNN takes approximately 1GPU hour, so to be able to train many models, we parallelize CodesignNAS over 6 machines, each with 8 Nvidia1080 GPUs each, allowing us to train 48 models in parallel.We run CodesignNAS with two constraints combined into one. Specifically, we combine latency and area into performanceperarea () and constrain the to a threshold value. We then attempt to maximize accuracy under the constraint. For our RL controller, we gradually increase the threshold according to – we run the search for ~2300 valid points in total, starting with 300 points at the first threshold value and increasing to 1000 points at the last threshold value. We found that this gradual increase in threshold makes it easier for the RL controller to learn the structure of highaccuracy CNNs. We decided to use the “combined” search strategy described in Section IIIB1 as it has shown to be faster to converge.
IvB Results
CNN  Accuracy  Perf/Area  Latency  Area 
[]  []  []  []  
ResNet Cell  72.9  12.8  42.0  186 
Cod1  74.2 (+1.3%)  18.1 (+41%)  41.8 (0.5%)  132 (29%) 
GoogLeNet Cell  71.5  39.3  19.3  132 (0.8%) 
Cod2  72.0 (+0.5%)  40.6 (+3.3%)  18.5 (4.2%)  133 
Fig.
LABEL:imagenet
shows the top1 accuracy and of various points searched by CodesignNAS. We plot the top 10 points among the modelaccelerator pairs visited at each threshold value. The plot also shows the ResNet and GoogLeNet cells within our CNN skeleton^{3}^{3}3We believe that the fairest comparison to discovered cells from CodesignNAS is to compare against GoogLeNet and ResNet cells within our same skeleton. Alternatively, we can also use our discovered cells within the ResNet50 [9] or GoogLeNet v1 [14] skeletons, and we anticipate very similar findings. (Fig. 2), and we pair those with their most optimal accelerator in terms of . This is a difficult baseline to beat as we are comparing against two wellknown highaccuracy CNN cells when implemented on their best possible corresponding accelerator in our FPGA search space. However, as the plot shows, we find many points that exceed both the accuracy and efficiency of both the ResNet and GoogLeNet baselines.We highlight the best two among those points, and label them “Cod1” and “Cod2” on Fig. LABEL:imagenet. Cod1 improves upon ResNet by 1.3% accuracy while simultaneously improving by 41% – considerable gains on both accuracy and efficiency. Cod2 shows more modest improvements over GoogLeNet as shown in Table II, but still beats it on both efficiency and accuracy while running 4.2% faster in terms of absolute latency.
HW Parameter  Cod1  Cod2 

filter_par, pixel_par  (16, 64)  (16, 64) 
buffer_depths  (4K, 2K, 4K)  (8K, 2K, 2K) 
mem_interface_width  256  512 
pool_en  false  false 
ratio_conv_engines  0.33  0.25 
Fig. 8 shows the CNN cell structure of Cod1 and Cod2, and Table III lists the HW parameters. It is difficult to reason about automatically designed CNNs [17], but we will highlight observations of our codesigned modelaccelerator pairs. For example, the Cod1 CNN manages to beat ResNet accuracy but uses an important ResNet feature: skip connections and elementwise addition as shown by the rightmost branch of the cell in Fig. (a)a. On the hardware side, both Cod1 and Cod2 use the largest convolution engine and avoid the use of a dedicated pooling engine. However, the other HW parameters are tailored for each CNN. For example, both the input buffer size and the memory interface width are smaller for Cod1, likely due to the fact that the Cod1 CNN uses a larger number of smaller convolutions compared to Cod2.
Naturally, we anticipate that there might even be better points within our search space that has ~3.7 billion points in total. We only explore ~1000 points before finding Cod1 and ~2000 points before finding Cod2 (compared to 8000 points in prior work to discover a CNN [15]). This highlights the speed of convergence of our RL controller and its effectiveness in finding good designs, especially when properly tuned with representative reward functions and search strategies as we have shown in this paper.
V Conclusion
We proposed the automatic codesign of CNN and HW accelerator, and provided a full methodology and case study using FPGAs to support our proposal. We presented three search strategies based on reinforcement learning and compared them against each other and against Paretooptimal designs. Finally, we implemented CodesignNAS for the task of CIFAR100 image classification on a popular FPGA platform and showed that we can improve upon ResNet (paired with its ideal accelerator) by 1.3% in accuracy and 41% efficiency. These are large improvements, especially considering that our FPGA search space contains a limited set of configurable HW parameters. We believe that our findings provide a compelling case for the automated codesign of HW and DNNs. In the future, we hope to study more interesting HW search spaces that give more freedom to CodesignNAS to tailor a hardware platform for a codesigned DNN.
References
 [1] (2012Dec.) Design Tradeoffs for Hard and Soft FPGAbased NetworksonChip. In International Conference on FieldProgrammable Technology (FPT), Vol. , pp. 95–103. External Links: ISSN Cited by: footnote 1.
 [2] DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration. In International Conference on Field Programmable Logic and Applications (FPL), Cited by: §I.
 [3] (2008Jan.) Multiobjective optimization, interactive and evolutionary approaches. Springer, Heidelberg, Germany. Cited by: §IIA, §IIIA, footnote 2.
 [4] (201906) Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. Emerging and Selected Topics in Circuits and Systems 9 (2). Cited by: §I.
 [5] (201901) Fast, Accurate and Lightweight SuperResolution with Neural Architecture Search. arXiv eprints. External Links: 1901.07261 Cited by: §I.
 [6] (2019) ShrinkML: EndtoEnd ASR Model Compression Using Reinforcement Learning. In Proc. Interspeech 2019, pp. 2235–2239. External Links: Document, Link Cited by: §I.
 [7] (2018) A Configurable Cloudscale DNN Processor for Realtime AI. In International Symposium on Computer Architecture (ISCA), Cited by: §I.
 [8] (2019) FPGA/dnn codesign: an efficient design methodology for iot intelligence on the edge. In Design Automation Conference (DAC), Cited by: §I, §I.
 [9] (201512) Deep Residual Learning for Image Recognition. arXiv eprints. External Links: 1512.03385 Cited by: item 3, footnote 3.
 [10] (2019) Chaidnnv2  hls based dnn accelerator library for xilinx ultrascale+ mpsocs. GitHub. Note: https://github.com/Xilinx/CHaiDNN Cited by: Fig. 3, §IIB2.
 [11] (201907) Hardware/Software CoExploration of Neural Architectures. arXiv eprints. External Links: 1907.04650 Cited by: §I, §I.

[12]
(2017)
Indatacenter performance analysis of a tensor processing unit
. In International Symposium on Computer Architecture (ISCA), Cited by: §I.  [13] (201901) The Evolved Transformer. arXiv eprints. External Links: 1901.11117 Cited by: §I.
 [14] (201409) Going Deeper with Convolutions. arXiv eprints. External Links: 1409.4842 Cited by: item 3, footnote 3.
 [15] (201807) MnasNet: PlatformAware Neural Architecture Search for Mobile. arXiv eprints. External Links: 1807.11626 Cited by: §I, §IVB, footnote 2.
 [16] (201902) NASBench101: Towards Reproducible Neural Architecture Search. arXiv eprints. External Links: 1902.09635 Cited by: Fig. 2, §IIB1, §IVA.
 [17] (201611) Neural Architecture Search with Reinforcement Learning. arXiv eprints. External Links: 1611.01578 Cited by: §I, §IIA, §IVB.
Comments
There are no comments yet.