2L-3W: 2-Level 3-Way Hardware-Software Co-Verification for the Mapping of Deep Learning Architecture (DLA) onto FPGA Boards

11/14/2019 ∙ by Tolulope A. Odetola, et al. ∙ 20

FPGAs have become a popular choice for deploying deep learning architectures (DLA). There are many researchers that have explored the deployment and mapping of DLA on FPGA. However, there has been a growing need to do design-time hardware-software co-verification of these deployments. To the best of our knowledge this is the first work that proposes a 2-Level 3-Way (2L-3W) hardware-software co-verification methodology and provides a step-by-step guide for the successful mapping, deployment and verification of DLA on FPGA boards. The 2-Level verification is to make sure the implementation in each stage (software and hardware) are following the desired behavior. The 3-Way co-verification provides a cross-paradigm (software, design and hardware) layer-by-layer parameter check to assure the correct implementation and mapping of the DLA onto FPGA boards. The proposed 2L-3W co-verification methodology has been evaluated over several test cases. In each case, the prediction and layer-by-layer output of the DLA deployed on PYNQ FPGA board (hardware) alongside with the intermediate design results of the layer-by-layer output of the DLA implemented on Vivado HLS and the prediction and layer-by-layer output of the software level (Caffe deep learning framework) are compared to obtain a layer-by-layer similarity score. The comparison is achieved using a completely automated Python script. The comparison provides a layer-by-layer similarity score that informs us the degree of success of the DLA mapping to the FPGA or help identify in design time the layer to be debugged in the case of unsuccessful mapping. We demonstrated our technique on LeNet DLA and Caffe inspired Cifar-10 DLA and the co-verification results yielded layer-by-layer similarity scores of 99% accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural network (CNN), a well-known Deep Learning Architecture (DLA) evolved from artificial neural network, has been extensively applied to various applications, such as video surveillance, mobile robot vision, image search engine in data centers and so on [zhang2015optimizing]

. In general, deep learning uses a multi-layer neural network model to extract high-level features which are a combination of low-level abstractions to classify mutually exclusive properties of an image data

[tolu2]

. This helps in finding the distributed data features, in order to solve complex problems in machine learning

[wang2017dlau].

Due to the specific computation pattern of CNN, cloud computing have been employed to perform classification of deep learning models but this raises concerns of privacy  [baza2019b, baza2018blockchain, parksmarnet, baza2019blockchain, parkccnc, pazos2019privacy, baza2019detecting, Lightride, Andrew, shafee2019mimic, baza2015efficient, blockchainKey, firmware2], security [tolu1] and latency. General-purpose processors are also not efficient for CNN implementation and can hardly meet the performance requirement [bacis2017pipelined]. Thus, various accelerators based on FPGA (Field Programmable Gate Array), GPU (Graphics Processing Unit), and even ASIC (Application Specific Integrated Circuits) design have been proposed to improve performance of CNN designs [hailesellasie2019mulnet]. Among these approaches, FPGA based accelerators have attracted more attention because they have advantages of good performance, high energy efficiency, fast prototyping, and capability of reconfiguration [zhang2015optimizing].

To take advantage of what FPGA has to offer, several approaches like [guo2017angel], [park2016fpga] and [rastegari2016xnor] have been proposed to enable efficient optimizations for the deployment and successful mapping of DLAs onto FPGA boards. These optimizations help to reduce latency, conserve area and memory on the hardware (FPGA) [zhang2017machine]. Majority of these mapping and optimizations are only validated at the point of final prediction and the measure of accuracy. Hence, layer-by-layer design time verification mechanism for DLA mapping to hardware from software paradigm to hardware paradigm has not been addressed. Verification is very crucial in hardware design as it accounts for about 80% of modern hardware design time [wang2009electronic].

Though, many researchers understand the crucial nature of verification in the design and mapping of DLA, but the mapping of DLA to FPGA board has unique phases from the software to the design and eventual mapping onto FPGA boards (hardware). Several approaches have been adopted to verify the workings and correctness of the DLA. Xiang et. al [xiang2018output] proposes a software simulation based approach to verify the correctness of multilayer neural networks by measuring the maximum sensitivity of the layers of the network. Similarly, Dwarakanath et. al [dwarakanath2018identifying] proposes a software based approach to verify the correctness of image classifiers by building relationships between subsequent layer-by-layer outputs corresponding to different inputs. These verification approaches are limited to software level layer-by-layer output and the accuracy of the final prediction. They do not provide a means of verifying the implementation correctness of the mapping of DLAs on hardware.

The verification approach verifies to the layer-by-layer output and the accuracy of the final prediction only. This approach does not take into consideration an approach that can be applied to the mapping of DLAs to FPGA boards.

Other approaches that focus on the mapping of DLA onto FPGA boards involve the process of hardware-software co-design. Guo et. al [guo2017angel] proposes a design flow for mapping CNNs onto embedded FPGA using data quantization to reduce the bit-width of CNN models without compromising much on accuracy. Similarly, Jiandong et. al [mu2018collaborative] proposes a collaborative framework to optimize the OpenCL based CNN design. These co-design techniques can only validate the correctness of the implementation based on the accuracy of the final prediction.

Very recently, Cong et. al [hao2019fpga] proposes a time saving co-design methodology that simultaneously searches possible design options to auto-generate efficient DNNs optimized for FPGA deployment. This design approach is all automatic from the software to the hardware deployment. However, it only validates the design based on the accuracy of prediction of the model.

One shortcoming in the existing literature in the process of mapping DLAs to FPGAs is their inability to show a complete hardware-software co-verification schemes of the hardware implementation against its counterpart software-based DLA. Some of the above mentioned approaches [guo2017angel], [mu2018collaborative] and [hao2019fpga] only have means of validating or debugging the deployment at the stage of final prediction while others [xiang2018output] and [dwarakanath2018identifying] show means of verifying the DLA in a software environment. During the mapping of DLA to FPGA, if the prediction in design is wrong or does not correspond to the software implementation, the traditional approaches of verification are not able to analyze layer-by-layer feature values of DLA in design time. In this paper, we work with the premise that for a sustainable DLA design environment co-verification at the three stages of design (software stage, hardware design stage and hardware deployment stage) is crucial. Hence, there is a need for a methodology for complete hardware-software co-verification of DLA that readily shows the step-by-step and end-to-end process of deployment and verification of the inference phase (the forward propagation path) of the DLAs over FPGA boards. To the best of the authors knowledge, no such methodology exists so far in the literature.

In this paper, we propose a 2-Level 3-Way (2L-3W) coherent hardware-software co-verification approach. Our 2-level verification approach is divided into software inference level and hardware inference level of the DLA. Our 3-way co-verification technique provides a means if assuring that the software design, hardware design and hardware mapping of the DLA are coherent and correctly implemented.

The following are the contributions of this work:

  • A step-by-step and end-to-end methodology for the mapping of DLAs onto FPGA boards

  • A 2-level verification approach to ensure the implementation correctness of a designed DLA in both software and hardware

  • A 3-way layer-by-layer co-verification technique that ensures successful mapping of DLA to FPGA boards

The remainder of this paper is organized as follows: Section II provides some preliminary information. Section III discusses the proposed methodology of hardware-software co-verification of DLA. Section IV describes the experimental validation of the co-verification methodology setup on Xilinx PYNQ FPGA board. Section V shows the results and lessons learned. Section VI discusses the related work. Section VII compares the co-verification methodology with different approaches. Section VIII concludes the paper.

2 Preliminaries

In order to understand this paper, we are providing information about some of the concepts used in this paper.

2.1 High Level Synthesis (HLS)

Hardware accelerators like FPGA provides a means to achieve moderate level performance with low power consumption, massive memory parallelism and short time to market[park2017optimizing]. To ensure proper deployment of DLA on FPGA, hardware-software co-verification is essential. Hardware-software co-verification helps to ensure the behavior of the embedded system software is consistent with the hardware design.

Fig. 1: General Deep Learning Approach.

Hardware design using Hardware Descriptive Languages (HDL) can be time consuming and difficult to debug and verify [o2014xilinx]. High Level Synthesis (HLS) offers flexibility by utilizing C/C++ code with a set of derivatives to automatically generate HDL for hardware implementation on FPGA. HLS provides a means of converting C/C++ code (High Level Languages) to HDL like VHDL or Verilog.

2.2 Deep Learning Framework: Caffe

In this paper, Caffe deep learning framework is adopted because of its popularity, support and easy-to-use interface. It is easy to experiment with popular pre-trained models [lacey2016deep]. Caffe provides toolkits for training, fine-tuning and the deploying DLA [jia2014caffe]. In Caffe, the DLA is designed and configured using prototxt files prior to training. After training, Caffe generates a caffemodel file containing the trained parameters (weights and biases) of the DLA. The parameters in the caffemodel file can be accessed through using Python libraries.

2.3 Network Surgery

DLA tend to have stacked layers. Each layer contains learnable parameters (weight and biases) [guo2016dynamic]. For proper replication and deployment of the DLA on hardware, access and extraction of these learnable parameters are needed. During the inference phase, network surgery gives access to the output of each layer when an unseen data is passed through the DLA. The output of a layer is called Blob. Network surgery allows access and extraction of the DLA parameters and Blobs.

2.4 Chosen FPGA Board: PYNQ

The hardware environment chosen is the PYNQ-Z1 FPGA [bbb2017]. PYNQ-Z1 is built upon Xilinx ZYNQ SoC technology and is used to develop applications for ZYNQ-7000 based devices [janssen2017dynamic]. The PYNQ platform offers designers the privilege of exploiting the programmable logic of the FPGA board from a Python environment[janssen2017dynamic]. Xilinx provides Python packages that facilitates the interaction with hardware modules using overlays.

2.5 Python Overlay

Overlays, or hardware libraries, are configurable FPGA designs capable of extending user application from the ZYNQ processor of a PYNQ board into the programmable Logic. Overlays can be loaded to the FPGA dynamically like a software library. PYNQ overlays are created by hardware designers, and wrapped with PYNQ’s Python Overlay API. This allows Python interface to program and control specialized hardware overlays [pynq2019].

2.6 Data Transfer: AXI Direct Memory Access (AXI DMA)

AXI DMA transfers data between memory and AXI4-Stream-type target peripherals [jeff2014]. AXI DMA in Vivado provides high-bandwidth direct memory access between an AXI4 memory-mapped and an AXI4-Stream ports on IPs (Intellectual Property) interfaces [Xil2019]. PYNQ supports the AXI central DMA IP with the PYNQ DMA class [pynq2019]. DMA can be used for high performance burst transfers between Processing System (PS) DRAM and the Programmable Logic (PL). It helps to offload data from the Central Processing Unit (CPU) in processor-based systems [Xil2019]. AXI DMA data movement between system memory and stream target is through the AXI4 Read Master to AXI4 memory-mapped to stream (MM2S) Master, and AXI stream to memory-mapped (S2MM) Slave to AXI4 Write Master.

Fig. 2: Verification Approach.

2.7 General Deep Learning Approach

Fig. 1 shows the general end-to-end approach from the training of a DLA to its deployment on FPGA board. This include the following steps:

  • Network Training: This stage takes place after the DLA has been designed. The training process is where the best sets of parameters that maximizes a DLA’s accuracy is determined by leveraging on gradient descent (back propagation). Training involves a number of forwared and backward propagation based on the number of iterations specified in the model design. Network training is done with CPUs or GPUs on different software frameworks like caffe, tensoflow and so on.

  • Network Testing: This stage is also referred to the inference stage. The trained model is used to classify unseen data and predict a result with a degree of accuracy.

  • C++ Layer-by-Layer Abstraction:

    Model design, training and testing are usually done in Python environment. For hardware design, the model design is converted from prototxt syntax adopted in Caffe (which is utilized using Python libraries in model training)to C++ syntax used in hardware design. In this stage, every layer is designed in C++ as stipulated in the model design in the prototxt. All conditions in terms of layer outputs, kernel sizes, stride sizes and so on for each respective layer is obeyed during this conversion.

  • Vivado HLS (Hardware Design): Vivado HLS provides an environment for the simulation and synthesis of the C++ code of the model design. After successful synthesis, Vivado HLS allows for IP generation of the model design.

  • Deployment over FPGA (PYNQ Board): In this stage, the IP generated from Vivado HLS is converted to bitstreams and deployed on the FPGA board.

The verification part shown on the right hand side of Fig. 1 is something outside the realm of the general deep learning deployment methodology.

3 Hardware-Software Co-Verification of DLA Inference Phase

In this paper we are proposing a novel 2L-3W hardware-software co-verification concept for DLA deployment on FPGA boards. In order to achieve this, Caffe software framework is utilized for the software implementation (training and testing) and Vivado HLS for the hardware design synthesis. Finally, our approach uses Xilinx PYNQ FPGA board for hardware implementation

Prior to explaining our proposed co-verification appoach it is worthy to note that we collect the trained model apriori using Caffe deep learning framework. This trained model is called Caffemodel file in Caffe framework. Furthermore, we design the feed forward path of the trained DLA using Vivado HLS.

Fig. 2 shows the different levels and sections of the co-verification methodology. These sections are discussed as follows:

3.1 Level 1: Inference Phase Software Verification

This is the first level of the proposed 2L-3W co-verification methodology. In this phase, the trained DLA is collected. As shown in Fig. 2

, the image dataset (correctly predicted by the trained DLA) used in training the DLA is passed through the trained model. Network surgery is used to get the Blobs (layer-by-layer output features) of each layer of the DLA. These Blobs are obtained and used to investigate the numerical distribution of each respective layer Blob. Statistical properties like the range, minimum, maximum and standard deviation of each Blob is also collected over a given number of training image set and generalized and written to the Software Properties Verification File (SPVF) as shown in Fig.

2. The SPVF file contains boundary values of each element in the Blob of each layer. This forms a benchmark for comparing the numerical distribution and statistical properties of Blobs of subsequent images (test images) that is passed through the model. This serves as the Inference Phase Software Verification as shown in Fig. 2.

3.2 Level 2: Inference Phase DLA Mapping From Software to Hardware

This is the second level of the proposed 2L-3W co-verification methodology. This level is divided into 6 sections. After the software verification is done, the DLA is mapped to FPGA and test images are used to verify the implementation correctness of mapping the DLA from software to hardware (FPGA).

3.2.1 Section A: Parameter Extraction Using Network Surgery

This stage of the co-verification methodology is shown as Section A in Fig. 2. This section shows that the trained model (i.e. Caffemodel file) parameters is obtained using a Caffe function called Network Surgery. Here, unseen data (data not used in the training phase) shown as input data in Fig. 2 is passed through the parameters of the trained Caffemodel file. This stage is carried out in the software environment. The prediction and the layer-by-layer output (Blobs) is extracted and written to a specified file called File_SW as shown in Fig. 2. The numerical distribution and statistical properties of the Blobs written to File_SW is then compared and validated with the properties written to the SPVF generated in level 1. Line 1 to 7 of Algorithm 1 shows what actions needs to be taken if the DLA requirement is not met.

3.2.2 Section B: Parsing of Weights from Caffe to Vivado HLS

The parameters of the Caffemodel are obtained in Python and passed through a data cleaning process to convert the layer-by-layer parameters to be compatible with C++ syntax required for Vivado HLS. This converted parameters are then incorporated for Vivado HLS synthesis of hardware design. This is further explained in the simulation of HLS design section (Setion D). Line 8 to 11 of Algorithm 1 summarizes this section

3.2.3 Section C: Streaming of Input Data to the Hardware Design

This section is in the hardware design stage. Here, the input data (unseen data), that is used in Section A is read using OpenCv C++ library and converted to a stream of data using HLS stream library. The stream of data is passed down to the designed IP in the HLS design (Section D). Line 12 of Algorithm 1 summarizes this section.

3.2.4 Section D: Simulation of HLS Design to Provide Layer-by-layer Output

This section is shown in Fig. 2 as section D. This section assumes that the C++ adaptation of each layer of the DLA has been completed to form the IP in Vivado HLS. The weights obtained using the parsing of weights of the Caffemodel file (section B) is imported and merged appropriately with the IP. Unseen image data is read using OpenCV (as shown in section C) and is converted to a stream of data. This stream of data is passed as an input to the designed IP. After simulation of the IP, the layer-by-layer output and the prediction is written to a specified file (denoted as File_Design) as shown in Fig. 2. Line 13 to 16 of Algorithm 1 shows what actions needs to be taken if the verification does not meet the requirement.

3.2.5 Section E: Hardware Deployment and On-board Verification

The generated IP is synthesized to obtain a bitstream and .tcl files in Xilinx Vivado environment as in the case of conventional design flow. These files are imported to the PYNQ board as an Overlay to be called in the Python environment. Special provisions are made to ensure the output of each stage of the DLA is compared against the output of simulation of HLS design (explained in Section D) and the software deep learning framework output (Caffe layer-by-layer output explained in Section A). Fig. 4 illustrates the comparison. It shows that all layers of the DLA are synthesized as a separate module. For example, the output of conv1_dma (shown in the right hand side of Fig. 4) corresponds to the output of conv1 layer in the software (shown in the left hand side of Fig. 4). In order to store the values and automate the process we stored the output of each stage in Python environment in a separate file (denoted as File_HW) as shown in Fig. 2. Line 18 to 24 of Algorithm 1 summarizes this section.

3.2.6 Section F: Co-verification

To automate our methodology, the output of all the three stages need to be compared seamlessly. In order to achieve this, a Python script is written that verifies the software verified layer-by-layer output of each stage that are stored in the File_SW, File_Design and File_HW for our three-way verification approach. Line 25 to 28 of Algorithm 1 summarizes this section and suggests possible actions if the verification does not meet the requirement.

Fig. 3: LeNet DLA and hardware configuration for the output of each layer on FPGA
Fig. 4: Cifar-10 DLA and hardware configuration for the output of each layer on FPGA
Fig. 3: LeNet DLA and hardware configuration for the output of each layer on FPGA
0:  Design, Configure and training of Model
1:  Testing of model on unseen data (D) in Caffe
2:  if Testing = Fails then
3:      Retrain and re-design or re-configure model
4:  else
5:     Perform network surgery on model
6:     Obtain layer-by-layer Blob and obtain the numerical distribution and generalized statistical properties (Range, Maximum, Minimum, Mean and Standard deviation) for correctly predicted images in training sets and write to SPVF
7:  end if
8:  Extract Blobs (layer-by-layer output) from testing (of unseen data), write to file (File_SW).
9:  Compare File_SW with SPVF generated in 1evel 1
10:  Extract parameters (weights and biases) of Model
11:  Convert the parameters from Python syntax to C++
12:  Implement C++ representation of each layer of the model design in Vivado HLS
13:  Incorporate model design parameters with model design in Vivado HLS
14:  Simulate model design with unseen data (D) used in model testing
15:  Write layer-by-layer output of the result of simulation of the model in Vivado HLS to file (File_Design)
16:  Compare value-to-value of respective layer-by-layer output between Vivado HLS and Caffe
17:  if Vivado HLS Output != Caffe Output then
18:      Redesign C++ algorithm and check for error using layer-by-layer output values
19:  else
20:     Generate IP from model design in Vivado HLS
21:  end if
22:  Configure IP in Vivado block design
23:  Generate bit-stream
24:  Deploy bit-stream on board
25:  Import bit-stream in Python Overlay
26:  Run bit-stream with unseen data (D) and write layer-by-layer output to file (File_HW)
27:  Perform hardware-software verification with results
28:  if FPGA Output != Vivado HLS Output or Caffe Output then
29:      Redesign C++ algorithm and re-generate bitstream
30:  else
31:     End Deployment
32:  end if
Algorithm 1 2L-3W Hardware-software Co-verification Methodology

4 Experimental Validation of Hardware-Software Co-Verification

To validate our methodology, we implemented 2 DLAs. The first DLA is LeNet and the other is Caffe-inspired Cifar-10 as shown in Figs. 4 and 4, respectively. Both DLAs are implemented on PYNQ Xilinx FPGA Board and the processes of implementation are for the most part the same. For the sake of vivid elaboration, our discussion in this section is explaining the process for LeNet DLA.

The LeNet DLA is designed and trained in Caffe. After training, several steps are taken to test the trained model and to validate the implementation correctness of the model. A total of 100 test images are passed through the model and it yields an accuracy of 97%. After verifying the accuracy, 1000 correctly predicted images are passed through the LeNet DLA to obtain the numerical distribution of each respective Blob (Blob is explained in Section II) using network surgery. The mean, range, maximum, minimum and standard deviations are obtained and averaged over 1000 images to get generalized statistical properties of the each respective Blobs as shown in Fig. 2. The boundary values (minimum and maximum of each element across all the chosen training imageset)for each element in the Blob is also obtained. These properties and boundary values are written to a specified file called SPVF. To illustrate this, Fig. 5

shows the numerical distribution of outputs from the first fully connected layer of LeNet DLA. The Blobs of the first fully connected layer for unseen data is compared with this to verify it. The SPVF for the first fully connected layer shows the Blobs follow a Gaussian distribution. The same procedure is carried out for all the layers in the LeNet DLA. This concludes Level 1 of the 2L-3W hardware-software co-verification which is shown as “Inference Phase Software Verification" in Fig.

2.

Fig. 5: Software Verification for Fully Connected Layer 1 for LeNet DLA
Fig. 6: Illustration of Conv1 for Caffe to HLS Design to Block Design to Python

For the second level of 2L-3W hardware-software co-verification, labelled as “Level 2: Inference Phase DLA Mapping in Fig. 2", an unseen image is passed through the trained DLA and the Blobs of each layer are obtained using network surgery (explained in Section II) and written to a specified file denoted by File_SW in Fig. 2. The Blobs written to File_SW is then verified with the SPVF. The element in the Blob of each layer is compared with the boundary values in the SPVF to verify them. The code snippet that allows the access to layer-by-layer output using network surgery for one of the DLA layers is shown in Image 3 of Fig. 6. Following the proposed 2L-3W co-verification methodology in Fig. 2, using Section B, the parameters (weights and biases) of the DLA are obtained and parsed into the HLS design. Each of the layers defined in the Caffe framework is also defined in Vivado HLS to maintain the same accuracy of prediction from Caffe to the PYNQ hardware. As shown in Section C in Fig. 2, a stream of input data (same used in testing the Caffe model) is used in simulating the HLS design layers and the parsed parameters. The layer-by-layer output result of the simulation is written to a specified file denoted by File_Design in Fig. 2. The layer-by-layer output of the DLA written to File_Design is then verified with the SPVF file. After successful verification, the DLA is optimized to fit the PYNQ FPGA board is then synthesized and packaged as an IP. Vivado HLS contains built in directives known as pragmas (shown in the first two lines of image 4 Fig. 6) that specifies how the data is written to the IP (shown in Section C in Fig. 2) and also how the data is read from the IP. The pragma used to allow data flow is called “interface axis port". This axis port is important because this allows for an actual physical port of an AXI4-Stream to be used later in the block design. The AXI4-Streams ports allow for this implementation to Blobs from each layer to be viewed in the Python environment.

Fig. 7: The IP integration with the Zynq Processor

To generate an Overlay that will be exported on the PYNQ board for the LeNet DLA, the generated IP is imported to Vivado where each axis port defined in Vivado HLS is now declared as AXI4-Stream port on the IP. An example of this can be seen in Fig. 4 where the LeNet_DLA IP has ports representing Blobs for each layer. The AXI4-Streams are written to and read from via AXI DMA as shown in Fig. 4. Each of these AXI DMAs needs to interact with Python Overlay APIs to write data and read data from the AXI DMA. These connections in Fig. 4 are collapsed into a hierarchy called “LeNet_DLA" shown in Fig. 7 which is called in the Python environment. As shown in Fig. 7 of the block design, the LeNet_DLA transfers data to and from the ZYNQ processor via the axi_interconnect_0 and axi_interconnect_1 modules, respectively. When all the connections are routed, the connections in the block diagram are validated, synthesized, and implemented. After the implementation, a bitstream file and a .tcl are generated which are exported to the PYNQ FPGA board to create an Overlay to be called at the Python environment.

As shown in the Hardware Deployment and On-board Verification phase (Section D) in the Fig. 2

, the Python Overlay API is imported into the Jupyter Notebook that allows reading from and writing to the IP on the hardware of the PYNQ board via AXI DMA. In the Python environment, the Python Overlay library uses AXI DMA APIs to call the AXI DMAs created in the block diagram directly and allows the writing of an image vector as a stream to the IP for processing. After the execution, the output of each layer is written to their respective AXI DMA, which is written to the Python environment. These outputs are verified with the SPVF and written to a specified file (File_HW). The prediction is read from the output register specified in Vivado HLS.

To illustrate this co-verification process, Fig. 6 shows how the output of conv1 layer defined in Caffe is written in Vivado HLS with its number of respective outputs. In Vivado HLS, the IN_DATA and OUT_CONV1 are defined as AXI4-Stream that allows the actual ports for the input image to be streamed in by the IN_DATA and the Blob to be streamed out by OUT_CONV1 as shown in Fig. 6. Importing the IP into Vivado block design shown in Fig. 6 shows that IN_DATA and OUT_CONV1 have their own ports to be connected to an AXI DMA. OUT_CONV1 is written to the conv1_dma (which is shown in Image 6 of the code snippet shown in Fig. 6) at the Python environment. Buffers are created and assigned to their AXI DMA for the data to be passed to and from the AXI DMA. Once the IP is signaled through the Python environment to start, the AXI DMA returns its values back to the buffer in which this buffer can be viewed in Python environment.

Layer Caffe Output Vivado HLS Output PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
TABLE II: Snippet of results from layer-by-layer output of LeNet DLA implementation using Arbitrary Precision for bit-width reduction in hardware design and deployment. First column shows Caffe output (Software), shows Vivado HLS (Design) and third column shows PYNQ FPGA (Hardware) output results.
Layer Caffe Output Vivado HLS Output PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
TABLE I: Snippet of results from layer-by-layer output of LeNet DLA implementation using default float (32-bit) data type. First column shows Caffe output (Software), second column shows Vivado HLS (Design) and third column shows PYNQ FPGA (Hardware)output results.
Layer Caffe Output Vivado HLS Output PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
TABLE IV: Snippet of results from layer-by-layer output of Cifar-10 DLA implementation using Arbitrary Precision for bit-width reduction in hardware design and deployment. First column shows Caffe output (Software), second column shows Vivado HLS (Design) and third column shows PYNQ FPGA (Hardware)output results.
Layer Caffe Output Vivado HLS PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction

TABLE III: Snippet of results from layer-by-layer output of Cifar-10 DLA implementation using default float (32-bit) data type. First column shows Caffe output (Software), second column shows Vivado HLS (Design) and third column shows PYNQ FPGA (Hardware)output results.

Caffe software framework generated File_SW at the end of Section A of Fig. 2. The Vivado design simulation generated the layer-by-layer output feature of the DLA which is stored in File_Design shown in Section D of Fig. 2. Finally, the layer-by-layer output of the AXI DMA of each respective layer is written to File_HW as depicted in Section E of Fig. 2. Finally, as shown in Fig. 2, the Section E of our 2L-3W co-verification compares the output of each layer at each stage of hardware-software co-design.

5 Results and Lessons Learned

The LeNet DLA for MNIST dataset and Caffe Cifar-10 inspired DLA for Cifar-10 datasets are shown in Figs. 4 and 4. They are implemented on the PYNQ hardware using the methodology shown in Fig. 2.

The LeNet DLA consists of 8 layers excluding the data and prob layers as shown in Fig. 4. The data layer passes a 28x28 hand-written image of a digit through the layers designed in Caffe and also through the layers designed in Vivado HLS and the PYNQ FPGA. The results are shown in Table I.

Layers Similarity Score Parameters Compared
conv1 File_SW 0.99999 3456
File_Design
File_SW 0.99999
File_HW
pool1 File_SW 0.99999 864
File_Design
File_SW 0.99999
File_HW
conv2 File_SW 0.98153 1024
File_Design
File_SW 0.98153
File_HW
pool2 File_SW 0.96887 256
File_Design
File_SW 0.96887
File_HW
conv3 File_SW 0.99057 120
File_Design
File_SW 0.99057
File_HW
fc1 File_SW 0.99333 84
File_Design
File_SW 0.99333
File_HW
fc2 File_SW 0.99088 10
File_Design
File_SW 0.99088
File_HW
(a) Table of Similarity Scores and Parameters Compared for Layer-by-Layer Output of LeNet DLA When Hardware is Designed With Float Data Type
Layers Similarity Score Parameters Compared
conv1 File_SW 0.82564 3456
File_Design
File_SW 0.82564
File_HW
pool1 File_SW 0.84104 864
File_Design
File_SW 0.84104
File_HW
conv2 File_SW 0.74304 1024
File_Design
File_SW 0.74304
File_HW
pool2 File_SW 0.68196 256
File_Design
File_SW 0.68196
File_HW
conv3 File_SW 0.64940 120
File_Design
File_SW 0.64940
File_HW
fc1 File_SW 0.66700 84
File_Design
File_SW 0.66700
File_HW
fc2 File_SW 0.77909 10
File_Design
File_SW 0.77909
File_HW
(b) Table of Similarity Scores and Parameters Compared for Layer-by-Layer Output of LeNet DLA When Hardware is Designed With Arbitrary Precision Data Type
TABLE V: Table of Results of Similarity Scores for LeNet DLA
Layers Similarity Score Parameters Compared
conv1 File_SW 0.99889 5120
File_Design
File_SW 0.99889
File_HW
pool1_relu1 File_SW 0.99885 1280
File_Design
File_SW 0.99885
File_HW
conv2_relu2 File_SW 1.00510 2560
File_Design
File_SW 1.00510
File_HW
pool2 File_SW 0.99896 640
File_Design
File_SW 0.99896
File_HW
conv3_relu3 File_SW 0.99732 960
File_Design
File_SW 0.99732
File_HW
pool3 File_SW 1.00856 240
File_Design
File_SW 1.00856
File_HW
fc1 File_SW 0.98541 50
File_Design
File_SW 0.98541
File_HW
fc2 File_SW 0.99964 10
File_Design
File_SW 0.99964
File_HW
(a) Table of Similarity Scores and Parameters Compared for Layer-by-Layer Output of Cifar-10 DLA When Hardware is Designed With Float Data Type
Layers Similarity Score Parameters Compared
conv1 File_SW 0.99889 5120
File_Design
File_SW 0.99889
File_HW
pool1_relu1 File_SW 0.99885 1280
File_Design
File_SW 0.99885
File_HW
conv2_relu2 File_SW 1.00510 2560
File_Design
File_SW 1.00510
File_HW
pool2 File_SW 0.99896 640
File_Design
File_SW 0.99896
File_HW
conv3_relu3 File_SW 0.99732 960
File_Design
File_SW 0.99732
File_HW
pool3 File_SW 1.00856 240
File_Design
File_SW 1.00856
File_HW
fc1 File_SW 0.98541 50
File_Design
File_SW 0.98541
File_HW
fc2 File_SW 0.99964 10
File_Design
File_SW 0.99964
File_HW
(b) Table of Similarity Scores and Parameters Compared for Layer-by-Layer Output of Cifar-10 DLA When Hardware is Designed With Arbitrary Precision Data Type
TABLE VI: Table of Results of Similarity Scores for Cifar-10 DLA

The Tables II and IV show the values of subsections of the array outputted by the Conv1, Pool1 and Conv2 layers of these respective written files of LeNet and DLA for Cifar-10 DLA respectively. The 3-way verification performed by the Python script which compares of the output values of each layer and the prediction written to the files returns a similarity score per layer. The similarity score is defined as the metric for measuring element-by-element similarity in terms of magnitude of the values stored in the arrays produced by each layer and written to the three files (File SW, File Design, File HW).

The similarity score per layer for the design stage () is given as:

(1)


Similarity score for a layer in design stage
element written to a particular file
Number of parameters to be compared in the layer


Absolute value of the element value written to the File_SW file
Absolute value of the corresponding element value written to the File_Design file

Similarly, the similarity score per layer for the deployment stage () is given as:

(2)


Similarity score for a layer in hardware deployment stage
element written to a particular file
Number of parameters to be compared in the layer


Absolute value of the element value written to the File_SW file
Absolute value of the element value written to the File_HW file
Absolute value of the corresponding element value written to the File_Design file

Table II shows snippets of partial results of the layer-by-layer output values written to the File_SW file in the software stage, the File_Design file in the design stage of the LeNet DLA and the File_HW in the hardware deployment stage to obtain similarity scores in the design stage and deployment stage respectively.

Prior to the design stage, the training of the DLA is done in Caffe software environment using float (32-bits precision) data type. Hence the parameters and the Blobs of the DLA are in float data type. The layer-by-layer output (Blob) are obtained and written to File_SW. In the design stage, the DLA is simulated with parameters and Blobs of float data type numbers to obtain and write the layer-by-layer output in the design stage to File_Design. The values written to the File_Design are verified and compared with the layer-by-layer outputs written to File_SW to obtain the similarity scores at the design stage as shown in Table LABEL:cifarfloat. Once the result shows desirable similarity scores, an IP is generated from the hardware design and exported and configured in Vivado to generate a bit-stream file that is deployed on the PYNQ FPGA board. The layer-by-layer values outputted by the PYNQ FPGA after deployment are obtained are written to File_HW to obtain the layer-by-layer similarity score for the deployment stage. The similarity score for the deployment stage for LeNet DLA is shown in Table LABEL:cifarfloat.

From Table LABEL:cifarfloat, a 99% similarity score for each layer is obtained for the LeNet DLA in the design stage and deployment stage using float data type.

FPGAs have a common characteristic of having limited area and hardware resources (DSPs, LUTs, Flip-flops, BRAM). For scalability, one of the strategies to ensure the large DLA fit the FPGA boards, the bit-width of parameters and Blob precisions of the large DLA are truncated using Arbitrary Precision (AP) libraries provided in the hardware design stage in Vivado HLS. This truncation reduces the memory and computation requirement of the large DLA. For the LeNet DLA in this work, the parameters and Blobs are truncated from 32-bit precision to 8-bits and 24-bits precisions respectively. The truncation reduces the area of the DLA on the board without compromising on accuracy as truncated parameters and Blobs are tested with 100 images and they show consistent predictions with the hardware design and deployment using float data type. This truncation leads to changes in the values of the parameters and hence changes in Blobs as shown in Table II. The similarity score of the design and deployment stage are obtained as shown in Table LABEL:aplenet.

From Table LABEL:aplenet, similarity scores ranging from 65% - 84% is obtained in the design and deployment stage when layer-by-layer output values written in File_Design and File_HW are compared with layer-by-layer output values written to File_SW. This drop in similarity scores is due to bit-width truncation of the parameters and Blobs of the LeNet DLA in the design stage and hence the deployment stage.

The 3-way prediction values written to the files are equivalent and consistent. This depicts the successful implementation of a DLA on the PYNQ FPGA. Based on the similarity score provided by the Python script, recommendations can be made on where to debug or redesign if the similarity score is below a certain threshold. This helps to avoid blind debugging of results during hardware implementation of DLA. A total of 100 images are used to validate our 3-way verification methodology and it turns out to be consistent in all cases.

Tables IV and IV shows a similar result of the implementation of a DLA shown in Fig. 4 for Cifar-10 dataset. The DLA also consist of 8 layers and accepts a 32x32 input image. Just like Tables II and II, Tables IV and IV also shows subsections of the 2D-matrices that are written to the File_SW, File_Design and File_HW for float data type and implementation using arbitrary precision data type for bit-width reduction respectively. The Python script returns a 3-way similarity score of 99% for the values written to the files in the design stage(File_Design) and deployment stage (File_HW) using float data type when compared with File_SW as seen in Table LABEL:cifarfloat and a similarity score ranging from 65% - 84% as seen in LABEL:cifarap for the values written to the files in the design stage (File_Design) and deployment stage (File_HW) using arbitrary precision data type when compared with File_SW. The arbitrary precision prediction show consistent results with the prediction obtained using float data type for 100 images.

6 Related Work

Several approaches in existing literature have been adopted to achieve efficient mapping of DLAs to FPGA boards. Guo et. al [guo2017angel] proposes a design flow for mapping CNNs onto embedded FPGA. In [guo2017angel], data quantization is introduced to reduce the bit-width of CNN models to achieve smaller memory and computation requirements with negligible accuracy loss. A compiler that maps the CNN to the FPGA is also proposed.

Florian et. al [kastner2018hardware] proposes a tool flow for the hardware/software codesign implementation of CNNs on PYNQ FPGAs. FPGA possess Dynamic Partial Reconfiguration (DPR) capabilities that enable the exchange of logic partitions within the FPGA fabric. This property offers a major advantage for designing hardware architectures able to adapt and reconfigure the hardware due to characteristics of DLA using high-level synthesis.

Jiandong et. al [mu2018collaborative] proposes a collaborative framework to optimize the OpenCL based CNN design for CNN applications. The introduction of LoopTree to capture the main features of OpenCL based hardware design. Hardware design specifications like loop orders, loop tiling, Block RAMs (BRAM) and Double Data Rate (DDR) configurations, and OpenCL attributes are utilized. Then a coarse-grained model is employed in evaluating the performance of LoopTree and to find candidate designs. Finally, a fine-grained model is employed to tune the candidate designs to obtain the best design deployed on the hardware. Also, [han2016eie], proposes weight compression and weight sharing neural networks in order to allow for the proper hardware resource utilization that enables the large neural network models to fit in ASICs and FPGAs.

Xiang et. al [xiang2018output] proposes a software simulation-based approach for the verification of Multilayer Neural Networks by coming up with an algorithm to measure the maximum sensitivity for the output of a finite number of different simulations corresponding to different finite bounded inputs. The sensitivity of the network is given as the mathematical expectation of output deviations due to input and weight deviations with respect to overall input and weight values in a given continuous interval. The maximum sensitivity used to measure the maximum deviation of outputs, which is brought by bounded disturbances around the input. The maximum sensitivity represents the output reachable sets of the network and is measured and computed layer-by-layer. These measurements are used for the verification of the layer-by-layer output of the network.

Dwarakanath et. al [dwarakanath2018identifying] proposes a software-based approach of verification of machine learning-based image classifiers using metamorphic testing. This approach builds multiple relationships between the subsequent output of a classifier to different inputs to derive the degree of correctness of the implementation of the classifier. This approach is designed to detect implementation bugs in the implementation of the classifier. The metamorphic testing comes up with different permutations of cases for the training and testing input features, training instances and layers and also scaling of the test data samples of the image classifier to generate different outputs.

Choi et. al [choi2018stochastic] proposes a stochastic functional verification method in designing DNN-based systems. In this approach, synthetic data sets are generated in a virtual environment and added to the training set for a DNN. The DNN is trained with both dataset and validated with validation subsets of both datasets. A comparison metric such as class-wise average precision is used to compare the performance of the model on both validation datasets against a predefined threshold. For a DNN under verification, the DNN is trained with synthetic datasets and the comparison, metric is obtained. The similarity between the comparison metric and the predefined threshold is used to validate the verification.

Cong et. al [hao2019fpga] proposes a time saving co-design methodology that simultaneously searches possible design options to auto-generate efficient DNNs optimized for FPGA deployment. [hao2019fpga] introduces a template for the generation of DNN with efficient performance and hardware resource utilization. An automatic HLS generator is proposed to help translate the auto-generated DNN to synthesizable C code for hardware deployment.

In reference [changwoolee] is a Github repository that the C++ code (Design code) for mapping LeNet DLA on hardware. The repository shows the weights and algorithms of each layer. This code is an already finished DLA on an FPGA board. This repository does not give information about which framework has been used to train the DLA and does not provide a means of debugging and validating the output of each layer in order to accomplish design time verification at every stage.

The references [stephendigikey] and [alveo] shows an introduction to the deployment of Machine Learning on Hardware. This only shows stacks and block diagrams of how neural networks is utilized on hardware and also the number of parameters and MACC (Multiply-Accumulate) units required by a DLA. This does not give a full picture from the training to the testing and successful deployment of DLAs on FPGA boards and other hardware.

These approaches are either limited to the software environment or they do not take into consideration the verification of the implementation correctness of the DLA mapping onto hardware across all the design stages involved.

Approach [guo2017angel] [kastner2018hardware] [mu2018collaborative] [xiang2018output] [dwarakanath2018identifying] [choi2018stochastic] [hao2019fpga] 2L-3W
Software Verification Layer-by-Layer Verification x x x x
Final Layer Verification x
Hardware Verification Layer-by-Layer Verification x x x x x x x
Final Layer Verification x x x
Hardware-Software Co-Verification Layer-by-Layer Verification x x x x x x x
Final Layer Verification x x x x x x x
TABLE VII: Verification Approach Comparison With Other Works.

7 Comparison With State-of-the-Art

Some state-of-the-art approaches have been adopted to ascertain the implementation correctness of DLA. Guo et. al [guo2017angel] proposes an approach that allows for the hardware-software co-design of DLA on FPGAs. This approach only has a means of validating the DLA at the final layers of the software and hardware. The limitation of this approach is that it does not account for layer-by-layer verification of the output of the layers.

Florian et. al [kastner2018hardware] proposes a toolflow approach for the hardware-software co-design of DLA on FPGAs. The means of validating the co-design is at the final layers of the software and hardware. This toolflow approach does not take into consideration the verification of the layer-by-layer outputs to ensure the implementation correctness on the hardware. The approach also does not provide a means of debugging in case of errors.

Jiandong et. al [mu2018collaborative] proposes a collaborative framework to optimize the deployment of DLA on FPGA. This approach validates the correctness of the deployment of the DLA only at the final layers of the hardware deployment. The limitation of this approach is that it does not account for software implementation, and at the hardware level, it does not provide layer-by-layer verification of the DLA.

Xiang et. al [xiang2018output] proposes a software simulation based approach to verify the correctness of a DLA. The verification approach is limited to the layer-by-layer output and the accuracy of the final prediction. This approach does not provide a means of verifying the implementation correctness of the mapping of DLAs on hardware.

Dwarakanath et.al [dwarakanath2018identifying] proposes a software based approach to verify the correctness of a image classifiers. The verification approach verifies to the layer-by-layer output and the accuracy of the final prediction only. This approach does not take into consideration an approach that can be applied to the mapping of DLAs to FPGA boards.

Choi et. al [choi2018stochastic] introduces a stochastic functional verification method using synthetic datasets. This method verifies the layer-by-layer output and the accuracy of the deep learning model. This approach is not scalable when trying to achieve successful mapping of DLAs on FPGA boards.

Cong et. al [hao2019fpga] proposes a co-design methodology that simultaneously generates a software design model and an synthesizable C code for the hardware design. This approach only validates the design based on the accuracy of prediction of the model.

Table VII shows that our proposed method can verify all the six types of cross layer verification.

8 Conclusions

This work proposes a 2-Level 3-Way methodology for hardware-software co-verification of DLA from deep learning software framework to HLS design of DLA and finally onto DLA deployment on the FPGA board. This methodology is used to test the hardware implementation correctness of 2 DLAs (LeNet and Caffe inspired Cifar-10 network) on PYNQ FPGA board. To the best of author’s knowledge this is the first time a methodology is developed, which performs layer-by-layer co-verification for mapping of DLA architectures across the 3 paradigms (software, design and hardware level). The methodology can help to achieve successful implementation and mapping of DLA onto FPGA during the design phase and can help in the cross paradigm debugging process. We proposed a new metric for cross paradigm co-verification, called similarity score, which as a metric to measure the degree of correctness of the implementation of each layer. The similarity score also helps to show layers that need debugging. Our implementation results from Caffe software to Vivado HLS design and finally to Xilinx’s PYNQ FPGA show similarity scores of 99% for LeNet and Caffe inspired Cifar-10 network in the design stage. A range of similarity scores from 65% - 84% are obtained in the deployment stage due to truncation of the bit-width of the LeNet DLA so it can fit on the PYNQ FPGA board. This stipulates the successful mapping of the DLA onto the PYNQ FPGA board

References