1 Introduction
Fieldprogrammable gate arrays (FPGAs) are an efficient and flexible processing solution to perform low latency and high bandwidth inference of deep neural networks (DNNs). Their design is extremely functional to parallelize the mathematical operations typical of DNN inference tasks, namely matrix multiplication and activation function application. FPGAs can be reprogrammed, which offers unquestionable advantages in terms of flexibility with respect to applicationspecific integrated circuits (ASICs). At the same time, they share some of the advantages offered by ASICs, such as low power consumption and speed.
Typically, FPGAs are used to emulate generic digitial circuits as a preliminary step toward the design of custom ASICs or as an alternative to them. For instance, hundreds of FPGAs are used as custom electronic logic to process in real time the protonproton collisions at the CERN Large Hadron Collider (LHC). With beams colliding every 25 ns and thanks to a builtin buffering system, a typical LHC experiment has s to decide whether to keep or discard a given event. This realtime decisiontaking system, referred to as the “L1 trigger", consists of a set of digital circuits implementing physicsmotivated rulebased selection algorithms. Currently, these algorithms are emulated on FPGAs, mounted on custom boards.
The severe L1 latency constraint prevents the LHC experimental collaborations from deploying complex rulebased algorithms on the L1 FPGA boards. Machine Learning (ML) solutions, and in particular DNNs, are currently being investigated as fasttoexecute and parallelisable approximations of rulebased algorithms. For instance, the CMS collaboration has deployed a Boosted Decision Trees in the L1 trigger electronic logic
[cms_l1t_bdt]. Following this approach, one could train a DNN to process a given input (e.g., energy deposits in a calorimeter) and return the output of an event reconstruction algorithm (e.g., to regress the energy of the incoming particle that caused these energy deposits or to identify its nature). Since the complexity of LHC collision events is going to increase, we expect this approach to become popular in the near future.In order to facilitate the deployment of DNNs in the L1 triggers of high energy physics (HEP) experiments, we developed a software library, hls4ml, to convert a DNN model into FPGA firmware through an automatic workflow [Duarte:2018ite]
. In HEP, the deployment of deep learning (DL) models on FPGAs has been discussed in the context of the online dataselection system of the LHC experiments. Alternative solutions based on VHDL
[Schlag:2670301] have been explored. Similar studies and comparable results have been shown in Ref. [s19132981].The hls4ml design is characterized by two aspects: (i) it relies on highlevel synthesis (HLS) backends, in order to allow a fully automatized workflow from a trained model to an FPGA firmware; (ii) it is designed so that the final outcome is a fullyonchip logic, which enables the latency to be within typical values of s. Our ultimate goal is to support the most popular DNN model ingredients (layers, activation functions, etc.) and an interface to the most popular DL training libraries, directly (e.g., for TensorFlow [TF], Keras [keras], and PyTorch [pytorch]) or through the ONNX [onnx] interface. The library is under development and many of these ingredients are already supported. While hls4ml was initially conceived for LHC applications, its potential use cases go well beyond HEP. In general, hls4ml provides a userfriendly interface to deploy custom DNN models on FPGAs, used as coprocessing accelerators or as digital circuits in resourceconstrained, lowlatency computing environments.
The main challenge in deploying a DNN model on an FPGA is the limited computational resources. Typically, one would reuse resources for the inference operations across multiple clock cycles, at the price of a larger latency. The reuse factor quantifies how many times a resource is reused and is equal to the initiation interval (II) for that operation. A complementary approach consists of compressing the model, e.g., by reducing the number of operations needed in the inference step (pruning) or their cost (e.g., quantizing the network to a fixed point numerical representation). In a previous publication [Duarte:2018ite], we showed that pruning and quantization allow one to execute simple fullyconnected DNN models with stateoftheart performance on a specific LHC problem within a latency of ns, while using only a fraction of the FPGA resources. In this paper, we investigate how a similar result can be obtained with binary and ternary networks [courbariaux, binary_first, ternary_first], following closely the studies presented in Refs. [courbariaux, xilinx_finn, bertmoons].
This paper is structured as follows: Section 2 introduces the benchmark problems and data sets. The implementation of binary and ternary networks in hls4ml is described in Section 3. Section 4 describes the different model architectures considered in this study, while their application to the two benchmark classification problems is discussed in Section 5. A summary and outlook is given in Section 6.
2 Benchmark models and data sets
We consider two benchmark classification tasks: a digit recognition task with the MNIST data set [MNISTdata] and the LHC jet tagging task discussed in Ref. [Duarte:2018ite].
The MNIST data set consists of training and validation samples (with 60,000 examples) and a testing sample (with 10,000 examples). Each image is represented as a
pixel array, storing the grayscale content of each pixel in the original image. For our purpose, we flatten the 2D array to a 1D array, concatenating each row of the image to the right to the previous one. The derived 1D array is passed as input to a multilayer perceptron (MLP)
[MLP]with an input (output) layer of 784 (10) nodes and three hidden layers with 128 nodes each. Rectified linear unit (ReLU) activation functions
[Hahnloser:2000wt] are used for the hidden layer nodes, while a softmax activation function is used for the output layer.The other benchmark task consists of classifying jets from a set of 16 physicsmotivated highlevel features, as described in Ref. [Duarte:2018ite]
. The network receives as input a vector of 16 quantities and processes them through a MLP with three hidden layers of 64, 32, and 32 nodes with ReLU activation functions. The output layer consists of five nodes with softmax activation. The five output values correspond to the probability of a given jet to belong to one of the jet classes (light quark
, gluon , boson, boson, or top quark ). The utilized data set is available on the Zenodo repository [data].The architectures of the baseline MNIST and LHC jet classifiers are illustrated in Fig. 1. Both are implemented and trained with Keras in floating point precision (FPP). Their performance is shown in Fig. 2 in terms of receiver operating characteristic (ROC) curves and normalized confusion matrices. The area under the curve (AUC) of each ROC curve is quoted in the figure, as well as in Table 1, where the corresponding accuracy values are also given. Following convention, we define the accuracy as the fraction of correctly labeled examples. In practice, this is done applying an function to the array of scores returned by the network and comparing it to the corresponding target array. The total accuracy of the MNIST and LHC jet classifiers, computed across all categories, are found to be 98% and 75%, respectively.
Class  MNIST  Class  Jet tagging  

AUC  Accuracy [%]  AUC  Accuracy [%]  
0  0.9997  99.7  0.939  89  
1  0.9995  99.8  
2  0.9991  99.6  0.904  85  
3  0.9993  99.6  
4  0.9996  99.6  0.946  91  
5  0.9994  99.6  
6  0.9992  99.6  0.939  92  
7  0.9996  99.6  
8  0.9994  99.4  0.958  93  
9  0.9991  99.5 
3 Implementing binary and ternary networks in hls4ml
Binary and ternary networks are extreme examples of quantized neural networks. A network is quantized when the numerical representation of its parameters is a fixed precision. This precision could be constant across the full network or specific for each components (e.g., for different layers). Quantization allows one to reduce the computing resources of a given model for inference and it can be tuned so that it comes with little or no loss in terms of performance. In the case of binary (ternary) networks, each weight assumes a value of or (, , or ). Two and threevalued activation functions are used after each layer, acting as discrete versions of the function. As alternatives, we also investigate a standard ReLU function as well as its clipped version [Cai2017DeepLW], defined as , with
being a positive hyperparameter. In our study, we fix
. The four functions are represented in Fig. 3.In order to convert the models described in Sections 2, we rely on the MLPrelated functionalities offered by the hls4ml library, discussed at length in Ref. [Duarte:2018ite]. In addition to that, we exploit a set of custom implementations [xilinx_finn], specific to binary and ternary networks, that allow one to speed up the execution of the buildingblock architecture shown in Fig. 4. The implementation of these solutions is integrated in recent versions of the hls4ml library, starting with the v0.1.6 tag of the GitHub repository [hls4ml_github]. With respect to the work presented in Ref. [Duarte:2018ite], this version provides a special support for large dense layers, which allows one to deal with large number of nodes as in the models we consider in this study. This functionality will be described in more detail in a future publication.
Binary networks use 1bit representations for both weights and activations. In this case, the product between two quantities can be optimised to an extremely lightweight operation. By encoding an arithmetical value of ‘’ as ‘
’, the product can be expressed as an XNOR operation, as shown in Table
2. For models using ternary weights or greater than 1bit for activations, FPGA logic is always used rather than DSPs.1  1  1 
1  1  1 
1  1  1 
1  1  1 
0  0  1 
0  1  0 
1  0  0 
1  1  1 
The binary and ternary activation functions are implemented by testing the sign (in the case of binary ) or sign and magnitude (for ternary ) of the input and yielding the corresponding value or as seen in Fig. 3. A binary or ternary activation layer preceded by a BN layer can be further optimized. The usual BN transformation is:
given the mean
, variance
, scale , and shift computed during the network training. For a BN followed by a binary activation, the sign of is enough to determine a node output value. To avoid calculating the scaling of using FPGA logic, the four BN parameters are used to compute the value of at which flips sign. This calculation is performed at compiletime, when the model is converted to HLS firmware using hls4ml. Similarly, the two values of around which the output of the ternary activation changes are also calculated at compiletime. In the FPGA, each node output is then simply compared against these precomputed thresholds, outputting the corresponding , or . An additional optimization step sets the type of in the HLS implementation to integer with a bitwidth corresponding to the largest integer expected for each binary/ternary layer which is computed at compiletime. This procedure allows one to further save FPGA resources.The binary and ternary layers considered for this work are fully integrated and compatible with the hls4ml package. While not explored here, the package also supports models mixing binary/ternary layers with higher precision layers for fully customised networks.
4 Binarization and ternarization strategies
Given a fullprecision model, one could follow different strategies to turn it into a binary or ternary model. One could just replace each fullprecision component by the corresponding binary/ternary element, in order to minimize resource utilization. This might result in a loss of accuracy. As an alternative, one could train a binary/ternary model with arbitrarily large architecture, in order to match the accuracy obtained at full precision, at a cost of a larger latency and resource consumption. The ultimate strategy to follow depends on the use case. In this work, we present a few options, covering these two extremes and intermediate solutions.
In this work, we focus on binary/ternary MLPs. The basic structure for the adopted architectures is shown in Fig. 4
. Each model consists of a sequence of blocks, each composed of a dense, batch normalization (BN)
[Ioffe:2015:BNA:3045118.3045167], and activation layer. The BN layer shifts the output of the dense layers to the range of values in which the activation function is nonlinear, enhancing the network’s capability of modeling nonlinear responses. For binary and ternary , a BN+activation layer sequence can be implemented at small resource cost (see Section 3), which makes this choice particularly convenient for fast inference on edge devices.The binarization/ternarization of a given model can be done in different ways, e.g., preserving the model architectures or its performance. As a consequence, for each benchmark problem we consider seven models:

Baseline: the threelayer MLP described in Section 2.

Binarized (BNN): a binary version of the baseline model, built preserving the model architecture (number of layers and nodes) while applying the following changes: use a binary representation () for the weights; replace the innerlayer ReLU activation functions with a binary (see Fig. 3); introduce BN layers in between the binary dense layers and the activation functions; remove the softmax activation function in the output layer.

Ternarized (TNN): a ternary version of the baseline model, built preserving the model architecture (number of layers and nodes) while applying the following changes: use a ternary representation () for the weights; replace the innerlayer ReLU activation functions with a ternary (see Fig. 3); introduce BN layers in between the ternary dense layers and the activation functions; remove the softmax activation function in the output layer.

Best BNN: same structure as the BNN model, but with more nodes in each layer to improve performance. We obtain this model with a Bayesian optimization, finalized to minimize the validation loss in the training process.

Best TNN: same structure as the TNN model, but with the number of nodes per layer chosen through a Bayesian optimization of the architecture, as for the best BNN model.

Hybrid BNN: same as the BNN model, but with ReLU or clipped ReLU activation functions rather than the binary of Fig. 3.

Hybrid TNN: same as the TNN model, but with ReLU or clipped ReLU activation functions rather than the ternary of Fig. 3.
The baseline model is taken as a benchmark of ideal performance and the other models represent different strategies toward a more resourcefriendly representation. The BNN and TNN models are simple translations of the baseline model. They are designed to reduce the resource consumption, at the potential cost of a performance drop. The best models are designed to match (as close as possible) the performance of the baseline model, which might result in a larger resource consumption with respect to what the BNN and TNN models achieve. The hybrid models are a compromise between the two approaches. The fixedprecision conversion is applied only to the weights and biases of the nodes in the dense layers, while ReLU or clipped ReLU activation functions are used. Given the relatively small resources used by the ReLU/clipped ReLU activations, the hybrid models allow one to reach performance closer to the baseline model without inflating the number of nodes and, consequently, numerical operations. The best BNN and TNN models are only presented for the LHC jet problem, since in that case the simple binarization or ternarization of the baseline model result in a substantial performance loss. The effect is much milder for the MNIST classification problem, so that the binary and ternary architectures are not reoptimized for in that case.
One should notice that not all the components of a binary (ternary) model come in binary (ternary) precision, e.g., the output of a ReLU activation function in a hybrid model. For this reason, in the following we discuss bit precision and network quantization even in the context of binary and ternary models.
All models are implemented in Keras [keras], with TensorFlow [TF] as a backend using the implementation in [bertmoons] for binary and ternary layers, which we also crosschecked with QKeras [qkeras] with similar results. The network training was performed on a NVIDIA V100 GPU. During training, binary/ternary precision is employed during forward propagation, while full precision is used during backward propagation. The baseline models of Section 2
are trained minimizing a categorical cross entropy. The binary and ternary models are trained minimizing a hinge loss function
[Lin:2002].5 Experiments
The results presented below are synthesized with the Vivado HLS version 2018.2 for a Xilinx Virtex Ultrascale 9+ FPGA with part number xcvu9pflga21042Le. The clock frequency is fixed at 200 MHz, which is typical for the LHC L1 triggers. Unless otherwise specified, the quoted results are derived after the HLS compilation step. The network implementation is further refined by the logic synthesis. We verified that this final step does not affect the accuracy while it reduces the resource consumption.
5.1 Handwritten digits classification
We first evaluate the performance of the HLS neural network implementation for the models described in Section 4 with different fixedpoint precisions by scanning the number of both integer (I) and fractional (F) bits. In the following, a given choice of fixedpoint precision is specified as , where is the total number of allocated bits. For each case, the minimum number of bits yielding an accuracy above 90% after quantization is considered. We then study the latency and resource utilization in these configurations. Table 3 shows a comparison of the performance obtained for the baseline, binary, and ternary models, in terms of accuracy and AUCs, before and after quantization.
Model  Floating point precision  Fixed point precision  
AUC  Accuracy [%]  Number of bits  AUC  Accuracy [%]  
Baseline  0.9991–0.9997  98  0.9919–0.9959  95  
BNN  0.9869–0.9979  93  0.9860–0.9976  93  
TNN  0.9921–0.9992  95  0.9918–0.9992  95  
Hybrid BNN (ReLU)  0.9953–0.9990  95  0.9956–0.9989  95  
Hybrid TNN (ReLU)  0.9970–0.9993  96  0.9971–0.9993  96  
Hybrid BNN (clipped ReLU)  0.9827–0.9983  95  0.9828–0.9983  95  
Hybrid TNN (clipped ReLU)  0.9857–0.9989  96  0.9859–0.9988  96 
Profile of the range of output values of each layer, sampled during inference on the test dataset, for the baseline (left) and BNN (right) MNIST models. For each layer, the box represents the quartiles of the distribution, while the line shows the median. The lines extending beyond the box show the minimum and maximum values. The gray shaded areas represent the range covered by the allocated fixed point precision for each layer. In the left plot, these ranges correspond to the precision specified at compilation (
). On the right plot, an optimization procedure implemented in hls4ml for binary and ternary networks automatically adapts the precision of each layer to match the range covered by the output distribution; as the BN layer is merged with the binary in the HLS implementation, its output precision is 1 bit. Dense, batch normalization (BN), and activation layers are presented in order from the input (top) to the output (bottom).For binary and ternary models, the hls4ml library applies a further level of perlayer customization of the fixedpoint representation, to match the numerical precision of each layer separately, as discussed in Section 3. The outcome of this optimization is shown in the right plot of Fig. 5 for the BNN model, where the gray areas cover different numerical ranges for different layers, despite the common precision specified at compilation ( in this case). During the optimization, the inputs and the outputs are still represented by the fixedpoint precision specified by the user, while the precision of the other network components is optimized.
When quantizing a model, one should allocate I and F bits so that the range of values one can cover overlaps with the range of values returned by the network layers, in order to reduce the impact on accuracy. This is shown in the left plot of Fig. 5, where the profile of output values returned by each layer of the baseline model is compared to the range covered by the allocated fixedpoint precision. For each layer, we consider the distribution of the output values obtained running the network on a test sample. In the figure, the box represents the quartiles of the distribution, while the line inside the box shows the median. The lines extending beyond the box show the minimum and maximum values. The gray area represents the numerical range covered by the allocated precision. Overall, the optimized precision matched the bulk of the output values at each layer. The only exception is observed for the output layer. In this case, the allocated precision (gray area in the last row of the left plot in Fig. 5) does not cover the bulk of values returned by the layer (red box in the figure). This happens whenever a given example is associated to a specific class with a score close to 1, so that the other values are pushed close to 0 and out of the supported range. In practice, this fact would not alter the classification outcome in inference ^{1}^{1}1For instance, this would not be a problematic aspect when operating this algorithm through the function, associating a given example to the class with the largest output..
For the baseline model, the quantization from floatingpoint precision to results in an accuracy drop from 98% to 95%. This is almost entirely induced by the softmax activation function applied to the last layer and it results from the limited precision of the LUT implementing the functions in the softmax. This parameter is hardcoded in the version of hls4ml used for this study. One could avoid this accuracy loss removing the softmax function at the end, as long as there is interest only on which class has the biggest score and not on the individual scores. An alternative option is to further optimize the precision of the LUT implementing the softmax activation function. In this case, we verified that a quantization baseline with precision for the softmax LUT recovers an accuracy of 97% without affecting the resources. The ability to externally configure the precision of the softmax LUT will be implemented in future versions of hls4ml.
For the hybrid BNN/TNN models, the same number of bits used for the BNN/TNN cases allows one to achieve the FPP accuracy, at the condition of allocating more integer (10 instead of 6) and less fractional (6 instead of 10) bits. This behaviour can be understood from Figure 6, which shows the range of outputs returned by each hybrid BNN layer. While for I=10 the allocated precision spans the full range of outputs returned by each layer, frequent overflows are observed for the Dense 1, Dense 3 and Dense 4 layers when we set I=6.
Model  II  Latency [ns]  DSPs [%]  FFs [%]  LUTs [%]  BRAMs [%]  

C  S  C  S  C  S  C  S  
Baseline  28  315  130  100  18  8  69  54  126  61 
BNN  14  200  0  0  5  7  155  18  46  16 
TNN  14  190  0  0  6  7  174  22  52  16 
Hybrid BNN (ReLU)  14  200  1  0.16  7  9  215  31  52  16 
Hybrid TNN (ReLU)  14  200  1  1  7  10  217  35  52  16 
Hybrid BNN (clipped ReLU)  14  200  1  2  7  8  215  29  52  16 
Hybrid TNN (clipped ReLU)  14  200  1  1  7  9  215  31  52  16 
, together with timing information. Resources estimated by the HLS compiler (C) and obtained by the logic synthesis (S) are quoted for a chosen initiation interval (II).
Table 4 provides a comparison of the resource utilization and latency for the configurations presented in Tab. 3. For each configuration, we quote both the resource utilization estimated by the HLS compiler and those obtained by the logic synthesis. The logic synthesis transforms the Register Transfer Level (RTL) design created by the HLS compiler into a gatelevel implementation, applying additional optimizations that result in a more accurate assessment of the resource utilization. In the table, the II represents the number of clock cycles needed before the algorithm may accept a new set of inputs. In our study, the II value is fixed by requiring that the resulting resource utilization is below the maximum allowed on the target FPGA. Lower II values would result in a network design that would not fit the device. Larger II values would result in higher latency.
At the very low latency values ( ns) that we are targeting, BNN/TNN models allow one to reach competitive performance while saving most of the FPGA resources. About half of the observed accuracy loss can be recovered using hybrid BNN/TNN models, paying a small price in terms of DSPs utilization, induced by an explicit allocation of a BN layers before the ReLU/clipped ReLU activation functions rather than the bitshift implementation described in Section 3. A further optimization of the BN operations for hybrid models could in principle push the DSPs utilization closer to zero.
The LUTs usage is largely overestimated by the HLS compiler for all binary and ternary NN models, while it is found to be well within the available resources after the logic synthesis. Hybrid models require more LUTs with respect to the standard BNN/TNN, because of the wider data bitwidth at the input of each binary or ternary layer.
Figure 7 shows the dependence of the resource utilization on the maximum latency achieved by the design (controlled by the II) for the baseline and BNN models. Results for the TNN model are very close to the BNN ones. For all latency values, the resources used by the BNN/TNN models are typically reduced with respect to the baseline model. In particular, there is a large gain in using binary/ternary networks to obtain a large reduction in the number of DSPs up to a fews latency. For higher latency values, the II is large enough to allow a small usage of DSPs even for the baseline model. In that case, the advantage of using a binary or ternary quantization would be minor.
As a final test, we train a larger BNN model consisting of three dense layers with 256 nodes each, as in the study of Ref. [xilinx_finn], allowing for a direct comparison of our implementation of a binary architecture with what presented there. The hls4ml implementation of this model yields a total accuracy of 95% for both floatingpoint and fixedpoint precision, where the latter is fixed to . With an II of 28, we obtain a maximum latency of 0.31 s with a resource utilization comparable to that in Ref. [xilinx_finn]. In particular, the deployed model obtained with hls4ml after the logic synthesis utilizes 0% DSPs, 7% FFs, 23% LUTs, and 16% BRAMs on a Xilinx Virtex Ultrascale 9+ FPGA card.
5.2 LHC jet identification
As a second benchmark example, we consider the LHC jettagging problem introduced in Section 2 and study all the binarization/ternarization strategies described in Section 4. For all models a fixedpoint precision of is sufficient to reproduce the FPP accuracy after quantization, except for the hybrid binary/ternary models with ReLU for which a precision of with two more integer bits is needed. The AUCs and accuracy before and after quantization are reported in Table 5 for all models, while a comparison of the resource utilization is found in Table 6.
Model  Architecture  Floating point precision  Fixed point precision  
AUC  Accuracy [%]  Number of bits  AUC  Accuracy [%]  
Baseline  16x64x32x32x5  0.904–0.958  75  0.900–0.955  75  
BNN  16x64x32x32x5  0.794–0.891  58  0.794–0.891  58  
TNN  16x64x32x32x5  0.854–0.915  67  0.854–0.915  67  
Best BNN  16x448x224x224x5  0.886–0.937  72  0.884–0.938  72  
Best TNN  16x128x64x64x64x5  0.886–0.931  72  0.886–0.930  72  
Hybrid BNN (ReLU)  16x128x64x64x5  0.878–0.926  69  0.878–0.926  69  
Hybrid TNN (ReLU)  16x128x64x64x5  0.876–0.939  71  0.876–0.939  71  
Hybrid BNN  16x128x64x64x5  0.866–0.917  68  0.866–0.917  68  
(clipped ReLU)  
Hybrid TNN  16x128x64x64x5  0.870–0.933  70  0.870–0.933  70  
(clipped ReLU)  
Unlike what is seen for the MNIST digit classification, the simple binarization/ternarization of the baseline model results in a big accuracy loss. This is partially mitigated by the use of ReLU and clipped ReLU activations. As an alternative approach, we also consider optimized binary and ternary architectures (best models in Table 5), fixed through a Bayesian optimization of the network hyperparameters. The result of the Bayesian hyperparameter optimization for BNN and TNN converges to architectures with about 40 and 4 times more parameters with respect to the baseline architecture, respectively. With these larger architectures, binary and ternary methods almost match, with a moderate loss in accuracy. Optimizing the architecture of the binary and ternary models yields comparable precisions, but with a different resource balance (e.g., DSPs vs. LUTs), offering an alternative that might better fit certain use cases.
The results of Tables 5 and 6 confirm that ternary networks generally offer a better resource vs. accuracy balance than binary networks, with a minimal (often negligible) additional resource cost and a comparable (sometimes smaller) latency. In terms of FPGA resources, even the large architecture of the best TNN model results in a limited resource usage, well below the baseline model. Instead, the largest best BNN model requires a higher II value to fit the FPGA resource boundaries. The latency is kept within the s boundary we target, but is significantly larger than what is achieved by the best TNN and the baseline models. The best TNN model gives the same accuracy as the best BNN model, with the same latency as the baseline model but with a drastic reduction of DSP utilization .
Model  II  Latency [ns]  DSPs [%]  FFs [%]  LUTs [%]  BRAMs [%]  

C  S  C  S  C  S  C  S  
Baseline  1  60  60  57  1  1  7  5  0  0 
BNN  1  40  0  0  0  0  3  1  0  0 
TNN  1  40  0  0  0  0  4  1  0  0 
Best BNN  16  205  0  0  1  3  128  8  12  0 
Best TNN  1  55  0  0  0  0  14  3  0  0 
Hybrid BNN (ReLU)  1  70  3  4  1  1  14  6  0  0 
Hybrid TNN (ReLU)  1  65  3  4  1  1  22  6  0  0 
Hybrid BNN (clipped ReLU)  1  60  3  4  0  1  14  6  0  0 
Hybrid TNN (clipped ReLU)  1  60  3  4  1  1  22  6  0  0 
6 Summary and Outlook
We presented the implementation of binary and ternary networks in the hls4ml library, designed to automatically convert a given neural network model into firmware of an FPGA card. Using two benchmark classification examples (handwritten digit recognition on the MNIST data set and jet identification at the LHC), we discuss different strategies to convert a given model into a binary or a ternary model. We showed how binary and ternary networks allow one to preserve competitive performance (in terms of accuracy) while drastically reducing the resource utilization on the card and, at the same time, keeping the inference latency at ns. When compared to binary models, ternary models reach accuracy values much closer to the original baseline models, at a typically smaller resource cost and comparable latency. Model binarization and ternarization are competitive alternatives to other compression approaches (e.g., pruning) and represent the ultimate resource saving in terms of network quantization. They offer a qualitative advantage of keeping DSP utilization at a minimum, and offer an interesting opportunity to deploy complex architectures on resource constrained environments, such as the L1 trigger system of a typical collider physics experiment.
Acknowledgement
We acknowledge the Fast Machine Learning collective (https://fastmachinelearning.org) as an open community of multidomain experts and collaborators. This community was important for the development of this project.
M. P., S. S., V. L. and J. N. are supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement n 772369). S. J., M. L., K .P., and N. T. are supported by Fermi Research Alliance, LLC under Contract No. DEAC0207CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. P. H. is supported by a Massachusetts Institute of Technology University grant. PH and DR thank support from NSF AWARD #190444, #1934700, #1931469, #1836650. Z. W. is supported by the National Science Foundation under Grants No. 1606321 and 115164.