I Introduction
The area of Deep Learning (DL) [DL]
, and in particular learning approaches based on Deep Neural Networks (DNNs), has seen some remarkable advances in the past decade. Neural networks are computing systems that can be seen as collections of connected nodes that are called neurons and are typically organized into sequentially interconnected layers. Each layer performs an affine transformation defined by the layer’s parameters (weights and bias), followed by a nonlinear transformation called activation. DNNs can “learn” to perform specific tasks by training on examples and then infer the results for new input data. For example, when given enough training samples, a classification network can learn the values of weights and biases for each neuron such that given a new image it can to distinguish cats from dogs on images.
Many key algorithmic ideas underlying DNNs go back to as far as late 1960s [minsky69perceptrons] with a reinstated interest in 1990s [tesauno]
. The huge potential of DNNs is to solve complex problems while accepting input data in a raw and even heterogeneous form faced practical difficulties: hardware was not powerful enough, allowing only smallsized examples. It is not until the 2000s that some realistic problems could be solved by DNNs. The best example is the Imagenet
[imagenet]image classification problem. Then a huge wave of results in applying DL to realworld problems in the areas of computer vision, speech recognition, languageunderstanding and navigation in autonomous driving through reinforcement learning followed. Both increasingly large datasets and increasingly complex models were critical for the success. For example, to recognize handwritten digits using MNIST network neural network is defined using around 0.7 million parameters; the MobileNet network for the Imagenet problem requires around 27 million of parameters, while the BERT network goes up to 345 million parameters to solve Natural Language Processing problems.
Deep Learning became an extremely computationallyhungry application in the end of the Moore’s Law era [hennesy2019], when again performance improvement in CPUs and even GPUs is not enough. On the computer arithmetic level, performance can be improved by reducing the bitwidths of data and operators, which results in smaller hardware area and memoryaccess times, and faster computations. However, a tradeoff must be found in order to keep the DNN inference accurate.
While DNN training is usually performed in FloatingPoint (FP) arithmetic using uniform float32 or mixed float32/half [nvidia_mp] precision, inference can be performed in smaller formats, or even in FixedPoint arithmetic. There are several new lowprecision FP formats that have been suggested by the major hardware manufacturers: bfloat16 (Intel [bfloat_Intel], ARM [bfloat16_arm]), DLfloat (IBM)[dlfloat], MSFP811 (Microsoft [msfp]) and the dedicated hardware is on its way. Obviously, FPGAs and ASICs offer even richer design space.
Existing literature provides a large body of research on posttraining quantization of DNN’s parameters, typically down to 8bit (Google’s TPU) and 4bit integers [4bitq], or even down to 1 bit, which results in Binary Neural Networks [HubaraCSEB162, HubaraCSEB16]. Models can even be trained directly to have lowprecision representation of weights and biases [Kravchik_2019_ICCV, choi2018, CourbariauxB16c].
However, in the existing literature, the impact of rounding errors due to the precision of the underlying arithmetic has been, to the best of our knowledge, surprisingly missing. Perhaps negligible for closetofloat32 precisions, the arithmetic rounding errors in lowprecision implementations can potentially grow and impact the network’s classification choice. The experimental studies [bfloat_Intel, bfloat16_arm, bfloatStudy] make us think that DNNs can nevertheless maintain high inference accuracy even with lowprecision FP arithmetic. The first contribution of this paper is to shed some theoretical light on why this is the case by having a computer arithmetic look at DNN layers.
A typical study of the impact of rounding errors in DNNs is based on a comparison with a reference output on a (moderate) set of testing data. More formal approaches do exist, to analyze the robustness of DNNs with respect to perturbation of input parameters, e.g. the SafeAI project^{1}^{1}1http://safeai.ethz.ch [AI2] based on abstract interpretation or SMT [ChangRG19]. However, these tools do not account for FP rounding errors in their analyses. Our second contribution is to provide an interpretation of the impact of a precision choice upon the accuracy of a DNN.
Finally, we present a semiautomatic software framework for an automatic precision and accuracy analysis tool. Our versatile tool has a frontend accepting DNN models from common design frameworks such as Tensorflow/Keras, and is based on a generic error analysis technique, parametrizable by the target FP precision. The latter feature permits to analyze the behavior of DNNs upon a variety of target FP formats. In order to provide both relative and absolute error bounds, we introduce a combination of Affine and Interval arithmetics.
We start by recalling to the reader the basic notions of Deep Neural Networks in Section II. Then, in Sections III and IV we define our arithmetic toolkit and present a theoretical look at the numerical computations within DNN’s layers, respectively. We follow by the description of the software tool and numerical experiments in Section V before concluding and discussing future research.
Ii Deep Neural Networks
The basic computational units of DNNs are neurons, that can be seen as nodes parameterized by a weight and a bias . Given an input , it computes an output , where
is a nonlinear activation function. Typically, the inputs of a neural Network are assembled into a vector (e.g. a 32x32 pixel grayscale image is a flattened into a 1024x1 vector), hence the perlayer computations are dot products. In a general case, DNNs operate on
dimensional tensors.
Conceptually, DNN layers are divided into the input layer, the output layer and the intermediate, socalled hidden, layers as illustrated in Fig. 1. A trained DNN model is defined by its topology (number/type of layers) and the learned parameters (weights and biases). Even though classically a network layer was comprised of a linear computation (e.g. dot product) and a nonlinear activation function, modern literature often speaks of “activation layer” as an independent entity. Following this trend, we assume that computational layers are interleaved with the activation ones.
Activation layers
There exist a variety of nonlinear activation functions, having different properties, e.g. bounded/unbounded, monotonic, continuously differentiable, etc. Some of the most common activations over vectors are the following ones:

[wide]

Sigmoid (): computes
(1) 
Hyperbolic Tangent (): simply applies

(2) If the input is an dimensional tensor, all of the above functions are applied elementwise.

Softmax activation (): normalizes
into a probability distribution over output classes. It is evaluated via
(3) If the input is an dimensional tensor, the activation is applied along each axis separately.
It should be noted that all of the above functions are bounded, yielding output values in , except for the , which results in , i.e. it maintains an upper bound on the layer’s input while clipping negative values.
Computational layers
Some of the most common layer types are:

[wide]

Dense layer (): typically accepting an input data vector and parameterized by the weights matrix and a bias . The output is compute via . For multidimensional inputs, this operator extends into a tensor product.

Convolution^{2}^{2}2Should not be confused with a mathematical definition of convolution. layer (): its purpose is to extract the feature maps out of data represented as multidimensional arrays through a linear transformation. This layer is parameterized by a convolution kernel that is convolved with the layer input to produce a tensor of outputs. The guide [dumoulin2016guide] offers a comprehensive description of the convolutional arithmetic for deep learning. For example, an image is given as a 3D tensor , where provides the number of color channels. Convolution kernel is also a 3D tensor of the size . Convolution operation consists of the kernel sliding across the input data when at each location, the product between each element of the kernel and the input is computed; consequently the results are summed up to obtain a scalar output in the current location. The basic arithmetic operation in the convolution layers is again, the dot product.

Pooling layer () : Pooling operations reduce the size of feature maps by using some function to summarize subregions, such as taking the average or the maximum value. The basic arithmetic operation in this layer is summation and/or function.

Batch Normalization layer (): introduced in [batchnorm], this technique is used to normalize the input of the layer and only then apply a dotproduct. The idea is to divide the input data into minibatches and first perform perbatch normalization. Let be a size of a minibatch and be a dimensional input for the batch normalization layer. Then, the normalization is applied for each dimension separately via:
(4) where the mean
and variance
are computed per minibatch and is a small parameter. After normalization, the input vector is transformed as follows : , where and are dimensional vectors of parameters that are learned during the optimization.
The continuously updating list of computational layers can be found in the documentation for the Tensorflow/Keras.
Iii Arithmetic toolkit
FP operations, such as addition, subtraction, multiplication or more complex functions like , can necessarily not be exact: floatingpoint representation uses finite memory. Rounding is hence necessary after almost every operation. These roundings induce error into the computation, affecting the final result [Handbook]. For a DNN, this means e.g. that the output class and the attached confidence probability are affected by rounding error. When confidence is low, that error might have even toggled the class the DNN output. In order to make DNNs rigorous, the overall FP rounding error affecting the results must be analyzed.
The error due to solely one FP operation mainly depends on the precision of the FP format used. For binary FP formats –which we shall focus on in this paper– precision expresses the number of bits held in the format’s mantissa. For example, for the IEEE7542019 binary32, and for IEEE7542019 binary64, . For an IEEE7542019 FP operation with precision that does not overflow nor underflow, i.e. exceed the format’s exponent range, the following holds for roundingtonearest [Handbook]: let . Then, for every FP input , there exists such that
(5) 
where is the FP realization of operation . A similar bound is available for unary operations such as . This representation of the FP rounding error is also called the first FP error model [High02]. Remark that (5) holds independently of the value of , i.e. independently of the chosen precision . This model therefore allows code to be analyzed for a given precision so to tailor it for an application.
The difficulty in analyzing a given FP code’s accuracy by analyzing the total amount of error affecting a result value lies in the intricate ways the different elementary errors, which are given by eq. (5). For example, suppose two multiplication operations are execute one after the other. By applying eq. (5) twice, we already obtain:
A single scalar product of a DNN’s convolution layer with inputs uses multiplications and additions, each of which will result in an error term by application of eq. (5) but will also combine with all other error terms in the most intricate way. Manual analysis with this approach is hence completely intractable. Wilkinson and Higham therefore invented extended ways of analyzing the combination of elementary errors in FP code [Wilkinson, High02]. However, their analysis requires human insight into the different algorithms used, such as addition of FP values, scalar products etc. When existing code is to be analyzed for accuracy automatically using software, Higham’s approach is hence not usable either.
A workable approach is found with Affine Arithmetic (AA), as developed e.g. by Putot [putot2011]. Every FP quantity is annotated with a bound that expresses the quantity’s error with respect to the mathematically ideal, but unknown quantity in a similar manner as in eq. (5) above:
(6) 
If stems from an exact quantity in a way that involves only one FP operation, is set to according to (5). Otherwise, when is the result of combining two quantities and with an operation , the error terms and as well as the error term due to the rounding in the operation are combined to yield one single new error term for with respect to . This combination is specific to each operation type. For example, for addition, we obtain:
(7)  
with
(8) 
To annotate with bounding that in (8), we need to

have access to the annotations and of the operands,

know a bound on , which is provided by (5) and

to be able to bound the error amplification (or attenuation) quantities and .
For the latter task, to bound and , we use Interval Arithmetic (IA), as we shall explain below. We have detailed only the case of FP addition, . Similar error combination rules exist for the other operations that occur in DNNs, such as subtraction, multiplication, division, square root, , .
However, the AA approach above is based on a relative error model: the term describes the relative error of the FP quantity with respect to the ideal, unknown quantity . Such a relative error bound does not always exist, in which case relative AA breaks down. An easytounderstand example is when the result of an FP addition cancels out completely, amplifying the incoming errors and by an infinite amount. Typically, in this case, the quantities and do not stay bounded as their denominator becomes zero.
A solution to this issue is to use AA not with a relative error term but with an absolute error term. An FP quantity is labeled with an absolute error bound such that
(9) 
For FP addition, the absolute error terms just add up, plus an additional absolute error term, which can easily be deduced out of the relative term given by (5), by multiplying the relative error bound by an upper bound on the exact result’s absolute value. Such an upper bound is easy computed using IA
As a matter of course, due to the “there’s no free lunch”rule, the ease of use of absolute AA for addition comes with issues for other operators like division, where amplification terms similar to and become unbounded and hurt absolute AA the same way as addition hurts relative AA.
Our solution to this issue is to maintain both absolute and relative error “bounds” and for each quantity and to let them become infinite whenever no such bound exists. Operators like addition, multiplication, division, square root, etc. try to propagate both the absolute and the relative error bounds whenever possible, using the information in both bounds when appropriate. This is, addition and subtraction, which may cancel, propagate the absolute error bound and may yield an infinite relative error bound. Multiplication, division and square root start off the relative error bounds and propagate those. Exponential propagates the entering absolute error bound as a relative error bound as in
Logarithm does the inverse, transforming a relative error bound into an absolute one. The function , used a lot in DNNs activation layers, can propagate the absolute error with no amplification factor and may propagate the relative error bound with a small amplification factor of whenever . The details of this combined absolute and relative AA (CAA) go beyond the scope of this paper but we may state:
Whenever possible, the proposed CAA improves the one bound –absolute or relative– of a quantity using the other. For example, it is often possible to deduce tight relative error bounds out of absolute when a quantity can be shown never to be zero. Likewise, an absolute error bound is readily deduced from the relative one and an upper bound on the quantity.
In order to do so and, as explained above, to be able to combine the operands’ error terms and to bound absolute elementary errors using eq. (5), bounds for the quantities occuring in a computation must be known. We compute these bounds using Interval Arithmetic (IA) [Moore]. For IA, each quantity is replaced by an interval the quantity can be shown to lie in. Each IA operator working on intervals produces an interval that surely encompasses all possible images of the operation and operands in the operand intervals. All roundings are performed in such a way –viz. outwards– that this enclosure property is satisfied even in the presence of roundings [Moore].
Both IA and (absolute, relative or combined) AA are plagued by a phenomenon called the decorrelation effect [putot2011]. Consider the following code snippet:
While mathematically, will always be zero, as , and while even IEEE7542019 FP code ensures that z will be zero due to full cancellation of all bits of x and its copy y, IA and AA will have no global understanding that and are correlated, and –actually– equal. So assuming to be bounded by the interval , will evaluate to the interval instead of the interval . For CAA, the issue will be that the relative error bound on instead of becoming zero as the errors on and cancel out will become infinite due to the detected “catastrophic” cancellation. The absolute error bound for will not become zero but the double of the one on . For all kind of arithmetic techniques, such as IA or CAA, which only label quantities in code with interval bounds or absolute and relative error bounds but which do not gain any global understanding of the code, the decorrelation effect has no simple solution. It depends on the application whether or not the decorrelation effect occurs and whether or not its consequences are bearable or not for that type of application. As we shall see in more detail in Section IV, in code for DNNs, the decorrelation effect does occur, typically in precisely the way illustrated with the code sequence above. It does not occur in its even more intractable appearances, like in cases when and , where and are correlated for small , as the Taylor series of starts with .
For the easy case when two variables in code correlate because they are copies on of the other –as in the example code illustrated above– a simple solution to overcome the decorrelation effect in CAA (and IA) exists: all FP quantities analyzed by CAA are labeled with a unique identifier that relates to their moment of creation in the execution of the program. This identifier is never repeated for any other FP quantity but for assignment, where the identifier does get copied. Subtraction (and division) operations in CAA can then start by checking whether the identifier of both operands happens to be the same. If it does, both operands are correlated as they are copies one of each other. Interval bounds of
and CAA error bounds of and can then be returned. This solution is crude but addresses all simple decorrelation cases found in DNNs code.Yet another issue with code analysis with CAA and IA comes in the form of if statements depending on FP variables –which are to be analyzed– and, generally, control flow depending on FP variables [putot2011]. FP code might take the one or the other branch by evaluating a comparison like x < y to a boolean, which, of course, might be falsed due to the errors on and . In contrast, CAA and IA, which replace and by whole classes (intervals resp. abstract approximate quantities with bounded error measured in units of ), cannot even evaluate the expression to a unique boolean in cases where the intervals for and intersect or where the errors make the boolean answer not unique. Some approaches for this control flow issue have been proposed [fluctuat].
Fortunately, in code for DNNs, this issue is virtually absent. Code for DNNs, in inference mode, does not contain control flow in the form of loops that depend on FP values. In other words, no iterative FP techniques are used. All control flow for loops comes from the DNN’s configuration and the respective dimensions of the manipulated tensors. As we shall see in more detail in Section IV, the only if
statements in DNNs that depend on FP values are encountered in activation layers such as pooling or softmax layers. This is due to the very nature of DNNs: in order to make training possible, from a birdview perspective, DNNs need to represent (nonlinear) functions that are differentiable. Branches would necessarily introduce discontinuities of that derivative. In the concerned code sequences, the
if serve the only purpose of computing minima and maxima on vectors of FP values, hence does not influence the output directly. We solve the control flow issue in a similar way as the decorrelation effect: the point is to provide the CAA and IA arithmetics, which, again, are concerned with local effects, with just enough global insight on the program’s logic. For instance, the quantities analyzed with CAA and IA can be labeled with bounds given in the form of other CAA+IA quantities that are minimum or maximum bounds for them. Subsequent CAA or IA operations, like subtraction, can then exploit the fact that if for example a quantity is upperbounded by , i.e. , the result of the subtraction will always be bounded by .As a matter of course, DNNs that perform classification tasks do contain if statements that depend on FP values. These statements are the one executed as the very last step, when the onehot output of a softmax layer [DL] gets translated into the predicted numerical integer class. This code boils down to computing the integer index for a vector of FP values of probabilities; the DNN picks the class that is the most probable. However, it is the very aim of this paper to analyze the effects of roundings in the FP arithmetic for the DNN’s inference on the output class, which we discuss in Section IV.
To wrap it up, our approach is to analyze DNNs for FP rounding errors using CAA and IA, where the FP quantity in the DNN to be analyzed gets replaced by an arithmetical object containing the following entries:

[wide]

a unique ID of the quantity, in the form of an integer,

the FP value in the IEEE7542019 (or any other) FP format that would be used if the DNNs were implemented without this enhanced CAA+IA arithmetic,

an interval holding the actual error of the latter FP value, for reference purposes,

an absolute error bound , for this quantity, in units of ,

a relative error bound , for this quantity, in units of ,

an interval safely enclosing all possible values for this quantity if no FP rounding error occurred,

an interval safely enclosing all possible values for this quantity, as it is evaluated with rounding FP arithmetic and,

optionally, a lower and an upper bound for this quantity. These bounds are given in the form of arithmetical objects of the same nature.
All operators required for DNNs, starting with assignment, going over computational operators like to functions like , and are overloaded to work on such CAA+IA arithmetical objects, propagating all entries as described above. As a result, a DNNs run on an example input, widened with interval bounds for the inputs’ ranges, provides an output in these arithmetical objects, from which errors on probabilities etc. can be read off. As the absolute and relative error bounds are expressed in units of , that same output can be used to tailor a DNN’s FP precision to just the right amount of tolerable final error. We shall describe the use of this arithmetic just below, in Section IV. As for the technical realization of this enhanced CAA+IA arithmetic in an actual software tool, we refer the reader to Section V.
Iv Computer arithmetic look at DNNs
As we shall see in the next Section V, the enhanced CAA arithmetic we just described is able to automatically analyze given FP code for DNNs and to come up with absolute or relative error bounds, expressed in units of , that are pretty tight and suit their purpose. However, as useful as this automatic analysis might be for application programmers, we wanted to ensure its tightness and validity from a more theoretical standpoint. This Section strives at providing this insight. For the sake of brevity, we shall focus on DNNs for classification problems. The analysis is similar for other types of problems. We will present a concrete example of a DNN for a nonclassification problem in Section V.
DNNs for classification problems transform highdimensional input data into an output vector that is a onehot representation of the class detected for the input’s class. This is, the output vector has as many entries as there are classes, and the
th entry of that vector contains a probabilistic estimate of the confidence of the DNN the input is in the
th class. That estimate is expressed as a probability; all entries are hence between and and sum up to . Postprocessing after a DNN picks the class for which the confidence estimate is highest, computing the on the output vector. The index of this output class can then be translated into e.g. a textual representation of the class, such as “Cat” or “Dog”.In the case when the maximum confidence estimate is at and the secondtomaximum confidence estimate is also at , the slightest change to these output values will of course make the DNN commit a misclassification, outputting e.g. “Dog” when the input represents a “Cat”. Such a slight change may stem from FP rounding errors. For DNN input data where maximum confidence is at , no FP arithmetic –but exact arithmetic– exists avoiding misclassifications. However, when external knowledge on the DNN exists that guarantees that, on all possible inputs^{3}^{3}3For a reasonable definition of what a possible input is., the DNN will output a onehot vector with a top1 value , it can be guaranteed that the secondtomaximum, top2, value will be , leaving a margin of for each of the maximum and secondtomaximum entry to be affected by FP rounding error. Gaining this external knowledge is beyond the scope of this article, but approaches like SafeAI [AI2] seem to be able to provide it. This external minimum bound may also just be specified, accepting a certain percentage of misclassifications.
From a computer arithmetic perspective, we may hence assume that there is an absolute FP error margin available for each element in the output vector, where is the minimum bound established with external knowledge. Similarly, we may assume a relative FP error margin available. Our job is to rigorously ensure that no misclassification may occur given that error margin. Hence we need to choose FP precision , resp. in such a way that we can guarantee that the DNN’s inference accuracy is enough so that the FP rounding error does not exceed the margins. We may hence start reclimbing the DNN’s FP algorithm from its end with that margin as some kind of FP error budget to be burnt for FP roundings.
As their last layer, most classification DNNs have a layer, as it was defined in Section II. We must hence analyze the FP error in output of a layer. This analysis will also serve as an illustration of error analysis for the different layers; we analyzed all layers, for the sake of brevity, we shall only report on the layer. The error in output of a layer has two sources: (1) the FP rounding errors committed during the layer’s evaluation and (2) the errors present in input to the layer, propagated in an amplified or attenuated manner by the layer. The analysis of the first kind of errors, the rounding errors, is trivial, as the function required just the evaluation of exponentials, a division (often implemented as a division of a logarithm) and the summation of positive values, obtained by exponentiating the input. We shall hence not address this point any further.
For the propagated error of a layer, the following analysis can be performed; herein, are the elements of the computed input vector affected by an absolute error , approximating the unknown, ideal . The are the output vector elements, approximating the unknown, ideal .
(10)  
with
This quantity is easily bounded with
With some mild assumptions on bounds for the and , it can further be shown that the relative error affecting is bounded by
(11) 
essentially by taking the Taylor development of .
This analysis therefore shows us the following: the layer transforms the absolute error in its input into a relative error of approximately^{4}^{4}4i.e. times larger the same amount in output. Our margin of becomes hence an absolute error margin in input of the layer. This input of the layer is in general the output of a convolutional layer, where the arithmetical difficulty is in a summation, which lives very well when it just needs to satisfy an absolute error bound. Amazingly, the bound given above does not at all depend on the number of elements of the vectors and .
In order to illustrate this stability argument more intuitively, let us give a numerical example: let
, i.e. the classifying DNN shows at least
confidence for the best output class. Then , meaning that FP results with about valid bits are sufficent. Then a maximum elementwise absolute error of is still tolerated on the input of the softlayer. This means for a convolution or dotproduct, i.e. summation, fixedpoint arithmetic with a quantization unit of , i.e. about is enough. FP arithmetic can only do better, its precision is at least these bits, provided the inputs to the summation are bounded around .The boundedness of the values manipulated by DNNs is something we have already stated in Section II. Most activation layers bound their outputs to , equivalent to a bound . Convolutional or fullyconnected layers also exhibit pretty small bounds , which are easily established, and, by the way, perfectly bounded with IA [Shary], considering the boundedness of the DNN’s coefficients, the relatively small dimensions of the manipulated vectors, matrices and tensors and the boundedness of the preceeding input.
It is hence all but surprising to observe that

[wide]

DNN inference behaves very well for FP arithmetic with low precision,

DNN inference behaves very well for FP arithmetic with low exponent range, as fixedpoint arithmetic already provided enough accuracy for the subsequent layers, such as softmax,

and analysis with CAA that exhibits small FP error bound in output does provide tight and sensible results.
max absolute error in  max relative error in  analysis time  required precision to prevent misclassificaton with  
Digits  12s per class  
MobileNet  4.2h per class  
Pendulum  –  100ms  – 
V CAAbased FP Error Analysis and Experimental Results
We implemented our semiautomatic FP accuracy analysis tool building upon a combination of existing software packages for DNNs, such as frugallydeep^{5}^{5}5https://github.com/Dobiasd/frugallydeep, which we patched pretty heavily. We coupled these packages with a C++ implementation of the enhanced CAA that we have described in Section III. This C++ implementation of CAA was written from scratch. The implementation is currently based on IA provided by MPFI 1.5.3, which is itself based on MPFR 4.0.2 on top of GMP 6.1.2 [MPFI, MPFR, GMP]. However, we wrapped MPFI in a C++ façade class in order to facilitate transition to other IA libraries later. We use g++ version 8.3.0 to compile our code. The frugallydeep library we are using requires the use of C++ in its C++20 version [CPP20]. Our contribution in terms of code consists in C++ classes to implement CAA as we have described it and in the patches required to allow for binding and use of that CAA arithmetic instead of plain IEEE7542019 arithmetic in frugallydeep. Our workflow runs only with our version of frugallydeep, not with a stock version. Thanks to frugallydeep, our semiautomatic FP accuracy analysis tool is compatible with almost all DNNs as they are designed and trained with Tensorflow/Keras [Tensorflow, Keras]. The frugallydeep package first converts DNN models to JSON files, and then provides C++ header classes that allow loading of JSON files as object graphs that can be evaluated on the input data. The frugallydeep library leverages several other C++ libraries for this task, the most prominent of which is Eigen [Eigen] that permits binding of custom arithmetic.
Our CAA class structure consists of three classes: a façade class for the frontend binding with frugallydeep; a class that actually implements the CAA arithmetic and overloads all necessary arithmetic operations; and a backend wrapper for IA. We did not use existing MPFI wrappers in C++, in order to have the possibility to exploit the performance advantage of new C++20 features, like move constructors and move assignment operators.
Our workflow runs as follows: using frugallydeep we construct a C++ program to load a DNN model designed in Tensorflow/Keras, as well as the input data, expressed with CAA objects. The bounds on these data are trivial in most cases, e.g. image data gets annotated with 8bit unsigned values in . We run the resulting program for all possible classes to cover all possible control flows. And this can be done only for one representative of the class, no additional tests are required. The program outputs the inference result and the absolute and relative error on it. The error bounds are all given in units of , an upper bound on which is userconfigurable. The output error bounds can then be used to tailor the DNNs actual FP arithmetic, by applying the theory we described in Section IV, determining the value of such that the required accuracy bounds are still met.
We demonstrate our tool on several examples of DNNs and give some results in Table I.
Digits
We built a simple DNN for the recognition of handwritten digits and trained it on the MNIST dataset [lMNIST]. This model requires around 0.7 million parameters and consists of three , two and a layer. As input, it takes grayscale images (i.e. a flattened vector of length ) and has a dimensional output vector whose element indicates the probability that the input image is the digit . Table I illustrates the results of analysis, where the maximum absolute and relative errors denote the maximum errors over all possible classes. We also observed that on the top1 choice, the relative error bounds are quite tight, while on the other elements the relative error looks less good. However, the bound (11) still holds and in any case, the absolute error stays low. Our analysis shows that the network can safely run with 7bit precision FP.
MobileNet
We used a Keras pretrained model for this considerably bigger network for Imagenet classification. MobileNet requires around 27 million parameters. The complete architecture can be found in Keras documentation, we will only state here that it is a Convolutional Neural Network with 28
layers, 27 layers followed by activations and a layer that classifies RGB images over 1000 classes. This challenging example revealed a performance bottleneck in our tool. To analyze the model over one class it took the tool around 4 hours on a conventional laptop. Our performance analysis determined that most of the analysis time was dedicated to the memory allocation process somewhere deep in MPFI. Regardless of the analysis time, the tool successfully illustrated that even for largescale models our analysis techniques compute tight error bounds.Pendulum
This small neural network model comes from the context of reinforcement learning applied for autonomous control [ChangRG19] and aims at approximating a Lyapunov function for a nonlinear controller. The article [ChangRG19] proposes a new methodology for certified approximation of neural controllers based on SAT theory. This example is interesting from the formal verification point of view: our bound on the absolute error can be effortlessly incorporated into the existing verification procedure. This network has two layers and two activations. It takes a coordinate vector as input and, as in [ChangRG19], we tested it for the interval . Our tool provided an absolute error bound in a fraction of a second. A relative error bound does not exist since the output interval contains zero.
Vi Conclusion and Perspectives
With this work we proposed a semiautomated way to bound and interpret the impact of rounding errors due to the precision choice for inference in generic DNNs. We presented a software tool that, thanks to frugallydeep library, can receive any TensorFlow/Keras model in the frontend. We support the most common activation and computational layers.
The backend that we developed automatically computes and propagates rounding errors through the computations. For this, we have introduced a combination of Affine and Interval Arithmetics called CAA. This new construction permitted us to compute both relative and absolute error bounds. Our implementation of CAA is based on rigorous error analysis for arithmetic operations, as well as for all the necessary elementary functions for activation layers (e.g. , ). We enhance CAA with just enough global insight on the program’s logic in order to fight the decorellation effect and take care of the control flow depending on FP variables. Software implementation of the arithmetic is done in C++ in a generic way.
We offer a computer arithmetic look at the computations with the DNNs. When analyzing the computations within activation layers, we establish that activation functions are actually transforming absolute errors on their inputs into relative errors in the output. Which means that even if the computational layers yield relatively imprecise results, activation layers will recover decent relative errors, as long as inputs are bounded (which is basically always the case thanks to bounded activation functions and normalizations).
Finally, we offered the first, to the best of our knowledge, interpretation of the impact of precision choice upon the top1 accuracy of classification networks. If we reason that the goal is to preserve the top1 choice w.r.t. the reference model, we can establish the minimum required precision as a function of the deduced error bound and the distance between the top1 and top2 choice.
This reasoning reflects perfectly the fact that as long as the model was welltrained for some classification problem and can clearly distinguish between classes, then the network is extremely robust to lowprecision evaluation.
We identify several axes of future improvements to this work. The first improvement will be to improve the tool’s performance, which so far does not scale up to models having tens of millions parameters. We identified the performance bottleneck to be the memory management in MPFI. The solution would be to replace the the underlying IA implementation by a faster one. In addition to that, the manuallycoded error analysis is of course errorprone, from the formal verification point of view. Another limitation of the proposed tools is that it analyzes one implementation produced by the frugallydeep library. In order to support other implementations, e.g. using Kahan summation instead of a straightforward one, a corresponding code generation phase needs to be added. The second improvement concerns mixedprecision implementations, as proposed by NVIDIA [nvidia_mp], which can be achieved by removing the global and parameterizing the error analysis with the input/output precision. To go further, we would like to combine our results with a static analysis and mixedprecision tuning tool like Daisy [Daisy] to accelerate specific parts of DNN models. Extension towards the training of DNNs is nontrivial and requires an analysis of gradient descent algorithms. Finally, [Cohen644658] proposes to relate the classification capacity of DNNs with the geometry of the object manifolds issued after each layer. By combining our error analysis with the quantitative “separability” measure from [Cohen644658], we hope to back up our interpretation of the relation between precision and accuracy for DNNs with a solid theoreticgeometrical basis.
Comments
There are no comments yet.