Log In Sign Up

Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ODEs

Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of lower-precision arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of Ordinary Differential Equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, stochastic rounding consistently results in smaller errors compared to single-precision floatingpoint and fixed-point arithmetic with round-tonearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the LSB in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by Partial Differential Equations (PDEs).


Fixed-point iterative linear inverse solver with extended precision

Solving linear systems is a ubiquitous task in science and engineering. ...

Issues with rounding in the GCC implementation of the ISO 18037:2008 standard fixed-point arithmetic

We describe various issues caused by the lack of round-to-nearest mode i...

Deep Neural Networks to Correct Sub-Precision Errors in CFD

Loss of information in numerical simulations can arise from various sour...

Iterative Methods at Lower Precision

Since numbers in the computer are represented with a fixed number of bit...

ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient Iterative Compute

Many algorithms feature an iterative loop that converges to the result o...

Stochastic Rounding: Algorithms and Hardware Accelerator

Algorithms and a hardware accelerator for performing stochastic rounding...

1 Introduction and motivation

64-bit double-precision floating-point numbers have been the accepted standard for numerical computing for a long time because they largely shield the user from any concerns over numerical range, accuracy and precision. However, they come at a significant cost. Not only does the logic required to perform an arithmetic operation such as multiply-accumulate on this type consume an order of magnitude more silicon area and energy than the equivalent operation on a 32-bit integer or fixed-point type, but each number also requires double the memory storage space and double the memory bandwidth (and hence energy) each time it is accessed or stored. Today energy-efficiency has become a prime consideration in all areas of computing, from embedded systems through smart phones and data centres up to the next-generation exascale supercomputers. The pressure to pursue all avenues to improved energy efficiency is intense, causing all assumptions, including those regarding optimum arithmetic representations, to be questioned.

Nowhere is this pressure felt more strongly than in machine learning, where the recent explosion of interest in deep and convolutional nets has led to the development of highly-effective systems that incur vast numbers of computational operations, but where the requirements for high-precision are limited. As a result, reduced precision types, such as fixed-point real types of various word lengths, are becoming increasingly popular in architectures developed specifically for machine learning. Where high throughput of arithmetic operations is required, accuracy is sacrificed by reducing the working numerical type from floating- to fixed-point and the word length from 64- or 32-bit to 16- or even 8-bit precision

[Jouppi:2017:IPA:3079856.3080246]. By reducing the word length, precision is sacrificed to gain the advantage of a smaller memory footprint and smaller hardware components and buses through which the data travels from memory to the arithmetic units. On the other hand, changing the numerical type from floating- to fixed-point, significantly reduces the range of representable values, increasing the potential for the under/overflow of the data types. In some tasks these range issues can be dealt with safely by analysing the algorithm in various novel or problem-specific ways, whereas in other tasks it can be a very difficult to generalise effectively across all use cases [KabiB2017AnOverflowFree].

A 32-bit integer adder uses 9x less energy [6757323] and 30x less area [hw-machine-learning-reduced-precision-slides] than a single-precision floating-point adder. As well as reducing the precision and choosing a numerical type with a smaller area and energy footprint, mixed-precision arithmetic [micikevicius-mixed-precision], stochastic arithmetic [Gupta:2015:DLL:3045118.3045303] and approximate arithmetic [6569370] have also been explored by many researchers in the machine learning community with the goal of further reducing the energy and memory requirements of accelerators. Mixed-precision and stochastic techniques help to address precision loss when short word length numerical types are chosen for the inputs and outputs to a hardware accelerator, and we address these in the present work. Mixed-precision arithmetic keeps some internal calculations in different formats from the input and output data, whereas stochastic arithmetic works by utilising probabilistic rounding to balance out the errors in the conversions from longer to shorter numerical types. These approaches can also be combined. Approximate arithmetic (or more generally approximate computing) is a recent trend which uses the philosophy of adding some level of (application tolerant) error at the software level or inside the arithmetic circuits to reduce energy and area.

Interestingly, similar ideas have been explored in other areas, with one notable example being at the birth of the digital audio revolution where the concept of dither became an important contribution to understanding the ultimate low-level resolution of digital systems [vanderkooy1987dither, vanderkooy1989digital]. Practical and theoretical results show that resolution is available well below the least significant bit if such a system is conceived and executed appropriately. Although the application here is different, the ideas are close enough to be worth mentioning and we show results with dither added at the input to an ODE. Similarly, an approximate dithering adder, which alternates between directions of error bounds to compensate for individual errors in the accumulation operation, has been reported [6386754].

Our main experiments are run on the Spiking Neural Network (SNN) simulation architecture

SpiNNaker. SpiNNaker is based on a digital 18-core chip designed primarily to simulate sparsely connected large-scale neural networks with weighted connections and spike-based signalling - a building block of large-scale digital neuromorphic systems that focuses on flexibility, scaling and low energy use [spinnproject, 10.3389/fnins.2018.00816]. At the heart of SpiNNaker lie standard ARM968 CPUs. Due to the low power requirements of such a large scale system, ARM968 was specifically chosen as it provides an energy-efficient, but integer-only processor core. Therefore, fixed-point arithmetic is used to handle real numbers unless the significant speed and energy penalties of software floating-point libraries can be tolerated.

While many of the neural models that appear in the neuroscience literature can be simulated with adequate accuracy using fixed-point arithmetic and straightforward algorithms, other models pose challenges and require a more careful approach. In this paper we address some previously published work with the Izhikevich neuron model [izhikevic2003simple] solved using fixed-point arithmetic and demonstrate a number of improvements.

The main contributions of this paper are:

  • The accuracy and efficiency of ODE solvers for the Izhikevich neuron equations are assessed using different rounding routines applied to algorithms that use fixed-point types. This builds on earlier work where the accuracy of various fixed-point neural ODE solvers was investigated for this model [Hopkins2015], but where the role of rounding was not addressed.

  • Fixed-point ODE solvers with stochastic rounding are shown be more robust than with other rounding algorithms, and are shown experimentally to be more accurate than single-precision floating-point ODE solvers (Section 5).

  • The application of dither at the input to the ODE is shown to be a simple and cheap but, in many cases, effective alternative to a full stochastic rounding mechanism.

  • 8 bits in the residual and random number is shown to be sufficient for a high-performance stochastic rounding algorithm when used in an ODE solver (Section 5).

  • GCC implements rounding down in fixed-point arithmetic by default. This includes the rounding of constants in decimal to fixed-point conversion, as well as rounding in multiplication. This had an impact on the constants and calculations used within the ODE solutions in our previous work [Hopkins2015] which detracted from the accuracy of that work. We show how this can be improved upon.

2 Background

2.1 Arithmetic types used in this study

There are 8 arithmetic types under investigation in this study. Two of the floating-point types are standardised by the IEEE and widely available on almost every platform, and the other six are defined in the relatively recent ISO standard 18037 [iso18037] and are supported by the GCC compiler:

  • IEEE 64-bit double precision floating-point: double

  • IEEE 32-bit single precision floating-point: float

  • ISO 18037 s16.15 fixed-point: accum

  • ISO 18037 u0.32 fixed-point: unsigned long fract or ulf

  • ISO 18037 s0.31 fixed-point: signed long fract or lf

  • ISO 18037 s8.7 fixed-point: short accum

  • ISO 18037 u0.16 fixed-point: unsigned fract or uf

  • ISO 18037 s0.15 fixed-point: signed fract or f

These types all come with default arithmetic operations. The combination of these types and operations (rounded in various ways where appropriate) are the space that we explore in this paper.

2.2 Fixed-Point Arithmetic

Figure 1: We define a fixed-point number that is made out of three parts: the most significant bit is a sign bit ’s’, an integer part with bits and a fractional part with bits. The word length .

A generalized fixed-point number can be represented as shown in Figure 1. We define a signed fixed-point number and an unsigned fixed-point number of word length (which usually is or ) where denotes the number of integer bits, denotes the number of fractional bits and is a binary value depending on whether a specific number is signed or unsigned (where means that there is no sign and bit at all and means there is a sign bit and then we have to use 2’s complement in interpreting the numbers). also tells us the precision of the representation. Given a signed fixed-point number , the range of representable values is . Whereas, given an unsigned fixed-point number , the range of representable values is . To measure the accuracy of a given fixed-point format (or more specifically some function that works in this format and we want to know how well, how accurately it performs in a given precision), we define machine epsilon . gives the smallest positive representable value in a given fixed-point format and therefore represents a gap or a step between any two neighbouring values in the representable range. Note that this gap or a step is absolute across the representable range of values and is not relative to the exponent as in floating-point or similar representations. This requires us to consider only absolute error when measuring the accuracy of functions that are implemented in fixed-point representation. Accuracy is sometimes also measured as - a value represented by the least significant bit in a fixed-point word, which is the same as machine epsilon. Note that the maximum error of a correctly rounded fixed-point number is .

Lastly, it is worth noting how to convert a general fixed-point number into a decimal number. Given a 2’s complement fixed-point number of radix 2 (binary vector)

, if the number is not signed (therefore bit becomes integer bit ):


If the number is signed, first invert all bits and add and then the decimal value is calculated as:


SpiNNaker software uses mostly two fixed-point formats available in the GCC implementation: accum, which is and long fract which is . The values close to for each format are shown in Figure 2 and Figure 3. Accum type has a range of representable values of with the gap between them . The long fract type has a range of with the gap of between the neighbouring values. As can be seen from the graphs and the values of the machine epsilon, long fract is a more precise fixed-point represention. However, long fract has a very small range of representable values compared to the accum. Which format to use depends on the application requirements, such as precision and bounds of all variables.

Figure 2: Values close to zero in the accum representation.
Figure 3: Values close to zero in the long fract representation.

Three main arithmetic operations on fixed-point numbers can be defined as:

  • Addition: .

  • Subtraction: .

  • Multiplication: .

Note that and denote integer operations available in most of the processors’ Arithmetic-Logic-Units, including the ARM968. Therefore, for addition and subtraction, no special treatment is required to implement these in fixed-point. However, for multiplication, if the operands and have the word lengths and , then the result will have the word length of . Therefore after the integer multiplication we have to shift the result right to obtain thes result in the same fixed-point representation as the operands (example in Figure 4 shows this for the accum type). This shifting results in the loss of precision and therefore a rounding routine has to be chosen to minimize the error.

Figure 4: Example of multiplication of two accum type variables. The shaded part is a 32-bit result that has to be extracted, discarding the bottom and top bits.

2.3 Rounding

We define two rounding approaches that we have used to increase the accuracy of the fixed-point multiplication: round-to-nearest (henceforth called RTN) and stochastic rounding (henceforth called SR) [Gupta:2015:DLL:3045118.3045303, HOHFELD1992291], the latter is named stochastic due to the requirement for random numbers. Given a high precision value , an output fixed-point format to round the value to and defining as the truncation operation (canceling a number of bottom bits and leaving fractional bits) which returns a number in format less than or equal to , round-to-nearest is defined as:


Note that we chose to implement rounding-up on the tie because this results in a simple rounding routine that requires checking only the Most Significant Bit (MSB) of the truncated part to make a decision of rounding up or down. Other tie breaking rules such as round-to-even can sometimes yield better results, but we preferred the cheaper tie-breaking rule in this work.

Similarly as RTN, we define SR. Given all the values as before and additionally, given a random value , drawn from a uniform random number generator, SR is defined as (adapted from [Gupta:2015:DLL:3045118.3045303]):


Lastly, we will denote the cases where rounding is not carried out as rounding-down (henceforth called RD) - this is achieved by truncating/shifting(right) a 2’s complement binary number.

We have described above the two most common approaches but it is worth noting that we can describe an infinite family of rounding mechanisms by defining a function which takes as input the fraction left in the remainder and gives as the output

the probability of rounding up. Both input and output are defined over [0,1] and if this function goes through the following values of

and (0,0), (0.5,0.5), (1,1) and is anti-symmetric either side of = 0.5, it will round in a way that produces a distribution of rounding errors that are both unbiased and symmetric.

An elegant description of such a function is the CDF of a Beta distribution with both parameters equal to the same value, say

, where 0 < . In this parameterisation, = 1 gives SR as defined above, gives RTN and other values of

provide something in between; perhaps these could be described as varying amounts of ‘soft RTN’. Other advantages of this formulation are that analytical expressions are available for the distribution of errors from a single operation which allows moments and other useful summaries to be produced easily - perhaps to shape the error distribution or minimise some function of it - and also accurate polynomial approximations can be found for smaller values of

which will make calculation much faster than via the Beta CDF itself.

Figure 5: Nonlinear SR functions based upon the Beta CDF and the single operation error distributions that result

Figure 5 shows this function for = 1, 2, 5, 15, 100, and the resulting error distributions. It can be seen that as gets larger the result becomes closer to RTN and at = RTN. The 2nd and 3rd plots in Figure 6 shows histograms of actual distributions against these analytical results.

2.4 Floating-point arithmetic: a short review

The IEEE 754-2008 standard radix-2 normalized single-precision floating-point number with a sign bit S, exponent E and an integer significand T, is defined as (After [MullerEtAl2018]):


whereas a double-precision number is defined as:


Table 1 shows some minimum and maximum values of various floating- and fixed-point numerical types. It can be seen that fixed-point types have absolute accuracy denoted by the corresponding machine epsilon, which means that any two neighbouring numbers in the representable range have a fixed gap between them. On the other hand, floating-point types have accuracy relative to the exponent, which means that the gap between any two neighbouring numbers (or a real value of the least significant bit) is relative to the exponent that those numbers have, i.e. the corresponding machine epsilon multiplied by the to the power of the biased exponent. For example, the next number after in a single-precision floating-point number is , while the next number after is . Due to this, the accuracy of floating-point numbers is measured relative to the exponent.

Property DP float SP float s16.15 u0.32
Accuracy (Rel.) (Rel.) (Abs.) (Abs.)
Min (Exact)
Min (Approx.)
Max (Exact)
Max (Approx.)
Table 1: Some significant positive numbers of floating-point (normalized) and fixed-point numerical types.

Due to these features of floating-point arithmetic, it depends on the application which types will give more accuracy overall. For example, if the application works with numbers around only, u0.32 is a more accurate type as it gives smaller steps between adjacent numbers () than the single-precision floating-point (). On the other hand, if the application works with numbers that are as small as , single-precision floating-point representation becomes as accurate as u0.32 and increasingly more accurate as the numbers decrease beyond that point, eventually reaching the accuracy of (normalised) and (denormalised) near zero.

Therefore, it is very hard to show analytically what errors a complex algorithm will introduce at various steps of the computation for different numerical types. More generally, it is important to bear in mind that most of the literature on numerical analysis for computation implicitly assumes constant relative error (as in floating-point) as opposed to constant absolute error (as in fixed-point), and so many results on, for example, convergence and sensitivity of algorithms, need to be considered in this light. In this paper, we chose to address these numerical issues experimentally using a set of numerical types with a chosen set of test cases. One approach which may be useful in a limited number of cases such as working with log-likelihoods in the calculation of Bayesian probabilities is to use fixed-point types for the log(

) of the values needed in the calculations and this then provides constant relative error in the underlying as well as a huge increase in range (for positive ).

3 Related work

There are already a number of projects analysing reduced precision types and ways to alleviate the numerical errors using stochastic rounding, primarily in the machine learning community who are important users of large scale computation. In this section we review some of the papers that explore fixed-point ODE solvers and stochastic rounding applications in machine learning and neuromorphic hardware.

3.1 Fixed-point ODE solvers

The most recent work that explores fixed-point ODE solvers on SpiNNaker [10.3389/fninf.2018.00081] was published in the middle of our current investigation and demonstrates a few important issues with the default GCC s16.15 fixed-point arithmetic when used in the Izhikevich neuron model. Therefore, we address some of the conclusions of this study. The authors tested the current sPyNNaker software framework [10.3389/fnins.2018.00816] for simulating the Izhikevich neuron and then demonstrated a method for comparing the network statistics of two forward Euler solvers at smaller timesteps to one another using a custom fixed-point type of s8.23 for one of the constants. Matching network behaviour is a valuable development in this area but we do have some comments on their methodology and note a few missing details in their study:

  1. Although they mention the value of an absolute reference, they apparently do not use one in their quantitative comparison, and so they compare and match two algorithms neither of whose accuracy is established.

  2. The iterative use of forward Euler solvers with small time steps is far from optimal in terms of the three key measures of accuracy, stability and computational efficiency. Regarding the latter, we estimate that 10x as many operations are required compared to the RK2 solver that they compare against. Given this computational budget, a much more sophisticated solver could be used which would provide higher accuracy and stability than their chosen approach. It also militates against the use of fixed-point types without re-scaling because of quantisation issues near the bottom of variable ranges that are inevitable with such small time steps.

  3. The choice of the s8.23 shifted type to improve the accuracy of the constant 0.04 seems odd when the u0.32

    long fract type and matching mixed-format arithmetic operations can be used for any constants smaller than 1 and are fully supported by GCC. This could only result in a solver of equal or higher accuracy.

  4. Rounding is not explored as a possible improvement in the conversion of constants and arithmetic operations and neither is the provision of an explicit constant for the nearest value that can be represented in the underlying arithmetic type.

In this study we will address all of these issues.

3.2 Stochastic rounding

In one of the earliest investigations of SR [HOHFELD1992291]

the authors investigate the effects of probabilistic rounding in backpropagation algorithms. Three different applications are shown with varying degrees of precision in the internal calculations of the backprogatation algorithms. It is demonstrated that when

bits are used, training of the neural network starts to fail due to weight updates being too small to be captured by limited precision arithmetic, resulting in underflow in most of the updates. To allevate this, the authors apply probabilistic rounding in some of the arithmetic operations inside the backpropagation algorithm and show that the neural network can then perform well for word widths as small as 4 bits. The authors conclude that with probabilistic rounding, the 12-bit version of their system performs as well as a single precision floating-point version.

In a recent papers [Gupta:2015:DLL:3045118.3045303] about SR (and similarly in [Muller2015]

), the authors demonstrate how the error resiliency of neural networks used in deep learning can be used to allow reduced precision arithmetic to be employed instead of using the standard 32/64-bit types and operations. The authors describe a simple matrix multiplier array built out of many multiply-accumulate units that multiply two 16-bit results and accumulate the sum of multiple such multiplications in full-precision (implementing a dot product operation). Furthermore, the authors show that applying SR when converting the result of the dot product into lower 16-bit precision, improves the neural network’s training and testing accuracy. The accuracy with a standard round-to-nearest routine is demonstrated to be very poor while stochastic rounding results are shown to be almost equal to the same neural network that uses single-precision floating-point. A very important result from this study is that 32-bit fixed-point types can be reduced down to 16 bits and maintain neural network performance, provided that SR is applied. This reduction in precision allows for smaller and more energy efficient arithmetic circuits and reduced memory requirements in neural network accelerators. However, the authors did not report the effects of variation in the neural network results due to stochastic rounding - as applications using SR become stochastic themselves it is important to report both

mean benchmark results and the variation across many trial runs as we have done in this study.

A recent paper from IBM [Wang2018] also explores the use of the 8-bit floating-point type with mixed-precision in various parts of the architecture and stochastic rounding . The authors demonstrate similar performance to the standard 32-bit float type in training neural networks.

Closely related to the work above, the effects of training recurrent neural networks, used in deep learning applications, on analogue Resistive Processing Units (RPUs) have been studied

[10.3389/fnins.2018.00745]. The authors aim to minimize the analogue hardware requirements by looking for a minimum number of bits that the input arguments to the analogue parts of the circuit can have. A baseline model with 5-bit input resolution is first demonstrated and it becomes significantly unstable (training error reported) as network size is increased, compared to a simulation model with single-precision floating point. The authors then increase the input resolution to 7-bits (with rounding to nearest reported on this version, not mentioned on the baseline version) and observe a much more regular development of the training error and with lower magnitude at the last training cycle. Finally, the authors show that stochastically rounding these inputs to 5 bits again makes the training error almost as stable as the 7-bit version without the large error observed when using the baseline version for training large networks.

More important work involving SR was done by Intel, in the recently introduced Loihi neuromorphic chip [8259423]

. In this neuromorphic chip, SR is not applied to an ODE of the neuron model as it is in our study, but to the biological learning rule that defines how synapses change with spiking, called Spike-Timing-Dependent Plasticity (STDP). The implementation of STDP usually requires spike history traces to be maintained for some number of timesteps. In Loihi, history traces (for each synapse) are held as 7-bit values to minimize memory cost and calculated as

, where is a decay factor which defines how the trace decays back to 0 between spikes and is a value contributed by each spike. The authors in [8259423] state that these traces are calculated using stochastic rounding, but do not mention at which point SR is applied in this calculation. However, given that the is a binary value, we can easily see that the main place where stochastic rounding can be applied is on the multiplication which produces a result longer than 7 bits (depending on how many bits the value is held in). Therefore, after this multiplication, the result has to be shifted right, to put it back into the 7-bit format that is in and therefore that is when SR should be applied to minimize the loss of precision from the multiplier. Further complication could arise if the value is held to higher precision than the resultant , which would also have to be rounded, either after the multiplication or after the final adder.

4 Implementing SR and testing atomic multiplies

In order to establish the correctness of our rounding mechanisms for fixed-point types, we have carried out a set of tests to assess the distribution of errors from a simple atomic multiply with a wide range of randomly generated operands. There are 5 cases of interest (with 5 equivalent cases for 16-bit types): . The first three of these return a saturated answer in s16.15 and the last two in s0.31, with a sign used because some of the multipliers results in ODE solvers are preceded by a minus to invert the sign. Each multiplier has an option to do three types of rounding from RD, RTN or SR when converting the result to a destination numerical type. In all cases we take 50,000 random numbers distributed uniformly between a minimum and maximum value. For example, in the cases where both operands are s16.15 the range is [-256.0 , 256.0] or [-16.0 , 16.0] for the 32- and 16- bit types respectively, to ensure that the result can be represented. In the other cases the full ranges of the types can safely be used.

Each pair of operands is generated as numbers that can be exactly represented in the relevant fixed-point type and then copied to the reference type IEEE double where the compiler translates the numerical value appropriately. We then multiply them using the default double operation and whatever other multiply operation that we are considering. The fixed-point result is copied back to double and the difference between the results is stored and normalised to the LSB of the fixed point result type so that e.g. if the result is an s16.15 the absolute error is multiplied by . We then calculate summary statistics and plot each set of errors. We will call these Bit Error Distributions or BEDs and they allow the atomic multiplication errors of the different types to be compared directly, as well as to confirm expected behaviour.



Figure 6: Histograms showing the bit error distribution of random s16.15 * s16.15 operations with different rounding schemes.

Figure 6 shows the BEDs of a multiplier with the three different rounding routines. These plots and summary statistics calculated on the residuals confirm that in all cases the BEDs are as expected and so we can have confidence in their implementation.

4.1 Pseudo-random number generators

We have primarily investigated three sources of uniformly distributed noise in this study, implemented as deterministic pseudo-random number generators (PRNGs) with varying levels of quality and execution speed.

The reference PRNG is a version of Marsaglia’s KISS64 algorithm [marsKISS64]. This has had several incarnations - we are using David Jones’ implementation of what is called KISS99 here [2007TestU01]. It is now the default PRNG on the SpiNNaker system with a long cycle length and producing a high-quality random stream of 32-bit integers that satisfies the stringent TESTU01 BigCrush and Dieharder test suites [2007TestU01], [dieharder-weblink]. Results have also been checked using faster PRNGs that fail these very challenging tests but which are considered to be of a good basic quality for non-critical applications. In reducing order of sophistication these are a 33-bit maximum-cycle LFSR algorithm with taps at bits 33 and 20 implemented within the SpiNNaker base runtime library sark see e.g. [Golomb:1981:SRS:578271] and a very simple Linear Congruential algorithm with a setup recommended by Knuth and Lewis which is defined on p.284 of [press1992numerical] where it is called ranqd1. No significant differences in either mean or SD of results were found in any case tested so far which is encouraging in terms of being insensitive to the choice of PRNG as long as it meets a certain minimum quality standard.

It is worth mentioning that for SR to be competitive in terms of computation time a very efficient source of high quality pseudo-random numbers is critically important as in many cases it will be drawn on for every multiplication operation.

5 Ordinary Differential Equation solvers

The solution of Ordinary Differential Equations (ODEs) is an important part of mathematical modelling in computational neuroscience, as well as in a wide variety of other scientific areas such as chemistry (e.g. process design), biology (e.g. drug delivery), physics, astronomy, meteorology, ecology and population modelling.

In computational neuroscience we are primarily interested in solving ODEs for neuron behaviour though they can also be used for synapses and other more complex elements of the wider connected network such as gap junctions or neurotransmitter concentrations. Many of the neuron models that we are interested in are formally hybrid systems in that they exhibit both continuous and discrete behaviour. The continuous behaviour is defined by the time evolution of the internal variables describing the state of the neuron and the discrete behaviour is described by a threshold being crossed, a spike triggered followed by a reset of the internal state and then sometimes a refractory period set in motion whereby the neuron is temporarily unavailable for further state variable evolution.

A number of good references exist regarding the numerical methods required to solve ODEs when analytical solutions are not available which is usually the case [hall1976modern, lambert1991numerical, dormand1996numerical, butcher2003numerical].

5.1 Izhikevich neuron

A sophisticated neuron model which provides a wide range of potential neuron behaviours is the one introduced by Eugene Izhikevich. In [Hopkins2015] we take a detailed look at this neuron model, popular because of its relative simplicity and ability to capture a number of different realistic neuron behaviours. Although we provide a new result in that paper which moves some way towards a closed-form solution, for realistic use cases we still need to solve the ODE using numerical methods. We aimed to balance between fidelity of the ODE solution with respect to a reference result and efficiency of computation using fixed-point arithmetic with our primary aim being real-time operation on a fixed time grid (1ms or 0.1ms). The absolute reference used is that of the ODE solver in the Mathematica environment [research2014mathematica] which has many advantages including a wide range of methods that can be chosen adaptively during the solution, a sophisticated threshold-crossing algorithm, the ability to use arbitrary precision arithmetic and meaningful warning messages in the case of issues in the solution that contravene algorithmic assumptions. For the Izhikevich model a well implemented, explicit, variable-timestep solver such as the Cash-Karp variant of the Runge-Kutta 4th/5th order will also provide a good reference as long as it implements an effective threshold crossing algorithm. For other models such as variants of the Hodgkin-Huxley model, it may be that more complex (e.g. implicit) methods will be required to get close to reference results. However, none of these will be efficient enough for real-time operation unless there is a significant reduction in the number of neurons that can be simulated, and so for the SpiNNaker toolchain we chose a variant of the fixed-timestep Runge-Kutta 2nd order solver that was optimised for fixed-point arithmetic to achieve an engineering balance between accuracy and execution speed.

5.1.1 Algorithmic error and arithmetic error

In the above paper we were most focused on algorithmic error i.e. how closely our chosen algorithm can match output from the reference implementation. This algorithmic error is created by the inability of a less accurate ODE solver method to match the reference result (assuming equivalent arithmetic). As the output in this kind of ODE model is best described as the time evolution of the state variable(s) it is by no means easy to formulate a measure that allows direct comparisons to be made between these time-referenced vector quantities. We thought long and hard about single number measures that were feasible to calculate whilst at the same time being useful to the computational neuroscience community, including some novel ones such as e.g. scaled distances between a small number of parameters in an autoregressive linear predictor calculated using MEM that can exactly reproduce the state variable traces. What we decided in the end was much simpler: spike lead or lag i.e. the timing of spikes relative to the reference. This is not a perfect measure by any means and relies on a relatively simple drift in one direction or the other for spike timings, but in all cases tested so far where the arithmetic error is relatively small it is reliable and easy to understand, measure and communicate. So we work with the lead or lag of spike times relative to the reference for a set of different inputs (either DC or exponentially falling to approximate a single synapse) and neuron types (three of the most commonly used Izhikevich specifications: RS, CH and FS). We looked at different arithmetics and timesteps and their effect but these were side issues compared to choosing and implementing an algorithm that balanced execution speed against fidelity to the reference using the ISO fixed-point types at 1ms timestep.

It should be carefully noted that in this section we are investigating a different matter: that of arithmetic error. Now we are in a different position where we have chosen an algorithm for the above reasons and our new reference is this algorithm calculated in IEEE double precision arithmetic. So the purpose of this study is different: how closely can other arithmetics come to this arithmetic reference. We don’t claim that the chosen algorithm is perfect, just that it is a realistic use case on which to test the different arithmetics that are of interest.

We provide results below on four different fixed-timestep algorithms in order to demonstrate some generality in the results. In increasing order of complexity and execution time these are two 2nd order and one 3rd order fixed-timestep Runge-Kutta algorithms (see e.g. [butcher2003numerical, dormand1996numerical]) and a variant of the Chan-Tsai algorithm described in [Chan:2010aa]. All are implemented using the ESR approach described in [Hopkins2015] where the combination of ODE definition and solver mechanism are combined and ‘unrolled’ into an equivalent algebraic solution which can then be manipulated algebraically in order to be optimised for speed and accuracy of implementation using the fixed-point types available. See the paper for other possibilities and it is worth mentioning that we focused primarily there on 1ms timestep whereas for the current study we are using 0.1ms. This is for a number of reasons: (1) it is becoming the new informal standard for accurate neural simulation (2) accuracy is critically dependent upon the timestep chosen and so our solutions will now be significantly closer to the absolute algorithmic reference and (3) relative differences in arithmetic should therefore be made more apparent.

5.2 About constants

The constants that are used in the ODE solvers of the previously explored Izhikevich neuron model usually cannot be represented exactly in a digital computer system. Originally it was left to the compiler to round such constants as and . However, as part of this work we have discovered that the GCC implementation of the fixed-point arithmetic does not support rounding in decimal to fixed-point conversion and therefore the specification of constants suffers from truncation error which can be up to for a given target fixed-point type, and of course for ODE solvers this error will in most cases accumulate. Additionally we have found that there is no rounding on most of the common arithmetic operations involving fixed-point numbers. The pragma FX_FULL_PRECISION defined in the fixed-point standard [iso18037] is not implemented in the GCC compiler version 6.3.1 that we have used for this work and therefore there is no way to turn on correct rounding. Experiments on GCC compiler version 7.3.0 confirm that this is still the case.

Our first step was to specify constants in decimal exactly, correctly rounded to the nearest s16.15 (e.g. explicitly stating and as constants instead of and ) which has reduced the maximum error in the decimal to s16.15 conversion to . As a result this has reduced the spike lag previously reported in [Hopkins2015] significantly, leading to an understanding that ODE solvers can be extremely sensitive to how the constants are represented. Following from that, we have taken another step in this direction and have represented all of the constants smaller than as u0.32 instead of s16.15 which resulted in a maximum error to be . The authors in [10.3389/fninf.2018.00081] used s8.23 format for these constants, but we think that there is no downside to going all the way to u0.32 format if the constants are below  and any arithmetic operations involving these constants can output in a different format if more dynamic range and signed values are required. In order to support this, we have developed libraries for mixed-format multiplication operations, where s16.15 variables can be multiplied by u0.32 type of variables returning an s16.15 result (As described in detail in Section 4). A combination of more accurate representation of the constants, mixed-format multiplication and appropriate rounding has significantly reduced the error in the fixed-point solution of Izhikevich neuron model from what was previously reported [Hopkins2015] even before one takes into account 0.1ms timestep. Therefore, we would like to note that all the results below with round-down on the multipliers do not reproduce the results from the previous study [Hopkins2015] because these new results are generated with more precise constants as well as some reordering of arithmetic operations in the originally reported ODE solvers to keep the intermediate values in fract types as long as possible.

5.3 Results

In this section we present the results of four different ODE solvers with stochastic rounding added on each multiplier, at the stage where the multiplier holds an answer in full precision and has to truncate it into the destination format, as described in Section 2.

5.3.1 Neuron spike timings

We show results here on the most challenging target problems. These are a constant DC input of for the RS and FS neurons. The results for exponentially falling input and the CH neuron are essentially perfect in terms of arithmetic error for all the possibilities and so showing these would not be a good use of limited space. The results for different solvers are shown in Figure LABEL:fig:spike-lags0 with the exact spike lags of the 650th spike shown in Table LABEL:table:ode-stats0.

Because the SR results by definition follow a distribution, we need to show this in the plots. This has been done by running the ODE solve 100 times with a different random number stream and recording spike times on each run. We therefore gather a distribution of spike times for each of the first 650 spikes and on the plots below show the mean and SD of this distribution. A small number of checks with 1000 repeats show no significant differences in mean or SD from this case.

Mean spike time lag/lead relative to double reference