Stochastic computing [1, 2, 3, 4] has a long history and is an alternative framework for performing computer arithmetic using stochastic pulses. It can approximate arbitrary real numbers and perform arithmetic on them to the correct value in expectation, but the stochastic nature means that the result is not accurate each time. Recently, Ref. 
suggests that deterministic variants of stochastic computing can be just as efficient, and does not have the random errors introduced by the random nature of the pulses. Nevertheless in such deterministic variants the finiteness of the scheme implies that it cannot approximate general real numbers with arbitrary precision. This paper proposes a framework that combines these two approaches to get the best of both worlds, and inherit some of the best properties of both schemes. In the process, we also provide a more complete probabilistic analysis of these schemes. In addition to considering both the first moment of the approximation error (e.g. average error) and the variance of the representation, we also consider the set of real numbers that are represented and processed to be drawn from an independent distribution as well. This allows us to provided a more complete picture of the tradeoffs in the bias, variance of the approximation and the number of pulses along with the prior distribution of the data.
Ii Representation of real numbers via sequences
We consider two independent random variables, with support in the unit interval . A common assumption is that and
are uniformly distributed. The interpretation is thatand generate the real numbers that we want to perform arithmetic on. In order to represent a sample from the main idea of stochastic computing (and other representations such as unary coding ) is to use a sequence of binary pulses. In particular, is represented by a sequence of independent Bernoulli trials . We estimate via . Our standing assumption is that , , and are all independent. We are interested in how well approximates a sample in . In particular, we define and are interested in the expected mean squared error (EMSE) defined as . Note that consists of two components, bias and variance, and the bias-variance decomposition  is given by where . The following result gives a lower bound on the EMSE:
Proof: Follows from the fact that is a rational number with denominator and thus .
Given the standard assumption that is uniformly distributed, this implies that , i.e. the EMSE can decrease at a rate of at most .
In the next sections, we analyze how well approximates samples in asymptotically as by analyzing the error for various variants of stochastic computing.
Ii-a Stochastic computing
A detailed survey of stochastic computing can be found in . We give here a short description of the unipolar format. Using the notation above, are chosen to be iid Bernoulli trials with . Then and is an unbiased estimator of . Since and , i.e. , we have for . More specifically, if has a uniform distribution on , then .
Ii-B A deterministic variant of stochastic computing
In  deterministic variants of stochastic computing111In the sequel for brevity we will sometimes refer to these schemes simply as “deterministic variants”. are proposed. Several approaches such as clock dividing, and relative prime encoding are introduced and studied. One of the benefits of a deterministic algorithm is the lack of randomness, i.e. the representation of via does not change and . However, the bias term can be nonzero. Because is represented by counting the number of 1’s in , it can only represent fractions with denominator . For where
is odd, the error is. This means that such values of , . If
is a discrete random variable with support only on the rational pointsfor integer , then . However, in practice, we want to the represent arbitrary real numbers in . Assume that is uniformly distributed in . By symmetry, we only need to analyze the . Then and . It follows that .
Ii-C Stochastic rounding
As the deterministic variant (Sect. II-B) has a better asymptotic EMSE than stochastic computing (Sect. II-A), one might wonder why is stochastic computing useful. It’s instructive to consider a special case: 1-bit stochastic rounding , in which rounding a number is given as a Bernoulli trial with . This type of rounding is equivalent to the special case of the stochastic computing mechanism in Sect. II-A. In deterministic rounding, and the corresponding EMSE is . For stochastic rounding,
has a Bernoulli distribution. If, then . Since , it follows that for , is minimized when and for , is minimized when , i.e. is minimized when . Thus with equality exactly when . This shows that the EMSE for deterministic rounding is minimal among all stochastic rounding schemes. Thus at first glance, deterministic rounding is preferred over stochastic rounding. While deterministic rounding has a lower EMSE than stochastic rounding, it is a biased estimator. This is problematic for applications such as reduced precision deep learning where an unbiased estimator such as stochastic rounding has been shown to provide improved performance over a biased estimator such as deterministic rounding. As indicated in , part of the reason is that subsequent values that are rounded are correlated and in this case the stochastic rounding prevents stagnation.
Ii-D Dither computing: A hybrid deterministic-stochastic computing framework
The main goal of this paper is to introduce dither computing, a hybrid deterministic-stochastic computing framework that combines the benefits of the stochastic computing (Sec. II-A) and its deterministic (Sec. II-B) variants and eliminates the bias component while preserving the optimal asymptotic rate for the EMSE . The encoding is constructed as follows. Let be a permutation of .
For , let and . Then we pick the Bernoulli trials with for and for with . Then . In addition, since and , this implies that . . Thus the bias is 0 and the EMSE is of the order . It is clear that this remains true if is either a deterministic or a random permutation as does not depend on .
For , let and . We pick the Bernoulli trials with for and for with . . In addition, since and , this implies that . . Thus again the bias is 0 and the EMSE is of the order .
The above analysis shows that dither computing offers better EMSE error than stochastic computing while preserving the zero bias property. In order for such representations to be useful in building computing machinery, we need to show that this advantage persists under arithmetic operations such as multiplication and (scaled) addition.
Iii Multiplication of values
In this section, we consider whether this advantage is maintained for these schemes for multiplication of sequences via bitwise AND. The sequence corresponding to the product of and is given by and the product is estimated via .
Iii-a Stochastic computing
In this case we want to compute the product . Let and be independent with and , Then for , are Bernoulli with and . and .Thus and the variance and the MSE of the product maintains the suboptimal asymptotic rate.
Iii-B Deterministic variant of stochastic computing
For numbers , we consider a unary encoding for , i,e, for and otherwise, where . For we have if . Let be the number of indices such that , then and . This means that . Since , this implies that and thus the bias is on the order of and the EMSE is on the order of .
Iii-C Dither computing
For numbers , we consider the encoding in Section II-D with the permutation for defined as the identity and the permutation for defined as spreading bits in a sample of as much as possible. In particular, let be a sample of and . Then for , where is a uniformly distributed random variable on [0,1] independent from and . We will only consider the case as the other cases are similar. Let , , and . Then for of indices on average and otherwise. This implies that and the bias is . Similar to the deterministic variant, it can be shown that for a constant , and thus is .
Iv Scaled addition (or averaging) of values
For , the output of the scaled addition (or averaging) operation is . An auxiliary control sequence of bits is defined that is used to toggle between the two sequences by defining as alternating between and : and is estimated via .
Iv-a Stochastic computing
The control sequence is defined as independent Bernoulli trials with . It is assumed that , and are independent. Then , i.e., . . Thus again .
Iv-B Deterministic variant of stochastic computing
For this case are deterministic and we define if is even and otherwise. Let and be the number of even and odd numbers in respectively. Then and . If is even, . If is odd, . In either case, and , .
Iv-C Dither computing
We set and both equal to the identity permutation and define sequence with for odd and
otherwise. With probability, for all and for all otherwise. Thus the 2 sequences and are each chosen with probability . Note that and are correlated, and . This means that and the bias is . The 2 sequences for selects 2 disjoint sets of random variables and the sum of which has variance . This implies that is .
V Numerical results
In Figures 1-6 we show the EMSE and the bias for the computing schemes above by generating independent pairs from a uniform distribution of and of . For each pair , 1000 trials of dither computing and stochastic computing222The set of pairs are the same for the 3 schemes. For the deterministic variant, only 1 trial is performed as are are deterministic. are used to represent them and compute the product and average .
We see that the sample estimate for the bias for , and are lower for both the stochastic computing scheme and the dither computing scheme as compared with the deterministic variant. On the other hand, the dither computing scheme has similar EMSE on the order of as the deterministic scheme, whereas the stochastic computing scheme has higher EMSE on the order of .
Even though both stochastic computing and dither computing have zero bias, the sample estimate of this bias is lower for dither computing than for stochastic computing. This is because the standard error of the mean is proportional to the standard deviation and dither computing has standard deviationvs for stochastic computing and this is observed in Figs 2, 4, 6.
Furthermore, even though the dither computing representations of and has worse EMSE than the deterministic variant, the dither computing representation of both the product and the scaled addition has better EMSE. The asymptotic behavior of bias and EMSE for these different schemes are listed in Table I.
|Stoch. Comp.||Determ. Variant||Dither Comp.|
Vi Asymmetry in operands
In the dither computing scheme (and in the deterministic variant of the stochastic computing as well), the encoding of the two operands and are different. For instance is encoded as a unary number (denoted as Format 1) and has its -bits spread out as much as possible (denoted as Format 2) for multiplication while both and are encoded as unary numbers for scaled addition. For multilevel arithmetic operations, this asymmetry requires additional logic to convert the output of multiplication and scaled addition into these 2 formats depending on which operand and which operation the next arithmetical operation is. On the other hand, there are several applications where the need for this additional step is reduced. For instance,
In memristive crossbar arrays , the sequence of pulses in the product is integrated and converted to digital via an A/D converter and thus the product sequence of pulses is not used in subsequent computations.
11], one of the operand is always a weight or a bias and thus fixed throughout the inference operation. Thus the weight can be precoded in Format 2 for multiplication and the bias value is precoded in Format 1 for addition, whereas the data to be operated on is always in Format 1 and the result recoded to Format 1 for the next operation.
Vii Dither rounding: stochastic rounding revisited
Recently, stochastic rounding has emerged as an alternative mechanism to deterministic rounding for using reduced precision hardware in applications such as solving differential equations  and deep learning . As mentioned in Sec. II-C, 1-bit stochastic rounding can be considered as the special case of stochastic computing with . For -bit stochastic rounding, the situation is similar as only the least significant bit is stochastic. Another alternative interpretation is that stochastic computing is stochastic rounding in time, i.e. , can be considered as applying stochastic rounding times. Since the standard error of the mean of dither computing is asymptotically superior to stochastic computing, we expect this advantage to persist for rounding as well when applied over time.
Thus we introduce dither rounding as follows. We assume as the case can be handled similarly. We define dither rounding of a real number as where is the dither computing representation of as defined in Sect. II-D and is the fractional part of . Note that there is an index in the definition of which is an integer . In practice we will compute as , where counts how many times the dither rounding operation has been applied so far and is a fixed permutation, one for the left operand and one for the right operand of the scalar multiplier.
To illustrate the performance of these different rounding schemes, consider the problem of matrix-matrix multiplication, a workhorse of computational science and deep learning algorithms. Let and be and matrices with elements in . The goal is to compute the matrix . A straightforward algorithm for computing requires (scalar) multiplications. Let us assume that we have at our disposal only -bit fixed point digital multipliers and thus floating point real numbers are rounded to -bits before using the multiplier. We want to compare the performance of computing between traditional rounding, stochastic rounding and dither rounding. In particular, since each element of is used times and each element of is used times, for dither rounding we set . For dither rounding the computation of each of the partial results is illustrated in Fig. 7, and the other schemes can be obtained by simply replacing the rounding scheme. We measure the error by computing the Frobenius matrix norm where is the product matrix computed using the specified rounding method and the -bit fixed point multiplier. In our case this is implemented by rescaling the interval to and rounding to fixed point -bit integers. Note that the Frobenius matrix norm is equivalent to the vector norm when the matrix is flattened as a vector.
We expect dither rounding (and stochastic rounding) to outperform traditional rounding333which is equivalent to deterministic -bit quantization. when the range of the matrix elements is narrow compared to the quantization interval. For example, take the special case of and , where is the square matrix of all ’s and . When we use traditional rounding to round the elements of and , the corresponding is , where . The analysis in Section III shows that for both dither rounding and stochastic rounding the resulting satisfies , with for dither rounding and for stochastic rounding.
We generate 100 pairs of 100 by 100 matrices and where elements of and are randomly chosen from the range and choose . The average for traditional rounding, stochastic computing and dither computing are shown in Fig. 8444Note that for traditional rounding and , and are both rounded to the zero matrix, and
are both rounded to the zero matrix, andin this case.. We see that dither rounding has smaller than stochastic rounding and that for small both dither computing and stochastic rounding has significant lower error in computing than traditional rounding. There is a threshold where traditional rounding outperforms dither or stochastic rounding for , and we expect this threshold to increase when increase.
For the next numerical experiment, we compare stochastic rounding with dither rounding and set to be or with , random matrices with elements in and computed trials each. The results are shown in Fig. 9 where we plot the error for various . Again we see that dither rounding has a smaller average error in computing than stochastic rounding. Based on our analysis above, similar to the previous numerical results, we expect these gaps to widen as increase.
We present a hybrid stochastic-deterministic scheme that encompasses the best features of stochastic computing and its deterministic variants by achieving the optimal asymptotic rate for the EMSE of the deterministic variant while inheriting the zero bias property of stochastic computing schemes. We also show how it can be beneficial in stochastic rounding applications as well.
-  B. R. Gaines, “Stochastic computing,” in Proceedings of the AFIPS Spring Joint Computer Conference, pp. 149–156, 1967.
-  A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM Transactions on Embedded Computing Systems, vol. 12, no. 2s, pp. 1–19, 2013.
-  T.-H. Chen and J. P. Hayes, “Analyzing and controlling accuracy in stochastic circuits,” in IEEE 32nd International Conference on Computer Design (ICCD), 2014.
-  R. P. Duarte, M. Vestias, and H. Neto, “Enhancing stochastic computations via process variation,” in 25th International Conference on Field Programmable Logic and Applications (FPL), 2015.
-  D. Jenson and M. Riedel, “A deterministic approach to stochastic computation,” in ICCAD, 2016.
-  M. D. Davis, R. Sigal, and E. J. Weyuker., Computability, Complexity, and Languages: Fundamentals of Theoretical Computer Science. Academic Press, 1994.
-  G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning. Springer, 2013.
-  M. Höhfeld and S. E. Fahlman, “Probabilistic rounding in neural network learning with limited precision,” Neurocomputing, vol. 4, no. 6, pp. 291–299, 1992.
-  M. P. Connolly, N. J. Higham, and T. Mary, “Stochastic rounding and its probabilistic backward error analysis,” Tech. Rep. MIMS EPrint 2020.12, The University of Manchester, 2020.
T. Gokmen, M. Onen, and W. Haensch, “Training deep convolutional neural networks with resistive cross-point devices,”Frontiers in Neuroscience, vol. 11, 10 2017.
Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, “A survey of stochastic computing neural networks for machine learning applications,”IEEE Transactions on Neural Networks and Learning Systems, pp. 1–16, 2020.
M. Hopkins, M. Mikaitis, D. R. Lester, and S. Furber, “Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations,”Phisophical Transactions A, vol. 378, p. 20190052, 2020.
-  S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in Proceedings of the 32nd International Conference on Machine Learning, pp. 1737–1746, Feb. 2015.