1. Introduction
The use of probabilistic inference is popular for robust classification, diagnosis and decisionmaking problems, because of its ability to assign a confidencelevel to every result, in terms of probability. Probabilistic Graphical Model (PGM)
(Pearl1988PRIS), an established tool for probabilistic inference, is widely used for such problems. PGMs have several interesting properties that make them suitable for embedded applications. Specifically, PGMs: 1) are capable of dealing with missing data; 2) allow to incorporate information from different domains, as well as expert knowledge; 3) can be trained with less data; and 4) can explicitly model uncertainty and causal relationships in the system. In addition, PGMs’ performance is competitive with respect to other stateoftheart Machine Learning implementations on embedded sensing applications (galindez2018dynamic; jonas; george2017generative; Liang2019AAAI).Inference in PGMs is prominently performed using a versatile representation known as an Arithmetic Circuit (AC) (or sumproduct network) (CHAVIRA2008772)
. An AC refers to a model of computation and is often represented as a graph of additions and multiplications. ACs allow for an integration of both statistical and symbolic methods in artificial intelligence, a promising combination that is pursued in stateoftheart machine learning methods
(Thompson2018Wired; Manhaeve2018NIPS; Liang2019AAAI). They are also central to performing inference in the field of probabilistic (logic) programming
(Fierens2015TPLP; Manhaeve2018NIPS). Furthermore, recent approaches learn ACs directly from data, with stateoftheart performance in certain applications (Liang2019AAAI). In this work, we focus on ACs representing Bayesian networks (BN), a type of PGM.Inference in ACs is generally restricted to obtaining exact solutions on general purpose computing devices. A significant improvement in energy efficiency would be possible by tolerating some error through approximating the probability computed by ACs. Take for instance a smartphonebased activity identification for elderly, wherein a probability is evaluated for different activities (e.g., a user walking up the stairs). The application chooses to identify an activity only if its probability is higher than a certain threshold, say 0.60. Here, allowing an output error of 0.01 would only affect the decisions within the probability range of 0.59 and 0.61, while enabling improved energyefficiency.
A promising hardware optimization that can exploit the available errortolerance is to realize the additions and multiplications in reducedprecision representation. Yet, the stateoftheart is lacking analysis of the impact of such precision reduction on the output probability error. Previous works (chan2002numbers; chan2004sensitivity; tschiatschek2015bayesian) have studied the impact of lowprecision in leaf nodes of an AC, but do not account for noisy or low precision computations in its internal nodes. However, the error in the imprecise internal nodes can get accumulated and be the dominant source of imprecision in the inference output.
In this paper, we propose ProbLP^{1}^{1}1Code available at https://github.com/nimish15shah/ProbLP, a holistic framework to automate the design of lowprecision energyefficient hardware for probabilistic inference in Arithmetic Circuits. Our contributions are as follows:

We derive bounds on the error in probabilistic queries due to lowprecision representation, taking into account the error introduced in all the nodes in an AC.

We develop energy models to help choose the most energyefficient representation

We develop a tool to automatically generate lowprecision inference hardware, and validates its performance on several embedded sensing benchmarks.
This paper is organized as follows. Section 2 gives an introduction to Arithmetic circuits compiled from Bayesian Networks and an overview of related works. In Section 3, we derive analytical error bounds for ACs. We introduce the ProbLP framework and elaborate on how it selects the optimal precision and selects between fixed or floatingpoint representation. Section 4 demonstrates the validity of the framework on a suite of embedded sensing benchmarks and Section 5 concludes this work.
2. Background and previous work
In this paper, we denote a random variable with uppercase letter
and its instantiation with a lower case letter . A set of multiple random variables are denoted with bold upper case letters X and its joint assignment with bold lower case letters x.Bayesian networks (BN) are directed acyclic graphs that compactly encode a joint probability distribution over a set of random variables
(Pearl1988PRIS):(1) 
where denotes the parents of and are the conditional dependencies between variables and their parents, which can be represented as Conditional Probability Tables (CPTs). In the graphical component of BNs, the variables are represented as nodes and their probabilistic or causal relationships are indicated by the direction of the edges among them, as depicted in Figure 0(a). The joint probability distribution in (1) allows answering a number of probabilistic queries such as the marginal probability, the conditional probability or the Most Probable Explanation (MPE)(Pearl1988PRIS).
Probabilistic inference on a BN can be made efficient by compiling it to an Arithmetic circuit, which consists only of multiplications and addition. Figure 0(b) shows an example of an AC generated by compiling the BN in Figure 0(a). The inputs to this AC are the BN’s parameters, represented by , where and are the instantiation of random variable and its parents. The second type of inputs to the AC are indicators
, which are binary variables that indicate the evidence of the observed nodes. The probability of an evidence (e.g.,
===) can be computed with an upward pass on the AC by setting the indicators that contradict the evidence to 0 (= = =), and others to 1 (= = = =).Previous works have studied the impact of finiteprecision CPT parameters on marginal and conditional probability (chan2002numbers; chan2004sensitivity; tschiatschek2015bayesian). This helps to reduce memory footprint due to a smaller inference model. However, they did not study the effect of lowprecision arithmetic operations. The authors of (zermani2015fpga; khan2016hardware)
studied the effect of fixedpoint arithmetic on marginal probability for a few BNs, but not of conditional probability, and did not provide error bounds. Moreover, the impact under floatingpoint arithmetic operation is also unclear.
This work provides analytical bounds on the absolute and the relative error in marginal and conditional probabilities for fixed and floatingpt operations in the entire AC. A holistic framework ProbLP is introduced, which also takes energyconsumption into account to choose the optimal representation among fixedpoint and floatingpoint. Subsequently, it automatically generates custom hardware for the AC evaluation.
3. Methodology
Different components of the ProbLP framework are shown in figure 2. ProbLP takes in an AC together with some user requirements, based on which, it calculates the least number of fixed and floating point bits needed to meet these requirements. To do so, ProbLP
evaluates errorbounds for the AC, based on their error models. To choose between these two representations, it subsequently estimates the energy of the complete AC based on energy models. Finally, it generates a fullyparallel pipelined hardware in the selected lowprecision representation.
The three inputs of ProbLP are as follows:
Arithmetic circuit: The Arithmetic circuit to be implemented using lowprecision hardware. In this paper, we use ACs compiled from Bayesian networks, but they can as well be compiled from probabilistic (logic) programs or can be trained directly from data.
Type of query: The type of probabilistic query to be performed using the AC, to be chosen from marginal probability, conditional probability or the probability of most probable explanation (MPE).
Error tolerance: The amount of error on the output that can be tolerated in the probabilistic queries by the application, for all possible combination of inputs, in terms of absolute or relative error. An absolute error is given as and a relative error is given as , where is the output probability of interest.
3.1. Error analysis
The aim of error analysis is to estimate the minimum number of bits required to achieve the userspecified error tolerance. For this, it has to take into account the impact of reducing the number of bits on the error in the AC output probability. There are two sources of error in an AC: an error in the leaf nodes when CPT values are quantized to finite precision, and an error injected in the intermediate nodes of AC due to the finite precision arithmetic operations. Unlike previous research works, we formally treat the error in the intermediate nodes as well to derive the errorbounds.
We consider two representations: fixedpoint and floatingpoint. Operators of both types round bits during computation. For example, a multiplication of 2 inputs of bits produces an exact result with bits, which is subsequently rounded to fit an bit output.The error introduced can be modeled as an additive noise source.
The error models used for the leaf nodes and the intermediate nodes are described next. Some of the models are inspired by (higham2002accuracy), but authors of that work did not perform the error analysis for ACs, and some of the models would render unbounded errors if ACspecific constraints are not exploited. The arithmetic operators are assumed to round the extra bits to the nearest value.
3.1.1. Fixedpt error estimation
Let and be the number of integer and fraction bits. All the numbers are assumed to be in the range of the fixedpt format, implying an absence of overflow during computation, which can be ensured by using an appropriate number of integer bits , discussed in detail in section 3.1.4.
Fixedpt leaf node: Let be the real value of a leaf node in an AC, and be its fixedpt representation. The error in fixedpt conversion can be bounded as,
(2) 
Fixedpt adder node: If and are the fixedpt representation of adder inputs and , the error in the output is given as
(3) 
Note that the fixedpt adder does not add any error of its own, as it does not round bits, and hence simply accumulates the error of the inputs. Note again that the adder output cannot overflow, as all the numbers are ensured to be in range.
Fixedpt multiplier node: With and as the fixedpt representation of multiplier inputs and , the error in fixedpt multiplier output can be bounded as,
(4)  
(5) 
In (4), the error term models the error introduced when the LSB bits of the intermediate multiplication result are rounded to fit back into fractional bits. Equation (5) produces an unbounded error unless and can be bounded.
The and can be efficiently bounded by taking into account the ACspecific properties. An AC consists of adders and multipliers and only operates on nonnegative numbers. As a result, each internal node in the AC is a monotonously increasing function of its inputs. Hence, all the nodes are at the maximum value when all the inputs are at their maximum. As such, since CPT parameters stay constant across AC evaluations, this is achieved when all the indicator variables are set to 1. This allows to assess the and of every operator in the AC with just a single AC evaluation. Thereby, allowing ProbLP to bound the error of fixedpt multipliers.
3.1.2. Floatingpt error estimation
Let and be the exponent and mantissa bits. We only consider normalized floatingpt here. All the numbers are assumed to be within the range of the given format, ensured by a method explained in detail in section 3.1.4.
Floatpt leaf node: Let be the real value of a leaf in AC and be its floatingpt representation. The absolute error introduced due to the floatingpt conversion can be bounded as described in (higham2002accuracy),
(6) 
which can be expressed alternatively as,
(7) 
where:
 :

Floatpt adder node: Let and be the floatpt versions of adder inputs and , be the ideal output, and be the output of a floatingpt adder. and can be represented as,
(8) 
Here, and depends on the amount of error accumulated in and , respectively. The bound on can be given as follows,
(9)  
(10) 
The error term in (9) is due to the rounding of LSB bits of the mantissa of the smaller input before addition.
Floatpt multiplier node: Just as in case of the adder, the inputs and can be bounded as in (8). With that, the output of a floatingpt multiplier can be given as,
(11)  
(12) 
The error term in (11) is due to the rounding of the LSB bits of the mantissa to fit the result in mantissa bits.
3.1.3. Errorbound at the AC output
Equations (2), (3.1.1), (5), and (6), (10) and (12) corresponds to the Error models shown in figure 2. As these models generate the output in the same format as the inputs, they provide a way to recursively propagate the error from the leaves of an AC all the way up to its output node, by accumulating the error introduced in every adder and multiplier. Figure 3 shows an example of error propagation using the fixedpt error models. This is performed as a part of the fixedpt error analysis and floatpt error analysis blocks of ProbLP shown in figure 2.
The error propagation in fixedpt arithmetic produces a bound of the form , where is the absolute error in the output node, and is a constant that depends on the size and structure of the AC, its parameters, and the number of fixedpt bits. The constant can be estimated recursively with our error models for any given AC. Similarly, the error propagation in floatingpt arithmetic produces a bound of the form , where is the output of an AC with floatingpt operators, is the ideal output, is a constant related to the number of floatingpt bits, and is a constant related to the size and structure of the AC. Again, the constant can be estimated recursively using the models we proposed, for any AC. Alternatively, the floatingpt bound can be expressed as for some constant , i.e., a bounded relative error at the output.
3.1.4. Number of integer or exponent bits
For the errormodels proposed in section 3.1.1 and 3.1.2 to be valid, the numbers encountered during the computation must stay within range of the representation. This can be ensured by using an appropriate number of integer bits and exponent bits for fixed and floatingpt respectively. Otherwise, error in some of the probability evaluations would exceed the predicted bounds. It is hence important to automatically derive the required range of numbers for any given AC.
Maxvalue analysis: The largest number to be encountered in an AC can be derived by setting all the indicator variables to 1, as explained in section 3.1.1. Analyzing the internal AC data values of this query, allows deriving the required , resp. to avoid overflow.
Minvalue analysis: The floatingpt models are invalid in case of underflow as well. Hence, it is necessary to also estimate the smallest positive nonzero value for an AC. It can be proven that all the nodes in an AC are at the respective minimum nonzero values when all the indicator variables are set to 1 and the adders are replaced with minimum operators . The resulting efficient AC evaluation allows ProbLP to analyze a lower bound on AC values, and find the appropriate required to prevent underflow. The fixedpt models remain valid even if the number of fraction bits is not enough to represent small values in the AC, so no special precautions are needed here regarding underflow.
In this way, ProbLP performs the Maxvalue and Minvalue analysis to selects , resp. , that satisfies both the requirements.
3.2. Bounds for probabilistic queries
As shown in figure 2, ProbLP aims to estimate the optimal fixedpt and floatpt bit width for a given type of probabilistic query and error tolerance. However, the bounds derived so far, apply only to a single AC evaluation. Some types of probabilistic queries require a combination of multiple AC evaluations. In this section, we derive bounds for two type of probabilisitic queries: 1) Marginal probability and MPE, and 2) Conditional probability.
3.2.1. Marginal probability and MPE
Marginal probabilities and most probable explanation (MPE) need only one AC evaluation. Hence, the bounds derived in section 3.1.3 apply for these queries.
3.2.2. Conditional probability
Conditional probability is evaluated by performing two AC evaluations, one for and one for , followed by taking the ratio of the two results^{2}^{2}2 can also be estimated by an upward and a downward pass in an AC followed with a division. We do not consider it explicitly, but similar error bounds are expected..
Fixedpt bounds: In the case of fixedpt arithmetic, the absolute error in each of the AC queries remains bounded. The impact on the conditional probability can hence be given as,
(13) 
Here, maximum error is achieved when and . In such a case, following equations show the impact on absolute and relative error in the conditional probability,
(14)  
(15) 
Equations (14) and (15) show the absolute and relative error in the conditional query . The error in the numerator is scaled by and , and these probabilities can become very small. Hence, large number of fixedpt bits are generally required to achieve a reasonable errorbound, especially for the relativeerror bound of (15). The absoluteerror bound in (14) can be quantified by estimating the minimum possible value for as described in section 3.1.4, wherein adders are replaced with min operators.
As the denominator of (15) can become very small, it is not a good idea to use fixedpt when requiring a relative error bound in conditional probabilities. Moreover, quantifying a bound for (15) is also not straightforward. Hence, ProbLP will always choose floatpt for relative error in conditional probability.
Floatpt bounds: The impact of using floatpt arithmetic on conditional probability can be given as follows.
(16) 
In (3.2.2), and are upper bounded by a constant, say , but not lower bounded. In the worst case, one of them can become 0, while the other is . Even in this worst case, the floatingpt version of the conditional probability still remains bounded as follows.
(17) 
This ensures a bound on the relative error .
3.3. Selecting optimal representation
Section 3.1 and 3.2 establishes a method to evaluate error bounds for a given AC in terms of number of bits. Next, ProbLP finds the least number of fixedpt and floatpt bits needed for given requirements. For this, it evaluates the bounds starting with 2 fraction bits and 2 mantissa bits, and increments them until the errorrequirement is satisfied. Then, it estimates the least number of integer and exponent bits required by the min and max analysis explained in section 3.1.4. In this way, ProbLP comes up with the optimal fixedpt and floatpt representation shown in figure 2.
Subsequently, the framework has to select among fixedpt and floatpt. ProbLP selects the one with the lowest energyconsumption, estimated using operatorlevel energy models. Energy models for the adders and multipliers are developed by synthesizing them with varying fraction/mantissa bits and integer/exponent bits in TSMC 65nm technology and extracting postsynthesis energy consumption. The models were fitted using leastsquares method to the simulation results, and are summarized in Table 1.
Operator  Energy (fJ) 

Fixedpt add  7.8 N 
Fixedpt mult  1.9 NlogN 
Floatpt add  44.74 (M+1) 
Floatpt mul  2.9 (M+1) log (M+1) 
3.4. Automatic hardware generation
ProbLP suggests the mostappropriate lowprecision representation for the AC, but this may not translate to energy savings unless hardware has custom arithmetic operators. To address this, ProbLP has an integrated hardware generator that generates custom parallel hardware that is fullypipelined and consists of arithmetic operators of the exact precision that is required to meet the user requirements. There are two major stages in the hardware generation process. In the first stage, all AC operators with more than two inputs are decomposed into a tree of 2input operators. An example of such decomposition is shown in figure 4, wherein the F operator is decomposed into a tree of F1, F2 and F3. In the second stage, the generator inserts pipeline registers after every operator. In some cases, it may have to insert multiple registers due to a mismatch in path timings, as shown in the path between A and G in figure 4. The final output of ProbLP is a verilog code of the custom hardware.
AC 









HAR  Marg. prob.  abs. err 0.01  1, 15 (4.3)  9, 14 (6.7)  5.9x10  5.3  10.8  
Marg. prob.  rel. err 0.01  1, ¿64 (  )  9, 14 (6.7)  1.0x10  7.2  
Cond. prob.  abs. err 0.01  1, ¿64 (  )  9, 14 (6.7)  2.6x10  7.2  
Cond. prob.  rel. err 0.01    9, 14 (6.7)  1.0x10  7.2  
UNIMIB  Marg. prob.  abs. err 0.01  1, 13 (0.4)  7, 12 (0.6)  4.9x10  0.34  0.89  
Cond. prob.  rel. err 0.01    7, 12 (0.6)  1.1x10  0.44  
UIWADS  Marg. prob.  abs. err 0.01  1, 11 (0.06)  6, 10 (0.09)  1.3x10  0.06  0.18  
Marg. prob.  rel. err 0.01  1, 47 (1.3)  6, 10 (0.09)  1.2x10  0.08  
Alarm  Marg. prob.  abs. err 0.01  1, 14 (2.2)  8, 13 (3.2)  2.2x10  2.43  5.37  
Cond. prob.  rel. err 0.01    8, 13 (3.2)  2.8x10  3.18 
4. Experimental results
We validate the functionality of ProbLP for the Arithmetic circuits targeting embedded sensing applications, by performing two types of experiments on four datasets. Three of these datasets (HAR (anguita2013public), UNIMIB(app7101101), UIWADS (casale2012personalization) in table 2) correspond to activity and user identification applications in smartphones and therefore rely on the accurate estimation of a conditional probability of the form to make thresholdbased decisions. The fourth dataset (Alarm in table 2 (beinlich1989alarm)) is of a patient monitoring application and is often used as a standard Bayesian network benchmark.
The ACs used in this section are compiled using the ACE tool (darwicheace)
, with cd06 and forceC2d option enabled. For the experiments on HAR, UNIMIB, and UIWADS, we trained Naive Bayes classifier on 60% of the data and used the rest for testing. The testing dataset for Alarm is generated by sampling 1000 instances from the trained network. In all the experiments, the leaf nodes of the BN were used as evidence nodes
eand one of the root nodes in the BN (the class node in the case of the classifiers) as a query node
q.4.1. Validation of bounds
This experiment confirms the validity of the derived error bounds for the AC compiled from the Alarm network. The experimental setting is as follows:
Fixedpt: The number of integer bits is set to 1 based on the maxanalysis , and fraction bits is varied from 8 to 40.
Floatpt: The number of exponent bits is set to 8 based on the maxmin analysis, and mantissa bits is varied from 8 to 40.
Figure 5 shows the max and mean error on the testset, which confirm the validity of the bounds.
4.2. Overall performance
In this experiment, the complete ProbLP framework is deployed to choose an appropriate arithmetic representation and generate hardware for different ACs and for given user requirements. The results of the experiment are summarized in Table 2. Experiments are performed for all combinations of queries and types of error tolerances for the HAR AC, and two combinations for the rest of the ACs. The table shows the optimal fixedpt and floatpt representation that meet the target errortolerance. Among these, ProbLP selects the one with less predicted energy, highlighted in bold. The resulting maximum error observed on the testsets remain within the required errortolerance. The postsynthesis energy consumption matches well to the energy predicted by the framework. The energy consumption of the hardware with a 32b float (E=8, M=23, 1 sign bit) is also shown for comparison. Note here that the choice of 0.01 error tolerance is arbitrary and higher energyefficiency can be achieved for relaxed error tolerances.
5. Conclusion
Probabilistic inference with Arithmetic circuits can be made energyefficient by tolerating a small amount of error in output probabilities and by designing custom hardware to exploit this error tolerance. This paper, therefore, proposes ProbLP, a holistic framework to automate the design of lowprecision custom hardware for ACs. The framework estimates worstcase error bounds for ACs, taking into account the error incurred in reduced precision fixed and floatingpoint operators. It estimates the impact of these errors on different types of probabilistic queries and finds the least number of fixedpt and floatpt bits required to meet the errortolerance. Subsequently, it chooses among the fixedpt and floatpt representation based on the energy models developed for this purpose. Next, ProbLP automatically converts an AC to pipelined logic with custom arithmetic operators. The analytically derived error bounds are validated for varying fixed and floatpt bits. Finally, the ProbLP framework is used for several embedded sensing benchmarks, confirming that the errorrequirements are met and the energy consumption of automatically generated hardware matches the prediction.
Comments
There are no comments yet.