Probabilistic Analysis Based On Symbolic Game Semantics and Model Counting

Probabilistic program analysis aims to quantify the probability that a given program satisfies a required property. It has many potential applications, from program understanding and debugging to computing program reliability, compiler optimizations and quantitative information flow analysis for security. In these situations, it is usually more relevant to quantify the probability of satisfying/violating a given property than to just assess the possibility of such events to occur. In this work, we introduce an approach for probabilistic analysis of open programs (i.e. programs with undefined identifiers) based on game semantics and model counting. We use a symbolic representation of algorithmic game semantics to collect the symbolic constraints on the input data (context) that lead to the occurrence of the target events (e.g. satisfaction/violation of a given property). The constraints are then analyzed to quantify how likely is an input to satisfy them. We use model counting techniques to count the number of solutions (from a bounded integer domain) that satisfy given constraints. These counts are then used to assign probabilities to program executions and to assess the probability for the target event to occur at the desired level of confidence. Finally, we present the results of applying our approach to several interesting examples and illustrate the benefits they may offer.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/28/2020

Counting the Number of Solutions to Constraints

Compared with constraint satisfaction problems, counting problems have r...
10/10/2020

SYMPAIS: SYMbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis

Probabilistic software analysis aims at quantifying the probability of a...
10/04/2011

Well-Definedness and Efficient Inference for Probabilistic Logic Programming under the Distribution Semantics

The distribution semantics is one of the most prominent approaches for t...
10/05/2021

SMProbLog: Stable Model Semantics in ProbLog and its Applications in Argumentation

We introduce SMProbLog, a generalization of the probabilistic logic prog...
02/23/2020

How Good Is a Strategy in a Game With Nature?

We consider games with two antagonistic players —Éloïse (modelling a pro...
01/28/2019

Quantitative Verification of Masked Arithmetic Programs against Side-Channel Attacks

Power side-channel attacks, which can deduce secret data via statistical...
04/26/2019

Quantitative Logics for Equivalence of Effectful Programs

In order to reason about effects, we can define quantitative formulas to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In order to understand program behaviour better, apart from finding out whether a behaviour (execution) can successfully terminate or not, we often need to know how likely

a behaviour is to occur. In particular, we want to distinguish between what is possible behaviour (even with extremely low probability) and what is likely behaviour (possible with higher probability). In this work, we show how to calculate the probability of behaviours and estimate the reliability of programs by using a combination of (symbolic) game semantics and model counting.

Game semantics [2, 18] is a technique for building models of programs that are fully abstract, i.e. sound and complete with respect to observational equivalence. The notion of observational equivalence relies on comparing the outcomes of placing programs in all possible syntactic contexts (environments). Its algorithmic subarea [16, 9, 8, 19] aims to apply game semantics models to software verification by providing concrete automata-based representations for them. The key characteristics of game semantics models are the following. They provide precise and compact summaries of observable (input and output) program behaviour, without showing the explicit reference to a state (state manipulations are hidden). There is a model for any open program with free (undefined) identifiers such as calls to library functions. Finally, the models are generated inductively (compositionally) on the structure of programs, which is often essential for the modular analysis of larger programs. Symbolic representation of game semantics models [11] extends the (standard) regular-language representation [16] by using symbolic data values instead of concrete ones for the inputs. This allows us to obtain compact models of programs by using finite-state symbolic automata. Each complete symbolic play (accepting word) in the model corresponds to a program execution (path), and it is guarded by a conjunction of constraints on the symbols, known as play condition, which indicate under what conditions this play (word, execution) is feasible. If the play condition is satisfied by some concrete values for symbols, then they represent input values that will allow the execution to follow the specific path through the code. For the generation of symbolic game models where each play is associated with a play condition we use the Symbolic GameChecker 111https://aleksdimovski.github.io/symbolicgc.html. tool [11]. Model counting is the problem of determining the number of solutions of a given constraint (formula). The LattE 222http://www.math.ucdavis.edu/~latte. UC Davis, Mathematics. tool [21] implements state-of-the-art algorithms for computing volumes, both real and integral, of convex polytopes as well as integrating functions over those polytopes. In particular, we use model counting techniques and the LattE tool to estimate algorithmically the exact number of points of a bounded (possibly very large) discrete domain that satisfy given linear constraints.

In this paper, we describe a method based on symbolic game models and model counting for performing a specific type of quantitative analysis – the calculation of play probabilities and the program reliability. Calculating the probability of a symbolic play (path) involves counting the number of solutions to the play condition (by using model counting), and dividing it by the total space of values of the inputs (context). We assume that the input values are uniformly distributed within their finite discrete domain. We label each (complete) symbolic play with either success or failure depending on whether a designated command is executed or not. Since the set of play conditions produced by the symbolic game model is a complete partition of the given finite input domain, we can compute the reliability of the program as the probability of satisfying any of the successful play conditions. To account for cycles (infinite behaviours) in the model, we use bounded analysis. For a “” command, a bound is set for the exploration depth (i.e. the number of re-visited states). For an undefined (first-order) function, we restrict the number of times the function can call its arguments when placed in the given bounded context, thus obtaining a finite input domain.

The main contributions of this work are: (1) A demonstration of how to add path probabilities using (symbolic) algorithmic game semantics and model counting; (2) An application of our approach to calculate the reliability of open programs; (3) A prototype implementation as part of Symbolic GameChecker.

2 Programming Language

The use of meta-languages is very common in the semantics community. The semantic model is defined for a meta-language, and a real programming language (C, ML, etc.) can be studied by translating it into this meta-language and using the induced model. Here we consider Idealized Algol (IA), a well studied meta-language introduced by Reynolds [23]. IA enables functional (typed call-by-name -calculus) and imperative programming. For the purpose of obtaining an automata-based representation of game semantics, we shall consider its second-order recursion-free fragment (IA for short). Its types are:

where , , and stand for data types, base types, and first-order function types, respectively. The syntax of the language is:

where ranges over a countable set of identifiers, and ranges over constants of type , which includes integers () and booleans (). The standard arithmetic-logic operations are employed, as well as the usual imperative constructs: sequential composition (), conditional (), iteration (), assignment (), de-referencing operator () which is used for reading the value stored in a variable, a “do-nothing” command (), and a divergence command (). Block-allocated local variables are introduced by a construct, which initializes a variable and makes it local to a given block. They are also called “good” (storage) variables since what is read from a variable is the last value written into it. The construct is used for creating so-called “bad” variables, which do not behave like genuine storage variables [17]. There are also standard functional constructs for function definition and application. Well-typed terms are given by typing judgements of the form , where is a type context consisting of a finite number of typed free identifiers. Typing rules are given in [2, 23].

The operational semantics is defined by a big-step reduction relation:

where is a term in which all free identifiers from are variables, i.e. , and , represent the state before and after reduction. The state is a function assigning data values to the variables in . Canonical forms (values) are defined by . Reduction rules are standard (see [2, 23] for details). Given a closed term , which has no free identifiers, we say that terminates if . We define a program context to be a term with zero or more holes in it, such that if is a term of the same type as the hole then is a well-typed closed term of type , i.e. . We say that a term is an approximate of a term , written , if and only if for all contexts , such that and , if terminates then terminates. If two terms approximate each other they are considered observationally-equivalent, denoted by . In general, observational equivalence is very difficult to reason about due to the universal quantification over all syntactic contexts in which the terms can be placed.

3 Symbolic Game Models

We now give a brief overview of symbolic representation of the algorithmic game semantics for IA [11]. Let be a countable set of symbolic names, ranged over by , , . For any finite , the function returns a minimal symbolic name which does not occur in , and sets . A minimal symbolic name not in is the one which occurs earliest in a fixed enumeration of all possible symbolic names. Let be a set of expressions, ranged over by , generated by data values (), symbols (), and arithmetic-logic operations (). We use to range over arithmetic expressions () and over boolean expressions ().

Let be an alphabet of letters. We define a symbolic alphabet induced by as follows:

The letters of the form are called input symbols. They represent a mechanism for dynamically generating new symbolic names. More specifically, creates a stream of fresh symbolic names, binding to the next symbol from its stream, , whenever is evaluated (met). We use to range over . Next we define a guarded alphabet induced by as the set of pairs of boolean conditions and symbolic letters:

A guarded letter is only if evaluates to true otherwise it is the constant (the language of is ), i.e. . We use to range over . We will often write only for the guarded letter . A word over can be represented as a pair , where is a boolean condition and is a word of symbolic letters.

We now describe how IA terms can be translated into symbolic regular languages and symbolic automata. Each type is interpreted by an alphabet of moves defined as follows:

where , , and denotes a disjoint union of alphabets. Function types are tagged by a superscript to keep record from which type, i.e. which component of the disjoint union, each move comes from. The letters in the alphabet represent the moves, i.e. observable actions that a term of type can perform. Each move is either a question (a demand for information) or an answer (a supply of information). For expressions in , there is a question move q to ask for the value of the expression, and values from to answer the question. For commands, there is a question move run to initiate a command, and an answer move done to signal successful termination of a command. For variables, there are question moves for writing to the variable, , which are acknowledged by the answer move ok; and there is a question move read for reading from the variable, which is answered by a value from .

For any term, we define a (symbolic) regular-language which represents its game semantics, i.e. its set of complete symbolic plays. A play is a sequence of moves played by two players in turns: P (Player) which represents the term being modeled, and O (Opponent) which represents its context. Every (complete) symbolic play represents the observable effects of a completed execution (path) of the given term. It is given as a guarded word , where is also called the play condition. Assumptions about a symbolic play to be feasible are recorded in its play condition. For infeasible plays, the play condition is unsatisfiable, thus no assignment of concrete values to symbolic names exists that makes the play condition true. The regular expression for , denoted as , is defined over the guarded alphabet:

where moves corresponding to types of free identifiers are tagged with their names to indicate the origin of moves. Hence, contains only observable moves associated with types of free identifiers from (suitably tagged) as well as moves of the top-level type .

The representation of constants is standard:

For example, an integer or boolean constant is modeled by a play where the initial question q (“what is the value of this expression?”) is answered by the value of that constant .

Free identifiers are represented by the so-called copy-cat regular expressions, which contain all possible behaviours of terms of that type, thus providing the most general context for an open term. Thus,

(1)

When a call-by-name non-local function with arguments is called, it may evaluate any of its arguments, zero or more times, in an arbitrary order (hence, the Kleene closure *) and then it returns any allowable answer from its result type. Recall that the input symbol creates a stream of fresh symbolic names for each instantiation of . Thus, whenever is met in a play, the mechanism for fresh symbol generation is used to dynamically instantiate it with a new fresh symbolic name from its stream, which binds all occurrences of that follow in the play until a new is met which overrides the previous symbolic name with the next symbolic name taken from its stream. For example, consider the term , where is an undefined function with two arguments. Its symbolic model is:

(2)

The play corresponding to function “” which evaluates its first argument two times, after instantiating its input symbols and is given as: , where and are two different symbolic names used to denote values of the first argument when it is evaluated the first and the second time, respectively. Therefore, we are using the streaming symbol to create different symbolic names so that we can produce distinct values (independent from one another) if is evaluated multiple times during the execution. Note that letters tagged with represent the actions of calling and returning from the function , while letters tagged with (resp. ) are the actions caused by evaluating the first (resp. second) argument of .

Table 1: Symbolic representations of some language constructs

The representations of some language constructs “” are given in Table 1. Observe that letter conditions different than occur only in plays corresponding to “” and “” constructs. In the case of “” construct, when the value of the first argument given by the symbol is true then its second argument is run, otherwise if is true then its third argument is run. A composite term built out of a language construct “” and subterms is interpreted by composing the regular expressions for and the regular expression for “”. For example, we have:

where is defined in Table 1. Composition of regular expressions () is defined as “parallel composition followed by hiding” in CSP style [2]. The parallel composition is matching (synchronizing) of the moves in the shared types, whereas hiding is deleting of all moves from the shared types. Conditions of the shared (interacting) moves (guarded letters) in the composition are conjoined, along with the condition that their symbolic letters are equal [11]. The regular expression in Table 1 is used to impose the good variable behaviour on a local variable introduced using . Note that is the initial value of , and is a symbol used to track the current value of . The behaves as a storage cell and plays the most recently written value in in response to read, or if no value has been written yet then answers read with the initial value . The model is obtained by constraining the model of , , only to those plays where exhibits good variable behaviour described by , and then by deleting (hiding) all moves associated with since is local and so not visible outside of the term [11].

The following formal results are proved before [11]. We define an effective alphabet of a regular expression to be the set of all letters that appear in the language denoted by that regular expression. The effective alphabet of a regular expression representing any term contains only a finite subset of letters from , which includes all constants, symbols, and expressions used for interpreting free identifiers, constructs, and local variables in .

Proposition 1

For any IA term, the set is a (symbolic) regular-language without infinite summations defined over its effective finite alphabet. Moreover, a finite-state symbolic automata which recognizes it is effectively constructible.

Suppose that there is a special free identifier of type . We say that a term is safe 333 denotes the capture-free substitution of for in . iff ; otherwise we say that a term is unsafe. We say that one play is safe if it does not contain moves from ; otherwise we say that the play is unsafe.

Proposition 2

A term is safe iff all plays in are safe.

For example, , so this term is unsafe since its model contains an unsafe play.

Example 3

Consider the term :

The model for this term is given in Fig. 1 444For simplicity, in examples we omit to write angle brackets in superscript tags of moves.. The dashed edges indicate moves of the environment (O) and solid edges moves of the term (P). They serve only as a visual aid to the reader. Accepting states are designated by an interior circle. Observe that the term communicates with its environment using non-local identifiers and . So in the model will only be represented actions associated with and as well as with the top-level type . The input symbol is used to keep track of the current value of the local variable (note that occurs only in conditional part of plays). Each time the term (P) asks for a value of with the move , the environment (O) provides a new fresh symbol for it. Note that we consider all possible environments (contexts) in which a term can be placed. Therefore, the undefined expression may obtain different value at each call in the above term [16]. At this point, the term (P) has three possible options depending on the current values of symbols and : it can terminate successfully with done; it can execute and terminate; or it can run the assignment and ask for a new value of .

start

done
Figure 1: The symbolic game model for .

4 Calculating Success and Failure Probabilities

In this section, we define the success and failure probability of terms, and show how they can be automatically calculated using symbolic game models and model counting. We also show how to cope with cases that introduce infinite behaviours.

4.1 Definition

We define the success probability as the probability that a term terminates successfully without hitting any failure, such as running the command. On the other hand, the failure probability

is the probability that a term hits a failure during its execution. The resulting symbolic game model is a set of symbolic plays (words), each with a play condition. Some of these plays are unsafe (i.e. lead to a failure, abortion); whereas some of them are safe (i.e. lead to a successful termination without abortion). The plays are therefore classified in two sets:

which contains safe plays, and which contains unsafe plays.

Our discussion focusses on the case of computing probabilities for terms that have finite input domains for all their plays (executions). This is achieved by constraining all identifiers from to be of types in which only finite sets of basic data values are used. For example, we may consider only the basic types over and for any . We also need to bound the input domain when undefined (first-order) functions are used. This case is handled separately in Section 4.2. Finally, we restrict our attention on play conditions expressed as linear integer arithmetic (LIA) constraints over symbols whose values are uniformly distributed over their finite input domain.

Given a symbolic play , let be the total space of possible values in its finite input domain and let be its play condition (constraint). We now show how to calculate the probability of occurring, denoted . We use the LattE tool to compute the number of elements of that satisfy , denoted . The size of , denoted , is the product of domain’s sizes of all symbols instantiated in , which correspond to all calls of free identifiers of types in which data values are used. Thus, we have: and , where if is a symbol that represents a value from the finite domain . Note that the size of the input domain (context) for each play can be different, and depends on how many symbols have been instantiated in that correspond to the data type . The play conditions associated with plays from and define disjoint input sets and cover the whole finite input domain, thus defining a complete partition of the finite input domain. Finally, we define the success probability (resp., failure probability) as the probability of evaluating the term within a context (input) that enables all safe (resp., unsafe) plays:

(3)

Note that .

Example 4

Consider the term :

Its symbolic game model is:

Suppose that and that the possible values for are independently and uniformly distributed across this range. Thus, after instantiation of the input symbol , there are one safe play () and one unsafe play (). The safe (resp., unsafe) play condition is: (resp., ). Thus, we obtain and .

We use model counting and the LattE tool [21] to determine the number of solutions of a given constraint. LattE

accepts LIA constraints expressed as a system of linear inequalities each of which defines a hyperplane encoded as the matrix inequality:

, where is an matrix of coefficients and is an

column vector of constants. Most LIA constraints can easily be converted into the form:

. For example, and can be flipped by multiplying both sides by , and strict inequalities can be converted by decrementing the constant . In LattE equalities can be expressed directly. If we have disequalities , they can be handled by counting a set of constraints that encode all possible solutions. For example, the constraint is handled by finding the sum of solutions for and . For a system , where is an matrix and is an column vector, the input LattE file is:

For example, the constraint “” from Example 4 results in the following (hyperplane) H-representation for LattE:

where the first line indicates the matrix size: the number of inequalities by the number of variables plus one. The next two inequalities encode the max and min values for the symbol based on its data type. The last inequality expresses the constraint: (i.e. ). LattE reports that there are exactly 5 points that satisfy the above inequalities ().

4.2 Bounded Analysis

The presence of “” command and free identifiers of function type (i.e. undefined functions) introduce infinite behaviors, a cycle, in our model. Hence, convenient analysis strategies are required for handling them in order to compute the success and failure probabilities. In the case of the “” command, the source of infinite behaviour is the term being modeled, but the context is still finite. On the other hand, in the case of undefined (first-order) functions, the source of infinite behaviour is the context in which that function can be placed (e.g. the function may call its arguments infinitely many times), so the context is unbounded in this case. This is the reason why we have two different strategies to cope with “” and undefined functions.

The command.

The solution is based on bounded exploration: a (user-defined) bound is set for the search depth (i.e. the number of times a state can be re-visited). When the bound is reached the search backtracks. Intuitively, the bound represents the number of iterations of the -loop and so we have the following bounded definition for (instead of the one in Table 1):

In this setting the search is no longer complete, and besides safe and unsafe plays, a new set of plays is collected for traces interrupted before completing the search. We call this set of plays grey and label it as . We can define analogously to the other sets as shown in Eqn. (3). The three sets of play conditions associated with plays in , , and are disjoint and constitute a complete partition of the entire finite input domain. Hence, . The intuitive meaning of is to quantify the plays of for which neither safety nor unsafety have been revealed at the current exploration depth. This information is a measure of the confidence we can put on our success (resp., failure) estimation obtained within the given exploration bound: . means that the search is complete, i.e. for each input we can state if it leads to a safe or an unsafe execution. Increasing the exploration depth, the confidence grows revealing more accurate safe (resp., unsafe) predictions.

Example 5

Let us reconsider the term from Example 3. Suppose that is of type . We will now calculate the values of , , , and , for different exploration depths . Let . This means the state \⃝raisebox{-0.9pt}{{\smalld}} from its symbolic model given in Fig. 1 can be visited only once (i.e. \⃝raisebox{-0.9pt}{{\smalld}} cannot be re-visited). Let be the symbol name instantiated for . In this case, there is one unsafe play: , and one safe play: . The condition of the unsafe play is unsatisfiable (note ) and so ; whereas the condition of the safe play is satisfiable with only one solution for and so . For , the state \⃝raisebox{-0.9pt}{{\smalld}} needs to be re-explored so and .

Let . This means the state \⃝raisebox{-0.9pt}{{\smalld}} in Fig. 1 can be re-visited once. Let and be the symbol names instantiated when is evaluated the first and the second time, respectively. In this case, there are two unsatisfiable unsafe plays and two safe plays. The first safe play is from the previous iteration corresponding to with probability . The second safe play is: , with the condition: , which has 18 solutions: for and . Thus, ; ; ; and .

Let and let , , be the symbol names instantiated the first, the second, and the third time when is met, respectively. We obtain unsafe plays when , , and , and so we have , , , and . For , we have , , , and .

Undefined functions.

Recall the definition of undefined functions in Eqn. (3). The ‘generic behaviour’ of a call-by-name function is, when called by its context, to perform some sequence of calls to its arguments, and then to return a result. Since the number of times the function’s arguments are called can be arbitrary (even infinite, see the Kleene closure in Eqn. (3)), the corresponding input domain is not finite. One solution is to place numeric bounds on the number of times an undefined function can call its arguments. For any integer , we define as a term which can be placed into contexts where any of its first-order free identifiers from can call its arguments at most times.

For example, the interpretation of now becomes:

(4)

Thus, we now use the bound instead of the Kleene closure *, which is used in the general case given in Eqn. (2). Let us calculate the sizes of input domains corresponding to individual plays from the above model in Eqn. (4). Assume that we work with the finite integer domain . The play corresponds to a function “” which does not evaluate its arguments at all (a non-strict function), and so there are 10 different instantiations of since . Note that if the play condition is true, which means that all instantiations of are feasible then . If “” evaluates its arguments once, then we have two plays: (“” evaluates its first argument) with different instantiations corresponding to , and (“” evaluates its second argument) with different instantiations corresponding to . For a function “” that calls its arguments times in any order, we have plays each of which with different instantiations. The total number of symbolic plays is .

In general, for a play where contains calls to an undefined function with arguments, we have:

(5)

Note that if the undefined function has 1 argument, then the total number of symbolic plays is . When the play condition is true, then .

Example 6

Consider the term:

where we bound the size of context on definitions of “” which can call its argument at most 4 times. Note that “” has 1 argument and is called once in the above term. The symbolic model of the above term is:

For the contexts corresponding to “” which does not call its argument at all (), the unsafe behaviour is exercised when the value returned from is , i.e. the failure probability is . When the function “” calls its argument once, the variable is incremented once () and so the failure probability is . For the contexts when “” calls its argument twice (), is run with the likelihood ; when “” calls its argument three times () the failure probability is ; whereas when “” calls its argument four times (), the failure probability is ( is unsatisfiable). Therefore, for , the failure probability is ; whereas the success probability is .

When , the failure probability is , and the success is . For , the failure is , and the success is .

5 Implementation

We have extended the Symbolic GameChecker tool [11] to implement our approach for performing probabilistic analysis of open terms. The basic tool [11] converts any IA term into a symbolic automaton representing its game semantics, and then explores the automaton for unsafe traces (plays). It calls an external SMT solver, Yices [13], to determine satisfiability of play conditions. The extended tool performs a bounded probabilistic analysis on the obtained symbolic automaton in order to determine the success and failure probabilities of the input term. Instead of an SMT solver, the extended tool calls a model counter, LattE [21], to determine the number of solutions to play conditions. We now illustrate our tool with an example. The tool, further examples and reports on how they execute are available from: https://aleksdimovski.github.io/symbolicgc.html (version for probabilistic analysis).

Consider the following version of the linear search algorithm: The meta variable represents the size of array , and represents the domain size of input expressions and