 # Hybrid Statistical Estimation of Mutual Information and its Application to Information Flow

Analysis of a probabilistic system often requires to learn the joint probability distribution of its random variables. The computation of the exact distribution is usually an exhaustive precise analysis on all executions of the system. To avoid the high computational cost of such an exhaustive search, statistical analysis has been studied to efficiently obtain approximate estimates by analyzing only a small but representative subset of the system's behavior. In this paper we propose a hybrid statistical estimation method that combines precise and statistical analyses to estimate mutual information, Shannon entropy, and conditional entropy, together with their confidence intervals. We show how to combine the analyses on different components of a discrete system with different accuracy to obtain an estimate for the whole system. The new method performs weighted statistical analysis with different sample sizes over different components and dynamically finds their optimal sample sizes. Moreover, it can reduce sample sizes by using prior knowledge about systems and a new abstraction-then-sampling technique based on qualitative analysis. To apply the method to the source code of a system, we show how to decompose the code into components and to determine the analysis method for each component by overviewing the implementation of those techniques in the HyLeak tool. We demonstrate with case studies that the new method outperforms the state of the art in quantifying information leakage.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In modeling and analyzing software and hardware systems, the statistical approach is often useful to evaluate quantitative aspects of the behaviors of the systems. In particular, probabilistic systems with complicated internal structures can be approximately and efficiently modeled and analyzed. For instance, statistical model checking has widely been used to verify quantitative properties of many kinds of probabilistic systems [LDB10].

The statistical analysis of a probabilistic system is usually considered as a black-box testing approach in which the analyst does not require prior knowledge of the internal structure of the system. The analyst runs the system many times and records the execution traces to construct an approximate model of the system. Even when the formal specification or precise model of the system is not provided to the analyst, statistical analysis can be directly applied to the system if the analyst can execute the black-box implementation. Due to this random sampling of the systems, statistical analysis provides only approximate estimates. However, it can evaluate the precision and accuracy of the analysis for instance by providing the confidence intervals of the estimated values.

One of the important challenges in statistical analysis is to estimate entropy-based properties in probabilistic systems. For example, statistical methods [CCG10, CKN13, CKNP13, CKN14, BP14] have been studied for quantitative information flow analysis  [CHM01, KB07, Mal07, CPP08], which estimates an entropy-based property to quantify the leakage of confidential information in a system. More specifically, the analysis estimates mutual information or other properties between two random variables on the secrets and on the observable outputs in the system to measure the amount of information that is inferable about the secret by observing the output. The main technical difficulties in the estimation of entropy-based properties are:

1. to efficiently compute large matrices that represent probability distributions, and

2. to provide a statistical method for correcting the bias of the estimate and computing a confidence interval to evaluate the accuracy of the estimation.

To overcome these difficulties, we propose a method for statistically estimating mutual information, one of the most popular entropy-based properties, in discrete systems. The new method, called hybrid statistical estimation method, integrates black-box statistical analysis and white-box precise analysis, exploiting the advantages of both. More specifically, this method employs some prior knowledge on the system and performs precise analysis (e.g., static analysis of the source code or specification) on some components of the system. Since precise analysis computes the exact sub-probability distributions of the components, the hybrid method using precise analysis is more accurate than statistical analysis alone.

Moreover, the new method can combine multiple statistical analyses on different components of the system to improve the accuracy and efficiency of the estimation. This is based on our new theoretical results that extend and generalize previous work [Mod89, Bri04, CCG10] on purely statistical estimation. As far as we know this is the first work on a hybrid method for estimating entropy-based properties and their confidence intervals. Note that our approach assumes that the system has discrete inputs and outputs and behaves deterministically or probabilistically, while it can also be applied to non-deterministic systems if the non-determinism has been resolved by schedulers.

To illustrate the method we propose, Fig. 3 presents an example of a probabilistic program (Fig. (a)a) having input ranging over and output over , built up from three overlapping components , and , and the corresponding joint probability distribution (Fig. (b)b). To estimate the full joint distribution , the analyst separately computes the joint sub-distribution for the component by precise analysis, estimates those for and by statistical analysis, and then combines these sub-distributions. Since the statistical analysis is based on the random sampling of execution traces, the empirical sub-distributions for and are different from the true ones, while the sub-distribution for is exact. From these approximate and precise sub-distributions, the proposed method can estimate the mutual information for the entire system and evaluate its accuracy by providing a confidence interval. Owing to the combination of different kinds of analyses (with possibly different parameters such as sample sizes), the computation of the bias and confidence interval of the estimate is more complicated than the previous work on statistical analysis.

### 1.1 Contributions

The contributions of this paper are as follows:

• We propose a new method, called hybrid statistical estimation, that combines statistical and precise analysis on the estimation of mutual information (which can also be applied to Shannon entropy and conditional Shannon entropy). Specifically, we show theoretical results on compositionally computing the bias and confidence interval of the estimate from multiple results obtained by statistical and precise analysis;

• We present a weighted statistical analysis method with different sample sizes over different components and a method for adaptively optimizing the sample sizes by evaluating the accuracy and cost of the analysis;

• We show how to reduce the sample sizes by using prior knowledge about systems, including an abstraction-then-sampling technique based on qualitative analysis. In particular, we point out that the state-of-the-art statistical analysis tool LeakWatch [CKN14] incorrectly computes the bias in estimation using the knowledge of the prior distribution, and explain how the proposed approach fixes this problem;

• We show that the proposed method can be applied not only to composed systems but also to the source codes of a single system by decomposing it into components and determining the analysis method for each component;

• We provide a practical implementation of the method in the HyLeak tool [BKLT], and show how the techniques in this paper can be applied to multiple benchmarks;

• We evaluate the quality of the estimation in this method, showing that the estimates are more accurate than statistical analysis alone for the same sample size, and that the new method outperforms the state-of-the-art statistical analysis tool LeakWatch;

• We demonstrate the effectiveness of the hybrid method in case studies on the quantification of information leakage.

A preliminary version of this paper, without proofs, appeared in [KBL16]. Also a preliminary version of the implementation description (Sections 7 and 8.2.1), without details, appeared in the tool paper describing HyLeak [BKLT17]. In this paper we add the estimation of Shannon entropy (Propositions 4.34.4 and 6.2) and that of conditional entropy (Propositions 5.4 and 5.5). We also show the formulas for the adaptive analysis using knowledge of prior distributions (Proposition 6.3) and using the abstraction-then-sampling technique (Theorem 6.4). Furthermore, we provide detailed explanation on the implementation in the HyLeak tool in Section 7, including how to decompose the source code of a system into components. We also present more experimental results with details in Section 8.2. Finally, we add Appendix A to present the detailed proofs.

The rest of the paper is structured as follows. Section 2 introduces background in quantification of information and compares precise analysis with statistical analysis for the estimation of mutual information. Section 3 overviews our new method for estimating mutual information. Section 4 describes the main results of this paper: the statistical estimation of mutual information for the hybrid method, including the method for optimizing sample sizes for different components. Section 5 presents how to reduce sample sizes by using prior knowledge about systems, including the abstraction-then-sampling technique with qualitative analysis. Section 6 shows the optimal assignment of samples to components to be samples statistically to improve the accuracy of the estimate. Section 7 overviews the implementation of the techniques in the HyLeak tool, including how to decompose the source code of a system into components and to determine the analysis method for each component. Section 8 evaluates the proposed method and illustrates its effectiveness against the state of the art, and Section 9 concludes the paper. Detailed proofs can be found in Appendix A.

### 1.2 Related Work

The information-theoretical approach to program security dates back to the work of Denning [Den76] and Gray [III91]. Clark et al. [CHM01, CHM07] presented techniques to automatically compute mutual information of an imperative language with loops. For a deterministic program, leakage can be computed from the equivalence relations on the secret induced by the possible outputs, and such relations can be automatically quantified [BKR09]. Under- and over-approximation of leakage based on the observation of some traces have been studied for deterministic programs [ME08, NMS09]. As an approach without replying on information theory McCamant et al. [KMPS11] developed tools implementing dynamic quantitative taint analysis techniques for security.

Fremont and Seshia [FS14] present a polynomial time algorithm to approximate the weight of traces of deterministic programs with possible application to quantitative information leakage. Progress in randomized program analysis includes a scalable algorithm for uniform generation of sample from a distribution defined as constraints [CFM15, CMV13], with applications to constrained-random program verification.

The statistical approach to quantifying information leakage has been studied since the seminal work by Chatzikokolakis et al. [CCG10]. Chothia et al. have developed this approach in tools leakiEst [CKN13, CKNa] and LeakWatch [CKN14, CKNb]. The hybrid statistical method in this paper can be considered as their extension with the inclusion of component weighting and adaptive priors inspired by the importance sampling in statistical model checking [BHP12, CZ11]. To the best of our knowledge, no prior work has applied weighted statistical analysis to the estimation of mutual information or any other leakage measures.

The idea on combining static and randomized approaches to quantitative information flow was first proposed by Köpf and Rybalchenko [KR10] while our approach takes a different approach relying on statistical estimation to have better precision and accuracy and is general enough to deal with probabilistic systems under various prior information conditions. In related fields, the hybrid approach combining precise and statistical analysis have been proven to be effective, for instance in concolic analysis [MS07, LCFS14], where it is shown that input generated by hybrid techniques leads to greater code coverage than input generated by both fully random and concolic generation. After the publication of the preliminary version [KBL16] of this paper, a few papers on quantitative information flow combining symbolic and statistical approaches have been published. Malacaria et al. [MKP18] present an approach that performs Monte Carlo sampling over symbolic paths while the prior distributions are restricted to be uniform. Sweet et al. [STS18]

combine abstraction interpretation with sampling and concolic execution for reasoning about Bayes vulnerability. Unlike our work, these two studies aim at giving only bounds on information leakage and do not use statistical hypothesis testing.

Our tool HyLeak processes a simple imperative language that is an extension of the language used in the QUAIL tool version 2.0  [BLQ15]. The algorithms for precise computation of information leakage used in this paper are based on trace analysis [BLMW15], implemented in the QUAIL tool [BLTW, BLTW13, BLQ15]. As remarked above, the QUAIL tool implements only a precise calculation of leakage that examines all executions of programs. Hence the performance of QUAIL does not scale, especially when the program performs complicated computations that yield a large number of execution traces. The performance of QUAIL as compared to HyLeak is represented by the “precise” analysis approach in Section 8. Since QUAIL does not support the statistical approach or the hybrid approach, it cannot handle large problems that HyLeak can analyze.

As remarked above, the stochastic simulation techniques implemented in HyLeak have also been developed in the tools LeakiEst [CKN13] (with its extension [KCP14]) and LeakWatch [CKNP13, CKN14]. The performance of these tools as compared to HyLeak is represented by the “statistical” analysis approach in Section 8.

The tool Moped-QLeak  [CMS14] computes the precise information leakage of a program by transforming it into an algebraic decision diagram (ADD). As noted in [BLQ15], this technique is efficient when the program under analysis is simple enough to be converted into an ADD, and fails otherwise even when other tools including HyLeak can handle it. In particular, there are simple examples [BLQ15] where Moped-QLeak fails to produce any result but that can be examined by QUAIL and LeakWatch, hence by HyLeak.

Many information leakage analysis tools restricted to deterministic input programs have been released, including TEMU [NMS09], squifc [PM14], jpf-qif [PMTP12], QILURA [PMPd14], nsqflow [VEB16], and sharpPI [Wei16]. Some of these tools have been proven to scale to programs of thousands of lines written in common languages like C and Java. Such tools are not able to compute the Shannon leakage for the scenario of adaptive attacks but only compute the min-capacity of a deterministic program for the scenario of one-try guessing attacks, which give only a coarse upper bound on the Shannon leakage. More specifically, they compute the logarithm of the number of possible outputs of the deterministic program, usually by using model counting on a SMT-constraint-based representation of the possible outputs, obtained by analyzing the program. Contrary to these tools, HyLeak can analyze randomized programs111Some of these tools, like jpf-qif and nsqflow, present case studies on randomized protocols. However, the randomness of the programs is assumed to have the most leaking behavior. E.g., in the Dining Cryptographers this means assuming all coins produce head with probability 1. and provides a quite precise estimation of the Shannon leakage of the program, not just a coarse upper bound. As far as we know, HyLeak is the most efficient tool that has this greater scope and higher accuracy.

## 2 Background

In this section we introduce the basic concepts used in the paper. We first introduce some notions in information theory to quantify the amount of some information in probabilistic systems. Then we compare two previous analysis approaches to quantifying information: precise analysis and statistical analysis.

### 2.1 Quantification of Information

In this section we introduce some background on information theory, which we use to quantify the amount of information in a probabilistic system. Hereafter we write and to denote two random variables, and and to denote the sets of all possible values of and , respectively. We denote the number of elements of a set by . Given a random variable we denote by or by the expected value of , and by

the variance of

, i.e., . The logarithms in this paper are to the base and we often abbreviate to .

#### 2.1.1 Channels

In information theory, a channel

models the input-output relation of a system as a conditional probability distribution of outputs given inputs. This model has also been used to formalize information leakage in a system that processes confidential data:

inputs and outputs of a channel are respectively regarded as secrets and observables in the system and the channel represents relationships between the secrets and observables.

A discrete channel is a triple where and are two finite sets of discrete input and output values respectively and is an matrix where each element represents the conditional probability of an output given an input ; i.e., for each ,  and for all .

A prior is a probability distribution on input values . Given a prior over and a channel from to , the joint probability distribution of and is defined by: for each and .

#### 2.1.2 Shannon Entropy

We recall some information-theoretic measures as follows. Given a prior on input , the prior uncertainty (before observing the system’s output ) is defined as H(X) = - ∑x∈X PX[x] log2 PX[x] while the posterior uncertainty (after observing the system’s output ) is defined as H(X|Y) = - ∑y∈Y+ PY[y] ∑x∈X PX|Y[x|y] log2 PX|Y[x|y], where is the probability distribution on the output ,  is the set of outputs in with non-zero probabilities, and is the conditional probability distribution of given :

 PY[y] =∑x′∈XPXY[x′,y]PX|Y[x|y]=PXY[x,y]PY[y] if PY[y]≠0.

is also called the conditional entropy of given .

#### 2.1.3 Mutual Information

The amount of information gained about a random variable by knowing a random variable is defined as the difference between the uncertainty about before and after observing . The mutual information between and is one of the most popular measures to quantify the amount of information on gained by :

 I(X;Y) =∑x∈X,y∈YPXY[x,y]log2(PXY[x,y]PX[x]PY[y])

where is the marginal probability distribution defined as .

In the security scenario, information-theoretical measures quantify the amount of secret information leaked against some particular attacker: the mutual information between two random variables on the secrets and on the observables in a system measures the information that is inferable about the secret by knowing the observable. In this scenario mutual information, or Shannon leakage, assumes an attacker that can ask binary questions on the secret’s value after observing the system while min-entropy leakage [Smi09] considers an attacker that has only one attempt to guess the secret’s value.

Mutual information has been employed in many other applications including Bayesian networks

[Jen96], telecommunications [Gal68][ESB09][Mac02], quantum physics [Wil13], and biology [Ada04]. In this work we focus on mutual information and its application to the above security scenario.

### 2.2 Computing Mutual Information in Probabilistic Systems

In this section we present two previous approaches to computing mutual information in probabilistic systems in the context of quantitative information flow. Then we compare the two approaches to discuss their advantages and disadvantages.

In the rest of the paper a probabilistic system is defined as a finite set of execution traces such that each trace records the values of all variables in and is associated with a probability . Note that

does not have non-deterministic transitions. For the sake of generality we do not assume any specific constraints at this moment.

The main computational difficulty in calculating the mutual information between input and output lies in the computation of the joint probability distribution of and , especially when the system consists of a large number of execution traces and when the distribution is represented as a large data structure. In previous work this computation has been performed either by the precise approach using program analysis techniques or by the statistical approach using random sampling and statistics.

#### 2.2.1 Precise Analysis

Precise analysis consists of analyzing all the execution traces of a system and determining for each trace , the input , output , and probability by concretely or symbolically executing the system. The precise analysis approach in this paper follows the depth-first trace exploration technique presented by Biondi et al. [BLQ15].

To obtain the exact joint probability for each and in a system , we sum the probabilities of all execution traces of that have input and output , i.e.,

 PXY[x,y]=∑{PS[tr]  ∣∣  tr∈S has input x and % output y}

where is the probability distribution over the set of all traces in . This means the computation time depends on the number of traces in the system. If the system has a very large number of traces, it is intractable for the analyst to precisely compute the joint distribution and consequently the mutual information.

In [YT14] the calculation of mutual information is shown to be computationally expensive. This computational difficulty comes from the fact that entropy-based properties are hyperproperties [CS10] that are defined using all execution traces of the system and therefore cannot be verified on each single trace. For example, when we investigate the information leakage in a system, it is insufficient to check the leakage separately for each component of the system, because the attacker may derive sensitive information by combining the outputs of different components. More generally, the computation of entropy-based properties (such as the amount of leaked information) is not compositional, in the sense that an entropy-based property of a system is not the (weighted) sum of those of the components.

For this reason, it is inherently difficult to naïvely combine analyses of different components of a system to compute entropy-based properties. In fact, previous studies on the compositional approach in quantitative information flow analysis have faced certain difficulties in obtaining useful bounds on information leakage [BK11, ES13, KG15, KCP17].

#### 2.2.2 Statistical Analysis

Due to the complexity of precise analysis, some previous studies have focused on computing approximate values of entropy-based measures. One of the common approaches is statistical analysis based on Monte Carlo methods, in which approximate values are computed from repeated random sampling and their accuracy is evaluated using statistics. Previous work on quantitative information flow has used statistical analysis to estimate mutual information [CCG10, Mod89, Bri04], channel capacity [CCG10, BP14] and min-entropy leakage [CKN14, CK14].

In the statistical estimation of mutual information between two random variables and in a probabilistic system, the analyst executes the system many times and collects the execution traces, each of which has a pair of values corresponding to the input and output of the trace. This set of execution traces is used to estimate the empirical joint distribution of and and then to estimate the mutual information .

Note that the empirical distribution is different from the true distribution and thus the estimated mutual information is different from the true value . In fact, it is known that entropy-based measures such as mutual information and min-entropy leakage have some bias and variance that depends on the number of collected traces, the matrix size and other factors. However, results on statistics allow us to correct the bias of the estimate and to compute the variance (and the 95% confidence interval). This way we can guarantee the quality of the estimation, which differentiates the statistical approach from the testing approach.

#### 2.2.3 Comparing the Two Analysis Methods

The cost of the statistical analysis is proportional to the size of the joint distribution matrix (strictly speaking, to the number of non-zero elements in the matrix). Therefore, this method is significantly more

efficient than precise analysis if the matrix is relatively small and the number of all traces is very large (for instance because the system’s internal variables have a large range).

On the other hand, if the matrix is very large, the number of executions needs to be very large to obtain a reliable and small confidence interval. In particular, for a small sample size, statistical analysis does not detect rare events, i.e., traces with a low probability that affect the result. Therefore the precise analysis is significantly more efficient than statistical analysis if the number of all traces is relatively small and the matrix is relatively large (for instance because the system’s internal variables have a small range).

The main differences between precise analysis and statistical analysis are summarized in Table 1.

## 3 Overview of the Hybrid Statistical Estimation Method

In this section we overview a new method for estimating the mutual information between two random variables (over the inputs ) and (over the outputs ) in a system. The method, we call hybrid statistical estimation, integrates both precise and statistical analyses to overcome the limitations on those previous approaches (explained in Section 2.2).

In our hybrid analysis method, we first decompose a given probabilistic system into mutually disjoint components, which we will define below, and then apply different types of analysis (with possibly different parameters) on different components of the system. More specifically, for each component, our hybrid method chooses the faster analysis between the precise and statistical analyses. Hence the hybrid analysis of the whole system is faster than the precise analysis alone and than the statistical analysis alone, while it gives more accurate estimates than the statistical analysis alone, as shown experimentally in Section 8.

To introduce the notion of components we recall that in Section 2.2 a probabilistic system is defined as the set of all execution traces such that each trace is associated with probability 222Note that this work considers only probabilistic systems without non-deterministic transitions.. Formally, a decomposition of is defined a collection of mutually disjoint non-empty subsets of : , , and for any ,  implies . Then each element of is called a component. In this sense, components are a partition of the execution traces of . When is executed, only one of ’s components is executed, since the components are mutually disjoint. Hence is chosen to be executed with the probability .

In decomposing a system we roughly investigate the characteristics of each component’s behaviour to choose a faster analysis method for each component. Note that information about a component like its number of traces and the size of its joint sub-distribution matrix can be estimated heuristically before computing the matrix itself. This will be explained in Section

7; before that section this information is assumed to be available. The choice of the analysis method is as follows:

• If a component’s behaviour is deterministic, we perform a precise analysis on it.

• If a component’s behaviour is described as a joint sub-distribution matrix over small333Relatively to the number of all execution traces of the component. subsets of and , then we perform a statistical analysis on the component.

• If a component’s behaviour is described as a matrix over large33footnotemark: 3 subsets of and , then we perform a precise analysis on the component.

• By combining the analysis results on all components, we compute the estimated value of mutual information and its variance (and confidence interval). See Section 4 for details.

• By incorporating information from qualitative information flow analysis, the analyst may obtain partial knowledge on components and be able to reduce the sample sizes. See Section 5 for details.

See Section 7 for the details on how to decompose a system.

One of the main advantages of hybrid statistical estimation is that we guarantee the quality of the outcome by removing its bias and providing its variance (and confidence interval) even though different kinds of analysis with different parameters (such as sample sizes) are combined together.

Another advantage is the compositionality in estimating bias and variance. Since the sampling of execution traces is performed independently for each component, we obtain that the bias and variance of mutual information can be computed in a compositional way, i.e., the bias/variance for the entire system is the sum of those for the components. This compositionality enables us to find optimal sample sizes for the different components that maximize the accuracy of the estimation (i.e., minimize the variance) given a fixed total sample size for the entire system. On the other hand, the computation of mutual information itself is not compositional [KCP17]: it requires calculating the full joint probability distribution of the system by summing the joint sub-distributions of all components of the system.

Finally, note that these results can be applied to the estimation of Shannon entropy (Section 4.3) and conditional Shannon entropy (Section 5.1.3) as special cases. The overview of all results is summarized in Table 2.

## 4 Hybrid Method for Statistical Estimation of Mutual Information

In this section we present a method for estimating the mutual information between two random variables (over the inputs ) and (over the outputs ) in a system, and for evaluating the precision and accuracy of the estimation.

We consider a probabilistic system that consists of components , , , and , , , each executed with probabilities , , , and , , , , i.e., when is executed,  is executed with the probability and with the probability . Let and , one of which can be empty. Then the probabilities of all components sum up to , i.e., . We assume that the analyst is able to compute these probabilities by precise analysis. In the example in Fig. 3 the probabilities of the components , , and are explicitly given as , , and . However, in general they would be computed by analyzing the behavior of the system before the system executes the three components. More details about how to obtain this in practice are provided when discussing implementation in Section 7.

Once the system is decomposed into components, each component is analyzed either by precise analysis or by statistical analysis. We assume that the analyst can run the component for each to record a certain number of ’s execution traces, and precisely analyze the components for to record a certain symbolic representation of ’s all execution traces, e.g., by static analysis of the source code (or of a specification that the code is known to satisfy). In the example in Fig. 3, this means that the components and will be analyzed statistically producing an approximation of their joint distributions, while the component will be analyzed precisely obtaining its exact joint distribution. The two estimates and one precise joint distributions will be composed to obtain a joint distribution estimate for the whole system, as illustrated in Fig. (b)b.

In the rest of this section we present a method for computing the joint probability distribution (Section 3), for estimating the mutual information (Section 4.1), and for evaluating the accuracy of the estimation (Section 4.2). Then we show the application of our hybrid method to Shannon entropy estimation (Section 4.3).

In the estimation of mutual information between the two random variables and in the system , we need to estimate the joint probability distribution of and .

In our approach this is obtained by combining the joint sub-probability distributions of and for all the components ’s and ’s. More specifically, let and be the joint sub-distributions of and for the components ’s and ’s respectively. Then the joint (full) distribution for the whole system is defined by: P_XY[x, y] ∑_i∈I R_i[x,y] + ∑_j∈J Q_j[x,y] for and . Note that for each and , the sums of all probabilities in the sub-distribution and in respectively equal the probabilities (of executing ) and (of executing ).

To estimate the joint distribution the analyst computes

• for each , the exact sub-distribution for the component by precise analysis on , and

• for each , the empirical sub-distribution for from a set of traces obtained by executing a certain number of times.

More specifically, the empirical sub-distribution is constructed as follows. When the component is executed times, let be the number of traces that have input and output . Then . From these numbers of traces we compute the empirical joint (full) distribution of and by: ^D_i[x,y] Kixyni . Since is executed with probability , the sub-distribution is given by .

Then the analyst sums up these sub-distributions to obtain the joint distribution for the whole system :

 ^PXY[x,y]def=∑i∈I^Ri[x,y]+∑j∈JQj[x,y]=∑i∈IθiKixyni+∑j∈JQj[x,y].

Note that and may have different matrix sizes and cover different parts of the joint distribution matrix

, so they may have to be appropriately padded with zeroes for the summation.

### 4.1 Estimation of Mutual Information and Correction of its Bias

In this section we present our new method for estimating mutual information and for correcting its bias. For each component let be the joint (full) distribution of and obtained by normalizing : . Let , , and .

Using the estimated joint distribution we can compute the mutual information estimate . Note that the mutual information for the whole system is smaller than (or equals) the weighted sum of those for the components, because of its convexity w.r.t. the channel matrix. Therefore it cannot be computed compositionally from those of the components, i.e., it is necessary to compute the joint distribution matrix for the whole system.

Since is obtained from a limited number of traces, it has bias, i.e., its expected value is different from the true value . The bias in the estimation is quantified as follows.

###### Theorem 4.1 (Mean of estimated mutual information)

The expected value of the estimated mutual information is given by: E[^I(X; Y)] = I(X; Y)+ iI θi22ni ( (x,y)D φixy - xX+ φix - yY+ φiy ) + O(ni-2) where , and .

[Proof sketch.] Here we present only the basic idea. Appendices A.1 and A.2 present a proof of this theorem by showing a more general claim, i.e., Theorem 5.6 in Section 5.2.

By properties of mutual information and Shannon entropy, we have:

 E[^I(X;Y)]−I(X;Y) =E[^H(X)+^H(Y)−^H(X,Y)]−(H(X)+H(Y)−H(X,Y))

Hence it is sufficient to calculate the bias in , , and , respectively.

We calculate the bias in as follows. Let be the -ary function defined by:

 fxy(K1xy,K2xy,…,Kmxy)=(∑i∈IθiKixyni+∑j∈JQj[x,y])log(∑i∈IθiKixyni+∑j∈JQj[x,y]),

which equals . Let . Then the empirical joint entropy is:

 ^H(X,Y)=−∑(x,y)∈D^PXY[x,y]log^PXY[x,y]=−∑(x,y)∈Dfxy(Kxy).

Let for each and . By the Taylor expansion of (w.r.t. the multiple dependent variables ) at , we have:

 fxy(Kxy)=fxy(¯¯¯¯¯¯¯¯Kxy)+∑i∈I∂fxy(¯¯¯¯¯¯¯Kxy)∂Kixy(Kixy−¯¯¯¯¯¯¯¯¯Kixy)+12∑i,j∈I∂2fxy(¯¯¯¯¯¯¯Kxy)∂Kixy∂Kjxy(Kixy−¯¯¯¯¯¯¯¯¯Kixy)(Kjxy−¯¯¯¯¯¯¯¯¯¯Kjxy)+∑i∈IO(K3ixy).

We use the following properties:

• , which is immediate from .

• if , because and are independent.

• .

Then

 E[^H(X,Y)] =−∑(x,y)∈DE[fxy(Kxy)] =−∑(x,y)∈D(fxy(¯¯¯¯¯¯¯¯Kxy)+12∑i∈I∂2fxy(¯¯¯¯¯¯¯Kxy)∂Kixy∂KixyE[(Kixy−¯Kixy)2]+O(K3ixy)) =H(X,Y)−∑i∈Iθ2i2ni∑(x,y)∈Dφixy+O(n−2i),

where the derivation of the equalities is detailed in Appendix A. Hence the bias in estimating is given by:

 E[^H(X,Y)]−H(X,Y)=−∑i∈Iθ2i2ni∑(x,y)∈Dφixy+O(n−2i).

Analogously, we can calculate the bias in and to derive the theorem. See Appendices A.1 and A.2 for the details.

Since the higher-order terms in the formula are negligible when the sample sizes are large enough, we use the following as the point estimate of the mutual information: pe= ^I(X; Y) - ∑i∈I θi22ni ((x,y)∈D ^φixy - ∑x∈X+ ^φix - ∑y∈Y+ ^φiy ) where , and are respectively empirical values of , and that are computed from traces; i.e., , , and . Then the bias is closer to when the sample sizes are larger.

### 4.2 Evaluation of the Accuracy of Estimation

In this section we present how to evaluate the accuracy of mutual information estimation. The quality of the estimation depends on the sample sizes and other factors, and can be evaluated using the variance of the estimate .

###### Theorem 4.2 (Variance of estimated mutual information)

The variance of the estimated mutual information is given by:

[Proof sketch.] The variance is calculated using the following:

 V[^I(X;Y)] =V[^H(X)+^H(Y)−^H(X,Y)] =V[^H(X)]+V[^H(Y)]+V[^H(X,Y)]

The calculation of these variances and covariances and the whole proof are shown in Appendices A.3 and A.4. (We will present a proof of this theorem by showing a more general claim, i.e., Theorem 5.7 in Section 5.2).

The confidence interval of the estimate of mutual information is also useful to show how accurate the estimate is. A smaller confidence interval corresponds to a more reliable estimate. To compute the confidence interval approximately, we assume the sampling distribution of the estimate

as a normal distribution.

444In fact, Brillinge [Bri04] shows that the sampling distribution of mutual information values is approximately normal for large sample size, and Chatzikokolakis et al. [CCG10] employ this fact to approximately compute the confidence interval. We also empirically verified that the sampling distribution is closer to the normal distribution when the ’s are larger enough, and the evaluation of the obtained confidence interval will be demonstrated by experiments in Section 8.1. Then the confidence interval is calculated using the variance obtained by Theorem 4.2 as follows. Given a significance level , we denote by the z-score for the percentile point. Then the confidence interval of the estimate is given by: [  max(0, pe- z_α/2 v),  pe+ z_α

/2 v  ] . For example, we use the z-score

to compute the 95% confidence interval. To ignore the higher order terms the sample size needs to be at least .

By Theorems 4.1 and 4.2, the bias and variance for the whole system can be computed compositionally from those for the components, unlike the mutual information itself. This allows us to adaptively optimize the sample sizes for the components as we will see in Section 6.

### 4.3 Application to Estimation of Shannon Entropy

Hybrid statistical estimation can also be used to estimate the Shannon entropy of a random variable in a probabilistic system. Although the results for Shannon entropy are straightforward from those for mutual information, we present the formulas here for completeness. For each let be the sub-distribution of for the component . Then the mean and variance of the estimate are obtained in the same way as in the Sections 4.1 and 4.2.

###### Proposition 4.3 (Mean of estimated Shannon entropy)

The expected value of the estimated Shannon entropy is given by:

See Appendix A.2 for the proof. From this we obtain the bias of the Shannon entropy estimates.

###### Proposition 4.4 (Variance of estimated Shannon entropy)

The variance of the estimated Shannon entropy is given by:

 V[^H(X)]=∑i∈Iθ2ini(∑x∈X+DXi[x](1+logPX[x])2−(∑x∈X+DXi[x](1+logPX[x]))2)+O(n−2i).

See Appendix A.4 for the proof. From this we obtain the confidence interval of the Shannon entropy estimates.

## 5 Estimation Using Prior Knowledge about Systems

In this section we show how to use prior knowledge about systems to improve the accuracy of the estimation, i.e., to make the variance (and the confidence interval size) smaller and reduce the required sample sizes.

### 5.1 Approximate Estimation Using Knowledge of Prior Distributions

Our hybrid statistical estimation method integrates both precise and statistical analysis, and it can be seen as a generalization and extension of previous work [CCG10, Mod89, Bri04].

Due to an incorrect computation of the bias, the state-of-the-art statistical analysis tool LeakWatch [CKN14, CKNb] does not correctly estimate mutual information. We explain this problem in Section 5.1.1 and show how to fix it in Section 5.1.2. We extend this result to the estimation of conditional entropy in Section 5.1.3.

#### 5.1.1 State of the Art

For example, Chatzikokolakis et.al. [CCG10] present a method for estimating mutual information between two random variables (over secret input values ) and (over observable output values ) when the analyst knows the (prior) distribution of . In the estimation they collect execution traces by running a system for each secret value . Thanks to the precise knowledge of , they have more precise and accurate estimates than the other previous work [Mod89, Bri04] that also estimates from execution traces.

Estimation using the precise knowledge of is an instance of our result if a system is decomposed into the component for each secret . If we assume all joint probabilities are non-zero, the following approximate result in [CCG10] follows from Theorem 4.1.

###### Corollary 5.1

The expected value of the estimated mutual information is given by: E[^I(X; Y)] = I(X; Y) + (#X- 1)(#Y- 1)2n + O(n^-2), where and denote the numbers of possible secrets and observables respectively.

Using this result the bias is calculated as in [CCG10], which depends only on the size of the joint distribution matrix. However, the bias can be strongly influenced by probability values close or equivalent to zero in the distribution; therefore their approximate results can be correct only when all joint probabilities are non-zero and large enough, which is a strong restriction in practice. We show in Section 8.2.3 that the tool LeakWatch [CKN14] uses Corollary 5.1, and consequently miscalculates bias and gives an estimate far from the true value in the presence of very small probability values.

#### 5.1.2 Our Estimation Using Knowledge of Prior Distributions

To overcome these issues we present more general results in the case in which the analyst knows the prior distribution . We assume that a system is decomposed into the disjoint component for each index and input , and that each is executed with probability in the system . Let .

##### Estimation of Mutual Information

In the estimation of mutual information we separately execute each component multiple times to collect execution traces. Unlike the previous work the analyst may change the number of executions to where is an importance prior that the analyst chooses to determine how the sample size is allocated for each component . An adaptive way of choosing the importance priors will be presented in Section 6. Let .

Given the number of ’s traces with output , we define the conditional distribution of output given input: Let . Then we can calculate the mean and variance of the mutual information using , , as follows.

###### Proposition 5.2 (Mean of mutual information estimated using the knowledge of the prior)

The expected value of the estimated mutual information is given by:

 E[^IΘ,Λ(X;Y)]=I(X;Y)+∑i∈I12ni∑y∈Y+(∑x∈DyMixyPXY[x,y]−∑x∈DyMixyPY[y])+O(n−2i).
###### Proposition 5.3 (Variance of mutual information estimated using the knowledge of the prior)

The variance  of the estimated mutual information is given by:

 V[^IΘ,Λ(X;Y)]=∑i∈I∑x∈X+θ2ixniλi[x](∑    y∈DxDi[y|x](logPY[y]PXY[x,y])2−(∑    y∈DxDi[y|x](logPY[y]PXY[x,y]))2)+O(n−2i).

See Appendix A.6 for the proofs.

#### 5.1.3 Estimation of Conditional Entropy

The new method can also estimate the conditional Shannon entropy of a random variable given a random variable in a system. In the context of quantitative security, represents the uncertainty of a secret after observing an output of the system. The mean and variance of the conditional entropy are obtained from those of the mutual information in the case where the analyst knows the prior.

###### Proposition 5.4 (Mean of estimated conditional entropy)

The expected value of the estimated conditional Shannon entropy is given by where is the expected value of the mutual information in the case where the analyst knows the prior (shown in Proposition 5.2).

By , we obtain .

###### Proposition 5.5 (Variance of estimated conditional entropy)

The variance of the estimated conditional Shannon entropy coincides with the variance of the mutual information in the case where the analyst knows the prior (shown in Proposition 5.3).

By , we obtain .

### 5.2 Abstraction-Then-Sampling Using Partial Knowledge of Components

In this section we extend the hybrid statistical estimation method to consider the case in which the analyst knows that the output of some of the components does not depend on the secret input (for instance by static code analysis). Such prior knowledge may help us abstract components into simpler ones and thus reduce the sample size for the statistical analysis.

We illustrate the basic idea of this “abstraction-then-sampling” technique as follows. Let us consider an analyst who knows two pairs and of inputs and outputs have the same probability in a component : . Then, when we construct the empirical distribution from a set of traces, we can count the number of traces having either or , and divide it by two: . Then to achieve a certain accuracy, the sample size required for the estimation using the prior knowledge on the equality is smaller than that without using it.

In the following we generalize this idea to deal with similar information that the analyst may possess about the components. Let us consider a (probabilistic) system in which for some components, observing the output provides no information on the input. Assume that the analyst is aware of this by qualitative information analysis (for verifying non-interference). Then such a component has a sub-channel matrix where all non-zero rows have an identical conditional distribution of outputs given inputs [CT06]. Consequently, when we estimate the matrix of it suffices to estimate one of the rows, hence the number of executions is proportional to instead of .

The abstraction-then-sampling approach can be simply explained by referring to the joint distribution matrix in Fig. 4. Note that each row of the sub-distribution matrix for component is identical, even though the rows of the joint matrix are not, and assume that the analyst knows this by analyzing the code of the program and finding out that for component the output is independent from the input. Then the analyst would know that it is unnecessary to execute the component separately for each possible input value in : it is sufficient to execute the component only for one value of the input, and to apply the results to each row in the sub-distribution matrix for component . This allows the analyst to obtain more precise results and a smaller variance (and confidence interval) on the estimation given a fixed total sample size for the component.

Note that even when some components leak no information, computing the mutual information for the whole system requires constructing the matrix of the system, hence the matrices of all components.

Let be the set of indexes of components that have channel matrices whose non-zero rows consist of the same conditional distribution. For each , we define as the probability of having an input in the component . To estimate the mutual information for the whole system, we apply the abstraction-then-sampling technique to the components and the standard sampling technique (shown in Section 4) to the components .

Then the mean and variance of the mutual information are as follows. The following results show that the bias and confidence interval are narrower than when not using the prior knowledge of components.

###### Theorem 5.6 (Mean of mutual information estimated using the abstraction-then-sampling)

The expected value of the estimated mutual information is given by:

where .

See Appendix A.1 for the proof.

###### Theorem 5.7 (Variance of mutual information estimated using the abstraction-then-sampling)

The variance of the estimated mutual information is given by:

 V[^II⋆(X;Y)]= +∑i∈I⋆θ2ini(∑  y∈Y+DYi[y]γ2ixy−(∑  y∈Y+DYi[y]γixy)2)+O(n−2i)

where .

See Appendix A.3 for the proof.

## 6 Adaptive Optimization of Sample Sizes

In this section we present a method for deciding the sample size of each component to estimate mutual information with an optimal accuracy when using the hybrid estimation technique in Section 4 and its variants using prior information on the system in Section 5. The proof for all results in this section can be found in Appendix A.5.

##### Mutual Information

To decide the sample sizes we take into account the trade-off between accuracy and cost of the statistical analysis: The computational cost increases proportionally to the sample size (i.e., the number of ’s execution traces), while a larger sample size provides a smaller variance hence a more accurate estimate.

More specifically, given a budget of a total sample size for the whole system, we obtain an optimal accuracy of the estimate by adjusting each component’s sample size 555This idea resembles the importance sampling in statistical model checking in that the sample size is adjusted to make the estimate more accurate. (under the constraint ). To compute the optimal sample sizes, we first run each component to collect a small number (compared to , for instance dozens) of execution traces. Then we calculate certain intermediate values in computing the variance and determine sample sizes for further executions. Formally, let be the following intermediate value of the variance for the component :

Then we find ’s that minimize the variance of the estimate by using the following theorem.

###### Theorem 6.1 (Optimal sample sizes)

Given the total sample size and the above intermediate variance of the component for each ,  the variance of the mutual information estimate is minimized if, for all ,  the sample size for is given by: .

##### Shannon Entropy

Analogously to Theorem 6.1 we can adaptively optimize the sample sizes in the estimation of Shannon entropy in Section 4.3. To compute the optimal sample sizes we define by:

Then we can compute the optimal sample sizes by using the following proposition.

###### Proposition 6.2 (Optimal sample sizes for Shannon entropy estimation)

Given the total sample size and the above intermediate variance of the component for each ,  the variance of the Shannon entropy estimate is minimized if, for all