F-BLEAU: Fast Black-box Leakage Estimation

02/04/2019 ∙ by Giovanni Cherubin, et al. ∙ EPFL 0

We consider the problem of measuring how much a system reveals about its secret inputs. We work under the black-box setting: we assume no prior knowledge of the system's internals, and we run the system for choices of secrets and measure its leakage from the respective outputs. Our goal is to estimate the Bayes risk, from which one can derive some of the most popular leakage measures (e.g., min-entropy, additive, and multiplicative leakage). The state-of-the-art method for estimating these leakage measures is the frequentist paradigm, which approximates the system's internals by looking at the frequencies of its inputs and outputs. Unfortunately, this does not scale for systems with large output spaces, where it would require too many input-output examples. Consequently, it also cannot be applied to systems with continuous outputs (e.g., time side channels, network traffic). In this paper, we exploit an analogy between Machine Learning (ML) and black-box leakage estimation to show that the Bayes risk of a system can be estimated by using a class of ML methods: the universally consistent learning rules; these rules can exploit patterns in the input-output examples to improve the estimates' convergence, while retaining formal optimality guarantees. We focus on a set of them, the nearest neighbor rules; we show that they significantly reduce the number of black-box queries required for a precise estimation whenever nearby outputs tend to be produced by the same secret; furthermore, some of them can tackle systems with continuous outputs. We illustrate the applicability of these techniques on both synthetic and real-world data, and we compare them with the state-of-the-art tool, leakiEst, which is based on the frequentist approach.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

System Dataset frequentist NN -NN
Random 100 secrets, 100 obs. 10 070 10 070 10 070
Geometric () 100 secrets, 10K obs. 35 016 333 458
Geometric () 100 secrets, 10K obs. 152 904 152 698 68 058
Geometric () 10K secrets, 1K obs. 95 500 94 204 107 707
Multimodal Geometric () 100 secrets, 10K obs. 44 715 568 754
Spiky (contrived example) 2 secrets, 10K obs. 22 908 29 863 62 325
Planar Geometric Gowalla checkins in San Francisco area X X 19 948
Laplacian " N/A X 19 961
Blahut-Arimoto " 1 285 1 170 1 343

The proposed tool, F-BLEAU, is the combination of frequentist, NN, and -NN estimates, as an alternative to the frequentist paradigm.

TABLE I: Number of examples required for convergence of the estimates. “X” means an estimate did not converge.

Measuring the information leakage of a system is one of the founding pillars of security. From side-channels to biases in random number generators, quantifying how much information a system leaks about its secret inputs is crucial for preventing adversaries from exploiting it; this has been the focus of intensive research efforts in the areas of privacy and of quantitative information flow (QIF). Most approaches in the literature are based on the white-box approach, which consists in calculating analytically the channel matrix of the system, constituted by the conditional probabilities of the outputs given the secrets, and then computing the desired leakage measures (for instance, mutual information

[1], min-entropy leakage [2], or -leakage [3]). However, while one typically has white-box access to the system they want to secure, determining a system’s leakage analytically is often impractical, due to the size or complexity of its internals, or to the presence of unknown factors. These obstacles led to investigate methods for measuring a system’s leakage in a black-box manner.

Until a decade ago, the most popular measure of leakage was Shannon mutual information (MI). However, in his seminal paper [2] Smith showed that MI is not appropriate to represent a realistic attacker, and proposed a notion of leakage based on Rényi min-entropy (ME) instead. Consequently, in this paper we consider the general problem of estimating the Bayes risk of a system, which is the smallest error achievable by an adversary at predicting its secret inputs given the outputs. From the Bayes risk one can derive several leakage measures, including ME and the additive and multiplicative leakage [4]. These measures are considered by the QIF community among the most fundamental notions of leakage.

To the best of our knowledge, the only existing approach for the black-box estimation of the Bayes risk comes from a classical statistical technique, which refer to as the frequentist

paradigm. The idea is to run the system repeatedly on chosen secret inputs, and then count the relative frequencies of the secrets and respective outputs so to estimate their joint probability distribution; from this distribution, it is then possible to compute estimates of the desired leakage measure. LeakWatch 

[5] and leakiEst [6], two well-known tools for black-box leakage estimation, are applications of this principle.

Unfortunately, the frequentist approach does not always scale for real-world problems: as the number of possible input and output values of the channel matrix increases, the number of examples required for this method to converge becomes too large to gather. For example, LeakWatch requires a number of examples that is much larger than the product of the size of input and output space. For the same reason, this method cannot be used for systems with continuous outputs; indeed, it cannot even be formally constructed in such a case.

Our contribution

In this paper, we show that machine learning (ML) methods can provide the necessary scalability to black-box measurements, and yet maintain formal guarantees on their estimates. By observing a fundamental equivalence between ML and black-box leakage estimation, we show that any ML rule from a certain class (the universally consistent rules) can be used to estimate with arbitrary precision the leakage of a system. In particular, we study rules based on the nearest neighbor principle – namely, Nearest Neighbor (NN) and -NN, which exploit a metric on the output space to achieve a considerably faster convergence than frequentist approaches. In Tab. I we summarize the number of examples necessary for the estimators to converge, for the various systems considered in the paper. We focus on nearest neighbor methods, among the existing universally consistent rules, because: i) they are simple to reason about, and ii) we can identify the class of systems for which they will excel, which happens whenever the distribution is somehow regular with respect to a metric on the output (e.g., time side channels, traffic analysis, and most mechanisms used for privacy). Moreover, some of these methods can tackle directly systems with continuous output.

We evaluate these estimators on synthetic data, where we know the true distributions and we can determine exactly when the estimates converge. Furthermore, we use them for measuring the leakage in a real dataset of users’ locations, defended with three state-of-the-art mechanisms: two geo-indistinguishability mechanisms (planar geometric and planar Laplacian)[7], and the method by Oya et al. [8], which we refer to as the Blahut-Arimoto mechanism. Crucially, the planar Laplacian is real-valued, which -NN methods can tackle out-of-the box, but the frequentist method cannot. Results in both synthetic and real-world data show our methods give a strong advantage whenever there is a notion of metric in the output that can be exploited. Finally, we compare our methods with leakiEst on the problem of estimating the leakage of European passports [9, 6], and on the location privacy data.

As a further evidence of their practicality, we use them in Appendix G to measure the leakage of a time side channel in a hardware implementation of finite field exponentiation.

No Free Lunch

A central takeaway of our work is that, while all the estimators we study (including the frequentist approach) are asymptotically optimal in the number of examples, none of them can guarantee on its finite sample performance; indeed, no estimator can. This is a consequence of the No Free Lunch theorem in ML [10], which informally states that all learning rules are equivalent among the possible distributions of data. This rules out the existence of an optimal estimator.

In practice, this means that we should always evaluate several estimators, and select the one that converged faster. Fortunately, our main finding (i.e., any universally consistent ML rule is a leakage estimator) adds a whole new class of estimators, which one can use in practical applications.

We therefore propose a tool, F-BLEAU (Fast Black-box Leakage Estimation AUtomated), which computes nearest neighbor and frequentist estimates, and selects the one converging faster. We release it as Open Source software111URL omitted for anonymous submission., and we hope in the future to extend it and support several more estimators based on UC ML rules.

Nearest Neighbor rules

Nearest neighbor rules excel whenever there is a notion of metric on the output space, and the output distributions is “regular” (in the sense that it does not change too abruptly between two neighboring points). We expect this to be the case for several real-world systems, such as: side channels whose output is time, an electromagnetic signal, or power consumption; for traffic analysis on network packets; and for geographic location data. Moreover, most mechanisms used in privacy and QIF use smooth noise distributions. Suitable applications may also come from recent attacks to ML models, such as model inversion [11] and membership inference [12].

Furthermore, we observe that even when there is no metric, or when the output distribution is irregular, (e.g., a system whose internal distribution has been randomly sampled), these rules are equivalent to the frequentist approach. Indeed, the only case we observe when they are misled is when the system is crafted so that the metric contributes against classification (e.g., see “Spiky” example in Tab. I).

Ii Related Work

Chatzikokolakis et al. [13] introduced methods for measuring the leakage of a deterministic program in a black-box manner; these methods worked by collecting a large number of inputs and respective outputs, and by estimating the underlying probability distribution accordingly; this is what we refer to as the frequentist paradigm. A fundamental development of their work by Boreale and Paolini [14] showed that, in the absence of significant a priori information about the output distribution, no estimator does better than the exhaustive enumeration of the input domain. In line with this work, section IV will show that, as a consequence of the No Free Lunch theorem in ML, no leakage estimator can claim to converge faster than any other estimator for all distributions.

The best known tools for black-box estimation of leakage measures based on the Bayes risk (e.g., min-entropy) are leakiEst [15, 16] and LeakWatch [6, 5], both based on the frequentist paradigm. The former also allows a zero-leakage test for systems with continuous outputs. In section VIII we provide a comparison of leakiEst with our proposal.

Cherubin [17] used the guarantees of nearest neighbor learning rules for estimating the information leakage (in terms of the Bayes risk) of defenses against website fingerprinting attacks in a black-box manner.

Shannon mutual information (MI) is the main alternative to the Bayes risk-based notions of leakage in the QIF literature. Although there is a relation between MI and Bayes risk [18], the corresponding models of attackers are very different: the first corresponds to an attacker who can try infinitely many times to guess the secret, while the second has only one try at his disposal [2]. Consequently, MI and Bayes-risk measures, such as ME, can give very different results: Smith [2] shows two programs that have almost the same MI, but one has an ME several orders of magnitude larger than the other one; conversely, there are examples of two programs such that ME is for both, while the MI is in one case and strictly positive (several bits) in the other one.

In the black-box literature, MI is usually computed by using Kernel Density Estimation, which although only guarantees asymptotic optimality under smoothness assumptions on the distributions. On the other hand, the ML literature offered developments in this area: Belghazi et al. 


proposed an MI lower bound estimator based on deep neural networks, and proved its consistency (i.e., it converges to the true MI value asymptotically). Similarly, other works constructed MI variational lower bounds 

[20, 21].

Iii Preliminaries

Symbol Description
A secret
An object/black-box output
An example
A system, given a set of priors and channel matrix
Distribution induced by a system on

A classifier

Loss function w.r.t. which we evaluate a classifier
The expected misclassification error of a classifier
Bayes risk
TABLE II: Symbols table.

We define a system, and show that its leakage can be expressed in terms of the Bayes risk. We then introduce ML notions, which we will later use to estimate the Bayes risk.

Iii-a Notation

We consider a system , that associates a secret input to an observation (or object)

in a possibly randomized way. The system is defined by a set of prior probabilities

, , and a channel matrix of size , for which for and . We call the example space. We assume the system does not change over time; for us, is finite, and is finite unless otherwise stated.

Iii-B Leakage Measures

The state-of-the-art in QIF is represented by the leakage measures based on -vulnerability, a family whose most representative member is min-vulnerability [2], the complement of the Bayes risk. This paper is concerned with finding tight estimates of the Bayes risk, which can then be used to estimate the appropriate leakage measure.

Bayes risk

The Bayes risk, , is the error of the optimal (idealized) classifier for the task of predicting a secret given an observation output by a system. It is defined with respect to a loss function , where is the risk of an adversary predicting for an observation , when its actual secret is . We focus on the 0-1 loss function, , taking value if , otherwise. The Bayes risk of a system is defined as:

Random guessing

A baseline for evaluating a system is the error committed by an idealized adversary who knows priors but has no access to the channel, and who’s best strategy is to always output the secret with the highest prior. We call the error of this adversary random guessing error:


Iii-C Black-box estimation of

This paper is concerned with estimating the Bayes risk given

examples sampled from the joint distribution

on generated by . By running the system times on secrets , chosen according to , we generate a sequence of corresponding outputs , thus forming a training set222In line with the ML literature, we call the training or test “set” what is technically a multiset; also, we loosely use the set notation “” for both sets and multisets when the nature of the object is clear from the context. of examples . From these data, we aim to make an estimate close to the real Bayes risk.

Iii-D Learning Rules

We introduce ML rules (or, simply, learning rules), which are algorithms for selecting a classifier given a set of training examples. In this paper, we will use the error of some ML rules as an estimator of the Bayes risk.

Let be a set of classifiers. A learning rule is a possibly randomized algorithm that, given a training set , returns a classifier , with the goal of minimizing the expected loss for a new example sampled from  [22]. In the case of the 0-1 loss function, the expected loss coincides with the expected probability of error (expected error for short), and if is generated by a system , then the expected error of a classifier is:


where is the secret predicted for object . If is infinite (and is continuous) the summation is replaced by an integral.

Iii-E Frequentist estimate of

The frequentist paradigm [13] for measuring the leakage of a channel consists in estimating the probabilities by counting their frequency in the training data :


We can obtain the frequentist error from Eq. 3:


where is the frequentist classifier, namely:


where is estimated from the examples: .

Consider a finite example space . Provided with enough examples, the frequentist approach always converges: clearly, as

, because events’ frequencies converge to their probabilities by the Law of Large Numbers.

However, there is a fundamental issue with this approach. Given a training set , the frequentist classifier can tell something meaningful (i.e., better than random guessing) for an object , only as long as appeared in the training set; but, for very large systems (e.g., those with a large object space), the probability of observing an example for each object becomes small, and the frequentist classifier approaches random guessing. We study this matter further in subsection VI-D and Appendix E.

Iv No Free Lunch In Learning

The frequentist approach performs well only for objects it has seen in the training data; in the next section, we will introduce estimators that aim to provide good predictions even for unseen objects. However, we shall first answer an important question: is there an estimator that is “optimal” for all systems?

A negative answer to this question is given by the so-called “No Free Lunch” (NFL) theorem by Wolpert [10]. The theorem is formulated for the expected loss of a learning rule on unseen objects (i.e., that were not in the training data), which is referred to as the off-training-set (OTS) loss.

Theorem 1 (No Free Lunch).

Let and be two learning rules, a cost function, and a distribution on . We indicate by the OTS loss of given and , where the expectation is computed over all the possible training sets of size sampled from . Then, if we take the uniform average among all possible distributions , we have


Intuitively, the NFL theorem says that, if all distributions (and, therefore, all channel matrices) are equally likely, then all learning algorithms are equivalent. Remarkably, this holds for any strategy, even if one of the rules is random guessing.

An important implication of this for our purposes is that for every two learning rules and there will always exist some system for which rule converges faster than , and vice versa there will be a system for which outperforms .

From the practical perspective of black-box security, this demonstrates that we should always test several estimators and select the one that converges faster. Fortunately, the connection between ML and black-box security we highlight in this paper results in the discovery of a whole class of new estimators.

V Machine Learning Estimates of the Bayes Risk

In this section, we define the notion of a universally consistent learning rule, and show that the error of a classifier selected according to such a rule can be used for estimating the Bayes risk. Then, we introduce various universally consistent rules based on the nearest neighbor principle.

Throughout the section, we use interchangeably a system and its corresponding joint distribution on . Note that there is a one-to-one correspondence between them.

V-a Universally Consistent Rules

Consider a distribution and a learning rule selecting a classifier according to training examples sampled from . Intuitively, as the available training data increases, we would like the expected error of for a new example sampled from to be minimized (i.e., to get close the Bayes risk). The following definition captures this intuition.

Definition 1 (Consistent Learning Rule).

Let be a distribution on and let be a learning rule. Let be a classifier selected by using training examples sampled from . Let be the system corresponding to , and let be the expected error of , as defined by (3). We say that is consistent if as .

The next definition strengthens this property, by asking the rule to be consistent for all distributions:

Definition 2 (Universally Consistent (UC) Learning Rule).

A learning rule is universally consistent if it is consistent for any distribution on .

By this definition, the expected error of a classifier selected according to a universally consistent rule is also an estimator of the Bayes risk, since it converges to as .

In the rest of this section we introduce Bayes risk estimates based on universally consistent nearest neighbor rules; they are summarized in Tab. III together with their guarantees.

Method Guarantee Space
frequentist finite
NN finite
-NN infinite, separable
NN Bound infinite, separable
TABLE III: Estimates’ guarantees as

V-B NN estimate

The Nearest Neighbor (NN) is one of the simplest ML classifiers: given a training set and a new object , it predicts the secret of its closest training observation (nearest neighbor). It is defined both for finite and infinite object spaces, although it is UC only in the first case.

We introduce a formulation of NN, which can be seen as an extension of the frequentist approach, that takes into account ties (i.e., neighbors that are equally close to the new object ), and which guarantees consistency when is finite.

Consider a training set , an object , and a distance metric . The NN classifier predicts a secret for by taking a majority vote over the set of secrets whose objects have the smallest distance to . Formally, let and define:




We show that NN is universally consistent for finite .

Theorem 2 (Universal consistency of NN).

Consider a distribution on , where and are finite. Let be the expected error of the NN classifier for a new observation . As the number of training examples :

Sketch proof.

For an observation that appears in the training set, the NN classifier is equivalent to the frequentist approach. For a finite space , as , the probability that the training set contains all approaches . Thus, the NN rule is asymptotically (in ) equivalent to the frequentist approach, which means its error also converges to . ∎

V-C -NN estimate

Whilst NN guarantees universal consistency in finite example spaces, this does not hold for infinite . In this case, we can achieve universal consistency with the k-NN classifier, an extension of NN, for appropriate choices of the parameter .

The k-NN classifier takes a majority vote among the secrets of its neighbors. Breaking ties in the k-NN definition requires more care than with NN. In the literature, this is generally done via strategies that add randomness or arbitrariness to the choice (e.g., if two neighbors have the same distance, select the one with the smallest index in the training data) [23]. We use a novel tie-breaking strategy, which takes into account ties, but gives more importance to the closest neighbors. In early experiments, we observed this strategy had a faster convergence than standard approaches.

Consider a training set , an object to predict , and some metric . Let denote the -th closest object to , and its respective secret. If ties do not occur after the -th neighbor (i.e., if ), then k-NN outputs the most frequent among the secrets of the first neighbors:




If ties exist after the -th neighbor, that is, for :


we proceed as follows. Let be the most frequent secret in ; k-NN predicts the most frequent secret in the following multiset, truncated at the tail to have size :

We now define -NN, a universally consistent learning rule that selects a k-NN classifier for a training set of examples by choosing as a function of .

Definition 3 (-NN rule).

Given a training set of examples, the -NN rule selects a k-NN classifier, where is chosen such that and as .

Stone proved that -NN is universally consistent [24]:

Theorem 3 (Universal consistency of the -NN rule).

Consider a probability distribution on the example space , where has a density. Select a distance metric such that is separable333A separable space is a space containing a countable dense subset; e.g., finite spaces and the space of

-dimensional vectors

with Euclidean metric.. Then the expected error of the -NN rule converges to as .

This holds for any distance metric. In our experiments, we will use the Euclidean distance, and we will evaluate -NN rules for (natural logarithm) and .

The ML literature is rich of UC rules and other useful tools for black-box security; we list some of them in Appendix A.

Vi Evaluation on Synthetic Data

In this section we evaluate our estimates on several synthetic systems for which the channel matrix is known. For each system, we sample examples from its distribution, and then compute the estimate on the whole object space as in Eq. 3; this is possible because is finite. Since for synthetic data we know the real Bayes risk, we can measure how many examples are required for the convergence of each estimate. We do this as follows: let be an estimate of , trained on a dataset of examples. We say the estimate -converged to after examples if its relative change from is smaller than :


While relative change has the advantage of taking into account the magnitude of the compared values, it is not defined when the denominator is ; therefore, when (Tab. IV), we verify convergence with the absolute change:

Name Privacy parameter
Geometric 1.0 100 10K
Geometric 0.1 100 10K 0.007
Geometric 0.01 100 10K 0.600
Geometric 0.2 100 1K 0.364
Geometric 0.02 100 10K 0.364
Geometric 0.002 100 100K 0.364
Geometric 2 100K 100K 0.238
Geometric 2 100K 10K 0.924
Multimodal 1.0 100 10K 0.450
Multimodal 0.1 100 10K 0.456
Multimodal 0.01 100 10K 0.797
Spiky N/A 2 10K 0
Random N/A 100 100 0.979
TABLE IV: Synthetic systems.

The systems used in our experiments are briefly discussed in this section, and summarized in Tab. IV; we detail them in Appendix B. A uniform prior is assumed in all cases.

Vi-a Geometric systems

We first consider systems generated by adding geometric noise to the secret, one of the typical mechanisms used to implement differential privacy [25]. Their channel matrix has the following form:


where is a privacy parameter, a normalization factor, and a function ; a detailed description of these systems is given in in Appendix B.

We consider the following three parameters:

  • the privacy parameter ,

  • the ratio , and

  • the size of the secret space .

We vary each of these parameters one at a time, to isolate their effect on the convergence rate.

Vi-A1 Variation of the privacy parameter

We fix , K, and we consider three cases , and . The results for the estimation of the Bayes risk and the convergence rate are illustrated in Fig. 1 and Tab. V respectively. In the table, results are reported for convergence level ; an “X” means a particular estimate did not converge within 500K examples; a missing row for a certain means no estimate converged.

The results indicate that the nearest neighbor methods have a much faster convergence than the standard frequentist approach, particularly when dealing with large systems. The reason is that geometric systems have a regular behavior with respect to the Euclidean metric, which can be exploited by NN and -NN to make good predictions for unseen objects.

Fig. 1: Estimates’ convergence for geometric systems when varying their privacy parameter . The respective distributions are shown in the top figure for two adjacent secrets .
System Freq. NN
= 1.0
0.1 1 994 267 396 679
0.05 4 216 325 458 781
0.01 19 828 425 633 899
0.005 38 621 439 698 904
= 0.1
0.1 18 110 269 396 673
0.05 35 016 333 458 768
0.01 127 206 439 633 899
0.005 211 742 4 844 698 904
= 0.01
0.1 105 453 103 357 99 852 34 243
0.05 205 824 205 266 205 263 199 604
TABLE V: Convergence of the estimates when varying , fixed

Vi-A2 Variation of the ratio

Now we fix , and we consider three cases , , and K. (Note that we want to keep the ratio fixed, see Appendix B; as a consequence has to vary: we set to , , and , respectively.) Results in Fig. 2 and Tab. VI show how the nearest neighbor methods become much better than the frequentist approach as

increases. This is because the larger the object space, the larger the number of unseen objects at the moment of classification, and the more the frequentist approach has to rely on random guessing. The nearest neighbor methods are not that much affected because they can rely on the proximity to outputs already classified.

Fig. 2: Estimates’ convergence for geometric systems when varying the ratio . The respective distributions are shown in the top figure for two adjacent secrets .
System Freq. NN
Geometric 100x1K
= 0.2
0.1 8 679 8 707 7 108 2 505
0.05 14 823 14 853 14 853 7 673
0.01 51 694 60 796 60 796 60 796
0.005 71 469 71 469 71 469 71 469
Geometric 100x10K
= 0.02
0.1 85 912 85 644 71 003 11 197
0.05 152 904 152 698 151 153 68 058
Geometric 100x100K
= 0.002
0.1 X X 413 974 2 967
TABLE VI: Convergence of the estimates when varying .

Vi-A3 Case

We fix , and we consider two cases: and . It should be noted that the formulation of geometric systems prohibits the number of secrets to exceed the number of outputs; for this reason, in the system some secrets are associated with the same distribution over the output space (Appendix B).

The results in Fig. 3 and Tab. VII indicate that NN and frequentist are mostly equivalent: this is because they both need to observe at least one example for each secret. -NN rules, on the other hand, show poor performances, due to the fact that they would need at least examples for each secret. A natural extension of our work is to look at notions of metric also in the secret space for improving convergence.

Fig. 3: Estimates’ convergence for geometric systems when . The distributions are shown in the top figure for two adjacent secrets . In the case (right) there are identical distributions that coincide on , and identical distributions on .
System Freq. NN
Geometric 10Kx10K 0.1 74 501 73 085 88 296 140 618
0.05 95 500 94 204 107 707 155 403
0.01 137 099 137 348 144 846 192 014
0.005 153 370 153 370 159 075 203 363
Geometric 10Kx1K 0.1 5 5 5 5
0.05 721 514 2 309 5 977
0.01 5 595 6 171 7 330 12 354
0.005 10 770 10 797 11 037 14 575
TABLE VII: Convergence of the estimates when , .

Vi-B Multimodal geometric system

We now evaluate the estimators on systems with a multimodal distribution. In particular, we create multimodal geometric systems by summing two geometric probability distributions, appropriately normalized and shifted by some parameter. We provide the details of this distribution in Appendix B.

Fig. 4: Estimates’ convergence for multimodal geometric systems when varying the privacy parameter . The distributions are shown in the top figure for two adjacent secrets .

Vi-B1 Evaluation

Results are reported in Fig. 4. As expected, we observe that nearest neighbor rules improve on the frequentist approach; the reason is that, even for multimodal distributions, there exists a metric on the outputs which they can exploit. Detailed -convergence results are in Appendix C.

Vi-C Spiky system: When -NN rules fail

Nearest neighbor rules take advantage of the metric on the object space to improve their convergence considerably. However, as a consequence of the NFL theorem, there exist systems for which the frequentist approach outperforms NN and -NN. Investigating the form of such systems is important to understand when these methods fail.

We craft one such system, the Spiky system, where the metric misleads predictions. The Spiky system is such that neighboring points are associated with different secret. This means that NN and -NN rules will tend to predict the wrong secret, until enough examples are available. We detail its construction in Appendix B.

Fig. 5: Estimates’ convergence for a Spiky system (2x10K).

We conducted experiments for a Spiky system of size . Results in Fig. 5 confirm the hypothesis: nearest neighbor rules are misled for this system.

Interestingly, while the NN estimate keeps decreasing as the number of examples increases, there is a certain range of ’s where the -NN estimates become worse than random guessing. Intuitively, this is because when becomes larger than , all elements in tend to be covered by the examples. For every there are two neighbors, and , that belong to the class opposite to the one of , so if is not too small with respect to , it is likely that in the multiset of the closest neighbors of , the number of ’s and ’s exceeds the number of ’s, which means that will be misclassified. As increases, however, the ratio between and the number of ’s in the examples tends to decrease (because as ), hence at some point we will have enough ’s to win the majority vote in the neighbors (’s are considered before than ’s and ’s, by the nearest neighbor definition) so will not be misclassified anymore.

Concerning the comparison between the NN and frequentist estimates, we can do it analytically. We start by computing the expected error of the NN method on the spiky system in terms of the number of training examples . Let be a training set of examples of size . Given a new object , let us consider the NN estimate of for , i.e., the expected probability of error in the classification of . This is the probability that the element closest to

that appears in the training set has odd distance from

(i.e., , for some natural number ). Namely it is the probability that:

  • is not in the training data but either or are, or

  • are not in the training data but either or are, or

  • …etc.

Hence we have:


where is the probability that an element does not occur in any of the examples of the training set. (Thus represents the probability that none of the elements , with , appear in the training set, and represents the probability that the element (resp. ) appears in the training set.) By using the result of the geometric series


we obtain:


Since we assume that the distribution on is uniform, we have .

We want to study how the error estimate depends on the relative size of the training set with respect to the size of . Hence, let . Then we have , which, for large , becomes . Therefore:


It is easy to see that for , and for , as expected.

Consider now the frequentist estimate . In this case, given an element , the classification is done correctly if appears in the training set. Otherwise, we do random guessing, which gives a correct or wrong classification with equal probability. Only the latter case contributes to the probability of error, hence the error estimate is half the probability expectation that does not belong to the training set:


Therefore, is always above .

Vi-D Random System

In the previous sections, we have seen cases when our methods greatly outperform the frequentist approach, and a contrived system example for which they fail. We now consider a system generated randomly to evaluate their performances for an “average” system.

System description

We construct the channel matrix of a Random system by drawing its elements from the uniform distribution,

, and normalizing its rows.


We consider a Random system with and count the number of examples required for -convergence, for many ’s. Tab. VIII reports the results.

Freq. NN
0.05 5 5 5 5
0.01 82 139 202 500
0.005 10 070 10 070 10 070 10 070
TABLE VIII: Random: examples required for -convergence.
Fig. 6: Estimates’ convergence for a Random system ().

The frequentist estimate is slightly better than NN and -NN for . However, for stricter convergence requirements (), all the methods require the same (large) number of examples. Fig. 6 shows that indeed the methods begin to converge similarly already after 1K examples.


Results showed that nearest neighbor estimates require significantly fewer examples than the frequentist approach when dealing with medium or large systems; however, they are generally equivalent to the frequentist approach in the case of small systems.

To better understand why this is the case, we provide a crude approximation of the frequentist Bayes risk estimate.


This approximation, derived and studied in Appendix E, makes the very strong assumption that all objects are equally likely, i.e.: . However, this is enough to give us an insight on the performance of the frequentist approach: is the probability that some object does not appear within a training set of size . This weighs the value of the frequentist estimate between the optimal , used when the object appears in the training data, and random guessing : while the estimate converges asymptotically to the Bayes risk, the probability of observing an object – often related to the the size , has a major influence on its convergence rate.

Vii Application to Location Privacy

We show that F-BLEAU can be successfully applied to estimate the degree of protection provided by mechanisms such as those used in location privacy. Since the purpose of this paper is to evaluate the precision of F-BLEAU, we consider basic mechanisms for which the Bayes risk can also be computed directly. Of course, the intended applications of F-BLEAU are mechanisms or situations where the Bayes risk cannot be computed directly, either because this is too complicated, or because of the presence of unknown factors. Examples abound; for instance, the availability of additional information, like the presence of points of interest (e.g., shops, churches), or geographical characteristics of the area (e.g., roads, lakes) can affect the Bayes risk in ways that are impossible to evaluate formally.

We will consider the planar Laplacian and the planar geometric, which are the typical mechanisms used to obtain geo-indistinguishability [7], and one of the optimal mechanisms proposed by Oya et al. [8] as a refinement of the optimal mechanism by Shokri et al. [26]. The construction of the last relies on an algorithm that was independently proposed by Blahut and by Arimoto to solve the information theory problem of achieving an optimal trade-off between the minimization of the distortion rate and the the mutual information [27]. From now on, we shall refer to this as the Blahut-Arimoto mechanism. Note that the Laplacian is a continuous mechanism (i.e., its outputs are on a continuous plane); the other two are discrete.

In these experiments we also deploy the method that F-BLEAU uses in practice to compute the estimate of the Bayes risk: we first split the data into a training set and a hold-out set; then, for an increasing number of examples we train the classifier on the first examples on the training set, and then estimate its error on the hold-out set.

Fig. 7: Area of San Francisco considered for the experiments. The input locations correspond to the inner square, the output locations to the outer one. The colored cells represent the distribution of the Gowalla checkins.

Vii-a The Gowalla dataset

We consider real location data from the Gowalla dataset [28, 29], which contains users’ checkins and their geographical location. We use a squared area in San Francisco, centered in the coordinates (37.755, -122.440), and extending for 1.5 Km in each direction. This input area corresponds to the inner (purple) square in Fig. 7. We discretize the input using a grid of cells of size Sq m; the secret space of the system thus consists of locations. The prior distribution on the secrets is derived from the Gowalla checkins, and it is represented in Fig. 7 by the different color intensities on the input grid. The output area is represented in Fig. 7 by the outer (blue) square. It extends 1050 m (7 cells) more than the input square on every side. We consider a larger area for the output because the planar Laplacian and Geometric naturally expand outside the input square.444In fact these functions distribute the probability on the infinite plane, but on locations very distant from the origin the probability becomes negligible. Since the planar Laplacian is continuous, its output domain is constituted by all the points of the outer square. As for the planar Geometric and the Blahut-Arimoto mechanisms, which are discrete, we divide the output square in a grid of cells of size Sq m; therefore, .

Vii-B Defenses

The planar Geometric mechanism is characterized by a channel matrix , representing the conditional probability of reporting the location when the true location is :


where is a parameter controlling the level of noise, is a normalization factor, and is the Euclidean distance. The planar Laplacian is defined by the same equation, except that

belongs to a continuous domain, and the equation defines a probability density function.

As for the Blahut-Arimoto, it is obtained as the result of an iterative algorithm, whose definition can be found in [27].

Vii-C Results

We evaluated the estimates’ convergence as a function of the number of training examples and for different values of the noise level: . We randomly split the dataset ( examples) into training () and hold-out () sets, and then evaluated the convergence of the estimators on an increasing number of training examples, .

Results for the geometric noise (Fig. 8) indicate faster convergence when is higher (which means less noise and lower Bayes risk), in line with the results for the synthetic systems of the previous section. In all cases, the nearest neighbor methods outperform the frequentist one, as we expected given the presence of a large number of outputs. Tab. IX shows the number of examples required to achieve -convergence from the Bayes risk. The symbol “X” means we did not achieve a certain level of approximation with 75K examples.

Fig. 8: Estimates’ convergence speed for the planar Geometric defense applied to the Gowalla dataset, for , and , respectively. Above each graph is represented the distribution of the geometric noise for two adjacent input cells.

The corresponding results for the Laplacian noise are shown in Fig. 8 and in Tab. X. In this case, the frequentist approach is not applicable, but the -NN rule can still approximate the Bayes risk for some approximation levels.

Fig. 9: Estimates’ convergence speed for the planar Laplacian defense applied to the Gowalla dataset, for , and , respectively. Above each graph is represented the distribution of the geometric noise for two adjacent input cells.

The case of the Blahut-Arimoto mechanism is quite different: surprisingly, the output probability concentrates on a small number of locations. For instance, in the case , with 100K sampled pairs we obtained only different output locations (which reduced to after we mapped them on the grid). Thanks to the small number of actual outputs, all the methods converge very fast. The results are shown in Fig. 10 and in Tab. XI.

Fig. 10: Estimates’ convergence speed for the Blahut-Arimoto defense applied to the Gowalla dataset, for , and , respectively. Above each graph is represented the distribution of the output probability as produced by the mechanism. All the outputs with non-null probability turn out to be inside the input square. The outputs are points on the grid, but here are mapped on the coarser grid for the sake of visual clarity.
frequentist NN
2 0.1 X X 25 795 1 102
0.05 X X X 55 480
4 0.1 X X 36 735 2 820
0.05 X X X 59 875
8 0.1 X X 15 253 5 244
0.05 X X X 19 948
TABLE IX: Convergence for the Planar Geometric for various .
frequentist NN
2 0.1 N/A X X 259
4 0.1 N/A X X 4 008
8 0.1 N/A X X 6 135
0.05 N/A X X 19 961
TABLE X: Convergence for the Planar Laplacian for various .
frequentist NN
2 0.1 37 37 37 37
0.05 135 135 135 135
0.01 1 671 1 664 1 408 1 408
0.005 6 179 6 179 1 671 1 671
4 0.1 220 220 220 257
0.05 503 502 509 703
0.01 2 029 1 986 2 055 2 404
0.005 2 197 2 055 2 280 2 658
8 0.1 345 401 553 1 285
0.05 1 285 1 170 1 343 1 679
0.01 2 104 2 017 2 495 4 190
0.005 2 231 2 231 3 881 6 121
TABLE XI: Convergence for the Blahut-Arimoto for various .

Viii Comparison with leakiEst

LeakWatch [5] and leakiEst [6] are the major existing black-box leakage measurement tools, both based on the frequentist approach. LeakWatch is an extension of leakiEst, which uses the latter as a subroutine, but leakiEst is more feature rich: both tools compute Shannon mutual information (MI) and min-entropy leakage (ME) on the finite-output case, but leakiEst can also perform tests in the continuous output case. We compare leakiEst with our methods, for a time side channel in the RFID chips of the European passports and for the Gowalla examples of the previous section.

LeakiEst performs two functions: i) a statistical test, detecting if there is evidence of leakage (here referred to as leakage evidence test), and ii) the estimation of ME (discrete) or MI (discrete and continuous output). The leakage evidence test generates a “no leakage” distribution via a bootstrapping variant, it estimates the leakage measure on it, and it compares this estimate with the measure computed on the original data: if its value is far from the former (w.r.t. some defined confidence level), then the tool declares there is evidence of leakage. The second function estimates the distribution with an appropriate method (frequentist, for finite outputs, Kernel Density Estimation, for continuous outputs).

Viii-a Time side channel on e-Passports’ RFID chips

Passport leakiEst: Leakage evidence F-BLEAU:
British yes 0.383
German no 0.490
Greek no 0.462
Irish yes 0.350

Random guessing baseline is .

TABLE XII: Leakage of European passports

Chothia et al. [9] discovered a side-channel attack in the way the protocols of various European countries’ passports exchanged messages some years ago. (The protocols have been corrected since then.) The problem was that, upon receiving a message, the e-Passport would first check the Message Authentication Code (MAC), and only afterwards verify the nonce (so to assert the message was not replayed). Therefore, an attacker who previously intercepted a valid message from a legitimate session could replay the message and detect a difference between the response time of the victim’s passport and any other passport; this could be used to track the victim. As an initial solution, Chothia et al. [6]

proposed to add padding to the response time, and they used leakiEst to look for any evidence of leakage after such a defense.

We compared F-BLEAU and leakiEst on the padded timing data [30]. The secret space contains answers to the binary question: “is this the same passport?”; the dataset is balanced, hence . We make this comparison on the basis that, if leakiEst detects no leakage, then the Bayes risk should be maximum: no leakage happens if and only if . We compute ME from the Bayes risk as:


For F-BLEAU, we randomly split the data into training () and hold-out sets, and then estimated on the latter; we repeated this for different random initialization seeds, and averaged the estimates. Results in Tab. XII show two cases where leakiEst did not find enough evidence of leakage, while F-BLEAU indicates non-negligible leakage. Note that, because F-BLEAU’s results are based on an actual classifier, they implicitly demonstrate there exists an attack that succeeds with accuracy 51% and 54%. We attribute this discrepancy between the tools to the fact that the dataset is small (

1K examples), and leakiEst may not find enough evidence to reject the hypothesis of “no leakage”; indeed, leakiEst sets a fairly high standard for this decision (95% confidence interval).

Viii-B Gowalla dataset

Mechanism L.E. ME F-BLEAU: ME True ME
B.-Arimoto 2 no* 1.481 1.479 1.501
4 no* 2.305 2.310 2.304
8 no* 2.738 2.746 2.738
Geometric 2 no 2.585 1.862 1.988
4 no 2.859 2.591 2.638
8 no 3.105 2.983 2.996

Mechanism leakiEst: L.E. F-BLEAU: ME True ME
Laplacian 2 no 1.802 1.987
4 no 2.550 2.631
8 no 2.970 3.003

L.E.: leakage evidence test

TABLE XIII: Estimated leakage of privacy mechanisms on Gowalla data

We compare F-BLEAU and leakiEst on the location privacy mechanisms (section VII): Blahut-Arimoto, planar Geometric, and planar Laplacian. The main interest is to verify whether the advantage of F-BLEAU w.r.t. the frequentist approach, which we observed for large output spaces, translates into an advantage also w.r.t. leakiEst. For the first two mechanisms we also compare the ME estimates. For the Laplacian case (continuous), we only use leakiEst’s leakage evidence test.

We run F-BLEAU and leakiEst on the datasets as in section VII. Results in Tab. XIII

show that, in the cases of planar Geometric and Laplacian distributions, leakiEst does not detect any leakage (the tool reports “Too small sample size”); furthermore, the ME estimates it provides for the planar Geometric distribution are far from their true values. F-BLEAU, however, is able to produce more reliable estimates.

The Blahut-Arimoto results are more interesting: because of the small number of actual outputs, the ME estimates of F-BLEAU and leakiEst perform equally well. However, even in this case leakiEst’s leakage evidence test reports “Too small sample size”. We think the reason is that leakiEst takes into account the declared size of the object space rather then the effective number of observed individual outputs; this problem should be easy to fix by inferring the output size from the examples (this is the meaning of the “*” in Tab. XIII).

Ix Conclusion and Future Work

We showed that the black-box leakage of a system, measured until now with classical statistics paradigms (frequentist approach), can be effectively estimated via ML techniques. We considered a set of such techniques based on the nearest neighbor principle (i.e., close observations should be assigned the same secret), and evaluated them thoroughly on synthetic and real-world data. This allows to tackle problems that were impractical until now; furthermore, it sets a new paradigm in black-box security: thanks to an equivalence between ML and black-box leakage estimation, many results from the ML theory can be now imported into this practice (and vice versa).

Empirical evidence shows that the nearest neighbor techniques we introduce excel whenever there is a notion of metric they can exploit in the output space: whereas for unseen observations the frequentist approach needs to take a guess, nearest neighbor methods can use the information of neighboring observations. We also observe that whenever the output distribution is irregular, they are equivalent to the frequentist approach, but for maliciously crafted systems they can be misled. Even in those cases, however, we remark that asymptotically they are equivalent to the frequentist approach, thanks to their universal consistency property.

We also indicated that, as a consequence of the No Free Lunch (NFL) theorem in ML, no estimate can guarantee optimal convergence. We therefore proposed F-BLEAU, a combination of frequentist and nearest neighbor rules, which runs all these techniques on a system, and selects the estimate that converges faster. We expect this work will inspire researchers to explore new leakage estimators from the ML literature; in particular, we showed that any “universally consistent” ML rule can be used to estimate the leakage of a system. Future work may focus on other rules from which one can obtain universal consistency (e.g., Support Vector Machine (SVM) and neural networks); we discuss this further in Appendix 


A crucial advantage of the ML formulation, as opposed to the classical approach, is that it gives immediate guarantees for systems with a continuous output space. Future work may extend this to systems with continuous secret space, which in ML terms would be formalized as regression (as opposed to the classification setting we considered here).

A current limitation of our methods is that they do not provide confidence intervals. We leave this as an open problem. We remark, however, that for continuous systems it is not possible to provide confidence intervals (or to prove convergence rates) under our weak assumptions [23]; this constraint applies to any leakage estimation method.

We reiterate, however, the great advantage of ML methods: they allow tackling systems for which until now we could not measure security, with a strongly reduced number of examples.


Giovanni Cherubin has been partially supported by an EcoCloud grant. The work of Konstantinos Chatzikokolakis and Catuscia Palamidessi was partially supported by the ANR project REPAS.

We are very thankful to Marco Stronati, who was initially involved in this project, and without whom the authors would have not started working together. We are grateful to Tom Chothia and Yusuke Kawamoto, for their help to understand leakiEst. We also thank Fabrizio Biondi for useful discussion.

This work began as a research visit whose (secondary) goal was for some of the authors and Marco to climb in Fontainebleau, France. The trip to the magic land of Fontainebleau never happened – although climbing has been a fundamental aspect of the collaboration; the name F-BLEAU is to honor this unfulfilled dream. We hope one day the group will reunite, and finally climb there together.


  • [1] D. Clark, S. Hunt, and P. Malacaria, “Quantitative information flow, relations and polymorphic types,” J. of Logic and Computation, vol. 18, no. 2, pp. 181–199, 2005.
  • [2] G. Smith, “On the foundations of quantitative information flow,” in Proceedings of the 12th International Conference on Foundations of Software Science and Computation Structures (FOSSACS 2009), ser. LNCS, L. de Alfaro, Ed., vol. 5504.   York, UK: Springer, 2009, pp. 288–302.
  • [3] M. S. Alvim, K. Chatzikokolakis, C. Palamidessi, and G. Smith, “Measuring information leakage using generalized gain functions,” in Proceedings of the 25th IEEE Computer Security Foundations Symposium (CSF), 2012, pp. 265–279. [Online]. Available: http://hal.inria.fr/hal-00734044/en
  • [4] C. Braun, K. Chatzikokolakis, and C. Palamidessi, “Quantitative notions of leakage for one-try attacks,” in Proceedings of the 25th Conf. on Mathematical Foundations of Programming Semantics, ser. Electronic Notes in Theoretical Computer Science, vol. 249.   Elsevier B.V., 2009, pp. 75–91. [Online]. Available: http://hal.archives-ouvertes.fr/inria-00424852/en/
  • [5] T. Chothia, Y. Kawamoto, and C. Novakovic, “LeakWatch: Estimating information leakage from java programs,” in Proc. of ESORICS 2014 Part II, 2014, pp. 219–236.
  • [6] ——, “A tool for estimating information leakage,” in International Conference on Computer Aided Verification (CAV).   Springer, 2013, pp. 690–695.
  • [7] M. E. Andrés, N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi, “Geo-indistinguishability: Differential privacy for location-based systems,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security.   ACM, 2013, pp. 901–914.
  • [8] S. Oya, C. Troncoso, and F. Pérez-González, “Back to the drawing board: Revisiting the design of optimal location privacy-preserving mechanisms,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17.   ACM, 2017, pp. 1959–1972. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134004
  • [9] T. Chothia and V. Smirnov, “A traceability attack against e-passports,” in International Conference on Financial Cryptography and Data Security.   Springer, 2010, pp. 20–34.
  • [10] D. H. Wolpert, “The lack of a priori distinctions between learning algorithms,” Neural computation, vol. 8, no. 7, pp. 1341–1390, 1996.
  • [11] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2015, pp. 1322–1333.
  • [12] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in Security and Privacy (SP), 2017 IEEE Symposium on.   IEEE, 2017, pp. 3–18.
  • [13] K. Chatzikokolakis, T. Chothia, and A. Guha, “Statistical measurement of information leakage,” Tools and Algorithms for the Construction and Analysis of Systems, pp. 390–404, 2010.
  • [14] M. Boreale and M. Paolini, “On formally bounding information leakage by statistical estimation,” in International Conference on Information Security.   Springer, 2014, pp. 216–236.
  • [15] T. Chothia and A. Guha, “A statistical test for information leaks using continuous mutual information,” in Proceedings of the 24th IEEE Computer Security Foundations Symposium, CSF 2011, Cernay-la-Ville, France, 27-29 June, 2011.   IEEE Computer Society, 2011, pp. 177–190. [Online]. Available: https://doi.org/10.1109/CSF.2011.19
  • [16] T. Chothia, Y. Kawamoto, C. Novakovic, and D. Parker, “Probabilistic point-to-point information leakage,” in Computer Security Foundations Symposium (CSF), 2013 IEEE 26th.   IEEE, 2013, pp. 193–205.
  • [17] G. Cherubin, “Bayes, not naïve: Security bounds on website fingerprinting defenses,” Proceedings on Privacy Enhancing Technologies, vol. 2017, no. 4, pp. 215–231, 2017.
  • [18] N. Santhi and A. Vardy, “On an improvement over Rényi’s equivocation bound,” 2006, presented at the 44-th Annual Allerton Conference on Communication, Control, and Computing, September 2006. Available at http://arxiv.org/abs/cs/0608087.
  • [19] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville, “Mine: mutual information neural estimation,” arXiv preprint arXiv:1801.04062, 2018.
  • [20] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.
  • [21] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan, “Learning to explain: An information-theoretic perspective on model interpretation,” arXiv preprint arXiv:1802.07814, 2018.
  • [22] V. Vapnik,

    The nature of statistical learning theory

    .   Springer science & business media, 2013.
  • [23] L. Devroye, L. Györfi, and G. Lugosi,

    A probabilistic theory of pattern recognition

    .   Springer Science & Business Media, 2013, vol. 31.
  • [24] C. J. Stone, “Consistent nonparametric regression,” The annals of statistics, pp. 595–620, 1977.
  • [25] C. Dwork, “Differential privacy,” in 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006), ser. Lecture Notes in Computer Science, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, Eds., vol. 4052.   Springer, 2006, pp. 1–12. [Online]. Available: http://dx.doi.org/10.1007/11787006_1
  • [26] R. Shokri, G. Theodorakopoulos, C. Troncoso, J.-P. Hubaux, and J.-Y. L. Boudec, “Protecting location privacy: optimal strategy against localization attacks,” in Proceedings of the 19th ACM Conference on Computer and Communications Security (CCS 2012), T. Yu, G. Danezis, and V. D. Gligor, Eds.   ACM, 2012, pp. 617–627.
  • [27] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.   John Wiley & Sons, Inc., 2006.
  • [28] “The gowalla dataset.” [Online]. Available: https://snap.stanford.edu/data/loc-gowalla.html
  • [29] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: User movement in location-based social networks,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’11.   New York, NY, USA: ACM, 2011, pp. 1082–1090. [Online]. Available: http://doi.acm.org/10.1145/2020408.2020579
  • [30] Example: e-passport traceability. School of Computer Science - leakiEst. [Online]. Available: www.cs.bham.ac.uk/research/projects/infotools/leakiest/examples/epassports.php
  • [31] I. Steinwart, “Support vector machines are universally consistent,” Journal of Complexity, vol. 18, no. 3, pp. 768–791, 2002.
  • [32] T. Glasmachers, “Universal consistency of multi-class support vector classification,” in Advances in Neural Information Processing Systems, 2010, pp. 739–747.
  • [33] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
  • [34] K. Fukunaga and D. M. Hummels, “Bayes error estimation using parzen and k-nn procedures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 5, pp. 634–643, 1987.
  • [35] M. Backes and B. Köpf, “Formally bounding the side-channel leakage in unknown-message attacks,” in European Symposium on Research in Computer Security.   Springer, 2008, pp. 517–532.
  • [36] P. C. Kocher, “Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems,” in Annual International Cryptology Conference.   Springer, 1996, pp. 104–113.
  • [37] B. Köpf and D. Basin, “Timing-sensitive information flow analysis for synchronous systems,” in European Symposium on Research in Computer Security.   Springer, 2006, pp. 243–262.

Appendix A Additional tools from ML

The ML literature can offer several more tools to black-box security. We now enumerate additional UC rules, a lower bound of the Bayes risk, and a general way of obtaining estimates that converge from below.

The family of UC rules is fairly large. An overview of them is by Devroye et al. [23], who, in addition to nearest neighbor methods, report histogram rules and kinds of neural networks, which are UC under requirements on their parameters. Steinwart proved that Support Vector Machine (SVM) is also UC for some parameter choices, in the case  [31]; to the best of our knowledge, attempts to construct an SVM that is UC when have failed so far (e.g., [32])

In applications with strict security requirements, a (pessimistic) lower bound of the Bayes risk may be desirable. From a result by Cover and Hart one can derive a lower bound on the Bayes risk based on the NN error,  [33]: as :


This was used as the basis for measuring the black-box leakage of website fingerprinting defenses [17].

Finally, one may obtain estimators that converge to the Bayes risk in expectation from below, for example, by estimating the error of a -NN rule on its training set [34, 17].

Appendix B Description of the synthetic systems

B-a Geometric system

Geometric systems are typical in differential privacy and are obtained by adding negative exponential noise to the result of a query. The reason is that the property of DP is expressed in terms of a factor between the probability of a reported answer, and that of its immediate neighbor. A similar construction holds for the geometric mechanism implementing geo-indistinguishability. In that case the noise is added to the real location to report an obfuscated location. Here we give an abstract definition of a geometric system, in terms of secrets (e.g., result of a query / real location) and observables (e.g., reported answer / reported location).

Let and be sets of consecutive natural numbers, with the standard notion of distance. Two numbers are called adjacent if or .

Let be a real non-negative number and consider a function . After adding negative exponential noise to the output of , the resulting geometric system is described by the following channel matrix:


where is a normalizing factor. Note that the privacy level is defined by , where is the sensitivity of :


where means and are adjacent. Now let , , we select to be . We define


so to truncate the distribution at its boundaries.

This definition of Geometric system prohibits the case . To consider such case, we generate a repeated geometric channel matrix, such that


where is the geometric channel matrix described above.

B-B Multimodal geometric system

We construct a multimodal distribution as the weighted sum of two geometric distributions, shifted by some shift parameter. Let be a geometric channel matrix. The respective multimodal geometric channel, for shift parameter , is:


In experiments, we used and weights .

B-C Spiky system

Consider an observation space constituted of consecutive integer numbers , for some even positive integer , and secrets space . Assume that is a ring with the operations and defined as the sum and the difference modulo , respectively, and consider the distance on defined as: . (Note that is a “circular” structure, that is, .) The Spiky system has uniform prior, and channel matrix constructed as follows:


Appendix C Detailed convergence results

Convergence for multimodal geometric systems, when varying and for fixed .

System Freq. NN
= 1.0
0.1 3 008 369 478 897
0.05 5 938 495 754 1 267
0.01 26 634 765 1 166 1 487
0.005 52 081 765 1 166 1 487
= 0.1
0.1 24 453 398 554 821
0.05 44 715 568 754 1 175
0.01 149 244 4 842 1 166 1 487
0.005 226 947 79 712 1 166 1 487
= 0.01
0.1 27 489 753 900 381
0.05 103 374 101 664 92 181 31 452

Detailed convergence results for a Spiky system of size and .

Freq. NN
0.1 15 953 22 801 52 515 99 325
0.05 22 908 29 863 62 325 112 467
0.01 38 119 44 841 81 925 137 969
0.005 44 853 51 683 91 661 147 593

Appendix D Uniform system

We measured convergence of the methods for a uniform system; this system is constructed so that all secret-object examples are equally likely, that is for all . The Bayes risk in this case is .

Fig. 11 shows that even in this case all rules are equivalent. Indeed, because the system leaks nothing about its secrets, all the estimators need to randomly guess; but because for this system the Bayes risk is identical to random guessing error (), all the estimators converge immediately to its true value.

Fig. 11: Convergence for a Uniform system of size .

Appendix E Approximation of the frequentist estimate

Fig. 12: Approximation of the frequentist estimate as grows for , , and ; the approximation is compared with the real frequentist estimate .

To better understand the behavior of the frequentist approach for observations that were not in the training data, we derive a crude approximation of this estimate in terms of the size of training data . The approximation makes the following assumptions:

  1. each observation is equally likely to appear in training data (i.e., );

  2. if an observation appears in the training data, the frequentist approach outputs the secret minimizing the Bayes risk;

  3. the frequentist estimate knows the real priors ;

  4. if an observation does not appear in the training data, then the frequentist approach outputs the secret with the maximum prior probability.

The first two assumptions are very strong, and thus this is just an approximation of the real trend of such estimate. However, in practice it approximates well the real trend Fig. 12.

Let denote the event “observation appears in a training set of examples”; because of assumption 1), . The conditional Bayes risk estimated with a frequentist approach given examples is:

Assumptions 2) and 3) were used in the last step. From this expression, we derive the frequentist estimate of t step :

Note that in the second step we used as a constant, which is allowed by assumption 1).

The expression of indicates that weights between random guessing according to priors-based random guessing and the Bayes risk; when , which happens for the frequentist approach starts approximating using the actual Bayes risk (Fig. 12).

Appendix F Gowalla details

We report in Tab. XIV the real Bayes risk estimated analytically for the Gowalla dataset defended using the various mechanisms, and their respective utility.

Mechanism Utility
Blahut-Arimoto 2 0.760 334.611
4 0.571 160.839
8 0.428 96.2724
Geometric 2 0.657 288.372
4 0.456 144.233
8 0.308 96.0195
Laplacian 2 0.657 288.66
4 0.456 144.232
8 0.308 96.212
TABLE XIV: True Bayes risk and utility for Gowalla dataset defended using various location privacy mechanisms.

Appendix G Application to time side channel

Operands’ size
4 bits 34
6 bits 123
8 bits 233
10 bits 371
12 bits 541
TABLE XV: Number of unique secrets and observations for the time side channel to finite field exponentiation.

We use F-BLEAU to measure the leakage in the running time of the square-and-multiply exponentiation algorithm in the finite field ; exponentiation in is relevant, for example, for the implementation of the ElGamal cryptosystem.

We consider a hardware-equivalent implementation of the algorithm computing in . We focus our analysis on the simplified scenario of a “one-observation” adversary, who makes exactly one measurement of the algorithm’s execution time , and aims to predict the corresponding secret key .

A similar analysis was done by Backes and Köpf [35] by using a leakage estimation method based on the frequentist approach. Their analysis also extended to a “many-observations adversary”, that is, an adversary who can make observations , all generated from the same secret , and has to predict accordingly.

G-a Side channel description

Square-and-multiply is a fast algorithm for computing in the finite field , where here represents the bit size of the operands and . It works by performing a series of multiplications according to the binary representation of the exponent , and its running time is proportional to the number of 1’s in . This fact was noticed by Kocher [36], who suggested side channel attacks to the RSA cryptosystem based on time measurements.

G-B Message blinding

We assume the system implements message blinding, a technique which hides to an adversary the value for which is computed. Blinding was suggested as a method for thwarting time side channels [36], which works as follows. Consider, for instance, decryption for the RSA cryptosystem: , for some decryption key ; the system first computes , where is the encryption key and is some random value; then it computes , and returns the decrypted message after dividing the result by .

Message blinding has the advantage of hiding information to an adversary; however, it was shown that it is not enough for preventing time side channels (e.g., [35]).

Fig. 13: Convergence of the estimates for the time side channel attack to the exponentiation algorithm as the bit size of the operands increases.