Side channels represent a broad class of security vulnerabilities that have received significant attention from the cybersecurity community, especially after multiple side channel-based attacks have been demonstrated in recent years [Kocher, wright_spot_2008, yan_study_2015]. Unfortunately, completely eliminating side channels can incur significant overhead, so practical protection techniques often aim to reduce information leakage as much as possible while maintaining acceptable performance. In this work, we aim to enable principled protection of side channels with passive adversaries.
The foremost challenge in developing principled protection schemes for side channels is quantifying the amount of leakage that occurs. While various reasonable metrics have been considered in the literature [smith_foundations_2009, wagner_technical_2018], these arguably do not capture, or even necessarily upper bound, the utility of a given side channel to an attacker. Recently, maximal leakage [issa] was introduced as a metric that quantifies, in an operationally interpretable way, how useful a given side channel is to an attacker. Armed with such a metric, the system designer can rigorously trade off security and performance when designing side-channel mitigation schemes. Previous work [issa_maximal_2016, issa_operational_2017, liao_tunable_2018] has shown that maximal leakage is computable in semi-closed form, that it provides a direct measure of how likely a side-channel attack will succeed, and that the metric is robust to changes in its underlying assumptions. So far though, work on applying maximal leakage to practical side channels or on designing optimal protection schemes is limited. Our contributions in this paper are as follows:
We conduct an empirical study and find that information theory metrics such as mutual information and channel capacity typically underestimate the utility gained by the adversary whereas maximal leakage at least provides a upper bound. Moreover, we find that mutual information and channel capacity are metrics that result in sub-optimal protection when used to analyze side channel leakage.
We also find that, despite its pessimistic formulation, maximal leakage does not consistently overestimate leakage. In many practical scenarios, it actually measures the adversary’s utility without overestimating the threat.
Under assumptions that encompass most timing and power side channels, we prove that cost-leakage optimality is achievable by close-to-deterministic protection, a result that is contrary to normal intuition.
Finally, in case the LP is too large to solve (owing to large dimensionality), we propose a heuristic algorithm that approximates the aforementioned LP and bound the error incurred by using this algorithm.
In the first part of this paper, we study how maximal leakage can be used to provide strong side-channel protection guarantees in practical systems. We first discuss our assumed threat model and present a mathematical interpretation of side channels in Section 2. We then reiterate, from previous work, maximal leakage’s theoretical advantages over traditional information theoretic metrics in Section 3. In Section 4, we provide an empirical study to demonstrate that these advantages are relevant to practical side channels. Our empirical studies lead us to further conclude that conventional information-theoretic leakage metrics such as mutual information and channel capacity result in suboptimal protection strategies. Surprisingly, we also find that mutual information and channel capacity, while appropriate measures of covert channel leaks [millen_covert_1987], actually underestimate the true threat posed by side channels.
In the second part of the paper, we consider how to design optimal protection schemes that minimize the performance overheads over a side channel given a target bound on maximal leakage. We show that finding the optimal protection scheme can be formulated as a linear program (LP) and present a structural result on optimal protection schemes that allows complete computation of the entire optimal trade-off curve by solving a relatively small number of LPs in Section 5. This result applies under a set of broad conditions that include (but are not limited to) most timing, power, and compression based side channels. In addition, our structural result implies that cost-leakage optimality over maximal leakage is achieved by deterministic protection schemes with low implementation overheads. Here we also demonstrate that optimizing over mutual information and channel capacity result in suboptimal protection. In Section 6, we present a heuristic that leverages knowledge of this structural result to rapidly approximate the full trade-off curve, at the cost of a small, bounded amount of sub-optimality. Together, these results enable fast computation of cost-leakage trade-offs with principled guarantees on leakage. Finally, we discuss related work in Section 7.
2 Threat Model and Mathematical Interpretation
In this section, we lay the groundwork to build up a rigorous model of side channels with passive adversaries. We first define our threat model and provide some examples. Then, under that threat model we formulate a mathematical interpretation of side channels and protection schemes.
2.1 Threat Model
Our threat model is given as follows.
There exists a secret that the adversary aims to guess.
An intermediate is generated using a fixed function of the secret’s value that may be either stochastic or deterministic. The distribution of the intermediate is known, but the conditional distribution of the intermediate given the secret is only necessarily known by the adversary. The intermediate value is not directly visible to the adversary.
There exists a protection scheme, which is either a stochastic or deterministic mapping of the intermediate to a side channel output. This mapping from intermediate to outputs is presumed to be memoryless (it does not rely on past values of the intermediate) and is known by the adversary. The adversary can see the output value.
The adversary must guess the secret, given the side channel output. They guess the secret value using the strategy that maximizes the probability of a correct guess after seeing the ouput.
The adversary does not have any control over the victim’s behavior that would affect the value of the secret, intermediate or the side channel output. We refer to such an adversary as a passive adversary.
The system designer only chooses the protection scheme. We presume that the joint distribution of the secret and output are fixed, once chosen.
We disregard system noise or measurement noise, but note that any independent randomness added to the side channel output cannot increase leakage any further.
To put these concepts into perspective, we give two relevant example side channels. First, consider RSA, an asymmetric key cryptosystem often used for key exchanges or identity validation. In systems that provide RSA decryption as a service, the decryption process may form a timing side channel if the decryption time varies with the value of the private key. Suppose for now that no protection scheme has been implemented. In this case, the victim is the implementation of RSA, the secret is the victim’s private key, and the side channel output observed by the adversary is the runtime of the decryption algorithm (and the intermediate is the same as the output).
Second, consider speech coding. This is a type of data compression typically used in real-time services such as Voice-over-IP. Given a speech waveform, the encoder converts short snippets of the waveform into individual packets that can be later decoded to recover the snippet. Due to fidelity and rate constraints, it is common for the instantaneous data rate to vary as a function of the speech waveform. Again, suppose no protection scheme is implemented. In this case, the victim is the compression service and the side channel output is the packet size. In this case, the identity of the secret is not necessarily clear; it may be the uncompressed waveform, the transcript, the speaker ID, or even the language spoken. We will return to these two examples later on in Sections 4, 5, and 6.
2.2 Mathematical Interpretation of Side Channels and Protection Schemes
We represent the components of a side channel using random variables. First, the victim’s secret is denoted as the random variable, the intermediate is denoted as the random variable , and the side channel output is denoted as the random variable . is a stochastic or deterministic function of , , and is a stochastic or deterministic function of , . We further assume that and
are discrete random variables with finite alphabetsand , respectively, since truly continuous-valued variables are rare in computer systems. By construction, , , and
form a Markov Chain (denoted as), so and are conditionally independent given .
We assume that the relationship between and is immutable but that the system designer can control how is generated from . Note that introducing the intermediate into this side channel model does not lose us any generality. merely represents the boundary of our design problem: how maps to is assumed to be outside of our control while how maps to is assumed to comprise our design problem. This model therefore includes side channels in which or (although the latter simply corresponds to no protection). Therefore, the space of all possible choices of constitutes the space of all protection schemes.
Finally, we restrict our attention to that do not depend on past values of (as stated previously, memoryless protection). Such protection schemes ensure that -values are only leaked by their corresponding -values, if at all. Furthermore, memoryless protection schemes can be represented simply as a transition matrix. This matrix is formatted such that each row corresponds to one element of , each column corresponds to one element of , and each entry is the probability that an -value is mapped to a particular -value. Figure 1 summarizes the variables and assumptions we have made thus far. Next, we define some notation that makes use of the above and will be necessary as we discuss maximal leakage as a leakage metric.
(Basic Notation) For random variables and , with alphabet sizes and , we define the following:
is the nonnegative (but not necessarily finite) cost of mapping each to each . Infinite cost entries correspond to illegal mappings. We refer to this function as the cost function and the corresponding matrix as the cost matrix.
is the distribution of . is the distribution of . is the joint distribution of and . is the conditional distribution of given .
is the total cost for any matrix . We will use as a shorthand for this quantity, when the parameter matrix is implied.
is the generic leakage value associated with any matrix using some pre-specified leakage metric. When necessary, we will distinguish different metrics using subscripts (e.g. ).
is an transition matrix such that and is finite. We refer to any matrix of this form as a protection scheme. It is subject to typical transition matrix constraints (rows sum to 1, non-negative entries).
3 Maximal Leakage
While many leakage metrics have been proposed, we believe maximal leakage [issa] is well-suited for analyzing side channel leaks. In this section we first define multiplicative gain leakage, or mult-leakage, a natural metric of side channel leakage. Then, we will summarize maximal leakage’s relationship with mult-leakage, why it is necessary to use maximal leakage instead of mult-leakage, and the advantages of maximal leakage. Finally, we briefly discuss mutual information and channel capacity, how they relate to mult-leakage, and how they can be interpreted in the context of side channels.
3.1 Multiplicative Gain Leakage
Recall that, in any side channel, the adversary is interested in making an informed guess of after observing . However, even without the side channel the adversary can make a blind guess of , which represents the case where is independent of . It stands to reason that a good leakage metric should reflect how the success rate of an informed guess compares to that of a blind guess. Thus, the ratio of the two is a natural measurement of leakage caused by the side channel. So, we define mult-leakage as:
where the notation is a blind guess of , and is a informed guess of after observing . Then, mult-leakage tells us the multiplicative gain on the adversary’s probability of correctly guessing given the side channel (the logarithm is for scaling purposes). Alternatively, mult-leakage can be interpreted as the bit-difference of information between the informed and blind guesses. Thus, if the side channel leaks no information, then the best informed guess is no better than the best blind guess and the leakage is 0. On the other hand, if the side channel does leak information, mult-leakage tells us how much the adversary’s guesses have been improved by the side channel and, ultimately, how useful the side channel is to the attacker.
3.2 Maximal Leakage
Maximal leakage is defined as follows [issa]:
By definition, maximal leakage is the worst-case mult-leakage over all possible , which means it upper bounds mult-leakage. The reason we elect to use maximal leakage in favor of mult-leakage is that the latter is not always possible to compute, depending on whether the system designer knows . Mult-leakage requires knowledge of , which is often difficult to characterize, especially for complex systems. Certainly, an approximation of mult-leakage is possible through data collection of , but we are interested in upper-bounding leakage.
At first blush, it seems that maximal leakage also requires knowledge of , but it turns out this is not the case. First, we note that maximal leakage is agnostic to by definition, so it is well-defined even in side channels where it isn’t clear which secret the adversary is after. Furthermore, it can be shown that maximal leakage is can be computed as follows [issa]:
In other words, maximal leakage requires knowledge of only in terms of its support, the set of -values with non-zero . And since we have restricted our attention to memoryless protection, we only need to find the maximum value in each column (ignoring -values with probability of 0) of P, sum these values, and take the log of the sum to compute it. Moreover, the maximum mult-leakage is achieved by a particularly useful , referred to as a shattering by the authors. This type of is characterized by the following two properties.
(Shattering U) For any random variable , is shattering if:
is uniformly distributed over a finite alphabet
For each , is a deterministic value.
A conceptual example of a shattered distribution is shown in Figure 2. Here, each blue square corresponds to an equally-probable potential value of (34 distinct values total, in this example). An example distribution of is shown, where each possible -value corresponds to some number of -values.
The significance of this shattering distribution is twofold. First, for any shattering , maximal leakage equals mult-leakage. In such cases, maximal leakage measures the the utility gained by the adversary from the side channel under any protection scheme. Second, the shattering is quite representative of side channels in which is known to be an encryption key, if keys are selected uniform randomly from some (not necessarily known) space of allowed keys and the baseline side channel process is not stochastic. For such channels, maximal leakage, though pessimistic by design, exactly captures the side channel leakage.
3.3 Other metrics
Mutual Information: As one of the most basic information-theoretic metrics, mutual information has been used to evaluate side channel protection in some previous work on side channel protection [gong_quantifying_2016, zhou_camouflage:_2017].
In plain terms, mutual information measures the amount of information shared between random variables and . However, mutual information requires knowledge of and is also is upper bounded by maximal leakage [issa]. Since we know that mult-leakage equals maximal leakage for the shattering , we note that mutual information may underestimate leakage in such cases (whether it does in practice is a question we will explore in Section 4).
Channel Capacity: Another basic information-theoretic metric, channel capacity, is used to measure the rate of reliable communication over a noisy channel.
As such, it is a useful metric for leakage in covert channels where a sender deliberately encodes messages to a receiver. At first blush, it may seem that channel capacity should bound the rate of information leakage in a side channel since the adversary doesn’t have the luxury of a cooperative sender. Interestingly, channel capacity is also upper bounded by maximal leakage [issa]. This is because channel capacity assumes the sender and receiver are interested in complete, reliable decoding of messages passed over the channel [issa]. In a side channel scenario, such an assumption is unnecessary because the adversary does not need to completely recover the secret to pose a threat. Indeed, many side channels do not risk complete recovery even without protection.111For a more nuanced discussion of this topic, refer to Section III of [issa_operational_2017]
Finally, to demonstrate that mutual information, channel capacity, and maximal leakage are numerically comparable (a fact that may not be immediately obvious from their definitions), we analytically compute all three metrics for a binary symmetric channel (BSC) with various switching probabilities . This comparison can be seen in Figure 3. We can see that all three metrics agree on the worst and best case leakage values, when is 0 (all metrics agree leakage is 1) and whe is 0.5 (all metrics agree leakage is 0). However, both mutual information and channel capacity can be less than maximal leakage by an arbitrarily large ratio. Thus, an open question at this point is whether, in practice, maximal leakage is too conservative or whether mutual information and channel capacity are truly underestimating the leakage. We address this question next.
4 Empirical Study of Leakage Metrics
In the previous section, we showed that mutual information/channel capacity could underestimate mult-leakage, at least in theory. However, to truly motivate the use of maximal leakage over these alternatives requires that such gaps between mutual information/channel capacity and mult-leakage do exist in practice, which we will demonstrate in this section.
The rest of the section will proceed as follows. First, we present an example side channel with which we will compare these metrics: an RSA decryption timing channel. Second, we demonstrate the existence of a significant gap between mutual information and mult-leakage under square-and-multiply implementations of RSA and under GNU’s multiple precision (GMP) library implementation. We argue that the existence of such gaps indicates that mutual information and channel capacity are problematic when cost-leakage optimality is desired or when a real bound on leakage is needed.
4.1 RSA Decryption Timing Channel
For the rest of this section, we consider a timing channel involving RSA as seen in Figure 4. In this side channel, Alice serves many clients who need to use Alice’s public key to encrypt messages to her over a network. For each such encrypted message, Alice uses her private key to decrypt the message and then responds. The adversary, Eve, is not one of Alice’s clients but is capable of observing the network traffic coming to and from Alice, perhaps by employing a packet sniffer (hence, Eve is a passive adversary). Eve sees when each decryption request arrives and when Alice responds from the sequence and timing of packets, which she uses to guess Alice’s private key. Here, is the private key, is the decryption time, and is the length of time between when Alice receives each message and when she sends a response to the client. To implement a protection scheme, we must choose , which is how long to delay Alice’s response on top of the true decryption time.
Finally, we assume that Alice’s private key was chosen uniformly at random from all binary strings of a fixed length (as opposed to only legal keys based on the RSA cryptosystem). This is a choice of convenience to facilitate our experiments, but doing so does not affect our conclusions, as we will explain later in this section.
4.2 Square-and-Multiply Implementation of RSA
We first consider the square-and-multiply implementation of modular exponentiation, the main sensitive operation of RSA decryptions. Pseudocode for the square-and-multiply implementation of RSA decryption is as follows:
The timing channel arises from the if statement in line 5. The modular multiplication of the result and ciphertext only occurs if the next bit of the private key is 1. From this fact, the adversary can deduce the weight (the number of 1s) of the private key. We make several simplifying assumptions and define the parameters of this experiment as follows:
Ignore confounding factors, such as system noise or network delay. is simply equal to plus any delay we choose to add.
Assume all 1024-bit sequences are valid keys.
Assume the bits of the key are independently and identically distributed Bernoulli random variables. This results in a uniform-randomly selected key out of all 1024-bit binary strings.
be a random vector representing the value of the private key.
Let be a random variable representing the weight of the private key.
Assume that the multiplication in line 6 of the above pseudocode takes a fixed milliseconds to execute each time it is called.
Let be a binomial random variable with fixed probability and size parameter (which we vary).
Let . In other words, our protection scheme is independently added binomial noise.
Note that this cost matrix enforces that any protection with finite total cost must be upper triangular. Moreover, note that with this cost matrix and independent binomial delays, the total cost is .
Here, we can analytically compute mutual information, channel capacity, and maximal leakage for many different values of
(the size parameter of the binomial-distributed noise) and plot them on the same axes, as seen in Figure5. We note that there exists a large gap between mutual information and maximal leakage that sharply shrinks as we approach no protection. A sizeable gap exists between maximal leakage and channel capacity as well. Recall that, since is shattering, maximal leakage equals mult-leakage. Conversely, this implies that both mutual information and channel capacity underestimate mult-leakage in this example, and thus also underestimate the adversary’s utility. Note that, had we chosen a key uniform-randomly from the set of all feasible keys (according to the RSA cryptosystem), would still have been shattering.
4.3 GMP Implementation of RSA
Here, we consider an implementation of modular exponentiation where the weight of the key isn’t directly leaked but instead the decryption time varies with the key in some other way. We show that maximal leakage’s gaps with mutual information and channel capacity are still significant even in this case. We use GNU’s multiple-precision (GMP) library’s implementation of modular exponentiation. Here, we perform essentially the same experiment as before. The assumptions and parameters of the experiment are as follows (only ones that are different from the square-and-multiply implementation will be listed):
Let be a random vector representing the value of the private key. Note that we are using a 16-bit key here so that it is possible to exhaustively collect decryption timing data for all private keys and ciphertexts, to compute the distribution of decryption times.
Let be the random variable representing the execution time (in cycles) of GMP’s modular exponentiation on an Intel i7 core. The decryption time of each private key varies with the ciphertext, so we uniform randomly selected a fixed ciphertext for the purposes of this experiment. The distribution of we used can be seen in Figure 5(a).
Choose the alphabet of , , as follows. Choose a noise width . Extend by elements, each spaced by the most common difference between consecutive elements in (in case of a tie, choose the smallest common difference). So for example, suppose and . Then since the most common interval between consecutive elements in is 2.
Let be a binomial random variable with fixed probability and size parameter (which we vary).
Let be defined as follows. Given , generate a -value . Note that is an integer; choose the th larger element than in .
As before, we can directly compute mutual information, channel capacity, and maximal leakage for many different values of . Plotting mutual information, channel capacity, and maximal leakage against total cost (Figure 5(b)), we find that a gap exists between maximal leakage and mutual information/channel capacity. These results reaffirm our earlier observation that mutual information and channel capacity underestimate the advantage given to the adversary in practice.
Here, we remark that both in the case of the GMP implementation and in the earlier square-and-multiply implementation, we obtained trade-off curves for mutual information, channel capacity, and maximal leakage with very similar shapes. So, it may be tempting to suggest that there is little difference in usage between the three metrics, since their trade-off curves are so similar in shape. However, there are two factors to keep in mind. First, in the above experiments we have only used independent (of
) random padding. Second, we compared how different metrics behave for the same protection scheme. Essentially, we have not shown that mutual information, channel capacity, and maximal leakage agree on a relative ordering of how secure an arbitrary pair of protection schemes (possibly not independent of). Indeed, in the next section, we will prove that mutual information/channel capacity disagree qualitatively on what kinds of protection are optimal. We will even show an example of two protection schemes that maximal leakage disagrees with mutual information/channel capacity in terms of the relative ordering of leakage.
5 Optimal ML-Cost Trade-offs
Using maximal leakage as our leakage metric, we will show two key facts about the minimization of total cost subject to an upper bound on maximal leakage, or the optimization over maximal leakage. First, we demonstrate that the optimization over maximal leakage can be written as a linear program. While this fact is relatively simple to verify, its significance lies in that it greatly simplifies the process of solving the optimization itself (which, in general, is not a trivial feat). Second, as the main theorem of this section, we prove that under certain constraints on the cost matrix that are quite common among side channels, optimality under maximal leakage can be achieved with easy-to-implement deterministic protection schemes. The remainder of the section is dedicated to stating these results rigorously, explaining their implications, and finally comparing maximal leakage optimal schemes with mutual information and channel capacity ones.
5.1 Formulating the Optimization
The following definitions are needed to formulate the optimization. For random variables and , with alphabet sizes and , we define the following:
We retain the defined variables and functions given in Definition 1, but will at this point we will retire the previous notation to distinguish between various leakage metrics (e.g. ) in favor of the next item.
is the exponentiated maximal leakage (or exp-leak, for short) of any matrix . We will use as a shorthand for this quantity, when the parameter matrix is implied. Note that minimizing over exp-leak is equivalent to minimizing over maximal leakage.
a given protection scheme P is deterministic if all equal 0 or 1. It is stochastic otherwise.
an pair is achieved by P if and .
an pair is achievable if there exists such a P that is achieved by it.
the set is the set of all achievable pairs.
. We refer to evaluated for all values of as the tradeoff curve and the set of points as the boundary of .
P is optimizing in if (i.e. if P achieves a point on the boundary of ).
the set is the set of all points in that can be achieved by a deterministic protection scheme.
an pair is achievable in if there exists a deterministic protection scheme P that achieves .
P is optimizing in if . Note that a P that is optimizing in is not necessarily a deterministic protection scheme.
(Set Indexing) We will choose to let and .
Here, we consider the optimization problem over maximal leakage:
This can be rewritten as an LP as follows:
5.2 Structural Result and Proof
(Convexity) is a convex function of . The proof follows from standard arguments. The convexity of the optimal trade-off curve is significant in that it allows for a useful qualitative assessment of the optimization problem. That is, adding a little protection on top of an unprotected side channel is very costly, but relaxing a zero-leakage scheme buys more cost reduction.
Note that is not convex since a deterministic protection schemes necessarily has an integer exp-leak value. The space is a subset of given by all pairs dominated by the set of finite points for integer values . The boundary of is shaped like a descending staircase, and is the set of all points above and to the right of this stair-like boundary.
We refer to a cost function/matrix that satisfies the following constraints as staircase nondecreasing:
For and all , if , then . (i.e. if one matrix element is infinite, then that column is infinite all the way down).
For and all , if , then . (i.e. excluding infinities, each row of the matrix is nondecreasing from left to right).
Note that staircase nondecreasing cost matrices are exemplified by upper triangular cost matrices (where all entries below the diagonal are infinite cost) with ordered cost entries for each row. This special case of staircase infinite, nondecreasing cost matrices is typical of most power and timing side channels, since one cannot map power consumption or latency to a value less than itself.
If is staircase nondecreasing, then
For all , can be achieved by for some and some deterministic protection schemes and , such that and .
The proof proceeds by taking an optimizing stochastic protection scheme and showing that it can be made more deterministic without losing optimality. Full proof in Appendix A.
5.3 Implications of Theorem 1
The main implication of Theorem 1
is that deterministic protection schemes are sufficient to achieve optimality over maximal leakage. The first part of the theorem essentially states that the supporting hyperplanes of the trade-off space are achieved by deterministic schemes. The second part follows from the first and states that the constrained optimization is solved by at least the next best thing: a mixture of at most two deterministic schemes.
First, recall the implementation benefits of deterministic protection schemes. Regardless of the specific cost function (as long as it is staircase nondecreasing) or application, any deterministic protection scheme can be compressed to an matrix (or smaller) recording which -value each -value maps to. In addition, deterministic schemes are resistant to averaging attacks, where the adversary attempts to learn additional information by gathering statistics of , since the same value always maps the the same value. In the event that a mixture of two deterministic schemes is needed, one may implement a pre-determined schedule alternating between the two deterministic schemes for each mapping. Here, while the leakage of individual observations of will change over time, we can enforce the desired long-run bound.
Second, the proof of the main theorem induces an algorithm by which one may take any optimizing protection scheme and convert it to a deterministic form, so that the discussed benefits can be leveraged. This algorithm simply performs the procedures specified in Definitions 11 and 13 in Appendix A recursively.
Third, if it is necessary to solve the entire optimal trade-off curve (for example, if on-the-fly tuning of leakage is expected), Theorem 1 states that it is only necessary to solve for integer exp-leak points and then connect the dots so that the overall curve is convex. Also, for small deviations in the leakage bound, tuning can be done simply by changing the mixture proportion.
5.4 Comparing Optimal Protection Under Different Metrics
Finally, we will compare maximal leakage optimal schemes (ML-optimal schemes) to mutual information and channel capacity optimal schemes (MI-optimal and CC-optimal schemes). First, we will simply show sample ML-optimal schemes for the RSA decryption side channel (the GMP implementation) and for a packet size side channel based on VoIP applications. Then, using these sample schemes, we will discuss the qualitative nature of MI-optimal and CC-optimal schemes.
ML-optimal protection for RSA decryption: Using the GMP-based decryption timing data from Section 4, we use Gurobi (a convex optimization solver) to solve the inverted optimization (i.e. minimize leakage subject to a cost bound) from Equation for a total cost bound of 5% delay overhead. Doing so achieves maximal leakage of 1.6862 bits. Note that in this type of side channel, one relevant class of deterministic protection schemes is emphthresholding where, for an ascending sequence of thresholds, all values less than or equal to the smallest threshold are mapped to the first threshold, all values greater than the first and less than or equal to the second threshold are mapped to the second, and so on. As it turns out, the resulting protection scheme from this experiment, is shown in Figure 6(a), and can be described as the combination of the thresholding schemes with thresholds and with thresholds . This particular scheme can be implemented by a schedule that uses 21.8% of the time and 78.2% of the time.
ML-optimal protection for VoIP: Here we present the packet size channel for speech coding in the context of Voice-over-IP (VoIP) applications and then perform an analogous experiment to assess the ML-optimal scheme. Previous work demonstrated that there exists a side channel leak through the sizes of packets sent over networks[wright_spot_2008]. It has even been shown that such side channels allow packet sniffers to partially recover or reconstruct spoken phrases [doychev_yes_2009, wright_uncovering_2010].
The victim system is a typical VoIP application, which operates by encoding fixed-length time intervals (called a “frame”) of sound waveforms into one packet per time interval. In particular, VoIP system designers favor a form of variable bitrate (VBR) compression, which reduces bandwidth usage and improves recovered speech quality.
We assume that the adversary is interested in reconstructing the transcript, or the text of what was spoken. The adversary observes the final payload size of each packet, . is the un-padded packet size produced by the speech codec (a coder-decoder used to compress and decompress human speech). Our protection scheme maps to by padding each packet independently.
For our experiments, we used Mozilla’s CommonVoice222https://voice.mozilla.org/en
English dataset. For the speech codec, we used Opus, an efficient open-source codec endorsed by the IETF, set to 24 kbps VBR with a frame size of 20 ms. Under these settings, Opus encodes each frame to one of 151 different packet sizes. We encoded approximately 572 hours worth of human speech to obtain the distribution.
Then, we set the cost matrix to be the number of bytes of padding for each packet ( if and otherwise), and again computed the inverted optimization using Gurobi. Solving the inverted optimization for a 20% padding overhead gives the solution seen in Figure 6(b), and it can be decomposed into two deterministic protection schemes in a similar fashion to the RSA optimization.
MI and CC optimal protections: Now, we will show that deterministic schemes typically will not be either MI-optimal or CC-optimal, except in some edge cases such as the zero-leakage case. This observation will further lead us to conclude that, especially for side channels with shattering , MI and CC optimal schemes will commonly result in suboptimal (in terms of the optimization over mult-leakage) protection.
First, recall the definition of mutual information as given in Equation 4. Since is fixed for the optimization over mutual information, the only active variable in mutual information is . Since mutual information is nonlinear and convex over , we can expect that it will be optimized by protection schemes in the interior of the feasible set. In other words, we expect that MI-optimal schemes will tend to be stochastic. Indeed, a simple experiment confirms this. Using the the alphabets and , the marginal distribution of , , and cost function
the ML-optimal and MI-optimal solutions for 0.5 units of cost are given by:
Finally, since channel capacity is itself defined as a maximization over mutual information, one should expect the same kind of behavior when optimizing over channel capacity.
6 A Heuristic Algorithm
In this section, we will address the dimensionality of the LP. For alphabet and , the constrained optimization in Equation 7 is over an variable matrix. So, the alphabet sizes of and are intimately linked to the dimensionality of the LP and can greatly affect computational complexity. It may be possible to reduce the problem size by grouping symbols in or in together, thereby reducing and . However, doing so incurs additional cost by some hard-to-measure quantity and is not always practical. For such cases, we present a heuristic algorithm that can be used to approximate the full trade-off curve.
6.1 Greedy Algorithm
For any nonempty set and cost matrix , we define a deterministic protection scheme such that:
We refer to as the deterministic protection scheme induced by the subset .
For any non-empty set , let:
For a given staircase nondecreasing cost matrix , we identify one (not necessarily unique) such that:
Define the subset .
For any set , we define the set function:
Here, we define a greedy algorithm to construct a sequence of deterministic protection schemes as follows:
Start with .
Choose such that is maximized over all such choices of . If is empty or if there does not exist such that , terminate this algorithm.
Go to step 2.
6.2 Bounded Sub-optimality of the Greedy Algorithm
Using standard results in combinatorial optimization[Nemhauser1978], we can obtain bounds on how suboptimal the solutions obtained from the greedy algorithm are. We will first prove some basic facts about the set function given in Definition 8.
For such that ,
Then, for and ,
since is equal to or (or both), and is no smaller than either or . Hence,
so is submodular ([schrijver-book], Thm 44.1).
For integer exp-leak bound , let be the set obtained by running the greedy algorithm unil (for simplicity, assume the greedy algorithm does not terminate prior to this point).
For integer exp-leak bound , let be the true optimal set such that is maximized subject to .
Now, since is submodular, we can bound the greedy algorithm for all as follows ([Nemhauser1978], Theorem 4.1):
The greedy algorithm is capable of approximating a full cost-leakage trade-off curve more quickly, compared to running as many as individual LP optimizations. The difference in computation time increases with the size of ; the greedy algorithm runs on the order of 30 times faster than the LP on the integer exp-leakage points for our larger experiments, but only on the order of 5 times faster for our smaller experiments.
Moreover, we have shown that the cost of deterministic protection schemes computed by the greedy algorithm is bounded relative to the true optimal protection schemes at the same leakage levels. Finally, a useful side-effect of this bound is that the true optimal scheme does not perform any better than the greedy algorithm after a single iteration (when ), which follows from Equation 13. As there exist many applications that require close to no leakage and since a single step of the greedy algorithm (computing for ) merely consists of a search over the space of , these protection schemes with exp-leak between 1 and 2 can be easily computed since we know from Theorem 1 that the optimal protection scheme is simply a convex combination of the two deterministic protection schemes.
6.3 Sub-optimality of the Greedy Algorithm in Practice
Here, we reuse the RSA decryption timing data and VoIP packet size data to demonstrate that the gap between the true optimal curve and the greedy algorithm is in fact very small. Indeed, we find that for these case studies, the error of the greedy rate is far below the projected error given by Equation 13. Thanks to Theorem 1, we obtain the true optimal curve by using Gurobi to optimize over maximal leakage at integer exp-leak points. We use the greedy algorithm to obtain an approximately optimal curve. These results can be seen in Figures 7(a) and 7(b).
7 Related Work
There exist many previous studies that demonstrate side channel attacks in various systems and applications. There also exists work on legitimizing various metrics other than maximal leakage for quantifying leakage in side channels [braun_quantitative_2009, alvim_measuring_2012, smith_foundations_2009, alvim_additive_2014]. Previous work has also proposed many protection schemes against side channel attacks, often based on heuristics and without quantitative security guarantees. Here, we discuss previous work on quantitative metrics for side channels, and studies on optimal trade-offs between security and protection overhead. This paper represents the first to experimentally demonstrate the limitations of traditional information theoretic metrics, the practical advantages of maximal leakage, and how the optimal protection trade-off can be efficiently obtained.
Previous work has attempted to quantify RSA timing channels using conditional entropy (or equivalently, mutual information) with the justification that it quantifies the amount of uncertainty of the adversary’s guesses [kopf_information-theoretic_2007]. In the same vein, useful results on optimal trade-offs using conditional entropy have been developed for strictly deterministic side channels [kopf_provably_2009]. However, in our work, we have made a case that such metrics not only underestimate the threat of the side channel, but past results from using conditional entropy only apply to deterministic side channels, a special case that is subsumed under our models given in Section 2.
Work on developing trade-offs for using maximum entropy has been applied to stochastic side channels [askarov_predictive_2010, zhang_predictive_2011]. Unfortunately, due to the simplicity of maximum entropy (using our notation, maximum entropy is simply ), protection schemes from this work are restricted to deterministic discretization of the output. While discretization leads to easily implementable protections, they are typically suboptimal with respect to mult-leakage. Trade-offs have also been developed for stochastic side channels using mutual information [mao_quantitative_2017], but likely due to the nature of mutual information, useful theoretical results are difficult to prove. This previous work proposes discretization and randomization as two possible protection mechanisms. Again, discretization and randomization based on independent noise are suboptimal. Finally some previous work has used mutual information, channel capacity, and maximal leakage all together to provide bounds and trade-offs of each [wang_secure_2017], but did not try to optimize for them.
Finally, there exists some work in the domain of privacy-preserving publication that analyzes optimal trade-offs using maximal leakage [liao_hypothesis_2017, liao_privacy_2018]
. This work is similar in spirit to ours, but the underlying problems are fundamentally different. These results are not for side channels, but for privacy-preservation in publishing database entries. The types of total cost considered are probabilities of Type-II errors and hard distortion functions (which aims to provide strong performance guarantees). In databases, where secrets are typically a small number of bits at most, these results are extremely relevant, but they are hard to justify in most side channels. Moreover, the results attained in this previous work pertain to the use of a tunable form of maximal leakage[liao_tunable_2018] that is largely irrelevant to side channels.
Appendix A Proofs
a.1 Proof of Theorem 1.1
First, we establish a structural result for all optimal protection schemes that will be a useful assumption for later steps of our proof of Theorem 1.1
Consider any protection scheme P. Define a vector such that . In plain terms, consists of the column maxima of P. Using alone, we construct a new protection scheme P’ as follows:
Start with a zero matrix . We will assume that the members of and are in some enumerated order, as previously stipulated.
For each row , iterate over each .
If , let
Else, set .
In plain terms, we are constructing P’ by maintaining the column maxima of P and ”filling” in probability mass in each row from left to right.
We define this procedure to generate P’ from P as the method to convert P into ”water-filled” form. Also, if P and P’ are identical, we say that P is a ”water-filled” protection scheme.
If the cost function satisfies definition 4, then all optimizing P can be converted into water-filled form P’ such that and .
Suppose we are given optimizing P and its water-filled form P’. By its construction, since we did not increase the total sum of column maxima. In addition, since we independently fill up each row’s entries in P’ from least cost to greatest cost, for any cost function that is staircase nondecreasing.
From these two statements, we also obtain the reverse inequalities:
implies that since P is optimizing and P’ cannot perform any better (have lower ) than P.
Similarly, implies that , since P is optimizing.
Therefore, and . ∎
(Proof Approach for Theorem 1.1) For any optimizing P, we start by assuming it is already in water-filled form, since we have already shown that doing so does not unnecessarily restrict our space of optimizing solutions. Then, we would like to show that there exists a special choice of Q such that :
is linear over some well-defined interval of values around 0.
does not change with for any fixed
results in protection scheme being strictly ”more deterministic” (to be defined shortly) than P for a particular choice of .
(Types of Matrix Entries)
For the sake of discourse, we will define the following classifications of matrix entries in any protection scheme:
An entry is fractional if it is not equal to 0 or 1, and integral otherwise. Similarly, a column is fractional if its maximum entry is fractional and integral otherwise.
An entry is maxed out if it is equal to the maximum value in its column, and hanging otherwise.
It is true by construction that a water-filled protection scheme will have at most one hanging mass entry and at least one maxed out entry in each row. Moreover, if a row has a hanging mass entry, there do not exist other non-zero entries further to the right of that entry.
(Measure of Randomness)
In order to compare which protection scheme, between two options, is ”more deterministic”, we rely on the following metric for randomness of a protection scheme:
(# fractional columns in P) + (# hanging entries in P)
Note that if and only if P is a deterministic protection scheme.
We will now propose a particular choice of Q and , and prove the desired properties about these choices after.
Given the water-filled protection scheme P with at least one fractional entry, we now define a procedure to generate a Q matrix. Note that any such protection scheme must also have at least one fractional column or else it would contradict the water-filled property.
Start with an zero matrix Q that we will populate with values.
Denote the leftmost fractional column index in P as . Further denote the current ”sign” to ”+”.
In the th column of Q, if the sign is ”+”, assign the value to all entries in that column that are maxed out in P. If the sign is ”-”, assign the value instead.
If the current sign is ”+”, change it to ”-”, and vice versa.
Consider the set of rows that are maxed out in the th column of P. Do all of these rows either have hanging mass in P or already have 2 non-zero entries in Q? Depending on the answer:
If yes, go to step 9.
If no, then proceed to step 6.
Again consider the set of rows that are maxed out in the th column of P. Choose the topmost row from this set that does not have hanging mass in P and has only 1 non-zero entry in Q. Denote the row index of that entry as .
Set to be the column index of the rightmost, maxed out entry of the th row in the P matrix. Note that must correspond to a fractional column here.
Go to step 3.
If any rows in Q
have hanging mass and an odd number of non-zero entries, assign eitheror , so that each of these rows sum to 0, to the hanging mass entries of these rows.
(Q-Generation Properties) The procedure specified by definition 13 satisfies the following:
The procedure terminates.
All of the rows in the generated Q matrix sum to 0 (so that is a protection scheme).
is a water-filled protection scheme
Since we never choose columns that aren’t fractional, any row selected in step 6 must have a maxed out entry (because we also ignore rows with fractional entries) somewhere to the right of the current column. Certainly, this procedure must terminate if the value ever reaches the right-most column (and the process may terminate earlier than that due to step 5). ∎
Since we only assign and to entries of Q in alternation, this is the same as saying that each row must contain an even number of non-zero entries. We see that this is true by noting that there are three types of rows, differentiated by how their non-zero entries in Q (if any) are assigned during the Q-generating procedure.
If a row has hanging mass in P, then step 9 will necessarily adjust that row to have an even number of non-zero entries by construction. In addition, we never assign mass to hanging mass entries until step 9, when the procedure terminates, which means that all hanging mass entries are free for us to use at that point. So, rows that have hanging mass in P will be valid rows in Q.
If a row has no hanging mass in P, then there are two cases, depending on whether that row was ever used in step 6 to determine the next value (we’ll refer to such a row as a ”critical” one). Note that, due to steps 5 and 6 filtering out rows that already have 2 non-zero entries, no row will ever be used in step 6 twice (i.e. a row will be a critical row at most once).
If the row is critical, it must be the topmost one that had only one non-zero entry in Q at that point of the procedure in the previous th column. Step 7 guarantees that the only other non-zero entry in this row will correspond to its rightmost non-zero entry in P. So this row will have exactly 2 non-zero entries in Q, making it valid.
If the row is not critical, it must either be located below one that is or not have any non-zero entries in Q at all. The latter case results in a trivially valid row. In the former case, the row must have at least two non-zero entries in columns shared with the previous critical row, or else it would violate our assumptions that P is water-filled and the cost function is staircase nondecreasing. In addition, since P is water-filled, each row is majorized by all rows above it (i.e. the cumulative left-to-right sum of the upper row is no less than that of the lower row for every column). This implies that a non-critical row cannot have more than 2 non-zero entries either.
We observe that due to step 3, we only ever change all of the maxed out entries in a column together. So, for small , will remain water-filled. ∎
Recall from remark A.1 that we require a particular choice of with various properties, as already described. We now define two choices of and justify properties about them in later lemmas.
Let is stochastic and P and are maxed out for the same entries and fractional for the same entries]
and is stochastic and P and are maxed out for the same entries and fractional for the same entries]
Note that, by definition and .
If P is water-filled for fixed and Q is generated according to definition 13, then evaluated with is linear with respect to .
For and fixed ,
Since is linear over and continuous over , it is linear over . ∎
(No Improvement Lemma)
If P minimizes over for fixed and is water-filled and Q is generated according to definition 13, then at .
If , then that implies that performs strictly better for some close to zero, which is a contradiction. ∎
(More Deterministic Lemma)
As increases from 0 to , some fractional entries of change, and none of the integral entries change. In addition, if one maxed out entry changes, all of the maxed out entries in that column change together. It thus follows that the set of fractional columns can only decrease with and that the set of hanging entries likewise can only decrease. So is nonincreasing in for . From the definition of in Definition 14, .
Similarly, we can show that . ∎
Any P that minimizes for some can be chosen to be optimizing and water-filled as per Lemma 3. If P is not a deterministic protection scheme, we can select Q as in Definition 13 with the properties shown in Lemma 4.
If is not deterministic, then we can repeat the above process since it is still water-filled and minimizes for the same .
Eventually, after repeating this process some finite number of times, will be 0 (since the function we defined is always nonnegative), and therefore deterministic. ∎
a.2 Proof of Theorem 1.2
Using standard convex analysis (e.g. [Rockafellar], chapter 12), Theorem 1.1 implies that that and have the same lower semi-continuous hull (or the closure, as defined by [Rockafellar] chapter 7), which is equivalent to our definition of the boundary of . We can see this fact as follows:
First, we note that the left and right hand sides of the equality in Theorem 1.1 are the conjugate functions of and , respectively. We have shown that the conjugates are equal for any .
Second, since is a convex function of , the conjugate of the conjugate of is equal to the closure of ([Rockafellar], Corollary 13.1.1).
Third, while is not a convex function, its conjugate is the same as the conjugate of the closure of its convex hull. Therefore, the conjugate of its conjugate must be equal to the closure of its convex hull.
Thus, we have shown that the convex hulls of and are the same, since the two sets are the epigraphs of (all points in on or above the curves defined by) the functions and , respectively.
From this fact, it trivially follows that any pair on the boundary of must also lie on the convex hull of , and therefore on the convex hull of .
Finally, as previously noted, is a descending staircase-like function for . So, the convex hull of
is given by the largest convex linear interpolation of the outer corner points of(for example, see figure)
Therefore, any pair on the boundary of is achievable by a convex combination of no more than two deterministic protection schemes. ∎