Reducing Metadata Leakage from Encrypted Files and Communication with PURBs

06/08/2018 ∙ by Kirill Nikitin, et al. ∙ 0

Most encrypted data formats, such as PGP, leak substantial metadata in their plaintext headers, such as format version, encryption schemes used, the number of recipients who can decrypt the data, and even the identities of those recipients. This leakage can pose security and privacy risks, e.g., by revealing the full membership of a group of collaborators from a single encrypted E-mail between two of them, or enabling an eavesdropper to fingerprint the precise encryption software version and configuration the sender used and to facilitate targeted attacks against specific endpoint software weaknesses. We propose to improve security and privacy hygiene by designing future encrypted data formats such that no one without a relevant decryption key learns anything at all from a ciphertext apart from its length - and learns as little as possible even from that. To achieve this goal we present Padded Uniform Random Blobs or PURBs, an encrypted format functionally similar to PGP but strongly minimizing a ciphertext's leakage via metadata or length. A PURB is indistinguishable from a uniform random bit-string to an observer without a decryption key. Legitimate recipients can efficiently decrypt the PURB even when it is encrypted for any number of recipients' public keys and/or passwords, and when those public keys are of different cryptographic schemes. PURBs use a novel padding scheme to reduce potential information leakage via the ciphertext's length L to the asymptotic minimum of O(log_2(log_2(L))) bits, comparable to padding to a power of two, but with much lower padding overhead of at most 12% which decreases further with large payloads.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditional data-encryption schemes and protocols aim only at protecting their payload while related metadata is being exposed. Formats, such as PGP [64], often reveal in cleartext message headers the public-key fingerprints of the intended recipients, the algorithm used for encryption, and the actual length of the payload. Protocols for secure communication often leak information during a key-and-algorithm agreement phase: for example, in the handshake phase of TLS [24], the protocol version, a chosen cipher-suite, and the public keys of the parties are exchanged in cleartext. This metadata exposure is assumed to be non-security sensitive but important for efficiency.

However, researchers consistently show that such metadata can be exploited by an attacker to retrieve information about communication content or patterns. In particular, the attacker might be able to fingerprint users [40, 51] and the applications used [63]. Using traffic analysis [21], an attacker might be able to infer user-visited websites [21, 39, 25, 56, 57] or to identify the videos that a user watches [43, 49, 44]. On VoIP, it can be used to infer the geo-location [35], the spoken language [61], or the voice activity of users [19]. The side-channel leaks of data compression [32] make several attacks on SSL possible [47, 26, 7]. The lack of proper padding might enable an active attacker to learn the length of the user’s password from TLS [53] or QUIC [1] traffic. In social networks, metadata can be used to draw conclusions on users’ actions [28], whereas telephone metadata has been shown to be sufficient for user re-identification and for determining home locations [36]. Furthermore, by observing the format of the packets, oppressive regimes can infer which technology is used and use this information for the purposes of incrimination or censorship, e.g., most TCP packets of Tor traffic are 586 bytes due to the standard Tor cell size [29].

To tackle these security and privacy threats, we developed Padded Uniform Random Blobs (PURBs), a novel approach to designing encrypted-data formats. A PURB incorporates application content and metadata into a single encrypted blob that is indistinguishable from a random string and is extended to a certain size by using a novel padding scheme to even further reduce information leakage. In particular, unlike other approaches (PGP, TLS, etc.), PURBs do not leak the encryption scheme used, who or how many recipients can decrypt it, or what application or software version created it. In this paper, we make the following contributions:

First, we present a content-and-metadata encryption scheme that supports any number of recipients who can use either shared passwords or public-private key pairs, and that supports simultaneous use of multiple cryptographic suites. The main challenge that we address is providing efficient decryption to recipients who do not have any cleartext markers. If efficiency was of no importance, the problem would be trivial: The sender could discard all the metadata and the recipient would parse the encrypted data by using every possible structure and/or cipher suite. However, adoption by the real-world heavily depends on how efficient a given scheme is and what cryptographic agility it provides. PURB is a combination of a variable-sized encrypted header of a special structure, containing metadata, and a symmetrically-encrypted payload. The header’s structure enables efficient decoding by legitimate recipients via a minimal number of trial-decryption steps. It also facilitates the seamless addition and removal of supported cipher suites, yet it leaks no information to any third party without a decryption key. We construct our scheme starting with the standard construction of the Integrated Encryption Scheme [2] and use the ideas of multi-recipient public-key encryption [34, 9] as a part of the multi-recipient development.

Second, to reduce information leakage from data lengths, we developed Padmé, a padding scheme that obfuscates the true length of data objects by grouping files in sets that logarithmically increase in their size. The scheme reduces information leakage to bits where denotes the data length and retains the minimal practical space overhead for a given asymptotic boundary. Padmé enlarges files by at most , and less for increasing file sizes.

Our evaluation demonstrates that creating a PURB ciphertext takes ~ ms for recipients on consumer-grade hardware, using different cipher suites, and only  ms for the most common single-recipient single-suite scenario. Yet our implementation is in pure Go without assembly optimization that could give a significant speed-up to public-key operations. The decoding performance is comparable to a typical PGP implementation, and almost independent of the number of recipients (up to 10,000) due to our design that limits the number of costly operations. Then, we analyze real-world data sets and show that, without padding, many objects are trivially identifiable by their unique sizes and that it remains an issue even after padding to a fixed block size (e.g., with a block cipher in CBC mode, or Tor cells). We show that Padmé can significantly reduce the number of objects uniquely identifiable by their sizes: from 83% to 3% for 56k Ubuntu packages, from 87% to 3% for 191k Youtube videos, from 45% to 8% for 848k hard-drive user files, and from 68% to 6% for 2.8k websites from the Alexa top 1M list, with a mean overhead of 3%.

In summary, our main contributions are as follows:

  • We introduce a novel encoding format for encrypted data. It reveals no metadata information to observers who do not hold valid (symmetric or asymmetric) decryption keys and supports multi-recipient and multi-ciphersuite use cases.

  • We introduce Padmé, a padding scheme that addresses the challenge of avoiding information leakage from data lengths and keeps size overheads low.

  • We implement the PURBs encoding and padding schemes, evaluate the performance of the former against PGP and the efficiency of the latter on real-world data sets.

2 Motivation and Background

We begin this section by giving examples of where PURBs can be useful, and describe the Integrated Encryption Scheme that we later use as a starting point in our design.

2.1 Motivation and Applications

PURBs is a paradigm for designing encryption data formats; it efficiently protects sensitive metadata. Our goal is to define a general approach applicable to most of the common data-encryption scenarios such that the techniques are flexible to the application or communication type, the cryptographic algorithms used, and to the number of participants involved. We also seek to enhance plausible deniability such that a user can deny that a PURB is created by a given application or that she owns the key to decrypt it. We envision at least several immediate applications that would benefit from using PURBs.

E-mail Protection. E-mail systems traditionally use PGP or S/MIME for encryption. However, their packet formats expose a significant amount of metadata: an OpenPGP-encrypted packet has its format version, the encryption method, the number and public-key identities of the recipients, and the public-key algorithm all stated in clear [18]. In addition, the payload is only padded to the block size of a symmetric-key algorithm used that, as we will show in §5.2, does not provide any “size privacy”. Hence, it is a suitable candidate for PURBs optimization, as the current metadata-leakage will be minimized. Furthermore, the e-mail traffic is normally sparse, hence the PURBs overhead can be easily accommodated.

Initiation of Cryptographic Protocols. In most cryptographic protocols, initial cipher-suite negotiation, handshaking, and key exchange are normally performed unencrypted. In TLS, an eavesdropper, who monitors a connection “from the start”, can learn the details of the cryptographic schemes used and Server Name Indication (SNI) that enables her to determine which specific web site, from the set of all the web sites hosted behind the same server, a client is connected to [24]. Some measures are already taken in TLS 1.3 [46], however, they are only partial and do not fully address the issue. PURBs would facilitate the fully-encrypted handshaking process from the very start. PURBs assume that a client knows at least one public key and a supported cipher-suite of the server. That is already the case in the scenario of the 0-RTT handshake mechanism of TLS 1.3 [45] when the client resumes a previous connection. Or the public key and the cipher suites can be obtained from services enabling DNS-based authentication, such as DANE [30].

Encrypted Disk Volumes. VeraCrypt111https://www.veracrypt.fr/en/Documentation.html is a disk-encryption software that uses a block cipher to turn a disk partition into an encrypted volume where the partition’s free space is filled with random bits. It supports the so-called hidden-volume feature – an encrypted volume can be placed as a partial payload of another primary volume. The hidden volume cannot be distinguished from free space filled with random bits. VeraCrypt already hides well metadata, due to the use of a special block-cipher operation mode. But the hiding is limited because it is only possible to create a single hidden volume inside a primary one. This creates a risk that a potential coercer would assume by default that the hidden volume is present and the claim of non-possession of the decryption keys would be considered as a refusal to provide them. We envision PURBs as an alternative approach to disk encryption. They could provide the same level of metadata protection but, at the same time, enhance the plausible deniability, as there can be as many hidden volumes in a PURB as needed. It would facilitate the N+1 defense when a coercee can reveal up to N “dummy” volumes, whereas the coercer would not be able to confirm whether there are more of them or not.

2.2 Integrated Encryption Scheme

The Integrated Encryption Scheme (IES) [2] is a hybrid encryption scheme that enables the encryption of arbitrary message strings (unlike ElGamal that requires the message to be a group element) and the flexibility of underlying primitives. To send an encrypted message, a sender first generates an ephemeral Diffie-Hellman key pair and uses the public key of the recipient to derive a shared secret. The choice of the Diffie-Hellman group is flexible, e.g., integer groups or elliptic curves. The sender then relies on a cryptographic hash function to derive the shared keys used to encrypt the message with a symmetric-key cipher and to compute a MAC using the encrypt-then-mac approach. The resulting ciphertext is shown in Figure 1.

[boxformatting=, bitwidth=1em]16 3 & 8 & 5

Figure 1: Ciphertext output of an Integrated Encryption Scheme where is an ephemeral public key of the sender, and and are generated using the DH-derived keys.

The IES is proven secure against adaptive chosen-ciphertext attacks (ind$-cca) when the underlying symmetric encryption is ind$-cpa, and the MAC and hash algorithms are universally unforgeable [2]. We use the idea of IES in our design, but we substitute the generic composition of encryption and MAC algorithms with dedicated authenticated encryption schemes that provide the same security guarantees [16].

3 Hiding Metadata

In this section, we introduce Padded Uniform Random Blobs (PURBs): an encoding scheme for encrypted data and the accompanying metadata. We begin by defining the notation, system and threat models, followed by a sequence of strawman approaches that tackle different challenges on the path towards the final system. More specifically, we start with a scheme where ciphertexts are encrypted with a shared secret and are addressed to a single recipient. We then improve it to support public-key operations with a single cipher suite, and finally present the multi-recipient, multi-cipher-suite encoding scheme. For each step, we also show that security is preserved in our threat model.

3.1 Preliminaries

We summarize here the cryptographic preliminaries; the full version is in Appendix A.1.

Let be a standard security parameter. Let be a ind$-cpa secure, deterministic nonce-based authenticated encryption (AE) scheme [10, 48] where , and are key generation, encryption and decryption algorithms respectively, given a message , a ciphertext , and a nonce .

Let be a cyclic finite group of prime order generated by the group element where the DDH problem is hard to solve (e.g., an elliptic curve or an integer group). Let be an algorithm that generates the private-public key pair .

Let be a mapping that encodes group elements of to binary strings that are indistinguishable from random bit strings of the same length (e.g., Elligator [13], Elligator Squared [50, 4]). Let be the counterpart to Hide that decodes binary strings into group elements of . We require that for all .

Let be a key derivation function [33] that converts a group element into a bit string that can be used as a symmetric key. Let be a secure password-based key-derivation function [41, 15], a variant of KDF that converts a salt and a password into a bit string that can be used as a key for symmetric encryption.

3.1.1 System Model

Let data be an application-level unit of data (e.g., a file or network message). A sender wants to send an encrypted version of data to one or more recipients. We consider two main approaches for secure data exchanges:

(1) Via pre-shared secrets, where the sender shares with the recipients long-term one-to-one passphrases that the participants can use in a password-hashing scheme to derive ephemeral secrets .

(2) Via public-key cryptography, where sender and recipients derive ephemeral secrets with and denoting the fresh private and public keys of the sender and and being the private and public keys of recipient .

In both scenarios, the sender uses the ephemeral secrets or to encrypt (parts of) the PURB header using an AE scheme.

We refer to the tuple used in the PURB generation as a cipher suite. This can be considered similar to the notion of a cipher suite in TLS [24]. Replacing any component of a suite, such as the used public-key algorithm, results in a different cipher suite.

3.1.2 Threat Model

We consider two types of adversaries. The first is an adversary who does not hold a private key or a password valid for derivation of the ephemeral secret. The second is a legitimate but malicious recipient who might be one of several recipients. Naturally, a legitimate recipient has more capabilites because she (a) can recover the plaintext payload and (b) can distinguish a PURB addressed to her from random bits by checking whether she can decrypt it. In both cases, the adversary is computationally bounded.

3.1.3 Security Goals

  1. The content and all metadata must be protected by the PURBs’ encoding. And the encoding output must be ind$-cpa secure given an adversary without a valid decryption key (indistinguishable from random bits under an adaptive chosen-plaintext-and-IV attack).

  2. A legitimate recipient of a PURB must not be able to learn the identities of other recipients, although she might learn the total number of them.

Limitations. (1) The scheme is secure against quantum computers if and only if all the encryption, key agreement and hashing algorithms used are secure against quantum computers. (2) Being non-interactive, the scheme does not provide forward secrecy by default.

3.1.4 System Goals

  1. PURBs must provide cryptographic agility: (1) They should accommodate a single recipient as well as recipients, support encryption for a recipient using a shared password or a public key, and support different cipher suites; and (2) adding and removing cipher suites must be seamless and must not affect other cipher suites.

  2. PURBs’ encoding and decoding must be “reasonably” efficient, in particular, the number of expensive public-key operations should be minimized; and there must not be an excessive space overhead.

3.2 Single Passphrase

We begin by a simple case where a sender wants to encrypt data by using a single long-term passphrase that is shared with a single recipient (e.g., out-of-band via a secure channel). The sender and the recipient use a standardized cipher suite that defines the scheme components. A typical use-case would be encrypting a file on a hard drive.

First, the sender generates a fresh symmetric key and a nonce , and she computes the PURB payload as . Then, the sender generates a random salt and derives the ephemeral secret . The sender creates an entry point () containing the session key , the start position of the payload and other potential metadata. Then, the sender encrypts the EP using and . Finally, the sender creates the PURB by concatenating the four segments as shown in Figure 2.

[boxformatting=, bitwidth=1em]16 2 & 2salt & 6 & 6(data)
[t]4 & [t]6entry point & [t]6payload

Figure 2: A PURB addressed to a single recipient and encrypted with a passphrase-derived ephemeral secret .

Security Argument. All four segments of the PURB are individually indistinguishable from random bits: the nonce and salt due to the way they have been generated, and the AE ciphertexts – jointly with – are required to be ind$-cpa secure. Reuse of the same nonce for encryption with different keys have been proven secure [9, Section 9]. Furthermore, discovering correlation between any of the two segments would imply breaking the security properties of the primitives. For the detailed arguments, see §A.2.

3.3 Single Public Key, Single Suite

In many scenarios, people and services do not use pre-shared secrets in order to establish secure communication channels or encrypting data at rest. They rely on public-key cryptography, instead, to derive ephemeral secrets. Typically, either the sender/initiator indicates in the file’s metadata in cleartext for which public key this file has been encrypted (e.g., in PGP), or parties exchange public-key certificates in clear during the communication setup (e.g., in TLS). Both approaches, typically, leak the identity of the receiver. To cover this use case, we improve the previous strawman to enable the decryption of an entry point  using a private key.

To expand our scheme to the public-key scenario, we adopt the idea of a hybrid asymmetric-symmetric scheme from IES (discussed in §2.2). Let denote the recipient’s key pair. The sender now generates an ephemeral key pair , computes the ephemeral secret , and then proceeds as before, except encrypting and metadata with instead of . The sender replaces the salt in the PURB with her encoded ephemeral public key , where maps a group element to a random bit string. The resulting PURB is shown in Figure 3.

[boxformatting=, bitwidth=1em]19 2 & 5Hide & 6 & 6(data)
[t]2 & [t]5encoded pk & [t]6entry point & [t]6payload

Figure 3: A PURB addressed to a single recipient that uses a public key , where is the public key of the sender and is the ephemeral secret.

Security Argument. The public key / entry point / payload is almost a canonical example of an IES ciphertext, proven to be ind$-cca secure [2]. The only difference from the IES specification is that we use an AE scheme instead of the generic encrypt-then-mac composition but both approaches, if secure, provide the same security guarantees [16]. As the output of Hide is also indistinguishable from random bits, the whole PURB is ind$-cpa secure. See §A.3 for the detailed discussion.

3.4 Multiple Public Keys, Single Suite

In certain cases, a message needs to be encrypted under several public keys, e.g., in multicast communication or in mobile group-chats. We improve the previous strawman by adding support for multiple public keys that can belong to a single recipient or multiple recipients. As before, we assume that all the keys are of the same suite.

For the first step, we adopt the idea of multi-recipient public-key encryption [34, 9] where the sender generates a single key pair and uses it to derive an ephemeral secret with each of the intended recipients. The sender creates one entry point per recipient, these entry points contain the same session key and metadata but are encrypted with different ephemeral secrets.

Layout Challenges. As PURBs’ goal is to prevent metadata leakage, including the number of recipients, a PURB cannot reveal how many entry points exist in the header. Yet a legitimate recipient needs to have a way to enumerate possible candidates for her entry point. Hence, the primary challenge is to find a space-efficient layout of entry points – with no cleartext markers – such that the recipients are able to find their segments efficiently.

Linear Approach. The most space-efficient approach is to place them sequentially. In fact, OpenPGP suggests a similar approach for achieving better privacy [18, Section 5.1]. However, in this case, decryption is inefficient: the recipients have to sequentially attempt to decrypt each potential entry point, before finding their own or reaching the end of the PURB.

Fixed Hash-Table Approach. A more computationally efficient approach is to use a hash table of a fixed size. The sender creates a hash table and places each encrypted entry point there, identifying the corresponding position by hashing an ephemeral secret. Once all the entry points are placed, the remaining slots are filled with random bit strings, hence a third-party is unable to deduce the number of recipients (yet the upper bound, corresponding to the size of the hash table, would be public information). This approach, however, causes a significant space overhead: in the most common scenario of a single recipient, all the unpopulated slots are filled with random bits and still transmitted. Additionally, there is now a limit on the number of recipients.

Growing Hash-Tables Approach. We propose to include not one but a sequence of hash tables whose sizes are defined by the consecutive powers of two. Thus, immediately after having placed the encoded public key, the sender encodes a hash table of length one, followed (if needed) by a hash table of length two, of length four, etc., until all the entry points are placed. The unpopulated slots are filled with random bits. To decrypt a PURB, a recipient decodes the public key , derives the ephemeral secret, computes the hash-index for the first table (which is always one), and tries to decrypt the corresponding entry point. In case of a failure, the recipient moves on to the second hash table, seeks the correct position and tries again, and so on. In the following, we formalize the scheme.

Definition. Let be the number of recipients where each owns a key pair from . The sender generates a fresh key pair and computes one ephemeral secret per recipient. Then, the sender encrypts the data and creates  entry points . The entry points are placed in hash table using the formula position=, where is a standard hash function; the sender iteratively tries to place the entrypoint in HT0, HT1, , until one placement succeed (i.e., in the absence of collision). If the placement fails in the last hash table HT, the sender creates another hash table HT of size . An example of a PURB encrypted for five recipients is shown in Figure 4.

[boxformatting=, bitwidth=1em]23

[]1 & []5encoded pk & []4HT0 & []4HT1 & []4HT2 & []5payload

2 & 4Hide & 4 & 4 & 4 & 5(data)

[rt]10 & 4 & 4 random & [l]5

[r]14 & [lr]4 & [l]5

[r]14 & [lrtb]4 random & [l]5

Figure 4: A PURB with hash tables of increasing sizes (HT0, HT1, HT2). The five slots of the hash tables are filled with entry points and two slots are filled with random bit strings. The “meta” inside the entry points is omitted from the figure due to the space constraints. Hash tables are put one after another in the byte representation (Appendix, Fig 14).

Similarly, the recipient (a) reads the public key, (b) derive the shared secret , and (c) iteratively try matching positions in hash tables until the decryption of the entry point succeeds. Although neither the recipient, nor anyone else initially knows the number of hash tables in a PURB, the recipient needs to do only a single expensive public-key operation, and the rest are cheap symmetric-key decryption trials.

Security Argument. The combination of the public key with multiple entry points is essentially the multi-recipient IES [34, 9] that has been proven ind$-cca secure [9] under the same assumptions as the single-recipient IES. The multi-recipient IES does not impose an ordering of symmetrically encrypted chunks to be sure; hence the hash-tables layout does not affect the security guarantees. The adversary is unable to distinguish entry points from an “empty” hash-table position filled with random bits, as is ind$-cpa secure. Finally, the adversary is unable to determine the total number of the hash table in a PURB (up to a bound given with the size of the PURB), as both the encrypted payload and the IES are indistinguishable from random bits and, hence, from each other. See details in Appendix A.4.

Efficiency Argument. In the common case of a single recipient, then only a single hash table of size exists, and the header is compact. With recipients, the worst-case compactness is having

hash tables (if each insertion leads to a collision), which happens with exponentially decrasing probability. In all cases, the recipient finds his entry point in

attempts.

3.5 Multiple Public Keys and Suites

In the real world, not all data recipients might use the same cipher suites. For example, users might prefer different key lengths or use public-key algorithms in different groups. We extend our encoding scheme to support the encryption of data for different cipher suites.

When a PURB is multi-suite encrypted, the recipients need a way to learn whether a given suite has been used and where the encoded public key of this suite is located in the PURB. Two possible approaches to enable recipients to distinguish the different suites are to place the public keys linearly in the beginning of a PURB or to set a fixed byte position for each defined suite. Both approaches incur undesirable overhead. In the former case, the recipients have to check all possible byte ranges by performing an expensive public-key operation for each, whereas the latter approach results in a significant space overhead and lack of agility, as unused fixed positions have to be filled with random bits and removing or adding cipher suites requires redefining the positions set to other suites.

Set of Standard Positions. To address this challenge, we introduce a set of standard positions per suite. These sets are public and standardized for all PURBs. The set refers to positions where the suite’s public key could be in the PURB, starting just after the nonce. For instance, let us consider a suite PURB_X25519_AES128GCM_SHA256 that defines , , and KDF respectively. We can define – arbitrarily for now – the set of positions as . As the length of the encoded public key is fully defined by the suite ( bytes here, since Curve25519 is used), the recipients will iteratively try to decode a public key at , then , etc. The actual values of the positions are not capital for the idea presented, and hence we further detail them in Appendix C.

Thus, if the sender wants to encode a PURB for two suites A and B, she needs to find one position in each set such that the public keys do not overlap. For instance, if and , and the public keys’ lengths are and , respectively, one possible choice would be to put the public key for suite A in , and the public key for suite in . We note that both suites A and B have the position in their set: in the common case where the PURB is encoded only for one suite, the encoded public key is simply at the beginning of the blob, just after the nonce, as in the previous design where the suite was fixed. With well-designed sets (that have non-overlapping positions for any pair of two suites), there exists a way for the sender to encode a PURB for any number of suites. We address efficiency hereunder; additionally, we provide a concrete example with real suites in Appendix C.

Overlapping Layers. We now explain the relationship between the position of the entry point hash tables and the public keys. In short, they overlap; an allowed position for a suite can refer to some bits that can be used by something else, be it another public key, an entry point, or even the payload. Conceptually, a PURB is made of overlapping layers: One layer is composed of the hash tables for the entry points, one layer is for the payload, and each suite has its own layer of its public key’s positions. Recall that the sender first builds the public keys’ layers. Then, the sender creates and places entry points. Some entries of the hash tables can be already (partially) occupied by a public key. In this case, the sender proceeds as described before: She either moves to the next hash table or creates a new one of twice the size if it does not yet exist, and places the entry point later. The payload is placed right after the last encoded public key or entry point, so that it never collides with a value in the header. As the recipient performs trial decryption, she eventually finds the correct public key and entry point. The position of the payload is included in the entry point.

[boxformatting=, bitwidth=1em]23

[]1 & []5encoded & []4HT0 & []4HT1 & []4HT2 & []5payload

2 & 6Hide & 2 rnd & 4Hide & 4 & 5(data)

[rt]10 & 4 & 4 random & [l]5

[r]14 & [lr]4 & [l]5

[r]14 & [lrtb]4 random & [l]5

Figure 5: Example of a PURB encoded for three public keys in two suites (suite and ). The sender generates one ephemeral key pair per suite ( and ). In this example, the public key for suite A is placed at the first allowed position, and the public key for suite B moves to the second allowed position (since the first position is taken by suite A). Those positions are public and fixed for each suite. HT0 cannot be used for storing an entry point, as the public key for suite A partially occupies it; HT0 is considered “full” and the entry point is placed in subsequent hash tables - here HT1.

Decoding Efficiency. Decoding efficiency, however, still does not match our system goal: For a given cipher suite, a recipient has to do several expensive public-key operations, one for each public key’s position until the correct position is found. We reduce this overhead to only a single public-key operation per suite by removing the need to know which of the suite positions actually has public key placed. We achieve this by requiring a sender to XOR bytes at all the suite positions and to place the result into one of them. The sender first constructs the whole PURB as before, then she substitutes the bytes of the already written encoded public key with the XOR of bytes at all the defined suite positions (if they do not exceed the PURB length). To decode a PURB, a recipient starts by reading and XORing the values at all the positions defined for a suite; this results in an encoded public key, if the suite has been used in this PURB. For an example, please refer to Figure 5.

Security Argument. Let us consider first the content of a PURB, without taking into account its structure. The content can be separated into four components: groups of data for each suite, random fillings, the nonce, and the payload. Each group of suite’s public key and related entry points is the multi-recipient IES that is ind$-cca secure. If the components were laid out sequentially, they represent a concatenation of ind$-cca/ind$-cpa secure ciphertexts and random bits. We argue that, as in §3.4, selecting a special order of ciphertext’s chunks done by the sender does not give a potential adversary any additional advantage. In fact, it makes the guessing game harder for the adversary, as the game with PURBs of indeterministic structure reduces to the game with sequentially laid out components by revealing the structural information to the adversary. See the detailed argument in Appendix A.5.

3.6 Overall Encoding Algorithm

The algorithm is as follows: A sender (a) reserves space for a chosen public key’s position per suite, (b) she lays out all the hash-table layouts of entry points for all the suites, (c) she lays out and encrypts the file’s contents (or its first substantial-size chunk of data in the streaming case), (d) she encrypts all the entry points containing session keys and related metadata into appropriate positions in respective hash tables, (e) she fills all remaining unreserved space in the whole variable-length header with random bits, and finally (f) she XOR-encodes the public key values. The detailed algorithm is in Appendix B.

Key Points. Despite the complexity of the encoding, we emphasize that in the common case where a PURB is encoded for one suite and one recipient, the algorithm falls back to the simpler compact construction; and it also seamlessly supports multiple recipients and suites. Due to the trial-decryption step, a third-party without decryption key is unable to differentiate between these variants (e.g., a short header and a long payload, or a longer header and a shorter payload), thus achieving zero leakage (except the total length, addressed in §4).

3.7 Sender Authentication

In PURBs, a sender generates an ephemeral key pair for every blob. The recipients cannot know the key pair in advance, hence cannot authenticate the origin. In some scenarios, it is important to be able to obtain proof that a PURB comes from a certain sender and that an adversary is unable to impersonate this sender. The standard approach for providing this is to enable sender authentication using a cryptographic signature scheme.

Assume now that a sender has a long-term identity key pair, and its public key is known to or can be retrieved by potential recipients and can be used to authenticate the sender. The design requirement is that it is only the legitimate recipients and no observers who are able to learn and verify the public-key identity of the sender. Due to this and the fact that a signature is not guaranteed to be indistinguishable from random bits, it is not acceptable for the sender to sign a final PURB and append the resultant signature. The sender produces a signature on the cleartext using the long-term identity key, instead, and forms the unified payload that is to be encrypted with . The sign-then-encrypt paradigm is secure when the identity of the PURB creator is covered by the signature [23], and the signature scheme is secure against existential forgery attacks. Any legitimate recipient is able to verify that the data indeed originated at the sender by checking the signature after decryption.

3.8 Non-malleability

By default, our encoding scheme does not ensure non-malleability. A typical decoder can only authenticate the origin and verify that the encrypted payload and the designated entry point have not been modified en route. However, if the entry points of other recipients or random-byte fillings are malformed, the decoder will not be able to detect this. If an attacker obtains access to a decoding oracle, he can randomly flip bits in an intercepted PURB and see whether the oracle returns a valid decoding. Thus, the attacker might be able to learn the length of the padded payload, the header length, and where the designated entry points are located, which might be undesirable in certain scenarios. The example of exploiting malleability is the Efail attacks [42] that tamper with PGP- or S/MIME-encrypted e-mails that achieve exfiltration of the plaintext.

The desired approach to ensuring full integrity would be to pass the PURB header as associated data (AD) into authenticated encryption of the payload. However, there are two issues: (1) The header needs to XOR the bytes of the encrypted payload if some of the potential public key’s positions fall within the payload range; and (2) the authentication tag must take into account all the XOR operations but might fall into a potential public key’s position, invalidating future decoder computation.

We introduce two modifications that enable us to provide AD-based integrity. First, the AEAD scheme used for payload protection must be chosen such that they follow the encrypt-then-mac generic composition where associate data is authenticated after the encryption is completed. The suitable scheme classes are two-pass AEAD schemes, e.g., EAX [11], sponge-based constructions [14], e.g., NORX [5], and Even-Mansour based tweakable ciphers [27]. Second, we reserve the last bytes of a PURB, where is the size of the integrity tag defined in the cipher suite as a part of the AEAD scheme, as non-usable for public key XOR during decoding. The updated scheme is as follows. To encode a PURB, a sender begins by preparing entry points, laying out the header, and encrypting the payload. Then, she computes the XOR value, potentially using the encrypted payload, computes the authentication tag, using the header as AD, and appends this tag at the end of the PURB. Upon reception of the PURB, a decoder computes the potential public keys of interest, ignoring the last bytes of the PURB, decrypts the payload and verifies the integrity using the unprocessed header, the payload, and the tag.

4 Hiding the Ciphertext Length

The encoding scheme presented in Section 3 produces blobs of data that are indistinguishable from random bit-strings of the same length, thus leaking no information to the adversary. However, the length in itself might reveal information about the content, and this leakage has already been used extensively in many traffic-analysis attacks, e.g., on website fingerprinting [39, 25, 56, 57], video identification [43, 49, 44], and VoIP traffic fingerprinting [61, 19]. Although solutions involving application- or network-level padding are numerous, they are typically designed for their specific problem, and the more fundamental problem of length-leaking ciphertexts remains. Some leakage is certainly unavoidable, but we show how the current padding of block ciphers is fundamentally insufficient for efficiently hiding the plaintext length, especially when considering plaintexts that can differ in size by several orders of magnitude.

We introduce Padmé, a novel padding scheme for PURBs (but not restricted to it): it reduces the length leakage for a wide range of encrypted data types. Padmé asymptotically leaks significantly less than stream- and block-cipher-encrypted data (which were not designed for hiding the length but are still the standard in many cases). Padmé’s overhead decreases with the file size, yielding an overhead of at most , which can be acceptable in most cases. The intuition behind Padmé is that instead of grouping into fixed-size sets of file lengths (i.e., of or lengths, like block-ciphers do), or in exponentially growing sets, Padmé groups files into logarithmically increasing sets of file lengths.

We emphasize that many defenses already exist in specific scenarios, e.g., on the topic of website fingerprinting [25, 58], and Padmé is not claiming to perform better than those tailored solutions in their application domain, but rather tries to be a generic solution for still-unprotected ciphertexts and protocols.

4.1 Definitions

Let be the length of the bit string .

Leakage. Let be a ciphertext (padded or unpadded). We assume the same adversary as in §3, and we assume that provides no information to the adversary except its length . We define as the number of bits leaked about the length .

Overhead. Let be the padded version of an unpadded ciphertext . We define as the number of bits used to pad into .

4.2 Design Criterion

To evaluate the strawman approaches and our proposal Padmé, we design a game where the adversary guesses the plaintext behind a padded encrypted blob. This game is inspired from the related work (i.e., defending against a perfect attacker) [58].

Padding Game. Let denote a collection of plaintext objects (e.g., data, documents, or application data units). An honest user chooses a plaintext , then pads and encodes it into a PURB . We consider that the adversary knows almost everything: all possible plaintext , the PURB and the parameters used to generate it (e.g., number of recipients and schemes). The adversary does not know the private inputs or any decryption keys corresponding to . The goal of the adversary is to guess the plaintext based on the observed PURB of length .

4.3 Strawman Padding Approaches

In this section, we present two strawman designs. We will define each time a padding function that yields the padded size given the plaintext length. For simplicity, we talk in terms only of lengths, and write as the original length and as the padded length, with .

Given the aforementioned adversary, it is clear that cannot be one-to-one, otherwise the adversary could trivially inverse it and recover . We have the choice of whether it should be one-to-many (i.e., picked at random) or many-to-one (i.e., grouping several ’s into buckets of size ).

In this work, we opt for the many-to-one approach as the one-to-many approach is (1) not deterministic, (2) leaks more information when encoding several times the same ciphertext, (3) requires a source of randomness, and (4) is arguably harder to analyze.

Strawman 1: Fixed-Size Blocks. We consider using , where is a block size. This is how block ciphers operate, hence how many objects get “padded” in real life (e.g., in Tor cells). In this case, the PURB’s size is a multiple of , thereby the maximum overhead is , and the leakage is .

Unfortunately, this approach exhibits the problem mentioned in the introduction of this section: When considering plaintexts whose size differs by several orders of magnitude, there is no good value for that would accommodate all the plaintexts. For instance, consider  MB: Padding small files and network messages would incur a large overhead, e.g., Tor’s cells are  B long and padding them to  MB would incur a maximum overhead of . In contrast, padding a  MB movie with at most  MB of chaff would only add a little confusion to the adversary, as this movie can still be distinguishable from others. Hence, to reduce the information leakage about the length, the padding should depend on the file size.

Strawman 2: Padding to the Nearest Power of 2. As the fixed-size blocks fail to accommodate a wide range of plaintexts, the next logical step is to have varying-size blocks. We introduce this strawman as a basis for comparing our actual scheme. The intuition is that for small plaintexts, the blocks would be small too (yielding a modest overhead), whereas for larger files, blocks would be larger, yielding again an appropriate trade-off in terms of leakage-to-overhead ratio.

If we try to have bucket with a varying size, a naïve way of doing so is to follow powers of , i.e., with (we discuss later the powers of ). Henceforth, we refer to this strawman as Next power of 2. We obtain

The leakage of this scheme is (see Appendix, Lemma 1). The maximum overhead of this scheme is almost (e.g., a 17 GB Blu-Ray movie would be padded into 32 GB). For powers of , we obtain less leakage by having more overhead, e.g., a scheme padding to the nearest power of has the overhead of at most , with lesser leakage than before, but still in . Although our second strawman has the desirable characteristic of the overhead depending on , we conclude this class of padding has an unacceptable overhead to be used in practice, independently of .

4.4 Padmé

We now describe our padding scheme Padmé, which limits the information leakage about the length of the plaintext for wide range of encrypted data sizes. Similarly to the previous strawman, Padmé also asymptotically leaks bits of information, but its overhead is much lower (at most and decreasing with ).

Intuition. Consider the previous strawman, where is padded to the next power of two. The only permissible padded lengths are of the form , and the information that leaks is , i.e., in which bucket the plaintext is. This value can be represented in a binary floating-point number, over bits of exponent and  bits of mantissa.

In Padmé, we follow the same idea where the permissible padded length is represented as a binary floating-point number, but we additionally allow the mantissa to be at most as long as the exponent (see Table 6). This doubles the number of bits used to represent the allowed padded length – and doubles the absolute leakage – but allows for more fine-grained buckets, reducing the overhead. Asymptotically, Padmé leaks the same number of bits (with an extra factor ), but reduces its overhead by almost (from to ). More importantly, the buckets sizes now grow logarithmically with respect to , instead of growing exponentially as in Next power of 2. Thus, the overhead in percentage is decreasing with .

[boxformatting=, bitwidth=1em]24 12-bit exponent & 120-bit mantissa

In the strawman Next power of 2, the allowed length can be represented as a binary floating-point number with no mantissa and bits of exponent.

[boxformatting=, bitwidth=1em]24 12-bit exponent & 12-bit mantissa

Figure 6: Padmé represents lengths as floating-point numbers, allowing the mantissa to be of at most bits.
Figure 7: Maximum overhead with respect to the plaintext size . The naïve approach to pad to the next power of two has a constant maximum overhead of , whereas Padmé’s overhead is decreasing with , following .

Algorithm. To compute the padded size and to ensure that it fits in a floating-point representation of at most bits, we enforce the last bits to be , where is the value of the exponent, and is the size of its binary representation. The reason for the substraction will become clear later. For now, we demonstrate how and are computed in Table 1.

L L E S IEEE representation
8 0b1000 3 2 0b1.0 * 2^0b11
9 0b1001 3 2 0b1.001 * 2^0b11
10 0b1010 3 2 0b1.01 * 2^0b11
Table 1: The IEEE floating-point representations of , and . The value has bit of mantissa (the initial 1 is omitted), and bits of exponents; has a -bits mantissa and a -bit exponent, while the value as bits of mantissa and exponents. Padmé enforces the mantissa to be no longer than the exponent, hence gets rounded up to the next permitted length .

Recall that Padmé enforces that the bit length of the mantissa is no longer than the bit length of the exponent. In Table 1, for the value the mantissa is longer than the exponent – intuitively, it is “too precise” – and is therefore not a permitted padded length. But, in contrast, the value is; thus, a bit-long ciphertext would be padded into bits.

To understand why Padmé enforces the bits to be , it suffices to realize that enforcing the last bits to is equivalent to padding to the next power of two. In comparison, Padmé allows extra bits to represent the padded size, with defined as the bit length of the exponent.

We provide a precise definition of the procedure in Algorithm 1. Once the padded size is computed, a PURB plaintext of length is simply padded with ’s (to be precise, we suggest following the compact padding scheme ISO/IEC 7816-4:2005222https://www.iso.org/standard/36134.html).

Data: length of content
Result: length of padded content
;
;
;
  // # of bits to set to 0
;
  // 1’s in the LSB
;
  // round up
;
  // set to 0 the last bits
Algorithm 1 Padmé

Leakage and Overhead. By design, the leakage is , which is apparent from the length of the binary representation of the leaked information. As we fix bits to and round up, the maximum overhead is

. As a percentage, the maximum overhead can be estimated as follows:

(1)

Thus, Padmé’s overhead in percentage decreases with respect to the file size . The max overhead is , when padding a -bit file into bits. For bigger files, the overhead is smaller.

On Optimality. We note that there is no sweet spot on the leakage-to-overhead curve, and we could easily enforce the last bits to be (instead of the last bits) to reduce overhead and increase leakage. Still, the relation that matters in practice is between and the overhead. We show in §5.2 how this choice performs with various datasets.

5 Evaluation

Our evaluation is two-fold: First, we show the performance and overhead of the PURB encoding and decoding; second, using several datasets, we show how Padmé facilitates hiding information about data length.

5.1 Performance of the Purb Encoding

The main question we answer in the evaluation of the encoding scheme is whether it has a reasonable cost, in terms of both time and space overhead, and whether it scales gracefully with an increasing number of recipients and/or cipher suites. First, we measure the average CPU time required to encode and decode a PURB. Then, we compare the decoding performance with the performance of plain and anonymized OpenPGP (described below) schemes. Finally, we show how the compactness of the header changes with multiple recipients and suites, as a percentage of useful bits in the header.

Anonymized PGP. In standard PGP, the identity – more precisely, the public key ID – of the recipient is embedded in the header of the encrypted blob. This plaintext marker speeds up decryption, but enables a third party to enumerate all data recipients. In the so-called anonymized or “hidden” version of PGP [18, Section 5.1], this key ID is substituted with zeros. In this case, the recipient tries sequentially the encrypted entries of the header with her keys. We use the hidden PGP variant as a comparison for PURBs, which also does not indicate key IDs in the header (but uses a more efficient structure). We note that the hidden PGP variant still leaks the cipher suites used, the total length, and other plaintext markers (version number, etc.).

5.1.1 Implementation

We implemented a prototype of the PURBs’ encoding and padding schemes in Go. The implementation covers the base encoding scheme without sender authentication and PURB non-malleability, and it consists of kLOC. Our implementation relies on the open-source Kyber library333https://github.com/dedis/kyber for cryptographic operations, because to the best of our knowledge, it is the only crypto library in Go that implements the Elligator-encoding for points on Curve25519. We provide a tool that is easily integrable with existing solutions. However, our code is only a proof-of-concept and has not yet gone through the necessary hardening, e.g., against timing attacks, etc.

Reproducibility. All datasets, the source code for PURBs and Padmé, as well as scripts for reproducing all experiments, are available in the main repository444https://github.com/dedis/purbs.

5.1.2 Methodology

We ran the encoding experiments on a consumer-grade laptop, with a quad-core 2.5 GHz Intel Core i7 processor and 8 GB of RAM, using Go 1.10. To compare with an OpenPGP implementation, we use Keybase’s fork555https://github.com/keybase/go-crypto of the default Golang crypto library666https://github.com/golang/crypto, as the fork adds support for the ECDH scheme on Curve25519.

We further modify the Keybase’s implementation to add the support for the anonymized OpenPGP scheme. The PURBs’ suite used in all the encoding experiments is based on Curve25519. If more than one suite is needed for an experiment, we use copies of Curve25519 to ensure homogeneity across timing experiments. For each data point, we generate a new set of keys, one per recipient. We measure each data point 20 times, using each time fresh randomness.

5.1.3 Results

(a) CPU time required to encode a PURB depending on the number of recipients and included cipher suites.
(b) The worst-case CPU cost of decoding PGP, PGP with hidden recipients, flat PURB, and standard PURB.
(c) PURB Compactness of the PURB header (percentage of non-random bits).
Figure 8: Performance of PURBs Encoding.

Encoding Performance. In this experiment, we first evaluate how the time required to encode a PURB changes with a growing number of recipients and cipher suites. Second, how main computational components contribute to this duration. We logically divide the total time into three components. The first is the generation and Elligator-encoding of sender’s public keys, one per suite. A public key is derived by multiplying a base point with a freshly generated private key (scalar). If the resultant public key is not encodable, which happens in half of the cases, a new key is generated. Point multiplication dominates this component constituting ~ of the total time. The second is the derivation of a shared secret with each recipient, essentially a single point-multiplication per recipient. We denote the third component as Other: It includes all the operations that are not in the first or second components, namely encryption of the entry points and payload, hash table placement, etc. We consider three cases: using one, three or ten cipher suites. When more than one cipher suite is used, the recipients are equally divided to support those.

Figure 7(a) shows that in the case of a single recipient, the generation of a public key and the computation of a shared secret dominate the total time and takes ~ ms and ~ ms, respectively. Generating a public key takes longer on average, due to the need of regenerating in of the cases due to encoding failures. As expected, computing shared secrets starts dominating the total time when the number of recipients grows, whereas the duration of the public-key generation only depends on a number of cipher suites used. The encoding is arguably efficient for most cases of communication, as even with hundred recipients and ten suites, the time for creating a PURB stays under seconds.

Decoding Performance. We measure the worst-case CPU time required to decipher a standard PGP message, a PGP message with hidden recipients, a flat PURB that has a flat layout of entry points without hash tables, and a standard PURB. The suite with Curve25519 is used in all the PGP and PURBs schemes.

Figure 7(b) shows the results. The OpenPGP library uses the assembly-optimized Go elliptic library for point multiplication, thus the multiplication takes ~- ms there, whereas it takes ~- ms in Kyber. This results in a significant difference in absolute values for small numbers of recipients. But our primary interest is the dynamics of total duration. The total time increase for anonymous PGP is linear because, in the worst case, a decoder has to derive as many shared secrets as there are recipients. Whereas, PURBs exhibit almost constant time, due to a single multiplication needed, regardless of the number of recipients. A decoder still has to perform multiple entry-point trial decryptions, but one such operation would account for only ~ of the total time (in the single-recipient, single-suite scenario). The advantage of using hash tables, hence logarithmically less symmetric-key operations, is illustrated by the difference between PURBs standard and PURBs flat, which is noticeable after recipients.

Header Compactness. We analyze the compactness of the header. We recall that in comparison with placing linearly the different elements in the header, our design with growing hash tables is less compact but it enables more efficient decoding (an example of this trade-off is in Figure 7(b), PGP hidden vs. PURBs standard).

We represent, in Figure 7(c), compactness, or the percentage of the PURB header that is filled with actual data, with respect to the number of recipients and cipher suites. Not surprisingly, an increasing number of recipients and/or suites increases the collisions and reduces compactness; 50% for recipients and suite, 40 for recipients and suites. A typical header size is around  KB in that case (Appendix, Figure 15), out of which  KB are useful data. More importantly, in the most common case of having one recipient in one suite, the header is perfectly compact. Finally, there is a trade-off between compactness and efficient decryption: compactness can be easily increased by resolving linearly the collisions of entry points (instead of directly moving to the next hash table). The downside is that the recipient has more entry points to try.

5.2 Performance of Padmé

When reviewing a padding scheme, one important metric is the incurred overhead, in terms of bits added to the plaintexts. By design, Padmé’s overhead is bounded by . As discussed in §4.4, Padmé does not escape the typical overhead-to-leakage trade-off, hence the novelty of Padmé does not lie in the trade-off of overhead, leakage. Rather, it lies in the relation between and the overhead, and Padmé’s overhead is at most .

A more interesting question is whether given an arbitrary collection of plaintexts , Padmé would hide well which plaintext is padded. Padmé was designed to work with an arbritrary collection of plaintexts . It remains to be seen how Padmé performs when applied to a specific set of plaintexts , i.e., with a distribution coming from the real world, and to establish how well it groups files into sets of identical length. In the next section, we experiment with four datasets made of various objects: a collection of Ubuntu packages, a set of YouTube videos, a set of user files, and a set of Alexa Top M websites.

5.2.1 Datasets and Methodology

Dataset # of objects
Ubuntu packages 56,517
YouTube videos 191,250
File collections 3,027,460
Alexa top 1M Websites 2,627
Table 2: Datasets used in the evaluation of anonymity provided by Padmé.

The Ubuntu dataset contains , unique packages, parsed from the official repository of a live Ubuntu

instance; as packages can be referenced in multiple repositories, we filtered the list by name and architecture. The reason for padding Ubuntu software updates is that the knowledge of updates enables a local eavesdropper to build a list of packages and their versions that are installed on a machine. If some of the packages are outdated and have known vulnerabilities, an adversary might use it as an attack vector. A percentage of software update still occurs over un-encrypted connections, which is still an issue; but encrypted connections to software-update repositories also expose which distribution and the kind of update being done (security / restricted

777Contains proprietary software and drivers. / multiverse888Contains software restricted by copyright. / etc). We hope that this unnecessary leakage will disappear in the near future.

The YouTube dataset contains , unique videos, obtained by iteratively querying the YouTube API. One semantic video is generally represented by .webm files, corresponding to various video qualities. Hence, each object in the dataset is a unique (video, quality) pair. We use this dataset as if the videos were downloaded in bulk rather than streamed; that is, we pad the video as a single file. The argument for padding YouTube videos as whole files is that as shown by related work [43, 49, 44], variable-bitrate encoding combined with streaming leak which video is being watched. If YouTube wanted to protect the privacy of its users, it could re-encode everything to constant-bitrate encoding and still stream it, but then the total length of the stream would still leak information. Alternatively, it could adopt a model like the iTunes store, where videos have variable bit-rate but are bulk-downloaded; but again, the total downloaded length would leak information, requiring some padding. Hence, we explore how unique the YouTube videos are with and without padding.

The files dataset was constituted by collecting the file sizes in ‘/home/user/’ of co-workers and contains ,, of both personal files and configuration files. These files were collected on machines, running Fedora, Arch, and MacOSX. The argument for analyzing the uniqueness of those files is not to encrypt each file individually (often, there is no point in hiding the metadata of a file if the file location exposes everything about it, e.g. ‘/home/user/.ssh’), rather to quantify the privacy gain when padding those objects.

Finally, the Alexa dataset is made of , websites from the Alexa Top 1M list. The size of each website is the sum of all the resources loaded by the webpage, which has been recorded by piloting a ‘chrome-headless’ instance with a script, mimicking real browsing. One reason for padding whole websites – as opposed to padding individual resources – is that related work in website fingerprinting showed the importance of the total downloaded size [25]. The effectiveness of Padmé when padding individual resources, or for instance bursts [58], is left as an interesting future work.

5.2.2 Evaluation of Padmé

The distribution of the objects sizes for all the datasets is shown in Figure 9. Intuitively, it would be harder for an efficient padding scheme to build groups of same-sized files when there are large objects in the dataset. Therefore, we expect the last to of the four datasets to remain somewhat unique, even after padding.

Figure 9: Distribution of the sizes of the objects in each dataset.

For each dataset, we analyze the “anonymity” (more precisely, the size of the anonymity sets) for each object. To compute this metric, we group objects by their size, and report the distribution of the sizes of these groups. A large number of small groups would indicate that many objects in the dataset are easily identifiable solely by their sizes. For each dataset, we compare three different approaches: the Next power of 2 strawman, Padmé, and padding to a fixed block size (of 512B, like a Tor cell). The anonymity metrics are shown in Figure 10, and the respective overheads are shown in Table  3. We first notice that for all these datasets, despite containing very different objects, only ~ of the objects have a unique size; only ~ of the YouTube videos have a unique size (Figure 9(a)), and only ~ of the Ubuntu packages (Figure 9(c)). These characteristics persist in the traditional block-cipher encryption (blue dashed curves) where objects are only padded to a block size. Even after being padded to bytes, the size of a Tor cell, most object sizes remain as unique as in the unpadded case. We observe similar results when padding to bits, the typical block size for AES (not plotted).

Next power of 2 (red dotted curves) provides the best anonymity: in the ‘YouTube’ and ‘Ubuntu’ datasets (Figures 9(a) and 9(c)), there is no single object that remains unique with respect to its size, all belong to groups of at least objects; of course, we cannot generalize this statement, as shown by the other two datasets (Figures 9(b) and 9(d)). In general, we see a massive improvements with respect to the unpadded case. Recall that this padding scheme is unfortunately rather impractical, adding to the size in the worst case and in mean; in Table 3, we see that the mean overhead is of .

Finally, we see the anonymity provided by Padmé (green solid curves). By design, Padmé has an acceptable maximum overhead (max and decreasing). In three of the four datasets, there is a constant difference between our expensive reference point Next power of 2 and Padmé; despite having a decreasing overhead with respect to , unlike Next power of 2. This means that although larger files have proportionally less protection (i.e., less padding in percentage) with Padmé, this is not critical, as these files are more rare and are harder to protect efficiently, even with a naïve and costly approach. The percentage of uniquely identifiable objects (objects that trivially reveal their plaintext given our perfect adversary) we observe a significant drop by using Padmé: from % to % for the ‘Ubuntu’ dataset, from to for the ‘Youtube’ dataset, from to for the ‘files’ dataset and from % to % for the ‘Alexa’ dataset. In Table 3, we see that the mean overhead of Padmé is around , more than an order of magnitude smaller than Next power of 2. We also see how using a fixed block size can yield high overhead in percentage, in addition to insufficient protection.

(a) Dataset ‘YouTube’:
(b) Dataset ‘files’:
(c) Dataset ‘Ubuntu’:
(d) Dataset ‘Alexa’:
Figure 10: Analysis of the anonymity provided by various padding approaches: Next power of 2, Padmé, padding with a constant block size and no padding. We measure for each object with how many other objects it becomes indistinguishable after being padded, and plot the distribution. Next power of 2 provides better anonymity, at the cost of a drastically higher overhead (at most instead of ). Overheads are shown in Table 3.
Dataset Fixed block size Next power of 2 Padmé
YouTube 0.01 44.12 2.23
files 40.15 44.18 3.64
Ubuntu 14.09 43.21 3.12
Alexa 36.71 47.12 3.07
Table 3: Analysis of the overhead, in percentage, of various padding approaches. In the first column, we use as block size.

6 Related Work

Traffic-morphing [62] is a method for hiding the traffic of a specific application by masking it as being produced by another application and imitating the corresponding packet distribution. The tools built upon this method can be standalone [55] or use the concept of Tor pluggable transport [37, 59, 60] that is applied to preventing Tor traffic from being identified and censored [17]. There are two fundamental differences with PURBs, however. First, PURBs focus on a single unit of data, and do not explore yet the question of the time-distribution of multiple PURBs. Second, the traffic-morphing systems, in most cases, try to mimic a specific transport and sometimes are designed to only hide the traffic of one given tool. Whereas PURBs are universal and arguably adaptable to any underlying application. Moreover, it has been argued that most traffic-morphing tools do not achieve unobservability in real-world settings due to discrepancies between their implementations and the systems that they try to imitate, because of the uncovered behavior of side protocols, error handling, response to proving, etc. [31, 54]. We believe that for a wide class of applications using pseudo-random uniform blobs, either alone or in combination with other lower-level tools, is a potential solution in a different direction.

Traffic-analysis aims at inferring the contents of encrypted communication by analyzing metadata. The most well-studied application of it is website fingerprinting [39, 25, 56, 57], but it has also been applied to video identification [43, 49, 44] and VoIP traffic [61, 19]. In website fingerprinting over Tor, research has repeatedly showed that the total website size is the feature that helps an adversary the most [20, 38, 25]. In particular, Dyer et al. [25] show the necessity of padding the whole website, as opposed to individual packets, for preventing an adversary from identifying a website by its observed total size. They also systematized the existing padding approaches. Wang et al. [58] proposed deterministic and randomized padding strategies tailored for padding Tor traffic against a perfect attacker, which inspired our §4.

Finally, Sphinx [22] is an encrypted packet format for Mix networks which aim at minimizing the information revealed to the adversary. Sphinx only supports one cipher suite, and one direct recipient (but several nested ones, due to the nature of Mix networks). We detail the differences in Appendix D. To the best of our knowledge, PURBs is the first solution that hides all metadata while providing cryptographic agility.

7 Conclusion

We have presented a universal solution to leaky ciphertexts that traditionally left sensitive metadata unprotected. We have discussed how PGP, TLS, and a typical disk-encryption scheme leaks information that could be used by an attacker to perform traffic analysis, website fingerprinting, and inference attacks.

We have argued that the metadata leakage is not a necessity and have presented PURBs, an approach for designing encrypted data format that does not leak anything at all, except the padded length of the ciphertexts, to anyone without the decryption keys. We have shown that despite having no cleartext header, PURBs could be efficiently encoded and decoded, and could simultaneously support multiple public keys and cipher suites. We have introduced Padmé, a padding scheme that reduces the length leakage of ciphertexts and has a modest decreasing-with-file-size overhead. It performs significantly better than classic padding schemes with fixed block size in terms of anonymity, and its overhead is asymptotically lower than one of the schemes with exponentially increasing padding.

References

Appendix A Security Arguments

In the following, we define the requirements for the building blocks of the PURBs encoding scheme and argue for the security properties of each strawman from §3.

a.1 Preliminaries

Let be a security parameter for cryptographic primitives used in our design. We describe the properties required from the primitives in terms of .

Authenticated Encryption. Let be a deterministic nonce-based authenticated encryption (AE) scheme [10, 48] where , and are the key generation, encryption and decryption algorithms respectively, given a message , a ciphertext , and a nonce , such that or if is a forged ciphertext. We require all AE schemes used in PURBs to be ind$-cpa secure, i.e., the encryption output, including an authentication tag, must be indistinguishable from random bits under an adaptive chosen-plaintext-and-IV attack as formalized by Rogaway [48]. In addition, the AE scheme is required to be int-ctxt secure [10], i.e., it must be computationally infeasible to produce a ciphertext not previously produced by the sender.

Nonce. A nonce in must be randomly drawn from for every PURB. Using a counter or the last block of a previous ciphertext would not suffice, as it would enable an adversary to link related ciphertexts.

Public-Key Algorithms. Let be a cyclic finite group of a prime order generated by the group element where the Decisional Diffie-Hellman (DDH) problem is hard to solve for a polynomial-time adversary. can, for example, be an elliptic curve or an integer group. We enforce that provides a level of security of bits. Adopting multiplicative notation, we define the key generation algorithm as

Public-Key Hiding. We require to have a mapping from a group element to a binary string that is indistinguishable from a random bit string of the same length. The examples of such mapping techniques are Elligator [13] for Curve1174 [13] and Curve25519 [12], less compact but suitable for a broader range of curves Elligator Squared [50, 4] (e.g., for the BN curves [6]), and public-key steganography [52] for integer groups. We formalize the mapping as and .

Key Derivation Function. Let be a key derivation function [33] that converts a group element into a bit string that can be used as a symmetric key. A secure KDF must have at least all the properties of a secure cryptographic hash function. SHA-256 is an example of a suitable KDF.

Password-Based Key Derivation Function. Let be a secure password-based key derivation function [41], a variant of KDF that takes a password and a salt as the input and outputs a bit string suitable to be used as a key in symmetric encryption. A PBKDF must have all the properties of a secure KDF. Argon2 [15] is an example of a secure PBKDF.

a.1.1 Adversary and Security Goals.

An adversary is a program with access to a PURBs oracle where either returns a requested PURB or a random bit string of the same length. To analyze the security of PURBs, we mimic the experiment-based find-then-guess approach [8, 2], such that submits a chosen message and possibly public keys of the recipients to and attempts to guess whether the oracle’s response is an actual PURB or a random string. Our goal is to achieve the ind$-cpa notion of indistinguishability. We define ’s advantage in a given experiment as

To achieve the indistinguishability, we require

a.2 Single Passphrase

Consider the following experiment :

 

If then

return

return

Oracle  

Input:

;

;

return

In Exp1, chooses a message and queries with it. and a password are the inputs to such that is unknown to . generates a random , nonce and key , derives the ephemeral secret , and creates a PURB as described in §3.2. Then, flips a coin to decide whether to respond to with the created PURB or a random bit string of the same length. Upon receiving ’s response, tries to guess what outcome the coin flip has had.

The PURB is composed of four segments: , , and the two AE ciphertexts. For to have a non-negligible advantage in the experiment, it suffices to distinguish one of the segments from a random string or to find a correlation between any two of the segments. and are randomly drawn or generated by , and the two AE segments, jointly with , are required to be ind$-cpa secure by design. The same nonce is used in the both encryptions but such a reuse has been proven secure [9, Section 9] as long as the nonce is used with different keys. Thus, all the segments of the PURB are individually indistinguishable from random bits. only can succeed in finding a correlation between and the entry point, or the entry point and the payload if is able to recover or from or to recover as plaintext bits of the entry point which all would imply breaking the security properties of the used primitives. Thus, we conclude that the adversary’s advantage in the experiment is negligible.

a.3 Single Public Key, Single Suite

In Exp2, the adversary chooses and also a public key (without the knowledge of the private key ) for the PURB to be encrypted for and queries the oracle with these two values. is modified such that it creates a PURB according to §3.3. attempts to distinguish the PURB from a random string as in Exp1.

 

If then

return

return

Oracle  

Input:

;

return

The encoded public key and the entry point practically form a ciphertext of the IES (§2.2) that is proven to be ind$-cca secure [2], i.e., has even stronger security properties than we require. The only difference of PURBs’ encoding from the original IES is that we use an AE scheme instead of the generic encrypt-then-mac composition but both approaches, if secure, provide the same security guarantees [16]. Moreover, the public-key hiding scheme ensures that Hide output is indistinguishable from random bits, and any random string maps to some group element. Hence, the advantage of is negligible.

a.4 Multiple Public Keys, Single Suite

  If then return return Oracle   Input: ; return
Figure 11: Exp3.

Let now be the oracle that constructs multi-recipient PURBs of the flat layout, i.e., without the hash-tables structure. In Exp3 (see Figure 11), the adversary chooses public keys, instead of one, and submits them to , along with a message . constructs a PURB where the entry points for the recipients are placed sequentially after the encoded public key and followed by the encrypted payload. As before, flips a coin to choose whether to respond with the PURB or a random string.

Semantics. The only difference between the constructed PURB and a single-recipient PURB is that the public key is now combined with multiple entry points. This combination of the public key with multiple entry points is essentially the multi-recipient IES that has been proven ind$-cca secure [9] under the same assumptions as the single-recipient IES, hence as the payload encryption is ind$-cpa secure, the encoding is still ind$-cpa secure.

Layout. Consider now that constructs the PURB with the hash-tables layout and fills the empty entries with random bits. We argue that there is no advantageous difference for : As the multi-recipient IES scheme [34, 9] does not impose any particular order of symmetrically-encrypted segments for its security, is allowed to rearrange without affecting indistinguishability. In particular, can choose the order defined by the hash tables. Additionally, needs to add random strings at the PURB positions that correspond to empty hash-tables entries. But such random strings are indistinguishable to from an entry point, as entry points are encrypted with an ind$-cpa secure scheme. Hence, this does not give any additional information. Finally, we note that is not able to learn where the hash tables end and the payload starts, as the encrypted header and the payload are indistinguishable from each other (both are indistinguishable from a random string).

Subset of Keys. In reality, the adversary might not know precisely for which keys a PURB is encrypted. To model this situation, the oracle could decide on a subset of the provided by the adversary keys for which to encrypt the PURB. For instance, could flip coins to decide whether a given public key is used for encryption, or otherwise fill the corresponding space with random bits. We argue that this game is harder for the adversary. In fact, the experiment where picks a random subset of the public keys reduces to Exp3 if we give the adversary the knowledge of the subset. Since cannot win with significant advantage Exp3, he also does not have a significant advantage in this variant.

Legitimate Recipients. So far, we have only considered an adversary who does not have a valid decryption key. In the multi-key scenario, privacy of a recipient from other recipients of the same PURB is also a security goal. We argue that an adversary who is one of the PURB recipients only learns either the total number of the recipients or the upper-bound of this number but not their identites. Indeed, if is ind$-cpa secure, is unable to validate which other hash-tables entries are actual encryptions and which are random-bit fillings, or to retrieve other’s ephemeral keys, even though knows the plaintext and can try all byte ranges as potential ciphertexts. What does know is the position of the payload, hence the beginning and end position of each hash table. By design of the hash-table structures, is able to draw a lower and upper bound on the number of recipients. When this is a concern, this can be fixed simply by adding dummy recipients.

We further argue that cannot learn the identity of another recipient: Even if was able to retrieve some ephemeral secret from an encrypted entry point, would not be able to obtain the pre-KDF value as long as KDF is pre-image resistant. And even obtaining the pre-KDF value would not enable to learn the identity of the recipient due to the DDH assumption.

a.5 Multiple Public Keys and Suites

  ... If then return return Oracle   Input: ; ... return
Figure 12: Exp4.

Consider an experiment where an adversary selects several different cipher suites (we do not restrict the exact number) and queries the oracle with public keys of these suites , where are the numbers of public keys in each suite, respectively, along with a message . We consider that constructs a simplified version of the PURB structure presented in §3.5. In this simplified version, the PURB is encoded without hash tables and XORing of the encoded public keys. Instead, places linearly an encoded public key, followed by all related entry points (i.e., in the same suite), followed by another public key in another suite and its entry points, and so on (see in Figure 12). This variant helps : it is equivalent to the real scheme presented in §3.5 when is additionally provided with the structural information (all values are the same, only their layout is now better known by ). This simplified structure is a concatenation of the multi-recipient IES ciphertexts. As each of these ciphertexts is ind$-cca secure, their concatenation is ind$-cca secure too (recall that reusing the same nonce with different keys is a secure practice [9, Section 9]).

Now, due to the same argument as in Appendix A.4, enforcing to construct the PURB with the hash tables (instead of laying out the entry points linearly) does not affect the indistinguishability guarantee: (1) the entry points and the public key form a ciphertext of the IES scheme that is ind$-cca secure [34, 9] and does not impose a particular order on the segments; hence, arranging the entry points in the hash-table layout does not affect security, and (2) the adversary is unable to differentiate the random fillings from the actual encrypted values (as is ind$-cpa secure).

The remaining difference is the XOR of publicly-known positions yielding an encoded public key (instead of the public key being positioned linearly in the header). Like a recipient, can perform the XOR of the bytes located at the standardized positions and obtain “candidates” for the encoded public keys of each suite. However, given our assumption on Hide, any bit string is a likely candidate, and cannot distinguish between a random looking value and an actual encoded public key. Regardless, even with the certainty of having found an encoded public key, the adversary is unable to derive the shared secrets without the appropriate secret key.

Legitimate Recipients. We argue that the game in the case of a multi-recipient multi-suite PURB is at least as hard for the legitimate-recipient adversary as it is for in Appendix A.4. If was given the information on which hash-tables entries are occupied by the entry points of the suite that uses or by random fillings and which are by other suites’ data, the experiment would fall back to the scenario of Appendix A.4, in which we have shown the scheme to be secure. Moreover, is not even able to obtain this additional information, as is unable to distinguish the values of other suites from a random string, without valid decryption keys.

Appendix B Algorithms

The creation of a PURB is presented in two successive steps: Encode and Layout. In Encode, the sender derives all necessary values for the PURB header (i.e., the shared secrets, ephemeral public keys, etc). In Layout the sender lays out those elements in a byte array.

Encode is decomposed into two algorithms: Encode with Passphrases (Algorithm 2) is used for encrypting data with long-term passphrases, and Encode for Public Keys (Algorithm 3) is used for public keys.

This division is made purely for simplicity, as intuitively Encode handles all cryptography, while Layout (Algorithm 4) consists purely of data structure operations. Similarly, the separation between Encode with Passphrases and Encode for Public Keys is made for simplicity, and our source code consists of a single algorithm that can handle a combination of passphrases and public keys at the same time (this is visible in the Layout algorithm which accepts simultaneously entry points and cornerstones).

b.1 Preliminaries

Let denote string concatenation, and let denote that a value is drawn at random from the set . We denote by value the bit-length of “value”.

Array notation. We denote by the fact that a is an empty array. We denote by , the operation of copying the bits of b at the positions . When written like this, always has correct length of bits, and we assume . If, before an operation , , we first grow to length . We sometimes write instead of .

In the Layout algorithm, we use a “reservation array”, which is an array with a method array.isFree(start,end) that returns True iff none of the bits were previously assigned a value, and False otherwise.

b.2 Performance

Let be the number of recipients using symmetric keys () and be the number of recipients using public keys ().

The Encode with Passphrases algorithm runs in , with PBKDF being the most expensive operation, while Encode for Public Keys runs in , with the point multiplication being the most expensive operation, runs twice (KeyGen + in the argument of KDF).

The Layout algorithm runs in , with being a public constant - a typical value could be , as we show in Appendix C - and being the worst-case estimate for inserting entrypoints in the hash-tables. Consequently, the overall complexity for Layout is in , and the amortized cost is .

Without showing an actual algorithm or proof here, creating a PURB with passphrases and for public keys and encoding it to bytes is done in . This can be derived from the source code with the full algorithm.

Input : , Data, long-term secrets
Output : Nonce, EncodedKeys[],EntryPoints[], Payload
;
;
Payload ;
EntryPoints [] ;
for  do