Obladi: Oblivious Serializable Transactions in the Cloud

09/27/2018
by   Natacha Crooks, et al.
0

This paper presents the design and implementation of Obladi, the first system to provide ACID transactions while also hiding access patterns. Obladi uses as its building block oblivious RAM, but turns the demands of supporting transactions into a performance opportunity. By executing transactions within epochs and delaying commit decisions until an epoch ends, Obladi reduces the amortized bandwidth costs of oblivious storage and increases overall system throughput. These performance gains, combined with new oblivious mechanisms for concurrency control and recovery, allow Obladi to execute OLTP workloads with reasonable throughput: it comes within 5x to 12x of a non-oblivious baseline on the TPC-C, SmallBank, and FreeHealth applications. Latency overheads, however, are higher (70x on TPC-C).

READ FULL TEXT VIEW PDF
07/28/2020

Efficient Cross-Shard Transaction Execution in Sharded Blockchains

Sharding is a promising blockchain scaling solution. But it currently su...
07/19/2020

Lazy State Determination: More concurrency for contending linearizable transactions

The concurrency control algorithms in transactional systems limits concu...
07/16/2020

OptChain: Optimal Transactions Placement for Scalable Blockchain Sharding

A major challenge in blockchain sharding protocols is that more than 95 ...
02/25/2019

PaRiS: Causally Consistent Transactions with Non-blocking Reads and Partial Replication

Geo-replicated data platforms are at the backbone of several large-scale...
10/14/2020

Blockene: A High-throughput Blockchain Over Mobile Devices

We introduce Blockene, a blockchain that reduces resource usage at membe...
02/19/2021

Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

Two-phase commit (2PC) has been widely used in distributed databases to ...
10/17/2019

MV-PBT: Multi-Version Index for Large Datasets and HTAP Workloads

Modern mixed (HTAP) workloads execute fast update-transactions and long-...

1 Introduction

This paper presents , the first cloud-based key value store that supports transactions while hiding access patterns from cloud providers. aims to mitigate the fundamental tension between the convenience of offloading data to the cloud, and the significant privacy concerns that doing so creates. On the one hand, cloud services [3, 4, 47, 48, 61] offer clients scalable, reliable IT solutions and present application developers with feature-rich environments (transactional support, stronger consistency guarantees [22, 51], etc.). Medical practices, for instance, increasingly prefer to use cloud-based software to manage electronic health records (EHR) [17, 38]. On the other hand, many applications that could benefit from cloud services store personal data that can reveal sensitive information even when encrypted or anonymized [52, 73, 53, 82]. For example, charts accessed by oncologists can reveal not only whether a patient has cancer, but also, depending on the frequency of accesses (e.g., the frequency of chemotherapy appointments), indicate the cancer’s type and severity. Similarly, travel agency websites have been suspected of increasing the price of frequently searched flights [82]. Hiding access patterns—that is, hiding not only the content of an object, but also when and how frequently it is accessed, is thus often desirable.

Responding to this challenge, the systems community has taken a fresh look at private data access. Recent solutions, whether based on private information retrieval [30, 2], Oblivious RAM [69, 43, 15], function sharing [82], or trusted hardware [24, 5, 7, 43, 80], show that it is possible to support complex SQL queries without revealing access patterns.

addresses a complementary issue: supporting ACID transactions while guaranteeing data access privacy. This combination raises unique challenges [5], as concurrency control mechanisms used to enforce isolation, and techniques used to enforce atomicity and durability, all make hiding access patterns more problematic (§3).

takes as its starting point Oblivious RAM, which provably hides all access patterns. Existing ORAM implementations, however, cannot support transactions. First, they are not fault-tolerant. For security and performance, they often store data in a client-side stash; durability requires the stash content to be recoverable after a failure, and preserving privacy demands hiding the stash’s size and contents, even during failure recovery. Second, ORAM provides limited or no support for concurrency [12, 85, 74, 69], while transactional systems are expected to sustain highly concurrent loads.

demonstrates that the demands of supporting transactions can not only be met, but also turned into a performance opportunity. Its key insight is that transactions actually afford more flexibility than the single-value operations supported by previous ORAMs. For example, serializability [60] requires that the effects of transactions be reflected consistently in the state of the database only after they commit. leverages this flexibility to delay committing transactions until the end of fixed-size epochs, buffering their execution at a trusted proxy and enforcing consistency and durability only at epoch boundaries. This delay improves ORAM throughput without weakening privacy.

The ethos of delayed visibility is the core that drives ’s design. First, it allows to implement a multiversioned database atop a single-versioned ORAM, so that read operations proceed without blocking, as with other multiversioned databases [10], and intermediate writes are buffered locally: only the last value of any key modified during an epoch is written back to the ORAM. Delaying writes reduces the number of ORAM operations needed to commit a transaction, lowering amortized CPU and bandwidth costs without increasing contention: ’s concurrency control ensures that delaying commits does not affect the set of values that transactions executing within the same epoch can observe.

Second, it allows to securely parallelize Ring ORAM [68], the ORAM construction on which it builds. pipelines conflicting ORAM operations rather than processing them sequentially, as existing ORAM implementations do. This parallelization, however, is only secure if the write-back phase of the ORAM algorithm is delayed until pre-determined times, namely, epoch boundaries.

Finally, delaying visibility gives the ability to abort entire epochs in case of failure. leverages this flexibility, along with the near-deterministic write-back algorithm used by Ring ORAM, to drastically reduce the information that must be logged to guarantee durability and privacy-preserving crash recovery.

The results of a prototype implementation of are promising. On three applications (TPC-C [79], SmallBank [21], and FreeHealth [41], a real medical application) is within 5×-12× of the throughput of non-private baselines. Latency is higher (70×), but remains reasonable (in the hundreds of milliseconds).

To summarize, this paper makes three contributions:

  1. It presents the design, implementation, and evaluation of the first ACID transactional system that also hides access patterns.

  2. It introduces an epoch-based design that leverages the flexibility of transactional workloads to increase overall system throughput and efficiently recover from failures.

  3. It provides the first formal security definition of a transactional, crash-prone, and private database. uses the UC-security framework [14], ensuring that security guarantees hold under concurrency and composition.

also has several limitations. First, like most ORAMs that regulate the interactions of multiple clients, it relies on a local centralized proxy, which introduces issues of fault-tolerance and scalability. Second, does not currently support range or complex SQL queries. Addressing the consistency challenge of maintaining oblivious indices [5, 24, 88] in the presence of transactions is a promising avenue for future work.

2 Threat and Failure Model

’s threat and failure assumptions aim to model deployments similar to those of medical practices, where doctors and nurses access medical records through an on-site server, but choose to outsource the integrity and availability of those records to a cloud storage service [17, 38].

Threat Model. adopts a trusted proxy threat model [74, 69, 85]: it assumes multiple mutually-trusting client applications interacting with a single trusted proxy in a single shared administrative domain. The applications issue transactions and the proxy manages their execution, sending read and write requests on their behalf over an asynchronous and unreliable network to an untrusted storage server. This server is controlled by an honest-but-curious adversary that can observe and control the timing of communication to and from the proxy, but not the on-site communication between application clients and the proxy. We extend our threat model to a fully malicious adversary in Appendix A. We consider attacks that leak information by exploiting timing channel vulnerabilities in modern processors [42, 35, 13] to be out of scope. guarantees that the adversary will learn no information about:  the decision (commit/abort) of any ongoing transaction;  the number of operations in an ongoing transaction;  the type of requests issued to the server; and  the actual data they access. does not seek to hide the type of application that is currently executing (ex: OLTP vs OLAP).

Failure Model. assumes cloud storage is reliable, but, unlike previous ORAMs, explicitly considers that both application clients and the proxy may fail. These failures should invalidate neither ’s privacy guarantees nor the Durability and Atomicity of transactions.

3 Towards Private Transactions

Many distributed, disk-based commercial database systems [57, 19, 8] separate concurrency control logic from storage management: SQL queries and transactional requests are regulated in a concurrency control unit and are subsequently converted to simple read-write accesses to key-value/file system storage. As ORAMs expose a read-write address space to users, a logical first attempt at implementing oblivious transactions would simply replace the database storage with an arbitrary ORAM. This black-box approach, however, raises both security concerns (§3.1) and performance/functionality issues (§3.2)

Security guarantees can be compromised by simply enforcing the ACID properties. Ensuring Atomicity, Isolation, and Durability imposes additional structure on the order of individual reads and writes, introducing sources of information leakage [71, 5] that do not exist in non-transactional ORAMs (§3.1). Performance and functionality, on the other hand, are hampered by the inability of current ORAMs to efficiently support highly concurrent loads and guarantee Durability.

3.1 Security for Isolation and Durability

The mechanisms used to guarantee Isolation, Atomicity, and Durability introduce timing correlations that directly leak information about the data accessed by ongoing transactions.

Concurrency Control. Pessimistic concurrency controls like two-phase locking [25] delay operations that would violate serializability: a write operation from transaction cannot execute concurrently with any operation to the same object from transaction . Such blocking can potentially reveal sensitive information about the data, even when executing on top of a construction that hides access patterns: a sudden drop in throughput could reveal the presence of a deadlock, of a write-heavy transaction blocking the progress of read transactions, or of highly contended items accessed by many concurrent transactions. More aggressive concurrency control schemes like timestamp ordering or multiversioned concurrency control [10, 66, 40, 1, 33, 65, 86] allow transactions to observe the result of the writes of other ongoing transactions. These schemes improve performance in contended workloads, but introduce the potential for cascading aborts: if a transaction aborts, all transactions that observed its write must also abort. If a write-heavy transaction aborts, it may cause a large number of transactions to rollback, again revealing information about and, perhaps more problematically, about the set of objects that accessed.

Failure Recovery. When recovering from failure, Durability requires preserving the effects of committed transactions, while Atomicity demands removing any changes caused by partially-executed transactions. Most commercial systems [58, 57, 49] preserve these properties through variants of undo and redo logging. To guarantee Durability, write and commit operations are written to a redo log that is replayed after a failure. To guarantee Atomicity, writes performed by partially-executed transactions are undone via an undo log, restoring objects to their last committed state. Unfortunately, this undo process can leak information: the number of undo operations reveals the existence of ongoing transactions, their length, and the number of operations that they performed.

3.2 Performance/functionality limitations

Current ORAMs align poorly with the need of modern OLTP workloads, which must support large numbers of concurrent requests; in contrast, most ORAMs admit little to no concurrency [12, 85, 74, 69] (we benchmark the performance of sequential Ring ORAM in Figure (a)a).

More problematically, ORAMs provide no support for fault-tolerance. Adding support for Durability presents two main challenges. First, most ORAMs require the use of a stash that temporarily buffers objects at the client and requires that these objects be written out to server storage in very specific ways (as we describe further in §4). This process aligns poorly with guaranteeing Durability for transactions. Consider for example a transaction that reads the version of object written by and then writes object . To recover the database to a consistent state, the update to should be flushed to cloud storage before the update to . It may however not be possible to securely flush from the stash before . Second, ORAMs store metadata at the client to ensure that cloud storage observes a request pattern that is independent of past and currently executing operations. As we show in §8, recovering this metadata after a failure can lead to duplicate accesses that leak information.

3.3 Introducing

These challenges motivate the need to co-design the transactional and recovery logic with the underlying ORAM data structure. The design should satisfy three goals:  security—the system should not leak access patterns;  correctness— should guarantee that transactions are serializable; and  performance— should scale with the number of clients. The principle of workload independence underpins ’s security: the sequence of requests sent to cloud storage shoud remain independent of the type, number, and access set of the transactions being executed. Intuitively, we want ’s sequence of accesses to cloud storage to be statistically indistinguishable from a sequence that can be generated by an simulator with no knowledge of the actual transactions being run by . If this condition holds, then observing ’s accesses cannot reveal to the adversary any information about ’s workload. We formalize this intuition in our security definition in §9.

Much of ’s novelty lies not in developing new concurrency control or recovery mechanisms, but in identifying what standard database techniques can be leveraged to lower the costs of ORAM while retaining security, and what techniques instead subtly break obliviousness.

To preserve workload independence while guaranteeing good performance in the presence of concurrent requests, centers its design around the notion of delayed visibility. Delayed visibility leverages the observation that, on the one hand, ACID consistency and Durability apply only when transactions commit, and, on the other, commit operations can be delayed. leverages this flexibility to delay commit operations until the end of fixed-size epochs. This approach allows to  amortize the cost of accessing an ORAM over many concurrently executing requests;  recover efficiently from failures; and  preserve workload independence: the epochs’ deterministic structure allows to decouple its externally observable behavior from the specifics of the transactions being executed.

4 Background

Oblivious Remote Access Memory is a cryptographic protocol that allows clients to access data outsourced to an untrusted server without revealing what is being accessed [28]; it generates a sequence of accesses to the server that is completely independent of the operations issued by the client. We focus specifically on tree-based ORAMs, whose constructions are more efficiently implementable in real systems: to date, they have been implemented in hardware [26, 45] and as the basis for blockchain ledgers [15] with reasonable overheads. Most tree-based ORAMs follow a similar structure: objects (usually key-value pairs) are mapped to a random leaf (or path) in a binary tree and physically reside (encrypted) in some tree node (or bucket) along that path. Objects are logically removed from the tree and remapped to a new random path when accessed. These objects are eventually flushed back to storage (according to their new path) as part of an eviction phase. Through careful scheduling, this write-back phase does not reveal the new location of the objects; objects that cannot be flushed are kept in a small client-side stash.

Ring ORAM. builds upon Ring ORAM [68], a tree-based ORAM with two appealing properties: a constant stash size and a fully deterministic eviction phase. leverages these features for efficient failure recovery.

As shown in Figure 1, server storage in Ring ORAM consists of a binary tree of buckets, each with a fixed number of slots. Of these, are reserved for storing actual encrypted data (real objects); the remaining exclusively store dummy objects. Dummy objects are blocks of encrypted but meaningless data that appear indistinguishable from real objects; their presence in each bucket prevent the server from learning how many real objects the bucket contains and which slots contains them. A random permutation (stored at the client) determines the location of dummy slots. In Figure 1, the root bucket contains a real slot followed by two dummy slots; the real slot contains the data object ; its left child bucket instead contains dummy slots in positions one and three, and an empty real slot in second position.

Client storage, on the other hand, is limited to   a constant sized stash, which temporarily buffers objects that have yet to be replaced into the tree and, unlike a simple cache, is essential to Ring ORAM’s security guarantees;  the set of current permutations, which identify the role of each slot in each bucket and record which slot have already been accessed (and marked invalid); and  a position map, which records the random leaf (or path) associated with every data object. In Ring ORAM, objects are mapped to individual leaves of the tree but can be placed in any one of the buckets along the path from the root to that leaf. For instance, object in Figure 1 is mapped to path 4 but stored in the root bucket, while object is mapped to path 2 and stored in the leaf bucket of this path.

[width=0.75valign=b]RingOramOverviewRead.pdf

Figure 1: Ring ORAM - Read (Z=1, S=2)

Ring ORAM maintains two core invariants. First, each data object is mapped to a new leaf chosen uniformly at random after every access, and is stored either in the stash, or in a bucket on the path from the tree’s root to that leaf (path invariant). Second, the physical positions of the dummy and real objects in each bucket are randomly permuted with respect to all past and future writes to that bucket (i.e., no slot can be accessed more than once between permutations) (bucket invariant). The server never learns whether the client accesses a real or a dummy object in the bucket, so the exact position of the object along that path is never revealed.

Intuitively, the path invariant removes any correlation between two accesses to the same object (each access will access independent random paths), while the bucket invariant prevents the server from learning when an object was last accessed (the server cannot distinguish an access to a real slot from a dummy slot). Together, these invariants ensure that, regardless of the data or type of operation, all access patterns will look indistinguishable from a random set of leaves and slots in buckets.

Access Phase. The procedures for read and write requests is identical. To access an object , the client first looks up ’s path in the position map, and then reads one object from each bucket along that path. It reads from the bucket in which it resides and a valid dummy object from each other bucket, identified using its local permutation map. Finally, is remapped to a new path, updated to a new value (if the request was a write), and added to the stash; importantly, is not immediately written back out to cloud storage.

Figure 1 illustrates the steps involved in reading an object , initially mapped to path 2. The client reads a dummy object from the first two buckets in the path (at slots two and three respectively), and reads from the first slot of the bottom bucket. The three slots accessed by the client are then marked as invalid in their respective buckets, and is remapped to path 1. To write a new object , the client would have to read three valid dummy objects from a random path, place in the stash, and remap it to a new path.

Access Security

. Remapping objects to independent random paths prevents the server from detecting repeated accesses to data, while placing objects in the stash prevents the server from learning the new path. Marking read slots as invalid forces every bucket access to read from a distinct slot (each selected according to the random permutation). The server consequently observes uniformly distributed accesses (without repetition) independently of the contents of the bucket. This lack of correlation, combined with the inability to distinguish real slots from dummy slots, ensures that the server does not learn if or when a real object is accessed. Accessing dummy slots from buckets not containing the target object (rather than real slots), on the other hand, is necessary for efficiency: in combination with Ring ORAM’s

eviction phase (discussed next) it lets the stash size remain constant by preventing multiple real objects from being addded to the stash on a single access.

[width=0.75valign=b]RingOramOverviewReadBucket.pdf

Figure 2: Eviction - Read Phase

Eviction Phase and Reshuffling. The aforementioned protocol falls short in two ways. First, if objects are placed in the stash after each access, the stash will grow unbounded. Second, all slots will eventually be marked as invalid. Ring ORAM sidesteps these issues through two complementary processes: eviction and bucket reshuffling. Every accesses, the evict path operation evicts objects from the client stash to cloud storage. It deterministically selects a target path, flushes as much data as possible, and permutes each bucket in the path, revalidating any invalid slots. Evict path consists of a read and write phase. In the read phase, it retrieves objects from each bucket in the path: all remaining valid real objects, plus enough valid dummies to reach a total of objects read. In the write phase, it places each stashed object—including those read by the read phase—to the deepest bucket on the target path that intersects with the object’s assigned path. Evict path then permutes the real and dummy values in each bucket along the target path, marking their slots as valid, and writes their contents to server storage. Figure 2 and 3 show the evict path procedure applied to path 4. In the read phase, evict path reads the unread object from the root node and dummies from other buckets on the path. In the write phase (Fig. 3), is flushed to leaf 4, as its path intersects completely with the target path. Finally, we note that randomness may cause a bucket to contain only invalid slots before its path is evicted, rendering it effectively unaccessible. When this happens, Ring ORAM restores access to the bucket by performing an early reshuffle operation that executes the read phase and write phase of evict path only for the target bucket.

Eviction Security. The read phase leaks no information about the contents of a given bucket. It systematically reads exactly valid objects from the bucket, selecting the valid real objects from the

real objects in the bucket, padding the remaining

required reads with a random subset of the dummy blocks. The random permutation and randomized encryption ensure that the server learns no information about how many real objects exist, and how many have been accessed. Similarly, the write phase hides the values and locations of objects written. At every bucket, the storage server observes only a newly encrypted and permuted set of objects, eliminating any correlation between past and future accesses to that bucket. Together, the read and write phases ensure that no slot is accessed more than once between reshuffles, guaranteeing the bucket invariant.

[width=0.75valign=b]RingOramOverviewWriteBucket.pdf

Figure 3: Eviction - Write Phase

Similarly, the eviction process leaks no information about the paths of the newly evicted objects: since all paths intersect at the root and the server cannot infer the contents of any individual bucket, any object in the stash may be flushed during any evict path. Moreover, since all paths intersect at the root, any object in the stash may be flushed during any evict path.

5 System Architecture

, like most privacy-preserving systems [75, 85, 69] consists of a centralized trusted component, the proxy, that communicates with a fault-tolerant but untrusted entity, cloud storage (Figure 4). The proxy handles concurrency control, while the untrusted cloud storage stores the private data. ensures that requests made by the proxy to the cloud storage over the untrusted network do not leak information. We assume that the proxy can crash and that when it does so, its state is lost. This two-tier design allows applications to run a lightweight proxy locally and delegate the complexity of fault-tolerance to cloud storage.

The proxy has two components:  a concurrency control unit and  a data manager comprised of a batch manager and an ORAM executor. The batch manager periodically schedules fixed-size batches of client operations that the ORAM executor then executes on a parallel version of Ring ORAM’s algorithm. The executor accesses one of two units located on server storage: the ORAM tree, which stores the actual data blocks of the ORAM; and the recovery unit, which logs all non-deterministic accesses to the ORAM to a write-ahead log [50] to enable secure failure recovery (§8).

6 Proxy Design

The proxy in has three goals: guarantee good performance, preserve correctness, and guarantee security. To meet these goals, designs the proxy around the concept of epochs. The proxy partitions time into a set of fixed-length, non-overlapping epochs. Epochs are the granularity at which guarantees durability and consistency. Each transaction, upon arriving at the proxy, is assigned to an epoch and clients are notified of whether a transaction has committed only when the epoch ends. Until then, buffers all updates at the proxy.

This flexibility boosts performance in two ways. First, it allows to implement a multiversioned concurrency control (MVCC) algorithm on top of a single versioned Ring ORAM. MVCC algorithms can significantly improve throughput by allowing read operations to proceed with limited blocking. These performance gains are especially significant in the presence of long-running transactions or high storage access latency, as is often the case for cloud storage systems. Second, it reduces traffic to the ORAM, as only the database state at the end of the epoch needs to be written out to cloud storage.

[width=]Architecture.pdf

Figure 4: System Architecture

Importantly, ’s choice to enforce consistency and durability only at epoch boundaries does not affect correctness; transactions continue to observe a serializable and recoverable schedule (i.e., committed transactions do not see writes from aborted transactions).

For transactions executing concurrently within the same epoch, serializability is guaranteed by concurrency control; transactions from different epochs are naturally serialized by the order in which the proxy executes their epochs. No transaction can span multiple epochs; unfinished transactions at epoch boundaries are aborted, so that no transaction is ongoing during epoch changes.

Durability is instead achieved by enforcing epoch fate-sharing [81] during proxy or client crashes: guarantees that either all completed transactions (i.e., transactions for which a commit request has been received) in the epoch are made durable or all transactions abort. This way, no committed transaction can ever observe non-durable writes.

Finally, the deterministic pattern of execution that epochs impose drastistically simplifies the task of guaranteeing workload independence: as we describe further below, the frequency and timing at which requests are sent to untrusted storage are fixed and consequently independent of the workload.

The proxy processes epochs with two modules: the concurrency control unit (CCU) ensures that execution remains serializable, while the data handler (DH) accesses the actual data objects. We describe each in turn.

6.1 Concurrency Control

, like many existing commercial databases [56, 64], uses multiversioned concurrency control [10]. specifically chooses multiversioned timestamp ordering (MVTSO) [10, 67] because it allows uncommitted writes to be immediately visible to concurrently executing transactions. To ensure serializability, transactions log the set of transactions whose uncommitted values they have observed (their write-read dependencies) and abort if any of their dependencies fail to commit. This optimistic approach is critical to ’s performance: it allows transactions within the same epoch to see each other’s effects even as delays commits until the epoch ends. In contrast, a pessimistic protocol like two-phase locking [25], which precludes transactions from observing uncommitted writes, would artificially increase contention by holding exclusive write-locks for the duration of an epoch. When a transaction starts, MVTSO assigns it a unique timestamp that determines its serialization order. A write operation creates a new object version marked with its transaction’s timestamp and inserts it in the version chain associated with that object. A read operation returns the object’s latest version with a timestamp smaller than its transaction’s timestamp. Read operations further update a read marker on the object’s version chain with their transaction’s timestamp. Any write operation with a smaller timestamp that subsequently tries to write to this object is aborted, ensuring that no read operation ever fails to observe a write from a transaction that should have preceded it in the serialization order.

Consider for example the set of transactions executing in Figure 5. Transaction ’s update to object () is immediately observed by transaction (). becomes dependent on and can only commit once also commits. In contrast, ’s write to object causes to abort: a transaction with a higher timestamp () had already read version , setting the version’s read marker to .

[width=]ProxyDesignTransactions.pdf

Figure 5: Batching Logic - denotes that transaction reads the version of object written by transaction

6.2 Data Handler

Once a version is selected for reading or writing, the DH becomes responsible for accessing or modifying the actual object. Whereas it suffices to guarantee durability and consistency only at epoch boundaries, security must hold at all times, posing two key challenges. First, the number of requests executed in parallel can leak information, e.g., data dependencies within the same transaction [11, 69]. Second, transactions may abort (§6.1), requiring their effects to be rolled back without revealing the existence of contended objects [5, 71]. To decouple the demands of these workloads from the timing and set of requests that it forwards to cloud storage, leverages the following observation: transactions can always be re-organized so that all reads from cloud storage execute before all writes [37, 87, 46, 19]. Indeed, while operations within a transaction may depend on the data returned by a read from cloud storage, no operation depends on the execution of a write. Accordingly, organizes the DH into a read phase and a write phase: it first reads all necessary objects from cloud storage, before applying all writes.

Read Phase. splits each epoch’s read phase into a fixed set of fixed-sized read batches () that are forwarded to the ORAM executor at fixed intervals (). This deterministic structure allows to execute dependent read operations without revealing the internal control flow of the epoch’s transactions. Read operations are assigned to the epoch’s next unfilled read batch. If no such batch exists, the transaction is aborted. Conversely, before a batch is forwarded to the ORAM executor, all remaining empty slots are padded with dummy requests. further deduplicates read operations that access the same key. As we describe in §7, this step is necessary for security since parallelized batches may leak information unless requests all access distinct keys [12, 85]. Deduplicating requests also benefits performance by increasing the number of operations that can be served within a fixed-size batch.

Write Phase. While transactions execute, buffers their write operations into a version cache that maintains all object versions created by transactions in the epoch. At the end of an epoch, transactions that have yet to finish executing (recall that epochs terminate at fixed intervals) are aborted and their operations are removed. The latest versions of each object in the version cache according to the version chain are then aggregated in a fixed-size write batch () that is forwarded to the ORAM executor, with additional padding if necessary.

This entire process, including write buffering and deduplication, must not violate serializability. The DH guarantees that write buffering respects serializability by directly serving reads from the version cache for objects modified in the current epoch. It guarantees serializability in the presence of duplicate requests by only including the last write of the version chain in a write batch. Since ’s epoch-based design guarantees that transactions from a later epoch are serialized after all transactions from an earlier epoch, intermediate object versions can be safely discarded. In this context, MVTSO’s requirement that transactions observe the latest committed write in the serialization order reduces to transactions reading the tail of the previous epoch’s version chain.

In the presence of failures, guarantees serializability and recoverability by enforcing epoch fate sharing: either all transactions in an epoch are made durable or none are. If a failure arises during epoch , the system simply recovers to epoch , aborting all transactions in epoch . Once again, this flexibility arises from delaying commit notifications until epoch boundaries.

Example Execution. We illustrate the batching logic once again with the help of Figure 5. Transactions ,, first execute read operations. These operations are aggregated into the first read batch of epoch . The values returned by these reads are then cached into the version cache. then executes a write operation, which also buffers into the version cache. When executing ), reads object directly from the version cache (we discuss the security of this step in the next section). Similarly, reads the buffered uncommitted version of . In contrast, schedules to execute as part of the next read batch as is not present in the version cache. The read batch is then padded to its fixed size and executed. contains no read operations: its write operations are simply executed and buffered at the version cache. then finalizes the epoch by aborting all transactions (and their dependencies) that have not yet finished executing: is consequently aborted. Finally, aggregates the last version of every update into the write batch (skipping version of object for instance, instead only writing ), before notifying clients of the commit decision.

6.3 Reducing Work

reduces work in two additional ways: it caches reads within an epoch and allows Ring ORAM to execute write operations without also executing dummy queries. While these optimizations may appear straightforward, ensuring that they maintain workload independence requires care.

Caching Reads. Ring ORAM maintains a client-side stash (§4) that stores ORAM blocks until their eviction to cloud storage. Importantly, a request for a block present in the stash still triggers a dummy request: a dummy object is still retrieved from each bucket along its path. While this access may appear redundant at first, it is in fact necessary to preserve workload independence

: removing it removes the guarantee that the set of paths that requests from cloud storage is uniformly distributed. In particular, blocks present in the stash are more likely to be mapped to paths farther away from the one visited by the last evict path, as they correspond to paths that could not be flushed: buckets have limited space for real blocks and blocks mapped to paths that only intersect near the top of the tree are less likely to find a free slot to which they can be flushed. The degree to which this effect skews the distribution leaks information about the stash size, and, consequently, about the workload. To illustrate, consider the execution in Figure 

6. Objects mapped to paths 1 and 2 (, , and ) were not flushed from the stash in the previous eviction of path 4. When these objects are subsequently accessed, naively reading them from the stash without performing dummy reads skews the set of paths accessed toward the right subtree (paths 3 and 4)

securely mitigates some of this work by drawing a novel distinction between objects that are in the stash as a result of a logical access and those present because they could not be evicted. The former can be safely accessed without performing a dummy read, while the latter cannot. Objects present in the stash following a logical access are mapped to independently uniformly distributed paths. Ring ORAM’s path invariant ensures that, without caching, the set of accessed paths is uniformly distributed. Removing an independent uniform subset of those paths (namely, the dummy requests) will consequently not change the distribution. Thus, caching these objects, and filling out a read batch with other real or dummy requests, preserves the uniform distribution of paths and leaks no information. consequently allows all read objects to be placed in the version cache for the duration of the epoch. Objects , , are, for instance, placed in the version cache in Figure 5, allowing read to read directly from the cache. In contrast, objects present in the stash because they could not be evicted are mapped to paths that skew away from the latest evict path. Caching these objects would consequently skew the distribution of requests sent to the storage away from a uniform distribution, as illustrated in Figure 6.

[width=valign=b]CachingSkew.pdf

Figure 6: Skew introduced by caching arbitrary objects

Dummiless Writes. Ring ORAM must hide whether requests correspond to read or write operations, as the specific pattern in which these operations are interleaved can leak information [88]; that is why Ring ORAM executes a read operation on the ORAM for every access. In contrast, since transactions can always perform all reads before all writes, no information is leaked by informing the storage server that each epoch consists of a fixed-size sequence of potentially dummy reads followed by a fixed-size sequence of potentially dummy writes. thus modifies Ring ORAM’s algorithm to directly place the new version of an object in the stash, without executing the corresponding read. Note, though, that continues to increment the evict path count on write operations, a necessary step to preserve the bounds on the stash size, which is important for durability (§8).

6.4 Configuring

’s good performance hinges on appropriately configuring the size/frequency of batches and ORAM tree for a target application. Table 1 summarizes the parameter space.

Ring ORAM. Configuring Ring ORAM first requires choosing an appropriate parameter. Larger values of reduce the total size of the ORAM on cloud-storage by decreasing the required height of the ORAM tree and decrease eviction frequency (reducing network/CPU overhead). In contrast, this increase the maximum stash size. Traditional ORAMs thus choose the largest value of Z for which the stash size fits on the proxy. adds an additional consideration: for durability (as we describe in §8), the stash must be synchronously written out every epoch. One must thus take into account the throughput loss associated with the stash writeback time. Given an appropriate value of Z, then chooses L, S, and A according to the analytical model proposed in [68].

Number of real objects
Number of real slots
Number of dummy slots
Frequency of evict path
Number of levels in the ORAM tree
Number of read batches
Size of a read batch
Size of a write batch
Batch frequency
Table 1: ’s configuration parameters

Epochs and batching. Identifiying the appropriate size and number of batches hinges on several considerations. First, must provision sufficiently many read batches () to handle control flow dependencies within transactions. A transaction that executes in sequence five dependent read operations, will for instance require five read batches to execute (it will otherwise repeatedly abort). Second, the ratio of reads () to writes () must closely approximate the application’s read/write ratio. An overly large write batch will waste resources as it will be padded with many dummy requests. A write batch that is too small will lead to frequent aborts caused by the batch filling up. Third, the size of a read or write batch (respectively and ) defines the degree of parallelism that can be extracted. The desired batch size is thus a function of the concurrent load of the system, but also of hardware considerations, as increasing parallelism beyond an I/O or CPU bottleneck serves no purpose. Finally, the number and frequency of read batches within an epoch increases overall latency, but reduces amortized resource costs through caching and operation pipelining (introduced in §7). Latency-sensitive applications may favor smaller batch sizes, while others may prefer longer epochs, but lower overheads.

Security Considerations. does not attempt to hide the size and frequency of batches from the storage server (we formalize this leakage in §9). Carefully tuning the size and frequency of batches to best match a given application may thus leak information about the application itself. An OLTP application, for instance, will likely have larger batch sizes (), but fewer read batches (), as OLTP applications sustain a high concurrent load of fairly short transactions. OLAP applications will prefer small or non-existent write batches (), as they are predominantly read-only, but require many read batches to support the complex joins/aggregates that they implement. does not attempt to hide the type of application that is being run. It does, however, continue to hide what data is being accessed and what transactions are currently being run at any given point in time. While ’s configuration parameters may, for instance, suggest that a medical application like FreeHealth is being run, they do not in any way leak information about how, when, or which patient records are being accessed.

7 Parallelizing the ORAM

Existing ORAM constructions make limited use of parallelism. Some allow requests to execute concurrently between eviction or shuffle phases [12, 85, 69], while others target intra-request parallelism to speed up execution of a single request [43]. explicitly targets both forms of parallelism. Parallelizing Ring ORAM presents three challenges: preserving the correct abstraction of a sequential datastore, enforcing security by concealing the position of real blocks in the ORAM (thereby maintaining workload independence), and preserving existing bounds on the stash size. While these issues also arise in prior work [69], the idiosyncrasies of Ring ORAM add new dimensions to these challenges.

Correctness. makes two observations. First, while all operations conflict at the Ring ORAM tree’s root, they can be split into suboperations that access mostly disjoint buckets4). Second, conflicting bucket operations can be further parallelized by distinguishing accesses to the bucket’s metadata from those to its physical data blocks.

draws from the theory of multilevel serializability [83], which guarantees that an execution is serializable if the system enforces level-by-level serializability: if operation is ordered before at level , all suboperations of must precede conflicting suboperations of . Thus, if orders conflicting operations at a level , it enforces the same order at level for all their conflicting suboperations; conversely, if two operations do not conflict at level , executes their suboperations in parallel. To this end, simply tracks dependencies across operations and orders conflicting suboperations accordingly. extracts further parallelism in two ways. First, since in Ring ORAM reads to the same bucket between consecutive eviction or reshuffling operations always target different physical data blocks (even when bucket operations conflict on metadata access), executes them in parallel. Second, ’s own batching logic ensures that requests within a batch touch different objects, preventing read and write methods from ever conflicting. Together, these techniques allow to execute most requests and evictions in parallel.

We illustrate the dependency tracking logic in Figure 7. The read operation to path 1 conflicts with the evict path for path 2, but only at the root (bucket 1). Thus, reads to buckets 2 and 3 can proceed concurrently, even though accesses to the root’s metadata must be serialized, as both operations update the bucket access counter and valid/invalid map (§4).

[width=]OramParallelVerySimpleLandscape.pdf

Figure 7: Multilevel Pipelining for a read of path 1 and an evict path of path 2 executing in parallel. Solid green lines represent physical dependencies and dashed red lines represent data dependencies. Inner blocks represent nested operations

Security. For security, ’s parallel evict path operation must flush the same blocks flushed by a sequential implementation. Reproducing this behavior without sacrificing parallelism is challenging. It requires that all real objects brought in during the last accesses be present in the stash when data is flushed, which may introduce data dependencies. Unlike dependencies that arise between operations that access the same physical location in cloud storage, these dependencies are not a deterministic function of an epoch’s operations already known to the adversary.

Consider, for instance, block in Figure 7. In a sequential implementation, would enter the stash as a result of reading path 1 and be flushed to bucket 3 by the following evict path. Thus, evict path would have to wait until is placed in the stash. Honoring these dependencies opens a timing channel: delay in flushing certain blocks can reveal object placement. As blocks holding real objects can exist anywhere in the tree and be remapped to any path, it follows that it is never secure to execute an eviction operation until all previous access operations have terminated.

mitigates this restriction by again leveraging delayed visibility and the idea to separate read and write operations within an epoch—but with an important difference. In §6.2 the proxy created separate batches for logical read and write operations; to improve parallelism, , expanding on an idea used by Shroud [43], assigns to separate phases within an epoch the physical read and write operations that underlie each of those logical operations. The read phase computes all necessary metadata and executes the set of physical read operations for all logical read path, early reshuffle, and evict path operations. This set is workload independent, so its operations need not be delayed. Physical writes, however, are only flushed at the end of an epoch. The proxy can again apply write deduplication: if a bucket is repeatedly modified during an epoch, only the last version must be written back. Reads that should have read an intermediate write are served locally from the buffered buckets.

The adversary thus always observes a set of reads to random paths followed by a deterministic set of writes independent of the contents of the ORAM and, consequently, of the workload. Data dependencies between read and evict operations no longer create a timing channel. Meanwhile parallelism remains high, as the physical blocks accessed in each phase are guaranteed to be distinct—Ring ORAM directly guarantees this for reads, while bucket deduplication does it for writes.

8 Durability

guarantees durability at the granularity of epochs: after a crash, it recovers to the state of the last failure-free epoch. adds two demands to the need of recovering to a consistent state: recovery should leak no information about past or future transactions, and it should be efficient, accessing minimal data from cloud storage. guarantees the former by ensuring that recovery logic and data logged for recovery maintain workload independence (§3). It strives towards the latter by leveraging the determinism of Ring ORAM.

Consistency. recovery logic relies on two well-known techniques: write-ahead logging [50] and shadow paging [29]. mandates that transactions be durable only at the end of an epoch; thus, on a proxy failure, all ongoing transactions can be aborted, and the system reverted to the previous epoch. To make this possible, must  recover the proxy metadata lost during the proxy crash, and  ensure that the ORAM does not contain any of the aborted transactions’ updates. To recover the metadata, logs three data structures before declaring the epoch committed: the position map, the permutation map, and the stash. The position map and the permutation map identify the position of real objects in the ORAM tree (respectively, in a path and in a bucket); logging them prevents the recovery logic from having to scan the full ORAM to recover the position of buckets. Logging the stash is necessary for correctness. As eviction may be unable to flush the entire stash, some newly written buckets may be present only in the stash, even at epoch boundaries. Failing to log the stash could thus lead to data loss.

To undo partially executed transactions, adapts the traditional copy-on-write technique of shadow paging [29]: rather than updating buckets in place, it creates new versions of each bucket on every write. then leverages the inherent determinism of Ring ORAM to reconstruct a consistent snapshot of the ORAM at a given epoch. In Ring ORAM, the current version of a bucket (i.e. the number of times a bucket has been written) is a deterministic function of the number of prior evict paths. The number of evict paths per epoch is similarly fixed (evict paths happen every accesses, and epochs are of fixed size). can then trivially revert the ORAM on failures by setting the evict path counter to its value at the end of the last committed epoch. This counter determines the number of evict paths that have occurred, and consequently the object versions of the corresponding epoch.

Security. ensures that  the information logged for durability remains independent of data accesses, and  that the interactions between the failed epoch, the recovery logic, and the next epoch preserve workload independence.

addresses the first issue by encrypting the position map and the contents of the permutations table. It similarly encrypts the stash, but also pads it to its maximum size, as determined in canonical Ring ORAM [68], to prevent it from indicating skew (if a small number of objects are accessed frequently, the stash will tend to be smaller).

The second concern requires more care: workload independence must hold before, during, and after failures. Ring ORAM guarantees workload independence through two invariants: the bucket invariant and the path invariant (§4). Preserving bucket slots from being read twice between evictions is straightforward. simply logs the invalid/valid map to track which slots have already been read and recovers it during recovery; there is no need for encryption, as the set of slots read is public information. Ensuring that the ORAM continues to observe a uniformly distributed set of paths is instead more challenging. Specifically, read requests from partially executed transactions can potentially leak information, even when recovering to the previous epoch. Traditionally, databases simply undo partially executed transactions, mark them as aborted, and proceed as if they had never existed. From a security standpoint, however, these transactions were still observed by the adversary, and thus may leak information. Consider a transaction accessing object (mapped to path ) that aborts because of a proxy failure. Upon recovery, it is likely that a client will attempt to access again. As the recovery logic restores the position map of the previous epoch, that new operation on will result in another access to path , revealing that the initial access to path

was likely real (rather than padded), as the probability of collisions between two uniformly chosen paths is low. To mitigate this concern while allowing clients to request the same objects after failure, durably logs the list of paths and slot indices that it accesses, before executing the actual requests, and replays those paths during recovery (remapping any real blocks). While this process is similar to traditional database redo logging 

[50], the goal is different. does not try to reapply transactions (they have all aborted), but instead forces the recovery logic to be deterministic: the adversary always sees the paths from the aborted epoch repeated after a failure.

Optimizations. To minimize the overhead of checkpointing, checkpoints deltas of the position, permutation, and valid/invalid map, and only periodically checkpoints the full data structures. While the number of changes to the permutation and valid/invalid maps directly follows from the set of physical requests made to cloud storage, the size of the delta for the position map reveals how many real requests were included in an epoch—padded requests do not lead to position map updates. thus pads the map delta to the maximum number of entries that could have changed in an epoch (i.e., the read batch size times the number of read batches, plus the size of the single write batch).

9 System Security

We now outline ’s security guarantees, deferring a formal treatment to Appendix B. To the best of our knowledge, we are the first to formalize the notion of crashes in the context of oblivious RAM.

Model We express our security proof within the Universal Composability (UC) framework [14], as it aligns well with the needs of modern distributed systems: a UC-secure system remains UC-secure under concurrency or if composed with other UC-secure systems. Intuitively, proving security in the UC model proceeds as follows. First, we specify an ideal functionality that defines the expected functionality of the protocol for both correctness and security. For instance, requires that the execution be serializable, and that only the frequency of read and write batches be learned. We must ensure that the real protocol provides the same functionality to honest parties while leaking no more information than would. To establish this, we consider two different worlds: one where the real protocol interacts with an adversary , and one where interacts with [], our best attempt at simulating . ’s transcript—including its inputs, outputs, and randomness—and []’s output are given to an environment , which can also observe all communications within each world. ’s goal is to determine which world contains the real protocol. To prompt the worlds to diverge, can delay and reorder messages, and even control external inputs (potentially causing failures). Intuitively, represents anything external to the protocol, such as concurrently executing systems. We say that the real protocol is secure if, for any adversary , we can construct [] such that can never distinguish between the worlds.

Assumptions The security of relies on four assumptions. Canonical Ring ORAM is linearizable MVTSO generates serializable executions. The network will retransmit dropped packets. The adversary learns of the retransmissions, but nothing more.

Ideal Functionality To define the ideal functionality , recall that the proxy is considered trusted while interactions with the cloud storage are not. This allows to replace the proxy and intermediate between clients and the storage server, performing the same functions as the proxy (we do not try to hide the concurrency/batching logic). We must, however, define to obliviously hide data values and access patterns. To this end, when the proxy logic finalizes a batch, simply informs the storage server that it is executing a read or write batch. Since is a theoretical ideal, we allow it to manage all storage internally, so it then updates its local storage and furnishes the appropriate response to each client.

In this setup, modeling proxy crashes is straightforward. Crashes can occur at any time and cause the proxy to lose all state. So, on an external input to crash, simply clears its state. Since we accept that may learn of proxy crashes, also sends a message to the storage server that it has crashed.

Proof Sketch The correctness of the system is straightforward, as behaves much the same as the proxy.

To prove security, we must demonstrate that, for any algorithm defining the behavior of the storage server, we can accurately simulate ’s behavior using only the information provided by . Note that the simulator [] can run internally, as is simply an algorithm. Thus we can define [] to operate as follows. When [] receives notification of a batch, it constructs a parallel ORAM batch from uniformly random accesses of the correct type. It provides these accesses to and produces ’s response.

The security of this simulation hinges on two key properties: the caching and deduplication logic do not affect the distribution of physical accesses, and the physical access pattern of a parallelized batch is entirely determined by the physical accesses proscribed by sequential Ring ORAM for the same batch. The first follows from Ring ORAM’s guarantee that each access will be an independent uniformly random path—removing an independently-sampled element does not change the distribution of the remaining set. The second follows from the parallelization procedure simply aggregating all accesses and performing all reads followed by all writes.

These properties ensure that the random access pattern produced by [] is identical to the access pattern produced by the proxy when operating on real data. Thus the simulated must behave exactly as it would when provided with real data, and produce indistinguishable output.

10 Implementation

[width=1.0]FreeHealth.pdf

Figure 8: FreeHealth Database Architecture

Our prototype consists of 41,000 lines of Java code. We use the Netty library for network communication (v4.1.20), Google protobuffers for serialization (v3.5.1), the Bouncy Castle library (v1.59) for encryption, and the Java MapDB library (v3) for persistence. We additionally implement a non-private baseline (NoPriv). NoPriv shares the same concurrency control logic (TSO), but replaces the proxy data handler with non-private remote storage. NoPriv neither batches nor delays operations; it buffers writes at the local proxy until commit, and serves writes locally when possible.

11 Evaluation

leverages the flexibility of transactional commits to mitigate the overheads of ORAM. To quantify the benefits and limitations of this approach, we ask:

  1. How much does pay for privacy? (§11.1)

  2. How do epochs affect these overheads? (§11.2)

  3. Can recover efficiently from failures? (§11.3)

Experimental Setup The proxy runs on a c5.xlarge Amazon EC2 instance (16 vCPUs, 32GB RAM), and the storage on an m5.4xlarge instance (16 vCPUs, 64GB RAM). The ORAM tree is configured with and optimal values of and (respectively, 196 and 168) [68]. We report the average of three 90 seconds runs (30 seconds ramp-up/down).

Benchmarks We evaluate the performance of our system using three applications: TPC-C [21, 79], SmallBank [21], and FreeHealth [41, 27]. Our microbenchmarks use the YCSB [18] workload generator. TPC-C, the defacto standard for OLTP workloads, simulates the business logic of e-commerce suppliers. We configure TPC-C to run with 10 warehouses [86]. In line with prior transactional key-value stores [78], we use a separate table as a secondary index on the order table to locate a customer’s latest order in the order status transaction, and on the customer table to look up customers by their last names (order status and payment). Smallbank [21] models a simple banking application supporting money transfers, withdrawals, and deposits. We configure it to run with one million accounts. Finally, we port FreeHealth [41, 27], an actively-used cloud EHR system (Figure 8). FreeHealth supports the business logic of medical practices and hospitals. It consists of 21 transaction types that doctors use to create patients and look up medical history, prescriptions, and drug interactions.

11.1 End-to-end Performance

[width=0.45valign=b] application-throughput-bars.pdf

(a) Throughput

[width=0.45valign=b] application-latency-bars.pdf

(b) Latency
Figure 9: Application Performance

Figure 9 summarizes the results from running the three end-to-end applications in two setups: a local setup in which the latency between proxy and server is low (0.3ms) (Obladi, NoPriv), and a more realistic WAN setup with 10ms latency (ObladiW, NoPrivW). We additionally compare those results with a local MySQL setup. MySQL, unlike NoPriv, cannot buffer writes. We consequently do not evaluate MySQL in the WAN setting.

TPC-C comes within 8× of NoPriv’s throughput, as NoPriv is contention-bottlenecked on the high rate of conflicts between the new-order and payment transactions on the district table. NoPriv’s performance is itself slightly higher than MySQL as the use of MVTSO allows for the new-order and payment transactions to be pipelined. In contrast, MySQL acquires exclusive locks for the duration of the transactions. Latency, however, spikes to 70× over NoPriv because of the inflexible execution pattern needs for security. Transactions in TPC-C vary heavily in size. Epochs must be large enough to accommodate all transactions, and hence artificially increase the latency of short instances. Moreover, write operations must be applied atomically during epoch changes. For a write batch size of 2,000, this process takes on average 340ms, further increasing latency for individual transactions. The write-back process also limits throughput, even preventing non-conflicting operations from making progress (in contrast, NoPriv can benefit from writes never blocking reads in MVTSO). Epoch changes also introduce additional aborts for transactions that straddle epochs. The additional 10ms latency of the WAN setting has comparatively little effect, as the large write batch size of TPC-C is the primary bottleneck: throughput remains within 9x of NoPrivW. Also NoPrivW’s performance does not degrade: since MVTSO exposes uncommitted writes immediately, increasing commit latency does not increase contention.

Smallbank Transactions in Smallbank are more homogeneous (between three and six operations); thus, the length of an epoch can be set to more closely approximate most transactions, reducing latency overheads (17× NoPriv). NoPriv is CPU bottlenecked for Smallbank; the relative throughput drop for is higher (12×) because of the overhead of changing epochs and the blocking that it introduces. Transaction dependency tracking becomes a bottleneck in NoPriv, resulting in a 15% throughput loss over MySQL. Increasing latency between proxy and storage causes both systems’ throughput to drop. ObladiW’s 35% drop is due to the increased duration of epoch changes (during which no other transactions can execute) while NoPrivW’s 30% drop stems from the larger dependency chains that arise from the relatively long commit phase.

FreeHealth Like SmallBank, FreeHealth consists of fairly short transactions and can thus choose a fairly small epoch (five read batches), reducing the impact on latency (20× NoPriv). Unlike Smallbank, however, FreeHealth consists primarily of read operations, and so it can choose a much smaller write batch (200), minimizing the cost of epoch changes and maximizing throughput (only a 4× drop over NoPriv and a 5.5× over NoPrivW for ObladiW). Both NoPriv and are contention-bottlenecked on the creation of episodes, the core units of EHR systems that encapsulate prescriptions, medical history, and patient interaction.

11.2 Impact of Epochs

Though epochs create blocking and cause aborts, they are key to reducing the cost of accessing ORAM, as they allow to  securely parallelize the ORAM and  delay and buffer bucket writes. To quantify epochs’ impact on performance as a function of their size and the underlying storage properties,

[width=0.45valign=b] parallel-oram-throughput.pdf

(a) Parallelism (Batch Size 500)

[width=0.45valign=b] strides-throughput-bars.pdf

(b) Batch Size Throughput

[width=0.45valign=b] strides-latency-bars.pdf

(c) Batch Size Latency

[width=0.45valign=b] writes-throughput-bars.pdf

(d) Delayed Visibility

[width=0.45valign=b] writebackstrides-increase-line.pdf

(e) Epoch Size Impact - ORAM

[width=0.45valign=b] smallbank.pdf

(f) Epoch Size Impact - Proxy
Figure 10: Performance impact of various features

we instantiate an ORAM with 100K objects and choose three different storage backends: a local dummy (storing no real data) that responds to all reads with a static value and ignores writes (dummy); a remote server backend with an in-memory hashmap (server, ping time 0.3ms) and a remote WAN server backend with an in-memory hashmap (server WAN, ping time 10ms); and DynamoDB (dynamo, provisioned for 80K req/s, read ping 1ms, write 3ms).

Parallelization We first focus on the performance impact of parallelizing Ring ORAM (ignoring other optimizations). Graph (a)a shows that, unsurprisingly, the benefits of parallelism increase with the latency of individual requests. Parallelizing the ORAM for dummy, for instance, yields no performance gain; in fact, it results in a 3× slowdown (from 72K req/s to 24K req/s). Sequential Ring ORAM on dummy is CPU-bound on metadata computation (remapping paths, shuffling buckets, etc.), so adding coordination mechanisms to guarantee multi-level serializability only increases the cost of accessing a bucket. As storage access latency increases and the ORAM becomes I/O-bound, the benefits of parallelism become more salient. For a batch size of 500, throughput increases by 12× for server, as much as 51× for dynamo, and 510×for WAN server. The available parallelism is a function of both the size/fan-out of the tree and the underlying resource bottlenecks of the proxy. Graph (b)b captures the parallelization speedup for both intra- and inter-request parallelism, while Graph (b)b quantifies the latency impact of batching. The parallelization speedup achieved for a batch size of one captures intra-request parallelism: the eleven levels of the ORAM can be accessed concurrently, yielding an 11× speedup. As batch sizes increase, can leverage inter-request parallelism to process non-conflicting physical operations in parallel, with little to no impact on latency. Dynamo peaks early (at 1750 req/s) because its client API uses blocking HTTP calls, and dummy’s storage eventually bottlenecks on encryption, but server and WAN server are more interesting. Their throughput is limited by the physical and data dependencies on the upper levels of the tree (recall that paths always conflict at the root (§7)).

Work Reduction To amortize ORAM overheads across a large number of operations, relies on delayed visibility to buffer bucket writes until the end of an epoch, when they can be executed in parallel, discarding intermediate writes. Reads to those buckets are directly served from the proxy, reducing network communication and CPU work (as encryption is not needed). Graph (d)d shows that enabling this optimization for an epoch of eight batches (a setup suitable for FreeHealth and TPC-C) yields a 1.5× speedup on both dynamo and the server, a 1.6× speedup on the WAN server, but only minimal gains for dummy (1.1×). When using a small number of batches, throughput gains come primarily from combining duplicate operations in buckets near the top of the tree. For example, the root bucket is written 27 times in an epoch of size eight (once per eviction, every 168 requests). As these operations conflict, they must be executed sequentially and quickly become the bottleneck (other buckets have fewer operations to execute). Our optimization lets write the root bucket only once, significantly reducing latency and thus increasing throughput. As epochs grow in size, increasingly many buckets are buffered locally until the end of the epoch (§7), allowing reads to be served locally and further reducing I/O with the storage. Consider Graph (e)e: throughput increases almost logarithmically; metadata computation eventually becomes a bottleneck for dummy, while server and server WAN eventually run out of memory from storing most of the tree (our AWS account did not allow us to provision dynamo adequately for larger batches). Larger epochs reduce the raw amount of work per operation: with one batch, requires 41 physical requests per logical operation, but only requires 24 operations with eight batches. For real transactional workloads, however, epochs are not a silver bullet. Graph (f)f suggests that applications are very sensitive to identifying the right epoch duration: too short and transactions cannot make progress, repeatedly aborting; too long and the system will remain unnecessarily idle.

11.3 Durability

[width=]checkpoint-freq-throughput-bars.pdf

(a) Checkpoint Frequency (100K)
10K 100K 1M
Levels 7 11 14
Slowdown 0.83 0.88 0.89
RecTime 1452 2604 6080
Network 182 681 848
Pos 8 74 1610
Perm 15 218 1424
Paths 864 1104 1341
(b) Server Wan Recovery Time (ms)
Figure 11: Durability

Table (b)b quantifies the efficiency of failure recovery and the cost it imposes on normal execution for ORAMS of different sizes (we show space results for only the WAN server as Dynamo follows a similar trend). During normal execution, durability imposes a moderate throughput drop (from 0.83× for 10K to 0.89× for 1M). This slowdown is due to the need to checkpoint client metadata and to synchronously log read paths to durable storage before reading. As seen in Graph (a)a, computing diffs mitigates the impact of checkpointing. Recovery time similarly increases as the ORAM grows, from 1.5s to 6.1s (Table (b)b, RecTime). The costs of decrypting the position and permutation maps (Pos and Perm) are low for small datasets, but grow linearly with the number of keys. Read path logging (Paths) instead starts much larger, but grows only with the depth of the tree.

12 Related Work

Batching amortizes ORAM costs by grouping operations into epochs and committing at epoch boundaries. Batching can mitigate expensive security primitives, e.g., it reduces server-side computation in private information retrieval (PIR) schemes [30, 9, 44, 32], amortizes the cost of shuffling networks in Atom [39] and the cost of verifying integrity in Concerto [6]. Changing when operations output commit is a popular performance-boosting technique: it yields significant gains for state-machine replication [63, 36, 34], file systems [54], and transactional databases [46, 20, 81].

ORAM parallelism extends recent work on parallel ORAM constructions [43, 85, 11] to extract parallelism both within and across requests. Shroud [43] targets intra-request parallelism by concurrently accessing different levels of tree-based ORAMs. Chung et al [12] and PrivateFS [85] instead target inter-request parallelism, respectively in tree-based [72] and hierarchical [84] ORAMs. Both works execute requests to distinct logical keys concurrently between reshuffles or evictions and deduplicate concurrent requests for the same key to increase parallelism. leverages delayed visibility to separate batches into read and write phases, extracting concurrency both within requests and across evictions. Furthermore, parallelizes across requests by deduplicating requests at the trusted proxy.

ObliviStore [76] and Taostore [69] instead approach parallelization by focusing on asynchrony. ObliviStore [76] formalizes the security challenges of scheduling requests asynchronously; the oblivious scheduling mechanism that it presents for that model however is computationally expensive and requires a large stash, making ObliviStore unsuitable for implementing ACID transactions. Like ObliviStore, Taostore leverages asynchrony to parallelize Path ORAM [77], a tree-based construction from which Ring ORAM descends. Taostore, however, targets a different threat model: it assumes both that requests must be processed immediately, and that the timing of responses is visible to the adversary. Request latencies thus necessarily increase linearly with the number of clients [85].

Hiding access patterns for non-transactional systems Many systems seek to provide access pattern protections for analytical queries: Opaque [88] and Cipherbase [5] support oblivious operators for queries that scan or shuffle full tables. Both rely on hardware enclaves for efficiency: Opaque runs a query optimizer in SGX [31], while Cipherbase leverages secure co-processors to evaluate predicates more efficiently. Others seek to hide the parameters of the query rather than the query itself: Olumofin et al. [55] do it via multiple rounds of keyword-based PIR operations [16]; Splinter [82] reduces the number of round-trips necessary by mapping these database queries to function secret sharing primitives. Finally, ObliDB [24] adds support for point queries and efficient updates by designing an oblivious B-tree for indexing. The concurrency control and recovery mechanisms of all these approaches introduce timing channels and structure writes in ways that leak access patterns [5].

Encryption Many commercial systems offer the possibility to store encrypted data [70, 23]. Efficiently executing data-dependent queries like joins, filters, or aggregations without knowledge of the plaintext is challenging: systems like CryptDB [62], Monomi [80], and Seabed [59] tailor encryption schemes to allow executing certain queries directly on encrypted data. Others leverage trusted hardware  [7]. In contrast, executing transactions on encrypted data is straightforward: neither concurrency control nor recovery requires knowledge of the plaintext data.

13 Conclusion

This paper presents , a system that, for the first time, considers the security challenges of providing ACID transactions without revealing access patterns. guarantees security and durability at moderate cost through a simple observation: transactional guarantees are only required to hold for committed transactions. By delaying commits until the end of epochs, inches closer to providing practical oblivious ACID transactions.

Acknowledgements We thank our shepherd, Jay Lorch, for his commitment to excellence, and the anonymous reviewers for their helpful comments. We are grateful to Sebastian Angel, Soumya Basu, Vijay Chidambaram, Trinabh Gupta, Paul Grubbs, Malte Schwarzkopf, Yunhao Zhang, and the MIT PDOS reading group for their feedback. This work was supported by NSF grants CSR-1409555 and CNS-1704742, and an AWS EC2 Education Research grant.

References

  • [1] Agrawal, D., and El Abbadi, A. Locks with Constrained Sharing (Extended Abstract).
  • [2] Aguilar-Melchor, C., Barrier, J., Fousse, L., and Killijian, M.-O. XPIR: Private Information Retrieval for Everyone. Cryptology ePrint Archive, Report 2014/1025, 2014. http://eprint.iacr.org/2014/1025.
  • [3] Amazon. S3: Simple storage service. https://aws.amazon.com/s3/.
  • [4] Amazon. Simple db. https://aws.amazon.com/simpledb/.
  • [5] Arasu, A., Blanas, S., Eguro, K., Kaushik, R., Kossmann, D., Ramamurthy, R., and Venkatesan, R. Orthogonal Security With Cipherbase. In Conference on Innovative Data Systems Research (CIDR) (2013).
  • [6] Arasu, A., Eguro, K., Kaushik, R., Kossmann, D., Meng, P., Pandey, V., and Ramamurthy, R. Concerto: A High Concurrency Key-Value Store with Integrity. In ACM SIGMOD International Conference on Management of Data (SIGMOD) (2017).
  • [7] Bajaj, S., and Sion, R. TrustedDB: A Trusted Hardware Based Database with Privacy and Data Confidentiality. In ACM SIGMOD International Conference on Management of Data (SIGMOD) (2011).
  • [8] Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Leon, J.-M., Li, Y., Lloyd, A., and Yushprakh, V. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In Conference on Innovative Data Systems Research (CIDR) (2011).
  • [9] Beimel, A., Ishai, Y., and Malkin, T. Reducing the servers’ computation in private information retrieval: PIR with preprocessing. Journal of Cryptology (JOFC) 17, 2 (2004), 125–151.
  • [10] Bernstein, P. A., and Goodman, N. Multiversion Concurrency Control — Theory and Algorithms. ACM Trans. Database Syst. 8, 4 (1983), 465–483.
  • [11] Bindschaedler, V., Naveed, M., Pan, X., Wang, X., and Huang, Y. Practicing Oblivious Access on Cloud Storage: The Gap, the Fallacy, and the New Way Forward. In ACM Conference on Computer and Communications Security (CCS) (2015).
  • [12] Boyle, E., Chung, K.-M., and Pass, R. Oblivious Parallel RAM and Applications. In Theory of Cryptography Conference (TCC) (2016).
  • [13] Bulck, J. V., Minkin, M., Weisse, O., Genkin, D., Kasikci, B., Piessens, F., Silberstein, M., Wenisch, T. F., Yarom, Y., and Strackx, R. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. In USENIX Security Symposium (USENIX) (2018).
  • [14] Canetti, R. Universally composable security: A new paradigm for cryptographic protocols. In IEEE Symposium on Foundations of Computer Science (FOCS) (2001).
  • [15] Cecchetti, E., Zhang, F., Ji, Y., Kosba, A., Juels, A., and Shi, E. Solidus: Confidential Distributed Ledger Transactions via PVORM. In ACM Conference on Computer and Communications Security (CCS) (2017).
  • [16] Chor, B., Gilboa, N., and Naor, M. Private information retrieval by keywords, 1997.
  • [17] Cloud, C. 5 advantages of a cloud-based EHR.
  • [18] Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. Benchmarking Cloud Serving Systems with YCSB. In ACM Symposium on Cloud Computing (SoCC) (2010).
  • [19] Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google’s Globally Distributed Database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 8:1–8:22.
  • [20] Crooks, N., Pu, Y., Alvisi, L., and Clement, A. Seeing is Believing: A Client-Centric Specification of Database Isolation. In ACM Symposium on Principles of Distributed Computing (PODC) (2017).
  • [21] Difallah, D. E., Pavlo, A., Curino, C., and Cudre-Mauroux, P. OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases.
  • [22] DynamoDB. DynamoDB. https://aws.amazon.com/dynamodb/.
  • [23] DynamoDB. Encryption at rest. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EncryptionAtRest.html.
  • [24] Eskandarian, S., and Zaharia, M. An Oblivious General-Purpose SQL Database for the Cloud. CoRR abs/1710.00458 (2017).
  • [25] Eswaran, K. P., Gray, J. N., Lorie, R. A., and Traiger, I. L. The Notions of Consistency and Predicate Locks in a Database System. Commun. ACM 19, 11 (1976), 624–633.
  • [26] Fletcher, C. W., Ren, L., Kwon, A., and v. Di, M. A Low-Latency, Low-Area Hardware Oblivious RAM Controller. In Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM) (2015).
  • [27] FreeHealth.io. FreeHealth EHR. https://freehealth.io/. Accessed 2018-05-01.
  • [28] Goldreich, O., and Ostrovsky, R. Software protection and simulation on oblivious RAMs. Journal of the ACM (JACM) 43, 3 (1996), 431–473.
  • [29] Gray, J., McJones, P., Blasgen, M., Lindsay, B., Lorie, R., Price, T., Putzolu, F., and Traiger, I. The Recovery Manager of the System R Database Manager. ACM Computing Surveys (CSUR) 13, 2 (1981), 223–242.
  • [30] Gupta, T., Crooks, N., Mulhern, W., Setty, S., Alvisi, L., and Walfish, M. Scalable and Private Media Consumption with Popcorn. In USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2016).
  • [31] Intel. Intel Software Guard Extension - SGX. https://software.intel.com/en-us/sgx.
  • [32] Ishai, Y., Kushilevitz, E., Ostrovsky, R., and Sahai, A. Batch Codes and Their Applications. In

    ACM Symposium on Theory of Computing (STOC)

    (2004).
  • [33] Jones, E. P., Abadi, D. J., and Madden, S. Low Overhead Concurrency Control for Partitioned Main Memory Databases. In ACM SIGMOD International Conference on Management of Data (SIGMOD) (2010).
  • [34] Kapritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., and Dahlin, M. All about Eve: Execute-Verify Replication for Multi-Core Servers. In USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2012).
  • [35] Kocher, P., Horn, J., Fogh, A., , Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., Mangard, S., Prescher, T., Schwarz, M., and Yarom, Y. Spectre Attacks: Exploiting Speculative Execution. In IEEE Symposium on Security and Privacy (SP) (2019).
  • [36] Kotla, R., Alvisi, L., Dahlin, M., Clement, A., and Wong, E. Zyzzyva: Speculative Byzantine Fault Tolerance. ACM Transactions on Computer Systems (TOCS) 27, 4 (2010), 7:1–7:39.
  • [37] Kung, H. T., and Robinson, J. T. On Optimistic Methods for Concurrency Control. ACM Trans. Database Syst. 6, 2 (1981), 213–226.
  • [38] Kuo, A. M.-H. Opportunities and challenges of cloud computing to improve health care services. Journal of Medical Internet Research (JMIR) 13, 3 (2011).
  • [39] Kwon, A., Corrigan-Gibbs, H., Devadas, S., and Ford, B. Atom: Horizontally Scaling Strong Anonymity. In ACM Symposium on Operating System Principles (SOSP) (2017).
  • [40] Larson, P.-A., Blanas, S., Diaconu, C., Freedman, C., Patel, J. M., and Zwilling, M. High-performance Concurrency Control Mechanisms for Main-memory Databases. In Proceedings of the VLDB Endowment (PVLDB) (2011).
  • [41] Libre, M. FreeHealth EHR. https://https://freemedsoft.com/fr/. Accessed 2018-05-01.
  • [42] Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Fogh, A., Horn, J., Mangard, S., Kocher, P., Genkin, D., Yarom, Y., and Hamburg, M. Meltdown: Reading Kernel Memory from User Space. In USENIX Security Symposium (USENIX) (2018).
  • [43] Lorch, J., Parno, B., Mickens, J., Raykova, M., and Schiffman, J. Shroud: Ensuring Private Access to Large-Scale Data in the Data Center. In Conference on File and Storage Technologies (FAST) (2013).
  • [44] Lueks, W., and Goldberg, I. Sublinear Scaling for Multi-Client Private Information Retrieval. In Financial Cryptography and Data Security (FC) (2015).
  • [45] Maas, M., Love, E., Stefanov, E., Tiwari, M., Shi, E., Asanovic, K., Kubiatowicz, J., and Song, D. PHANTOM: Practical Oblivious Computation in a Secure Processor. In ACM Conference on Computer and Communications Security (CCS) (2013).
  • [46] Mehdi, S. A., Littley, C., Crooks, N., Alvisi, L., Bronson, N., and Lloyd, W. I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades. In USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2017).
  • [47] Microsoft. Azure tables. https://azure.microsoft.com/en-us/services/storage/tables/.
  • [48] Microsoft. Documentdb - nosql service for json. https://azure.microsoft.com/en-us/services/documentdb/.
  • [49] Microsoft. SQL Server. https://www.microsoft.com/en-cy/sql-server/sql-server-2016.
  • [50] Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. ARIES: A Transaction Recovery Method Supporting Fine-granularity Locking and Partial Rollbacks Using Write-ahead Logging. ACM Trans. Database Syst. 17, 1 (1992), 94–162.
  • [51] MongoDB. Agility, Performance, Scalibility. Pick three. https://www.mongodb.org/.
  • [52] Narayanan, A., and Shmatikov, V. Robust De-anonymization of Large Sparse Datasets. In IEEE Symposium on Security and Privacy (SP) (2008).
  • [53] Narayanan, A., and Shmatikov, V. Myths and fallacies of “personally identifiable information”. Commun. ACM 53, 6 (June 2010), 24–26.
  • [54] Nightingale, E. B., Veeraraghavan, K., Chen, P. M., and Flinn, J. Rethink the Sync. ACM Transactions on Computer Systems (TOCS) 26, 3 (2008), 6:1–6:26.
  • [55] Olumofin, F., and Goldberg, I. Privacy-preserving Queries over Relational Databases. In Privacy Enhancing Technologies Symposium (PETS) (2010).
  • [56] Oracle. InnoDB. https://dev.mysql.com/doc/refman/8.0/en/innodb-storage-engine.html/.
  • [57] Oracle. MySQL. https://www.mysql.com/.
  • [58] Oracle. MySQL Cluster. https://www.mysql.com/products/cluster/.
  • [59] Papadimitriou, A., Bhagwan, R., Chandran, N., Ramjee, R., Haeberlen, A., Singh, H., Modi, A., and Badrinarayanan, S. Big Data Analytics over Encrypted Datasets with Seabed. In USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2016).
  • [60] Papadimitriou, C. H. The Serializability of Concurrent Database Updates. Journal of the ACM (JACM) 26, 4 (1979), 631–653.
  • [61] Platform, G. C. Cloud spanner. http://cloud.google.com/spanner/.
  • [62] Popa, R. A., Redfield, C. M. S., Zeldovich, N., and Balakrishnan, H. CryptDB: Protecting Confidentiality with Encrypted Query Processing. In ACM Symposium on Operating System Principles (SOSP) (2011).
  • [63] Ports, D. R., Li, J., Liu, V., Sharma, N. K., and Krishnamurthy, A. Designing Distributed Systems Using Approximate Synchrony in Data Center Networks. In USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2015).
  • [64] PostgreSQL. http://www.postgresql.org/.
  • [65] Reddy, P. K., and Kitsuregawa, M. Speculative Locking Protocols to Improve Performance for Distributed Database Systems. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16, 2 (2004), 154–169.
  • [66] Reed, D. P. Implementing Atomic Actions on Decentralized Data (Extended Abstract). In ACM Symposium on Operating System Principles (SOSP) (1979).
  • [67] Reed, D. P. Implementing Atomic Actions on Decentralized Data. ACM Transactions on Computer Systems (TOCS) 1, 1 (1983), 3–23.
  • [68] Ren, L., Fletcher, C., Kwon, A., Stefanov, E., Shi, E., van Dijk, M., and Devadas, S. Constants Count: Practical Improvements to Oblivious RAM. In USENIX Security Symposium (USENIX) (2015).
  • [69] Sahin, C., Zakhary, V., El Abbadi, A., Lin, H., and Tessaro, S. TaoStore: Overcoming Asynchronicity in Oblivious Data Storage. In IEEE Symposium on Security and Privacy (SP) (2016).
  • [70] Server, M. S. Always Encrypted. https://www.microsoft.com/en-us/research/project/always-encrypted/.
  • [71] Sheff, I., Magrino, T., Liu, J., Myers, A. C., and van Renesse, R. Safe Serializable Secure Scheduling: Transactions and the Trade-Off Between Security and Consistency. In ACM Conference on Computer and Communications Security (CCS) (2016).
  • [72] Shi, E., Chan, T.-H. H., Stefanov, E., and Li, M. Oblivious RAM with Worst-Case Cost. In International Conference on The Theory and Application of Cryptology and Information Security (2011).
  • [73] Singel, R. Netflix spilled your Brokeback Mountain secret, lawsuit claims. Wired (Dec. 2009). http://www.wired.com/images_blogs/threatlevel/2009/12/doe-v-netflix.pdf.
  • [74] Stefanov, E., and Shi, E. ObliviStore: High Performance Oblivious Cloud Storage. In IEEE Symposium on Security and Privacy (SP) (2013).
  • [75] Stefanov, E., and Shi, E. ObliviStore: High Performance Oblivious Distributed Cloud Data Store. In Network and Distributed System Security Symposium (NDSS) (2013).
  • [76] Stefanov, E., Shi, E., and Song, D. Towards Practical Oblivious RAM.
  • [77] Stefanov, E., van Dijk, M., Shi, E., Fletcher, C., Ren, L., Yu, X., and Devadas, S. Path ORAM: An Extremely Simple Oblivious RAM Protocol. In ACM Conference on Computer and Communications Security (CCS) (2013).
  • [78] Su, C., Crooks, N., Ding, C., Alvisi, L., and Xie, C. Bringing Modular Concurrency Control to the Next Level. In ACM SIGMOD International Conference on Management of Data (SIGMOD) (2017).
  • [79] Transaction Processing Performance Council. The TPC-C home page. http://www.tpc.org/tpcc.
  • [80] Tu, S., Kaashoek, M. F., Madden, S., and Zeldovich, N. Processing Analytical Queries over Encrypted Data. In Proceedings of the VLDB Endowment (PVLDB) (2013).
  • [81] Tu, S., Zheng, W., Kohler, E., Liskov, B., and Madden, S. Speedy Transactions in Multicore In-memory Databases. In ACM Symposium on Operating System Principles (SOSP) (2013).
  • [82] Wang, F., Yun, C., Goldwasser, S., Vaikuntanathan, V., and Zaharia, M. Splinter: Practical Private Queries on Public Data. In USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2017).
  • [83] Weikum, G. Principles and Realization Strategies of Multilevel Transaction Management. ACM Trans. Database Syst. 16, 1 (1991), 132–180.
  • [84] Williams, P., Sion, R., and Carbunar, B. Building Castles out of Mud: Practical Access Pattern Privacy and Correctness on Untrusted Storage. In ACM Conference on Computer and Communications Security (CCS) (2008).
  • [85] Williams, P., Sion, R., and Tomescu, A. PrivateFS: A Parallel Oblivious File System. In ACM Conference on Computer and Communications Security (CCS) (2012).
  • [86] Xie, C., Su, C., Littley, C., Alvisi, L., Kapritsos, M., and Wang, Y. High-performance ACID via Modular Concurrency Control. In ACM Symposium on Operating System Principles (SOSP) (2015).
  • [87] Zhang, I., Sharma, N. K., Szekeres, A., Krishnamurthy, A., and Ports, D. R. K. Building Consistent Transactions with Inconsistent Replication. In ACM Symposium on Operating System Principles (SOSP) (2015).
  • [88] Zheng, W., Dave, A., Beekman, J. G., Popa, R. A., Gonzalez, J. E., and Stoica, I. Opaque: An Oblivious and Encrypted Distributed Analytics Platform. In USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2017).

Appendix A Ensuring Data Integrity in

As we described in §3, we assume the untrusted storage server is honest-but-curious. In many cases this is a very strong assumption that system operators may not be happy to make. We can remove this requirement with the use of Message Authentication Codes (MACs) and a trusted counter—used to ensure freshness—that persists across crashes. We describe this technique here.

When we assumed the server was honest-but-curious, we assumed it could deny service, but would otherwise correctly respond to all queries. In order to remove this assumption while maintaining security, we must create a means to identify if the storage server returns incorrect data, thus reducing them to DoS attacks. To do this, the proxy must verify that the returned value is the value most recently written by the proxy to the specified location.

We can guarantee using MACs. At initialization, the proxy generates a secret MAC key (in addition to its secret encryption key) and attaches a MAC to every piece of data it stores on the cloud server. This allows the proxy to verify that the cloud server did not modify the data or manufacture its own.

By themselves, MACs do not guarantee and , as the cloud server can provide an old copy of the data or valid data from a different location, both of which will have valid MACs. We additionally need to include a unique identifier that the proxy can easily recompute. For data that is written at most once per epoch, this unique identifier can be the pair of epoch, ORAM location. Due to Ring ORAM’s deterministic eviction algorithm, the proxy can compute the epoch during which any given block was most recently written knowing only the current epoch counter and the early reshuffle table.

There is exactly one value which is written multiple times per epoch: each read batch, of which there may be many per epoch, logs the accessed locations. This means the counter associated with those writes must uniquely identify the read batch, not just the epoch. In fact, since every epoch has the same number of read batches, a read batch counter is sufficient for all values.

Handling Crashes The above modifications are sufficient to guarantee integrity if the proxy never crashes. When the proxy crashes, however, it needs information from the cloud storage to recover. To guarantee integrity—in particular freshness—of the recovery data, the epoch/read batch counter we describe above must persist in a trustworthy fashion across failures. Perhaps the easiest way to implement this requirement is to store the counter on a small amount of nonvolatile storage locally on the proxy, but any trustworthy and persistent storage mechanism is sufficient.

This, of course, raises the question of when to update this trustworthy persistent counter. Once the update occurs, a recovering proxy will expect the cloud storage to provide data associated with that counter value. This means that the counter must be updated after writing to cloud storage. Because a recovering proxy will be unaware of the newly-written data until the counter is updated, we do not consider the write complete until the counter is properly updated. As usual, if the proxy crashes while a write is in-progress, the write is simply rolled back.

As long as the storage server cannot learn anything from incomplete writes, our new strategy is entirely secure. Because the timing of ’s writes is completely deterministic and their locations are determined entirely by the locations of prior reads, the fact that a write has aborted does not inherently leak any information. The contents of the write can, however, leak information if we are not careful. Most data in the system is already encrypted, but one value is not: the read logs written during read batches. Previously we had no need to encrypt these as the write operation was atomic and the cloud server was to immediately learn all data contained in the write. However, the write is no longer atomic; the proxy can crash after sending data to the cloud server but before updating its trusted counter. In this case the storage server may withhold that data on recovery without detection and learn whether the proxy accessed the same locations after recovery is complete. To fix this leak, we encrypt the read batch logs in the cloud and update the counter after writing the log but before reading any values. That way the cloud storage gains no information about what data will be read until after the write is complete, at which point the proxy will always replay the read if a crash occurs. Thus we have removed the leakage.

As we will see in Appendix B, these modifications are sufficient to guarantee both confidentiality and integrity (though obviously not availability) even against an arbitrarily malicious cloud storage server.

Appendix B Formal Security

We now provide formal security definitions and proofs for . As we discuss in §9, we use the Universal Composability (UC) framework [14]. The UC framework requires us to specify an ideal functionality that defines what it means for to be secure. We must then prove that, for every possible adversarial algorithm specifying the behavior of the storage server, we can simulate ’s behavior when interacting only with .

We prove security of the scheme including the modification in Appendix A and do not assume the cloud storage provider is trusted for integrity. As the MACs and counters are only used to verify integrity and freshness of data, they are unnecessary if the cloud server is being honest. As we will see below, removing them—as we do in our implementation—does not impact security in this case.

We also noted in Appendix A that the proxy requires a trusted epoch counter that persists across crashes. This could be implemented as an integer in local non-volatile storage that the proxy updates with each epoch, it could be implemented by trusting the cloud storage for integrity and saving it there, or other means. We abstract away this detail by providing the protocol with access to , an ideal functionality that provides access to this counter.

b.1 Ideal Functionality

We begin by noting that ’s proxy acts as a trusted central coordinator that performs publicly-known logic on private data. As this is essentially the role played by any ideal functionality, we simply subsume the proxy into . Moreover, some of the proxy’s behavior, like the fact that it deduplicates and caches accesses, pads under-full batches, is public information, meaning can explicitly perform exactly the same operations.

In §5 we describe the proxy as consisting of a concurrency control unit and a data manager, which itself contains a batch manager and ORAM executor. As the concurrency control and batch management functionalities do not inherently leak any information, we define in terms of those operations. In particular, we let represent this functionality. is defined as providing the exact functionality of the concurrency control unit and batch manager as described in §5 and §6. has the following ways to interface with :

  • can supply with an input from a client (start, read, write, or commit).

  • can produce a read batch of logical data blocks. The batch need not be full, meaning it may contain fewer than the maximum number of reads for a batch. can then respond with the requested blocks.

  • can produce a write batch of logical data blocks. The batch need not be full. can then respond confirming the writes have completed.

  • can specify an epoch has ended and transactions should commit. can then respond with confirmation.

  • can clear ’s internal state, representing a crash.

can additionally send a messages directly to clients.

Modeling Crashes In the real system the proxy can crash at any time. As all state except the cryptographic keys (and possibly trusted counters) is considered volatile, it does not matter when during a local operation the proxy crashes, as every piece of that operation is lost regardless. We can therefore simplify the ideal functionality by allowing for crashes both between requests and immediately prior to any operation within a request that either leaves the proxy (e.g., writing to cloud storage) or persists across crashes (e.g., updating the trusted epoch counter).

To model any possible crash, we control the timing of the crashes through a Crash Client. queries the Crash Client immediately prior to any relevant action and waits for a reply. The Crash Client then waits for a prompt from the environment, which it forwards to , telling it to proceed or crash. Additionally, the Crash Client—again at the prompting of the environment—can issue a “crash” command independently between requests.

We provide the full specification for in Algorithm 1, which references . For notational clarity, we do not explicitly specify every call to the Crash Client. Instead any operation prefixed by notifies the Crash Client before executing and crashes if instructed. Note that it is possible to crash while recovering from a crash.

Data:
Data: Counters
Initialize
       Initialize ;
       Begin epoch ;
      
end
( from client ) Forward to ;
(“read-batch” from ) Send “read-batch-init” to , wait for ;
;
Send “read-batch-read” to , wait for ;
Read from ;
Respond to with results ;
(“write-epoch” from ) Send “write-epoch” to , wait for ;
;
Write to ;
Confirm write/epoch completed to ;
(“crash” from Crash Client) Execute crashRecover function crashRecover
       Send to ;
       Clear internal state of ;
       Rollback writes to since beginning of epoch ;
       ;
      
end
Before executing operation, notify Crash Client. On response of “crash,” abort operation and invoke crashRecover, otherwise proceed.
Algorithm 1 Ideal functionality using .

b.2 Security Lemmas

In order to prove the security of , we rely on two lemmas which we alluded to in §9.

[Caching and Deduplication] Let be any set of logical reads or writes selected independently from the current ORAM position map. Let be the set of accesses resulting from applying the proxy batch manager’s caching and deduplication logic to . The set of physical accesses needed to realize is identically distributed to the set of physical accesses needed to realize a uniformly random set of logical accesses of the same size.

Proof.

Since is selected independently from the current position map in the ORAM, Ring ORAM guarantees that the set of physical accesses needed to realize is identically distributed to that for a uniformly random set of logical reads or writes. is simply with some elements removed, so we claim that the elements removed form an unbiased sample. Since removing an unbiased sample from a distribution does not change the distribution, this is sufficient.

We first note that Ring ORAM guarantees that any independently-selected logical access results in physical accesses sampled independently from the following distribution. First sample a uniformly random path in the tree. Then, for each bucket in that path, sample a uniformly random block from among those not read since the bucket was last written. Finally, read all selected blocks.

In Ring ORAM, whenever a block is read or written, it is immediately remapped to an independent uniformly random path in the tree that determines what will be read next time it is accessed. The proxy batch manger’s caching and deduplication logic removes access requests for any block previously accessed in this epoch. Each of those blocks was mapped to a new independently uniform random path when accessed. Moreover, when an epoch ends, the cache is completely flushed, meaning there is no (potentially-biased) caching or deduplication.

Thus the sample of physical accesses removed by pairing down to must be unbiased, so must result in a uniformly random set of physical access paths. ∎

[Parallel ORAM] The set of parallel physical data operations performed by the proxy ORAM executor over one epoch (as described in §7) is completely determined by the set of sequential physical accesses required to perform the same logical actions in Ring ORAM (plus a single write to the durability store).

Proof.

We note that, as described in §7, the proxy performs all reads within an epoch before any writes (aside from the durability store). By construction, it ensures that each physical block that would be read at least once within an epoch in a fully sequential access is read exactly once in that epoch, and no other physical blocks are ever read (excluding crash recovery).

This is enforced by holding a record of every block that has been read this epoch and then performing the reads of the sequential access, but skipping blocks that have already been read. Additionally, whenever an evict path operation would happen, the proxy reads every unread block from each bucket along that path, thus marking them as read. As the timing of evict paths is determined by how many data accesses have happened and their locations are deterministic, this enforcement mechanism is dependent only on the physical blocks accessed, not in any way on the data held in those blocks.

Similarly, each block that would be written at least once in a sequentially-processed epoch is written exactly once at the end of the epoch. This is done by buffering writes in the proxy, allowing one buffered write of a physical block to overwrite any previous unflushed writes of that block. Then when the epoch ends, the proxy flushes all buffered writes. Again, the set of blocks being written is determined entirely by the physical access pattern of the sequential operation.

Finally, a fixed amount of data is written to the durability store before each read batch, and the entire durability store is written with each write batch. This means that in normal operation, the location and timing of all reads and writes are determined by only the physical operations needed to perform the epoch operations sequentially and some extra completely deterministic operations.

On crash recovery, the proxy reads the durability store and rereads all paths in the aborted epoch. This, again, is based entirely on physical access patterns.

Hence all physical read and write operations within a parallelized epoch are determined entirely by the physical data operations needed to perform that epoch sequentially. ∎

b.3 Proof of Security

We now prove that the protocol (with access to ) is secure with respect to the ideal functionality described in Algorithm 1. Let denote the full transcript of (including its inputs and randomness) when interacting with . Let denote the transcript produced by when run in the ideal world, interacting with .

Assume the encryption scheme used in is semantically secure and the MACs are existentially unforgeable. For all probabilistic polynomial time (PPT) adversaries and environments , there is a simulator [] such that for all PPT distinguishers there is some negligible function such that

Proof.

This proof follows from a series of hybrid simulators, each of which is indistinguishable from the previous.

We define hybrids . operates in the real world with being a “dummy” that passes all messages through to unmodified. has two ORAMs that are identical except for the MACs, one maintained by and the other maintained by [1]. replaces all data in ’s ORAM with random dummy data, independent from the actual data. replaces the access pattern in ’s ORAM with random data accesses. Finally uses [] in the ideal world and no longer maintains its own ORAM.

Hybrid contains a dummy simulator that passes messages between and the proxy unchanged. This produces a transcript identical to the real world.

Hybrid passes all messages through to , but also maintains its own copy of the ORAM, simultaneously processes requests internally. On initialization [1] generates its own MAC key according to the same distribution as ’s MAC key. It then replaces the MACs of all data sent to with valid MACs on the same data using this new key. When responds to a request, [1] checks the MACs on the data. If they are correct, it forwards the (correct) response from it’s own ORAM with the original MACs. If they are incorrect, it responds with a failure message. If ’s response is correct, so too will [1]’s. If ’s MACs to not verify, fails, so a failure message produces the same result. If ’s response is wrong but the MACs verify, must have forged a MAC since they include the data, position, and epoch counter, and no two pieces of data are ever given the same position and epoch counter. Moreover, because has access to a trusted epoch counter via , it can properly verify that the data has the correct epoch counter, even after crashes. Thus, if [1] accepts an incorrect response with non-negligible probability, we can simulate to forge a MAC with non-negligible probability. Hence is computationally indistinguishable from .

Note that the MACs are only used to check that provided correct data. If the storage server is assumed to be honest, this will always be the case and we can eliminate the MACs entirely (and also and become identical).

Hybrid replaces all data blocks provided to with valid encryptions of random data and MACs on those encryptions. It otherwise passes on requests, including the location and timing of reads and writes. [2] continues to furnish responses to the proxy’s queries using its internal ORAM with the original data, checking MACs according to the same scheme as in . [2] then output’s ’s transcript. As all data is encrypted, the only difference between and is the contents of the ciphertexts, and by assumption the encryption scheme is semantically secure. This means and must be computationally indistinguishable.

Hybrid replaces all data requests to with properly-formatted requests for randomly chosen data.

When [3] receives a location log for a read batch, it logs an encryption of random (unrelated) data with . When [3] receives the read instruction for a read batch, it first selects a random set of dummy paths of the batch size. It then requests perform the proper parallel read operation for that dummy data. If replies with the data and the MACs verify, [3] performs the actual reads on its separate ORAM with real data and returns the real data to the proxy.

When [3] is notified of the end of an epoch and given the associated write batch, it determines which physical blocks to write using Ring ORAM’s deterministic write sequence based on the total number of operations (both reads and writes) in an epoch. It then performs proper parallel writes of new encryptions of dummy data to each of those locations. If replied with confirmed writes, [3] performs the originally-specified operations on its separate ORAM and confirms success to the proxy.

Finally, if [3] receives a request to handle a proxy crash at epoch and batch , it queries as per the crash recovery protocol for that epoch and batch. When provides valid (MAC-verifying) read path logs for any batches this epoch, [3] provides the associated logs to the proxy. When the proxy issues redo read requests, [3] issues the same requests it did the first time to for the associated batches. Because [3] did not crash, it is able to retain which paths were read without having to store them explicitly. In is possible that the last read batch requested during recovery corresponds to a read that was never executed, in which case [3] generates a new random read batch and executes that instead. If responds correctly, [3] responds to the proxy’s requests.

By Lemma B.2, the physical operations needed to process all real requests in a given epoch sequentially form an identical distribution to the sequential accesses needed to process the random requests chosen by [3]. By Lemma B.2, applying the parallelization process relies only on the sequential physical access pattern, meaning it can be applied the same way to [3]’s random operations as to the real operations provided by the proxy. This means that the operations [3] requests of are identically distributed to those the proxy requests of [3] when there are no crashes.

When a crash occurs, the recovery procedure is guaranteed to reread all previously-read data, and any future reads must have independently random paths. This is because [3] does not even generate random paths to read until the read request is issued, by which point the persistent batch counter is updated. So if a crash does occur, it will redo any previous reads and future operations are treated as regular read/write batches with the same (independent) distribution. Since these are the only difference between and , the two must produce identical distributions.

Hybrid now interacts with the ideal functionality and no longer maintains its own internal ORAM copy, only the data necessary to perform actions on ’s, including the new MAC and encryption keys. The only data [3] was using to compute requests for was the timing of batches and crash recoveries, and the epoch and batch counters during recovery. As explicitly provides all of that information, [4] is able to provide with an identical view. Note that on crash recovery, this identical view requires completing a crash-recover epoch, which [4] can do by creating an appropriate number of read and write operations as it would in . This means that and are identically distributed.

Thus we see that corresponds to the real world, corresponds to the ideal world, and each sequential pair of produce computationally indistinguishable transcripts. Thus it must be the case that and form computationally indistinguishable transcripts, so realizes . ∎