Dynamic proofs of retrievability with low server storage

07/24/2020 ∙ by Gaspard Anthoine, et al. ∙ Université Grenoble Alpes United States Naval Academy 0

Proofs of Retrievability (PoRs) are protocols which allow a client to store data remotely and to efficiently ensure, via audits, that the entirety of that data is still intact. A dynamic PoR system also supports efficient retrieval and update of any small portion of the data. We propose new, simple protocols for dynamic PoR that are designed for practical efficiency, trading decreased persistent storage for increased server computation, and show in fact that this tradeoff is inherent via a lower bound proof of time-space for any PoR scheme. Notably, ours is the first dynamic PoR which does not require any special encoding of the data stored on the server, meaning it can be trivially composed with any database service or with existing techniques for encryption or redundancy. Our implementation and deployment on Google Cloud Platform demonstrates our solution is scalable: for example, auditing a 1TB file takes 16 minutes at a monetary cost of just 0.23 USD. We also present several further enhancements, reducing the amount of client storage, or the communication bandwidth, or allowing public verifiability, wherein any untrusted third party may conduct an audit.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 The need for integrity checks

While various computing metrics have accelerated and slowed over the last half-century, one which undeniably continues to grow quickly is data storage. One recent study estimated the world’s storage capacity at 4.4ZB (

), and growing at a rate of 40% per year [9]. Another study group estimates that by 2025, half of the world’s data will be stored remotely, and half of that will be in public cloud storage [31].

As storage becomes more vast and more outsourced, users and organizations need ways to ensure the integrity of their data – that the service provider continues to store it, in its entirety, unmodified. Customers may currently rely on the reputations of large cloud companies like IBM Cloud or Amazon AWS, but even those can suffer data loss events [2, 20], and as the market continues to grow, new storage providers without such long-standing reputations need cost-effective ways to convince customers their data is intact.

This need is especially acute for the growing set of decentralized storage networks (DSNs), such as Filecoin, Storj, SAFE Network, Sia, and PPIO, that act to connect users who need their data stored with providers (“miners”) who will be paid to store users’ data. In DSNs, integrity checks are useful at two levels: from the customer who may be wary of trusting blockchain-based networks, and within the network to ensure that storage nodes are actually providing their promised service. Furthermore, storage nodes whose sole aim is to earn cryptocurrency payment have a strong incentive to cheat, perhaps by deleting user data or thwarting audit mechanisms.

1.2 Existing solutions

The research community has developed a wide array of solutions to the remote data integrity problem over the last 15 years. Here we merely summarize the main lines of work and highlight some shortcomings that this paper seeks to address; see Section 7 for a more complete discussion and comparison.

Provable Data Possession (PDP).

PDP audits [24, 16, 35, 37] are practically efficient methods to ensure that a large fraction of data has not been modified. They generally work by computing a small tag for each block of stored data, then randomly sampling a subset of data blocks and corresponding tags, and computing a check over that subset.

Because a server that has lost or deleted a constant fraction of the file will likely be unable to pass an audit, PDPs are useful in detecting catastrophic or unintentional data loss. They are also quite efficient in practice. However, a server who deletes only a few blocks is still likely to pass an audit, so the security guarantees are not complete, and may be inadequate for critical data storage or possibly-malicious providers.

Proof of Retrievability (PoR).

PoR audits, starting with [5], have typically used techniques such as error-correcting codes, and more recently Oblivious RAM (ORAM), in order to obscure from the server where pieces of the file are stored [27, 13]. Early PoR schemes did not provide an efficient update mechanism to alter individual data blocks, but more recent dynamic schemes have overcome this shortcoming [34, 10].

A successful PoR audit provides a strong guarantee of retrievability: if the server altered many blocks, this will be detected with high probability, whereas if only few blocks were altered or deleted, then the error correction means the file can still likely be recovered. Therefore, a single successful audit ensures with high probability that the

entire file is still stored by the server.

The downside of this stronger guarantee is that PoRs have typically used more sophisticated cryptographic tools than PDPs, and in all cases we know of require multiple times the original data size for persistent remote storage. This is problematic from a cost standpoint: if a PoR based on ORAM requires perhaps 10x storage on the cloud, this cost may easily overwhelm the savings cloud storage promises to provide.

Proof of Replication (PoRep) and others.

While our work mainly falls into the PoR/PDP setting, it also has applications to more recent and related notions of remote storage proofs.

Proofs of space were originally proposed as an alternative to the computation-based puzzles in blockchains and anti-abuse mechanisms [4, 14], and require verifiable storage of a large amount of essentially-random data. These are not applicable to cloud storage, where the data must obviously not be random.

A PoRep scheme (sometimes called Proof of Data Reliability) aims to combine the ideas of proof of space and PoR/PDP in order to prove that multiple copies of a data file are stored remotely. This is important as, for example, a client may pay for 3x redundant storage to prevent data loss, and wants to make sure that three actual copies are stored in distinct locations. Some PoRep schemes employ slow encodings and time-based audit checks; the idea is that a server does not have enough time to re-compute the encoding on demand when an audit is requested, or even to retrieve it from another server, and so must actually store the (redundantly) encoded file [3, 18, 38, 11]. The Filecoin network employs this type of verification. A different and promising approach, not based on timing assumptions, has recently been proposed by [12]. An important property of many recent PoRep schemes is public verifiability, that is, the ability for a third party (without secrets) to conduct an audit. This is crucial especially for distributed storage networks (DSNs).

Most relevant for the current paper is that most of these schemes directly rely on an underlying PDP or PoR in order to verify encoded replica storage. For example, [12] states that their protocol directly inherits any security and efficiency properties of the underlying PDP or PoR.

We also point out that, in contrast to our security model, many of these works are based on a rational actor model, where it is not in a participant’s financial interest to cheat, but a malicious user may break this guarantee, and furthermore that most existing PoRep schemes do not support dynamic updates to individual data blocks.

1.3 Our Contributions

We present a new proof of retrievability which has the following advantages compared to existing PDPs and PoRs:

Near-optimal persistent storage. The best existing PoR protocols that we could find require between and bytes of cloud storage to support audits of an -byte data file, making these schemes impractical in many settings. Our new PoR requires only persistent storage.

Simple cryptographic building blocks. Our basic protocol relies only on small-integer arithmetic and a collision-resistant hash function, making it very efficient in practice. Indeed, we demonstrate in practice that 1TB of data can be audited in 16 minutes at a monetary cost of just USD.

Efficient partial retrievals and updates. That is, our scheme is a dynamic PoR, suitable to large applications where the user does not always wish to re-download the entire file.

Provable retrievability from malicious servers. Similar to the best PoR protocols, our scheme supports data recovery (extraction) via rewinding audits. This means, in particular, that there is only a negligible chance that a server can pass a single audit and yet not recover the entirety of stored data.

(Nearly) stateless clients. With the addition of a symmetric cipher, the client(s) in our protocol need only store a single decryption key and hash digest, which means multiple clients may easily share access (audit responsibility) on the same remote data store.

Public verifiability. We show an extension to our protocol, based on the difficulty of discrete logarithms in some group, that allows any third party to conduct audits with no shared secret.

Importantly, because our protocols store the data unencoded on the server, they can trivially be used within or around any existing encryption or duplication scheme, including most PoRep constructions. We can also efficiently support arbitrary server-side applications, such as databases or file systems with their own encoding needs.

The main drawback of our schemes is that, compared to existing PoRs, they have a higher asymptotic complexity for server-side computation during audits, and (in some cases) higher communication bandwidth during audits as well. However, we also provide a time-space lower bound that proves any PoR scheme must make a tradeoff between persistent space and audit computation time.

Furthermore, we demonstrate with a complete implementation and deployment on Google Compute Platform that the tradeoff we make is highly beneficial in cloud settings. Intuitively, a user must pay for the computational cost of audits only when they are actually happening, maybe a few times a day, whereas the extra cost of (say) 5x persistent storage must be paid all the time, whether the client is performing audits or not.

1.4 Organization

The rest of the paper is structured as follows:

  • [nosep]

  • Section 2 defines our security model, along the lines of most recent PoR works.

  • Section 3 contains our proof of an inherent time-space tradeoff in any PoR scheme.

  • Section 4 gives an overview and description of our basic protocol, with detailed algorithms and security proofs delayed until Section 6.

  • Section 5

    discusses the results of our open-source implementation and deployment on Google Compute Platform.

  • Section 7 gives a detailed comparison with prior work.

2 Security model

We define a dynamic PoR scheme as consisting of the following five algorithms between a client with state and a server with state . Our definition is the same as given by [34], except that we follow [24] and include the Extract algorithm in the protocol explicitly.

A subtle but important point to note is that, unlike the first four algorithms, Extract is not really intended to be used in practice. In typical usage, a cooperating and honest server will pass all audits, and the normal Read algorithm would be used to retrieve any or all of the data file reliably. The purpose of Extract is mostly to prove that the data is recoverable by a series of random, successful audits, and hence that the server which has deleted even one block of data has negligible chance to pass a single audit.

The client may use random coins for any algorithm; at a minimum, the Audit algorithm must be randomized in order to satisfy retrievability non-trivially.

  • [nosep]

  • : On input of the security parameter and the database , consisting of bits arranged in blocks of bits, outputs the client state and the server state .

  • : On input of an index , the client state and the server state , outputs or reject.

  • : On input of an index , data , the client state and the server state , outputs a new client state and a new server state , such that now , or reject.

  • : On input of the client state and the server state , outputs a succesful transcript or reject.

  • : On input of independent Audit transcripts , outputs the database . The number of required transcripts must be a polynomially-bounded function of , , and .

2.1 Correctness

A correct execution of the algorithms by honest client and server results in audits being accepted and reads to recover the last updated value of the database. More formally, correctness is:

Definition 1 (Correctness).

For any parameters , there exists a predicate IsValid such that, for any database of bits, . Furthermore, for any state such that and any index with , we have

  • [nosep]

  • ;

  • , where and the remaining for every ;

  • ;

  • For audits with independent randomness, with probability :

.

Note that, even though may use random coins in the algorithms, a correct PoR by this definition should have no chance of returning reject in any Read, Write or Audit with an honest client and server.

2.2 Authenticity and attacker model

The authenticity requirement stipulates that the client can always detect (except with negligible probability) if any message sent by the server deviates from honest behavior. We use the following game between an observer , a potentially malicious server and an honest server for the adaptive version of authenticity, with the same game as [34]:

  1. [nosep]

  2. chooses an initial memory . runs Init and sends the initial memory layout to both  and .

  3. For a polynomial number of steps , picks an operation where operation is either Read, Write or Audit. executes the operations with both  and .

  4. is said to win the game, if any message sent by  differs from that of  and  did not output reject.

Definition 2 (Authenticity).

A PoR scheme satisfies adaptive authenticity, if no polynomial-time adversary has more than negligible probability in winning the above security game.

2.3 Retrievability

Intuitively, the retrievability requirement stipulates that whenever a malicious server can pass the audit test with high probability, the server must know the entire memory contents . To model this, [10] use a black-box rewinding access: from the state of the server before any passed audit, there must exist an extractor algorithm that can reconstruct the complete correct database. As in [34], we insist furthermore that the extractor does not use the complete server state, but only the transcripts from successful audits. In the following game, note that the observer running the honest client algorithms may only update its state during Write algorithm, and hence the Audit algorithms are independently randomized from the client side, but we make no assumptions about the state of the adversary .

  1. [nosep]

  2. chooses an initial database . runs Init and sends the initial memory layout to ;

  3. For , the adversary adaptively chooses an operation where is either Read, Write or Audit. The observer executes the respective algorithms with , updating and according to the Write operations specified;

  4. The observer runs Audit algorithms with and records the outputs of those which did not return reject, where .

  5. The adversary  is said to with the game if and .

Definition 3 (Retrievability).

A PoR scheme satisfies retrievability if no polynomial-time adversary has more than negligible probability in winning the above security game.

3 Time-space tradeoff lower bound

As we have seen, the state of the art in Proofs of Retrievability schemes consists of some approaches with a low audit cost but a high storage overhead (e.g., [24, 34, 10]) and some schemes with a low storage overhead but high computational cost for the server during audits (e.g., [5, 32, 33]).

Before presenting our own constructions (which fall into the latter category) we prove that there is indeed an inherent tradeoff in any PoR scheme between the amount of extra storage and the cost of performing audits. By extra storage here we mean exactly the number of extra bits of persistent memory, on the client or server, beyond the bit-length of the original database being represented.

Theorem 4 below shows that, for any PoR scheme with sub-linear audit cost, we have

None of the previous schemes, nor those which we present, make this lower bound tight. Nonetheless, it demonstrates that a “best of all possible worlds” scheme with, say, extra storage and audit cost to store an arbitrary -bit database, is impossible.

The proof is by contradiction, presenting an attack on an arbitrary PoR scheme which does not satisfy the claimed time/space lower bound. Our attack consists of flipping randomly-chosen bits of the storage. First we show that is small enough so that the audit probably does not examine any of the flipped bits, and still passes. Next we see that is large enough so that, for some choice of the bits being represented, flipping bits will, with high probability, make it impossible for any algorithm to correctly recover the original data. This is a contradiction, since the audit will pass even though the data is lost.

Readers familiar with coding theory will notice that the second part of the proof is similar to Hamming’s bound for the minimal distance of a block code. Indeed, we can view the original -bit data as a message, and the storage using extra bits of memory as an -bit codeword. A valid PoR scheme must be able to extract (decode) the original message from an -bit string, or else should fail any audit.

Theorem 4 (Appendix A).

Consider any Proof of Retrievability scheme which stores an arbitrary database of  bits, uses at most  bits of persistent memory on the server,  bits of persistent memory on the client, and requires at most  steps to perform an audit. Assuming , then either , or

4 Retrievability via verifiable computing

We first present a simple version of our PoR protocol. This version contains the main ideas of our approach, namely, using matrix-vector products during audits to prove retrievability. It also makes use of Merkle hash trees during reads and updates to ensure authenticity.

This protocol uses only persistent server storage, which is an improvement to the persistent storage of existing PoR schemes, and is the main contribution of this work. The costs of our Read and Write algorithms are similar to existing work, but we incur an asymptotically higher cost for the Audit algorithm, namely communication bandwidth and server computation time. We demonstrate in the next section that this tradeoff between persistent storage and Audit cost is favorable in cloud computing settings for realistic-size databases.

Later, in Section 6, we give a more general protocol and prove it secure according to the PoR definition in Section 2. That generalized version shows how to achieve persistent client storage with the same costs, or alternatively to trade arbitrarily small communication bandwidth during Audits for increased client persistent storage and computation time.

4.1 Overview

A summary of our four algorithms is shown in Table 1, where dashed boxes are the classical, Merkle hash tree authenticated, remote read/write operations.

Our idea is to use verifiable computing schemes as, e.g., proposed in [17]. Our choice for this is to treat the data as a square matrix of dimension roughly . This allows for the matrix multiplication verification described in [19] to be used as a computational method for the audit algorithm.

Crucially, this does not require any additional metadata; the database is stored as-is on disk, our algorithm merely treats the machine words of this unmodified data as a matrix stored in row-major order. Although the computational complexity for the Audit algorithm is asymptotically for the server, this entails only a single matrix-vector multiplication, in contrast to some prior work which requires expensive RSA computations [5].

To ensure authenticity also during Read and Write operations, we combine this linear algebra idea above with a standard Merkle hash tree.

Server Communications Client
Init
.
Stores and Stores , and
Read
Returns
Write
Stores updated Stores updated
Audit
Table 1: Client/server PoR protocol with low storage server

4.2 Matrix based approach for audits

The basic premise of our particular PoR is to treat the data, consisting of bits organized in machine words, as a matrix , where is a suitable finite ring of size . Crucially, the choice of ring detailed below does not require any modification to the raw data itself; that is, any element of the matrix can be retrieved in time. At a high level, our audit algorithm follows the matrix multiplication verification technique of [19].

In the Init algorithm, the Client chooses a secret random control vector and computes a second secret control vector according to

(1)

Note that is held constant for the duration of the storage. This does not compromise security because no message which depends on is ever sent to the Server. In particular, this means that multiple clients could use different, independent, control vectors as long as they have a way to synchronize Write operations (modifications of their shared database) over a secure channel.

To perform an audit, the client chooses a random challenge vector , and asks the server to compute a response vector according to

(2)

Upon receiving the response , the client checks two dot products for equality, namely

(3)

The proof of retrievability will rely on the fact that observing several successful audits allows, with high probability, recovery of the matrix , and therefore of the entire database.

The audit algorithm’s cost is mostly in the server’s matrix-vector product. The client’s dot products are much cheaper in comparison. For instance if are close to , the communication cost is bounded by as each vector has about values. We trade this infrequent heavy computation for no additional persistent storage, justified by the significantly cheaper cost of computation versus storage space.

A sketch of the security proofs is as follows; full proofs are provided along with our formal and general protocol in Section 6. The Client knows that the Server sent the correct value of with high probability, because otherwise the Server must know something about the secret control vector chosen randomly at initialization time. This is impossible since no data depending on was ever sent to the Server. The retrievability property (Definition 3) is ensured from the fact that, after random successful audits, with high probability, the original data is the unique solution to the matrix equation , where is the matrix of random challenge vectors in the audits and is the matrix of corresponding response vectors from the Server.

Some similar ideas were used by [32] for checking integrity. However, their security relies on the difficulty of integer factorization. Implementation would therefore require many modular exponentiations at thousands of bits of precision. Our approach for audits is much simpler and independent of computational hardness assumptions.

4.3 Merkle hash tree for updates

While the audit operates on the data in word-size chunks as members of a finite ring , retrieving data is done at the byte level with support for retrieving any range of bytes (that is legal with the size of the data). A Merkle hash tree with block size is used here to ensure authenticity of individual Read operations. This is a binary tree, stored on the server, consisting of hashes, each of size for collision resistance.

The Client stores only the root hash, and can perform, with high integrity assurance, any read or write operation on a range of bytes in communication and computation time. When the block size is large enough, the extra server storage is ; for example, means the hash tree can be stored using bits.

Merkle hash trees are a classical result, commonly used in practice, and we do not claim any novelty in our use here [28, 26]. To that end, we provide three algorithms to abstract the details of the Merkle hash tree, and give more details on a possible implementation in Appendix C.

These are all two-party protocols between a Server and a Client, but without any requirement for secrecy. A vertical bar in the inputs and/or outputs of an algorithm indicates Server input/output on the left, and Client input/output on the right. When only the Client has input/output, the bar is omitted for brevity.

The MTVerifiedRead and MTVerifiedWrite algorithms may both fail to verify a hash, and if so, the Client outputs reject and aborts immediately. Our three Merkle tree algorithms are as follows.

. The Client initializes database for storage in size- blocks. The entire database is sent to the Server, who computes hashes and stores the resulting Merkle hash tree . The Client also computes this tree, but discards all hashes other than the root hash . The cost in communication and computation for both parties is bounded by .

. The Client sends a contiguous byte range to the server, i.e., a pair of indices within the size of . This range determines which containing range of blocks are required, and sends back these block contents, along with left and right boundary paths in the hash tree . Specifically, the boundary paths include all left sibling hashes along the path from the first block to the root node, and all right sibling hashes along the path from the last block to the root; these are called the “uncles” in the hash tree. Using the returned blocks and hash tree values, the Client reconstructs the Merkle tree root, and compares with . If these do not match, the Client outputs reject and aborts. Otherwise, the requested range of bytes is extracted from the (now-verified) blocks and returned. The cost in communication and computation time for both parties is at most .


    .
The Client wishes to update the data in the specified range, and receive the previous value of that range, , as well as an updated root hash . The algorithm begins as MTVerifiedRead with the Server sending all blocks to cover the range and corresponding left and right boundary hashes from . After the Client retrieves and verifies the old value with the old root hash , she updates the blocks with the new value and uses the same boundary hashes to compute the new root hash . Separately, the Server updates the underlying database in the specified range, then recomputes all affected hashes in . The asymptotic cost is identical to that for the MTVerifiedRead algorithm.

5 Experiments with Google cloud services

As we have seen, compared to other dynamic PoR schemes, our protocol aims at achieving the high security guarantees of PoR, while trading near-minimal persistent server storage for increased audit computation time.

In order to address the practicality of this tradeoff, we implemented and tested our PoR protocol using virtual machines and disks on the Google Cloud Platform service, a commercial competitor to Amazon Web Services, Microsoft Azure, and the like.

Specifically, we address two primary questions:

  • [nosep]

  • What is the monetary cost and time required to perform our time audit on a large database?

  • How does the decreased cost of persistent storage trade-off with increase costs for computation during audits?

Our experimental results are summarized in Tables 4, 3 and 5. For a 1TB data file, the communication cost of our audit entails less than 12MB of data transfer, and our implementation executes the audit for this 1TB data file in less than 16 minutes and for a monetary cost of less than $0.25 USD.

By contrast, just the extra persistent storage required by other existing PoR schemes would cost at least $40 USD or as much as $200 USD per month, not including any computation costs for audits. These results lead to two tentative conclusions:

  • [nosep]

  • The communication and computation costs of our Audit algorithm are not prohibitive in practice despite their unfavorable asymptotics; and

  • Our solution is the most cost-efficient PoR scheme available when few audits are performed per day.

We also emphasize again that a key benefit to our PoR scheme is its composability with existing software, as the data file is left in-tact as a normal file on the Server’s filesystem.

The remainder of this section gives the full details of our implementation and experimental setup. The source code is available via the following github repository: https://github.com/dsroche/la-por

5.1 Parameter selection

To balance the bandwidth (protocol communications) and the client computation costs, we chose to represent as a square matrix with dimensions , where the 8 comes from our choice of corresponding to 64-bit words (see Section 5.2). The resulting asymptotic costs for these parameter choices are summarized in Table 2.

Server Comm. Client
Storage

Comput.

Init
Audit
Read/Write
Table 2: Proof of retrievability via square matrix verifiable computing

5.2 Two Prime Calculations

In order to leave the data file unmodified in persistent storage, while allowing constant-time random access to individual matrix elements, we break the data into word-sized (8 byte) blocks, and choose a finite ring with .

One possibility would be to set as a prime larger than , but this would entail costly multiple-precision computations for the modular arithmetic. Instead, we chose the ring as the direct product of two finite fields, each of large prime order. When , this ensures unique recovery of the database from images in via Chinese remaindering, and also allows efficient computation without extended precision.

In our implementation, we chose and . That is a Mersenne prime makes computations with it particularly efficient, but a second Mersenne prime of similar size does not exist. For the actual arithmetic we used the low-level routines provided by the open-source high performance number theory library Flint [21].

This two-prime setup is equivalent to storing two databases and in finite fields and respectively, and so the formal security proof of Theorem 6 applies as long as the smaller prime is larger than the column dimension of the database matrix (see Section 6 for more details). This means our implementation parameters satisfy the security proof requirements for sizes up to 144PB.

5.3 Experimental Design

Our implementation provides the Init, Read, Write, and Audit algorithms as described in the previous section, including the Merkle hash tree implementation for read/write integrity. As the cost of the first three of these are comparable to prior work, we focused our experiments on the Audit algorithm.

We ran two sets of experiments, using virtual machines and disks on Google Cloud’s Compute Engine333https://cloud.google.com/compute/docs/machine-types.. The client machine was a basic f1-micro instance, 1 vCPU with 0.6GB memory. The server machine was an n1-standard-2, 2 vCPU with 7.5GB memory. In the second set of experiments, the 4/16 parallel VMs running MPI were all n1-standard-1 instances with 1 vCPU and 3.75GB memory. The client and server processes communicated over TCP connection. The data itself was stored on an attached 1.2TB standard persistent disk. Test files of size 1GB, 10GB, 100GB, and 1TB were generated with random bytes from /dev/urandom. The server time in Table 3 measures CPU time only; all other times are “wall time” in actual seconds for operation completion.

5.4 Audit compared to checksums

For the first set of experiments, we wanted to address the question of how “heavy” the hidden constant in the is. For this, we compared the cost of performing a single audit, on databases of various sizes, to the cost of computing a cryptographic checksum of the entire database using MD5 or SHA256.

Operation 1GB 10GB 100GB 1TB
Init Server 0.81 47.38 483.16 5236.65
Wall 11.17 102.49 1055.46 11437.00
Audit Server 5.53 54.26 463.25 5510.80
Wall 12.75 117.36 1080.66 11495.00
MD5 Server 2.46 23.92 251.31 2848.21
Wall 8.91 88.47 914.91 9234.00
SHA256 Server 6.30 62.67 639.08 6428.22
Wall 11.49 112.72 1553.74 11969.00
Table 3: Run Time Test Results (seconds)

In a sense, a cryptographic checksum is another means of integrity check that requires no extra storage, albeit without the malicious server protection that our PoR protocol provides. Therefore, having an audit cost which is comparable to that of a cryptographic checksum indicates the theoretical cost is not too heavy in practice.

The experiment took place in 4 stages. First, each file was run through the initialization algorithm. Then, each file was run through the Audit algorithm. Third, an MD5 digest was calculated for each file. Finally, a SHA256 digest was computed for each file. A Merkle tree was also created over each file. Instead of scaling the block size to each file, a block size of 8KiB was chosen for practical performance. The results are organized into Table 3.

Per operation, the timings report the CPU time from the server side, and the total wall time from the client side. The difference is due mostly to I/O overhead; even for the audit, the client-side work to compute the two dot products is minimal.

There are two main conclusions to draw from the experiments. The first deals with our Audit algorithm following the theoretical bounds that were expected, and the second deals with how the run time compares to that of the hash functions.

Because the server computation time for an audit is , we expect the times to scale linearly, and our results support this. We also see that the running time is consistently between that of MD5 and SHA256 checksums, both in wall time and CPU time. This justifies the fact that the time Audit algorithm, while more costly than other PoR and PDP schemes, is comparable to that of computing a cryptographic checksum.

We were surprised by the large disparity between the Server Time and the Wall Time in these experiments, both for our own Audit algorithm and for the checksum comparisons. We determined that this disparity is mostly due to I/O within the cloud datacenter, caused by the CPU waiting for the reads to the external drive.

5.5 Parallel audits using MPI

Our first round of experiments indicated that our Audit algorithm on the server was I/O bound, despite the favorable linear access pattern of the matrix-vector product computation. It seems that Google Cloud Platform throttles disk-to-VM I/O on a per-VM basis, so that even with many cores, the situation did not improve.

However, we were able to achieve good parallel speedup when running the Audit algorithm over multiple VMs in parallel using MPI. In this setup, the server VM waits for a connection from a client, who requests an audit, which is in turn performed by some number of VMs running in parallel, after which the results are collected and returned to the client. The simplicity of our Audit algorithm makes it trivially parallelizable, where each parallel VM performs the matrix-vector product on a contiguous subset of rows of , corresponding to a contiguous segment of the underlying file.

Because the built-in MD5 and SHA256 checksum programs do not achieve any parallel speedup, we focused only on our Audit algorithm for this set of experiments using MPI. The results are reported in Table 4. Our parallel speedup is not quite linear, but was sufficient to gain a significant improvement in the audit time, to just under 16 minutes in the case of a 1TB file using 16 VMs.

We also used these times to measure the total cost of running each audit in Google Cloud Platform, which features per-second billing of VMs and persistent disks, as reported in Table 4 as well. Note that the monetary cost for increasing parallel VMs is slightly decreasing for larger file sizes, indicating that even higher levels of parallelization may decrease the running time even further with no extra monetary cost.

VMs Metric 1GB 10GB 100GB 1TB
1 Audit (s) 12.75 117.36 1080.66 11495.00
Speedup 1x 1x 1x 1x
Cost $0.0004 $0.0031 $0.0285 $0.3033
4 Audit (s) 3.99 46.28 430.29 3512.33
Speedup 3.19x 2.54x 2.51x 3.27x
Cost $0.0003 $0.0037 $0.0341 $0.2781
16 Audit (s) 3.63 14.34 117.92 946.74
Speedup 3.51x 8.18x 9.16x 12.14x
Cost $0.0009 $0.0034 $0.0280 $0.2249
Table 4: Multiple Machine Parallelization Results

5.6 Communication and client computation time

Besides the complexity for server computation during an audit, the cost of client computation and communication bandwidth in our scheme is also asymptotically worse than existing PoR schemes. However, our experiments suggest that in practice these are not significant factors.

The time it took the client to compute the two dot products to finish the audit never took more than 0.12 seconds in any case tested. This indicates that even low-powered client machines should be able to run this Audit algorithm without issue.

The time spent communicating the challenge and response vectors, and , becomes insignificant in comparison to the server computation as the size of the database increases. In the case of our experiments, Table 5 summarizes that communication time of both and remains under five seconds. The amount of data communicated is also given to confirm the square root scaling.

Metric 1GB 10GB 100GB 1TB
Comm. (kB) 358 1131 3578 11314
Time (s) 2.05 1.68 3.87 4.04
Table 5: Amount of Communication Per Audit with 16 VMs

Our experiments had both the client and server (with associated added VMs) co-located in the us-central1-a zone (listed geographically as Iowa), meaning that the communication is likely within the same physical datacenter. We hope to address this in future work with geographically diverse clients; however, the small amount of data being transferred indicates that this will still not have a significant affect on the overall audit time.

Again, we emphasize that the main benefit of our approach is the vastly decreased persistent storage compared to existing PoR schemes. Including the Merkle tree with block size of 8KiB, the total storage overhead is only 1.0684x file size. Using Google Cloud with a 1TB database, the cost for a 1.069TB Standard Persistent Disk per month is $42.76. Running our Audit algorithm with 16 VMs every five hours would still be financially favorable compared to using a PoR scheme with just 2x storage overhead.

6 Formalization and Security analysis

In this section we present our PoR protocol in most general form, prove it satisfies the definitions of PoR correctness, authenticity, and retrievability, analyze its asymptotic performance and present a variant that also satisfies public verifiability.

6.1 Additional improvements on the control vectors

The control vectors and stored by the Client in the simplified protocol from Section 4 can be modified to increase security and decrease persistent storage or communications.

6.1.1 Ensuring security assumptions via multiple checks

In order to reach a target bound on the probability of failure for the authenticity, it might be necessary to choose multiple independent vectors during initialization and repeat the audit checks with each one. First, to ease independence considerations we forget the two prime ring and consider instead that tests are performed in , a finite field of size . Second, we model multiple vectors by inflating the vectors and to be blocks of non-zero vectors instead; that is, matrices and with rows each. To see how large needs to be, consider the probability of the Client accepting an incorrect response during an audit. An incorrect answer to the audit fails to be detected only if

(4)

where is the correct response which would be returned by an honest Server.

If is sampled uniformly at random among matrices in with non-zero rows, then since the Server never learns any information about , audit fails only if the Server can guess a vector in the right nullspace of . This happens with probability at most .

Achieving a probability bounded by , requires to set . In practice, reasonable values of and mean that in this case.

6.1.2 Random geometric progression

Instead of using uniformly random vectors and matrices , one can impose a structure on them, in order to reduce the amount of randomness needed, and the cost of communicating or storing them. We propose to apply Kimbrel and Sinha’s modification of Freivalds’ check [25]: select a single random field element and form , thus reducing the communication volume for an audit from to field elements.

Similarly, we can reduce the storage of by sampling uniformly at random distinct non-zero elements and forming

(5)

This reduces the storage on the client side from to only field elements.

Then with a rectangular database and , communications can be potentially lowered to any small target amount, at the cost of increased client storage and greater client computation during audits.

This impacts the probability of failure of the authenticity for the audits. Consider an incorrect answer to an audit as in (4). Then each element is a root of the degree- univariate polynomial whose coefficients are . Because this polynomial has at most distinct roots, the probability of the Client accepting an incorrect answer is at most

which leads to setting in order to bound this probability by . Even if for 1PB of storage, assuming , and again using , gives .

6.1.3 Externalized storage

Lastly, the client storage can be reduced to by externalizing the storage of the block-vector at the expense of increasing the volume of communication. Clearly must be stored encrypted, as otherwise the server could answer any challenge without having to store the database. Any IND-CPA symmetric cipher works here, with care taken so that a separate IV is used for each column; this allows updates to a column of during a Write operation without revealing anything about the updated values.

We will simply assume that the client has access to an encryption function and a decryption function . In order to assess the authenticity of each communication of from the Server to the client, we will use another Merkle-Hash tree certificate for it: the client will only need to keep the root of a Merkle-Tree built on the encryption of .

Since this modification reduces the client storage but increases the overall communication, both options (with or without it; extern=T or extern=F) should be considered, and we will state the algorithms for our protocol with a Strategy parameter, deciding whether or not to externalize the storage of .

Server Communication Client
Strategy extern=T extern=F extern=T extern=F
Storage

Comput.

Setup
Audit
Read/Write
Table 6: Proof of retrievability via rectangular verifiable computing with structured vectors
( is the size of the database, is the security parameter, is the Merkle tree block size. Assume is a constant.)

6.2 Formal protocol descriptions

Full definitions of the five algorithms, Init, Read, Write, Audit and Extract, as Algorithms 5, 4, 3, 2 and 1, are given below, incorporating the improvements on control vector storage from the previous subsection. They include subcalls to the classical Merkle hash tree operations defined in Section 4.3.

Then, a summary of the asymptotic costs can be found in Table 6.

0:  
0:  ,
1:  ;
2:  Client: with non-zero distinct elements {Secrets}
3:  Client: Let
4:  Client: {Secretly stored or externalized}
5:  Both:
6:  if  then
7:     Client: ;
8:     Client: ;
9:     Client: sends to the Server;
10:     Both:
11:     Server: ;
12:     Client: ;
13:  else
14:     Client: sends to the Server;
15:     Server: ;
16:     Client: ;
17:  end if
Algorithm 1
0:  
0:   or reject
1:  Both:
2:  Client: return
Algorithm 2
0:  
0:   or reject
1:  Both:
2:  if  then
3:     Both:
4:     Client: ;
5:  end if
6:  Client: Let
7:  Client: ;
8:  if  then
9:     Client:
10:     Both:
11:     Server: Update using , and
12:     Client: Update using and
13:  else
14:     Server: Update using and
15:     Client: Update using and
16:  end if
Algorithm 3
0:  ,
0:  accept or reject
1:  Client: ;
2:  Client: sends to the Server;
3:  Let
4:  Server: ;{ from }
5:  Server: sends to Client;
6:  if  then
7:     Both: ;
8:     Client:
9:  end if
10:  Client: Let
11:  if  then
12:     Client: return accept
13:  else
14:     Client: return reject
15:  end if
Algorithm 4
0:   and successful audit transcripts
0:   or fail
1:   indices of distinct challenge vectors
2:  if  then
3:     return  fail
4:  end if{Now is Vandermonde with distinct points}
5:  Form matrix
6:  Form matrix
7:  Compute
8:  return  
Algorithm 5

6.3 Security

Before we begin the full security proof, we need the following technical lemma to prove that the Extract algorithm succeeds with high probability. The proof of this lemma is a straightforward application of Chernoff bounds.

Lemma 5.

Let and suppose balls are thrown independently and uniformly into bins at random. If and , then with probability at least , the number of non-empty bins is at least .

Proof.

Let

be random variables for the indices of bins that each ball goes into. Each is a uniform independent over the

bins.

Let be random variables for each pair of indices with , such that equals 1 iff . Each is a therefore Bernoulli trial with , and the sum is the number of pairs of balls which go into the same bin.

We will use a Chernoff bound on the probability that is large. Note that the random variables are not independent, but they are negatively correlated: when any equals 1, it only decreases the conditional expectation of any other . Therefore, by convexity, we can treat the ’s as independent in order to obtain an upper bound on the probability that is large.

Observe that . A standard consequence of the Chernoff bound on sums of independent indicator variables tells us that ; see for example [30, Theorem 4.1] or [22, Theorem 1].

Substituting the bound on then tells us that . That is, with high probability, fewer than pair of balls share the same bin. If denotes the number of bins with balls, the number of non-empty bins is

which completes the proof. ∎

We now proceed to the main result of the paper.

Theorem 6 (Appendix B).

Let , a finite field satisfying be parameters for our PoR scheme. Then the protocol composed of:

  • [nosep]

  • the Init operations in Algorithm 1;

  • the Read operations in Algorithm 2;

  • the Write operations in Algorithm 3;

  • the Audit operations in Algorithm 4; and

  • the Extract operation in Algorithm 5 with

satisfies correctness, adaptive authenticity and retrievability as defined in Definitions 3, 2 and 1.

6.4 Public verifiability

These algorithms can also be adapted to support public verifiability. There a first client (now called the Writer) is authorized to run the Init, Write, Read and Audit algorithms, while a second client (now called the Verifier) can only run the last two. The idea is to provide equality testing without deciphering. In a group where the discrete logarithm is hard this can be achieved while preserving security thanks to the additive homomorphic property of exponentiation. For instance on the Externalized strategy the modifications are:

  1. [nosep]

  2. A group of prime order and generator is build.

  3. Init, in Algorithm 1, is run identically, except for two modifications. First that is ciphered in : . Second, that the Writer also publishes an encryption of as: over an authenticated channel.

  4. All the verifications of the Merkle tree root in Algorithms 4, 3 and 2 remain unchanged, but the Writer must publish the new roots of the trees after each Write also over an authenticated and timestamped channel to the Verifier.

  5. Updates to the control vector, in Algorithm 3 are performed homomorphically, without deciphering : the Writer computes in clear, , then updates .

  6. The dotproduct verification, in Algorithm 4 is performed also homomorphically: .

These modifications give rise to the following Theorem 7. This theorem and the associated security assumptions are formalized and proven in Appendix D.

Theorem 7.

Under LIP security in a group of prime order , where discrete logarithms are hard to compute, our Protocol can be modified in order to not only satisfy correctness, adaptive authenticity and retrievability but also public verifiability.

7 Detailed state of the art

PDP schemes, first introduced by [5] in 2007, originally only considered static data storage. The original scheme was later adapted to allow dynamic updates by [16] and has since seen numerous performance improvements. However, PDPs only guarantee (probabilistically) that a large fraction of the data was not altered; a single block deletion or alteration is likely to go undetected in an audit.

PoR schemes, first introduced at the same CCS conference in 2007 by [24], provide a stronger guarantee of integrity: namely, that any small alteration to the data is likely to be detected. In this paper, we use the term PoR to refer to any scheme which provides this stronger level of recoverability guarantee.

PoR and PDP are usually constructed as a collection of phases in order to initialize the data storage, to access it afterwards and to audit the server’s storage. Dynamic schemes also propose a modification of subsets of data, called write or update.

Since 2007, different schemes have been proposed to serve different purposes such as data confidentiality, data integrity, or data availability, but also freshness and fairness. Storage efficiency, communication efficiency and reduction of disk I/O have improved with time. Some schemes are developed for static data (no update algorithm) , others extend their audit algorithm for public verification, still others require a finite number of Audits and Updates. For a complete taxonomy on recent PoR schemes, see [37] and references therein.

For our purpose, we have identified two main storage outsourcing type of approaches: those which minimizes the storage overhead and those which minimize the client and server computation. For each approach, we specify in Table 7 which one meets various requirements such as whether or not they are dynamic, if they can answer an unbounded number of queries and what is the extra storage they require.

PoR Number of Extra
Protocol capable audits updates Storage
Sebé [32] X
Ateniese et al. [5] X
Ateniese et al. [6] X
Storj [36]
Juels et al. [24]
Lavauzelle et al. [27]
Stefanov et al. [35]
Cash et al. [10]
Shi et al. [34]
Here
Table 7: Attributes of some selected schemes

7.1 Low storage overhead

The schemes of Ateniese et al. [5] or Sebé et al. [32] are in the PDP model. Both of them have a storage overhead in . They use the RSA protocol in order to construct homomorphic authenticators, so that a successful audit guaranties data possession on some selected blocks. When all the blocks are selected, the audit is deterministic but the computation cost is high. So in practice, [5] minimizes the file block accesses, the computation on the server, and the client-server communication. For one audit on at most blocks,the S-PDP protocol of [5] gives the costs seen in Table 8. A robust auditing integrates S-PDP with a forward error-correcting codes to mitigate arbitrary small file corruption. Nevertheless, if the server passes one audit, it guarantees only that a portion of the data is correct.

Server Communication Client
Storage
Comput. Setup
Audit
Table 8: S-PDP on blocks : The file is composed of blocks of bit-size . The computation is made mod Q, where is the product of two large prime numbers.

Later, Ateniese et al. [6] proposed a scheme secure under the random oracle model based on hash functions and symmetric keys. It has an efficient update algorithm but uses tokens which impose a limited number of audits or updates.

Alternatively, verifiable computing can be used to go through the whole database with Merkel hash trees, as in [7, §6]. The latter proposition however comes with a large overhead in homomorphic computations and does not provide an Audit mechanism. Verifiable computing can provide an audit mechanism, as sketched in the following paper [17], but then it is not dynamic anymore.

Storj [36] (version 2) is a very different approach also based on Merkel hash trees. It is a dynamic PoR protocol with bounded Audits and updates. The storage is encrypted and cut into blocks of size . For each block and for a selection of salts, a Merkel Hash tree with leaves is constructed. The efficiency of Storj is presented Table 9.

Server Communication Client
Storage
Comput. Setup
Audit
Update
Table 9: Storj-V2: The file is composed of blocks of bit-size . is the number of salts.

7.2 Fast audits but large extra storage

PoR methods based on block erasure encoding are a class of methods which guarantee with a high probability that the client’s entire data can be retrieved. The idea is to check the authenticity of a number of erasure encoding blocks during the data recovery step but also during the audit algorithm. Those approaches will not detect a small amount of corrupted data. But the idea is that if there are very few corrupted blocks, they could be easily recovered via the error correcting code.

Lavauzelle et al., [27] proposed a static PoR. The Init algorithm consists in encoding the file using a lifted q-ary Reed-Solomon code and encrypting it with a block-cipher. The Audit algorithm checks if one word of blocks belongs to a set of Reed-Solomon code words. This algorithm has to succeed a sufficient number of times to ensure with a high probability that the file can be recovered. Its main drawback is that it requires an initialization quadratic in the database size. For a large data file of several terabytes this becomes intractable.

In addition to a block erasure code, PoRSYS of Juels et al. [24] use block encryptions and sentinels in order to store static data with a cloud server. Shacham and Waters [33] use authenticators to improve the audit algorithm. A publicly verifiable scheme based on the Diffie-Hellman problem in bilinear groups is also proposed.

Stefanov et al. [35] were the first to consider a dynamic PoR scheme. Later improvements by Cash et al. or Shi et al. [10, 34] allow for dynamic updates and reduce the asymptotic complexity (see Table 10). However, these techniques rely on computationally-intensive tools, such as locally decodable codes and Oblivious RAM (ORAM), and incur at least a 1.5x, or as much as 10x, overhead on the size of remote storage.

Recent variants include Proof of Data Replication or Proof of Data Reliability, where the error correction is performed by the server instead of the client [3, 38]. Some use a weaker, rational, attacker model [29, 11], and in all of them the client thus has to also be able to verify the redundancy; but we do not know of dynamic versions of these.

Server Communication Client
Storage
Comput. Setup
Audit
Update
Table 10: Shi et al. [34]: The file is composed of blocks of bit-size .
Shi Here Here
et al. [34] extern=T extern=F
Server extra-storage
Server audit cost
Communication
Client audit cost
Client storage
Table 11: Comparison of our low server storage protocol with that of Shi et al. [34].

Table 11 compares the additional server storage and audit costs between [34] and the two variants of our protocol: the first one saving on communication, and the second one, externalizing the storage of the secret audit matrix . In the former case, an arbitrary parameter can be used in the choice of the dimensions: and . This balances between the communication cost ) and the Client computation and storage .

Note that efficient solutions to PoR for dynamic data do not consider the confidentiality of the file , but assume that the user can encrypt its data in a prior step if needed.

References

  • [1] Michel Abdalla, Fabrice Benhamouda, and Alain Passelègue. An algebraic framework for pseudorandom functions and applications to related-key security. In Rosario Gennaro and Matthew Robshaw, editors, Advances in Cryptology – CRYPTO 2015, pages 388–409, Berlin, Heidelberg, 2015. Springer Berlin Heidelberg. doi:10.1007/978-3-662-47989-6_19.
  • [2] Lawrence Abrams. Amazon AWS Outage Shows Data in the Cloud is Not Always Safe. Bleeping Computer, September 2019.
  • [3] Frederik Armknecht, Ludovic Barman, Jens-Matthias Bohli, and Ghassan O. Karame. Mirror: Enabling proofs of data replication and retrievability in the cloud. In 25th USENIX Security Symposium (USENIX Security 16), pages 1051–1068, Austin, TX, August 2016. USENIX Association. URL: https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/armknecht.
  • [4] Giuseppe Ateniese, Ilario Bonacina, Antonio Faonio, and Nicola Galesi. Proofs of Space: When Space Is of the Essence. In Security and Cryptography for Networks, pages 538–557. Springer, 2014.
  • [5] Giuseppe Ateniese, Randal Burns, Reza Curtmola, Joseph Herring, Lea Kissner, Zachary Peterson, and Dawn Song. Provable data possession at untrusted stores. In Proceedings of the 14th ACM conference on Computer and communications security, pages 598–609. Acm, 2007.
  • [6] Giuseppe Ateniese, Roberto Di Pietro, Luigi V Mancini, and Gene Tsudik. Scalable and efficient provable data possession. In Proceedings of the 4th international conference on Security and privacy in communication networks, page 9. ACM, 2008.
  • [7] Siavosh Benabbas, Rosario Gennaro, and Yevgeniy Vahlis. Verifiable delegation of computation over large datasets. In Phillip Rogaway, editor, Advances in Cryptology – CRYPTO 2011, pages 111–131, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
  • [8] Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. Sakura: A flexible coding for tree hashing. In Ioana Boureanu, Philippe Owesarski, and Serge Vaudenay, editors, Applied Cryptography and Network Security, pages 217–234, Cham, 2014. Springer International Publishing.
  • [9] Erik Cambria, Anupam Chattopadhyay, Eike Linn, Bappaditya Mandal, and Bebo White. Storages are not forever. Cognitive Computation, 9:646–658, 2017. doi:10.1007/s12559-017-9482-4.
  • [10] David Cash, Alptekin Küpçü, and Daniel Wichs. Dynamic proofs of retrievability via oblivious RAM. J. Cryptol., 30(1):22–57, January 2017. doi:10.1007/s00145-015-9216-2.
  • [11] Ethan Cecchetti, Ben Fisch, Ian Miers, and Ari Juels. Pies: Public incompressible encodings for decentralized storage. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and Jonathan Katz, editors, Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS 2019, London, UK, November 11-15, 2019, pages 1351–1367. ACM, 2019. doi:10.1145/3319535.3354231.
  • [12] Ivan Damgård, Chaya Ganesh, and Claudio Orlandi. Proofs of replicated storage without timing assumptions. In Advances in Cryptology – CRYPTO 2019, pages 355–380. Springer, 2019.