Repair rate lower bounds for distributed storage

02/18/2020 ∙ by Michael Luby, et al. ∙ berkeley college 0

One of the primary objectives of a distributed storage system is to reliably store a large amount dsize of source data for a long duration using a large number N of unreliable storage nodes, each with capacity nsize. The storage overhead β is the fraction of system capacity available beyond dsize, i.e., β = 1- dsize/N · nsize. Storage nodes fail randomly over time and are replaced with initially empty nodes, and thus data is erased from the system at an average rate erate = λ· N · nsize, where 1/λ is the average lifetime of a node before failure. To maintain recoverability of the source data, a repairer continually reads data over a network from nodes at some average rate rrate, and generates and writes data to nodes based on the read data. The main result is that, for any repairer, if the source data is recoverable at each point in time then it must be the case that rrate >erate/2 ·β asymptotically as N goes to infinity and beta goes to zero. This inequality provides a fundamental lower bound on the average rate that any repairer needs to read data from the system in order to maintain recoverability of the source data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Overview

A distributed storage system generically consists of interconnected storage nodes, where each node can store a large quantity of data. We let be the number of storage nodes in the system, where each node has bits of storage capacity.

Commonly, distributed storage systems are built using relatively inexpensive and generally not completely reliable hardware. For example, nodes can go offline for periods of time (transient failure), in which case the data they store is temporarily unavailable, or permanently fail, in which case the data they store is permanently erased. Permanent failures are not uncommon, and transient failures are frequent.

Although it is often hard to accurately model failures, an independent failure model can provide insight into the strengths and weaknesses of a practical system, and can provide a first order approximation to how a practical system operates. In fact, one of the primary reasons practical storage systems are built using distributed infrastructure is so that failures of the infrastructure are as independent as possible.

In our model, each storage node permanently fails independently and randomly at rate at each point in time and is replaced with a new node initialized to zeroes when it fails, and thus bits are erased from the system at an average rate as defined in Equation (2).

A primary goal of a distributed storage system is to reliably store as much source data as possible for a long time, i.e., at each point in time the source data should be recoverable from the data stored in the system at that point in time. We let be the size of the source data to be stored. To maintain recoverability of the source data, a repairer continually reads data over a network from nodes at some average rate , and generates and writes data to nodes based on the read data.

Distributed storage systems generally allocate a fraction of their capacity to storage overhead, which is used by the repairer to help maintain recoverability of source data as failures occur. The storage overhead is the fraction of capacity available beyond the size of the source data, i.e., is defined in Equation (1), and thus .

The main result is that, for any repairer, if the source data is recoverable at each point in time then it must be the case that Inequality (3) holds asymptotically as goes to infinity and goes to zero. Thus, Inequality (3) provides a fundamental lower bound on the average rate that any repairer needs to read data from the system in order to maintain recoverability of the source data.

The repairers described in [19] have a peak read rate that is at most the righthand size of Inequality (3) asymptotically as goes to infinity and goes to zero, and thus

expresses a fundamental trade-off between the repairer read rate and storage overhead as a function of the erasure rate.

I-a Practical system parameters

An example of a practical system is one with nodes, with bits of capacity at each node, thus bits is the system capacity. The amount of storage needed by the repairer to store its programs and state generously is at most something like bits. Generally, . We assume and in our bounds with respect to growing .

Practical values of range from (triplication) to and smaller. In the example, bits. In practice nodes fail in a few years, e.g., years.

For practical systems, source data is generally maintained at the granularity of objects, and erasure codes are used to generate redundant data for each object. When using a erasure code, each object is segmented into source fragments, an encoder generates repair fragments from the source fragments, and each of these fragments is stored at a different node. An erasure code is MDS (maximum distance separable) if the object can be recovered from any of the fragments.

I-B Small code systems

Replication is an example of a trivial MDS erasure code, i.e., each fragment is a copy of the original object. For example, triplication can be thought of as using the simple erasure code, wherein the object can be recovered from any one of the three copies. Some practical distributed storage systems use triplication.

Reed-Solomon codes [2], [3], [5] are MDS codes that are used in a variety of applications and are a popular choice for storage systems. For example, [11] and [9] use a Reed-Solomon code, and [12] uses a Reed-Solomon code. These are examples of small code systems, i.e., systems that use small values of , and .

There are some issues that complicate the design of small code systems. For example, the data for each object is spread over a tiny fraction of the nodes, i.e., in a system of nodes, triplication spreads the data for each object over only nodes, and a Reed-Solomon code spreads the data for each object over only nodes. Thus, an issue for a small code system is how to distribute the data for all the objects smoothly over all the nodes.

A typical approach is to assign each object to a placement group, where each placement group maps to of the nodes, which determines where the fragments of the object are stored. An equal amount of object data should be assigned to each placement group, and an equal number of placement groups should map a fragment to each node. For small code systems, Ceph [14] recommends placement groups, i.e., 100 placement groups map a fragment to each node. A placement group should avoid mapping fragments to nodes with correlated failures, e.g., to the same rack. Pairs of placement groups should avoid mapping fragments to the same pair of nodes. Placement groups are continually remapped as nodes fail and are added. These and other issues make the design of small code systems challenging.

Since a small number of failures can cause source data loss for small code systems, reactive repair is used, i.e., the repairer operates as quickly as practical to regenerate fragments lost from a node that permanently fails before another node fails, and typically reads fragments to regenerate each lost fragment. Thus, the peak read rate is higher than the average read rate, and the average read rate is times the failure erasure rate.

As highlighted in [12], the read rate needed to maintain source data recoverability for small code systems can be substantial. Modifications of standard erasure codes have been designed for storage systems to reduce this rate, e.g., local reconstruction codes [10], [12], and regenerating codes [7], [8]. Some versions of local reconstruction codes have been used in deployments, e.g., by Microsoft Azure.

I-C Liquid systems

Another approach introduced in [18] is liquid systems, which use erasure codes with large values of , and . For example, and a fragment is assigned to each node for each object, i.e., only one placement group is used for all objects. The RaptorQ code [4][6] is an example of an erasure code that is suitable for a liquid system, since objects with large numbers of fragments can be encoded and decoded efficiently in linear time.

Typically is large for a liquid system, thus source data is unrecoverable only when a large number of nodes fail. A liquid repairer is lazy, i.e., repair operates to slowly regenerate fragments erased from nodes that have permanently failed. The repairer reads fragments for each object to regenerate around fragments erased over time due to failures, and the peak read rate is close to the average read rate. The peak read rate for the liquid repairer described in [18] is within a factor of two of the lower bounds on the read rate, and the peak read rate for the advanced liquid repairer described in [19] asymptotically approaches the lower bounds.

Ii Related work

The groundbreaking research of Dimakis et. al., described in [7] and [8], is closest to our work: An object-based distributed storage framework is introduced, and optimal tradeoffs between storage overhead and local-computation repairer read rate are proved with respect to repairing an individual object. [7] and [8] describe a framework, the types of repairers that fit into the framework, and lower bounds on these types of repairers within the framework, which are hereafter referred to as the Regenerating framework, Regenerating repairers, and Regenerating lower bounds, respectively.

The Regenerating framework was originally introduced to model repair of a single lost fragment, and is applicable to reactive repair of a single object. The Regenerating framework is parameterized by : is the number of fragments for the object (each stored at a different node); is the number of fragments from which the object must be recoverable; is the number of fragments used to generate a lost fragment at a new node when a node fails; is the fragment size; and is the total amount of data generated and read across the network to generate a fragment at a new node, i.e., is the amount of data generated from each of fragments needed to generate a fragment at a new node. Regenerating lower bounds on the local-computation repairer read data rate prove necessary conditions on the Regenerating framework parameters used by any Regenerating repairer to ensure that an individual object remains recoverable when using reactive repair.

The Regenerating lower bounds were not originally designed to provide general lower bounds for a large system of nodes. Nevertheless, it is interesting to interpret the Regenerating framework parameterized by in the context of a system so that the Regenerating lower bounds can be as closely compared as possible to the system level lower bounds proved in this paper. Let be the amount of source data to be stored in the system. Then, is set to the number of nodes in the system, and is set to the storage capacity of each node, and thus the storage overhead  is as shown in Equation (1).

Since we want the best tradeoff possible between the amount of data read by the Regenerating repairer to replace each failed node and the storage overhead , we set . (At a general point in time a failed node is being replaced and there are only available nodes.) Thus, at the system level we consider the Regenerating framework with parameters

The Regenerating framework uses labeled acyclic directed graphs, where each directed edge is labeled with the amount of data transferred from the node at the tail to the node at the head of the edge, to represent the actions of Regenerating repairers, and it is the properties of these graphs that are used to prove the Regenerating lower bounds. The labeled acyclic graphs restrict the possible actions taken by Regenerating repairers as follows. Suppose a node with identifier  fails at time and the next failure is at time . A Regenerating repairer is restricted to the following actions between time and :

  • For each node other than the failed node , the Regenerating repairer computes a function of the bits stored at node to generate bits, and transfers the bits to the replacement node for .

  • From the bits received at the replacement node for from the nodes other than node , the Regenerating repairer computes a function of the bits to generate the bits to be stored at the replacement node for .

Thus, between time and , a fixed amount of data is transferred to the replacement node for and no data is transferred to any other node; once another node fails at time , no more data is transferred to the replacement node for until it fails again and is replaced with another replacement node; an equal amount of data is read and transferred from each of the non-failing node to the replacement node for .

Dimakis Lemma II.1

The following holds as goes to infinity. For any Regenerating repairer parameterized by , if

(4)

then the source data cannot be reliably recovered at the end of any failure sequence with distinct failures.

  • Inequality (16) of [8] implies that if

    (5)

    for a Regenerating repairer then the source data cannot be reliably recovered at the end of any failure sequence with distinct failures. With , , and using Equation (1), we can rewrite Inequality (5) as

    As goes to infinity, we can approximate the sum by integration to yield:

    Simplifying yields Inequality (4).       

Dimakis Lemma II.1 is tight, i.e., [7], [8] describe Regenerating repairers with that maintain source data recoverability for periodic failure sequences.

Dimakis Lemma II.1 holds for any Regenerating repairer and for any failure sequence with distinct failures, even if the Regenerating repairer is provided the failures in advance. However, the following repairer maintains recoverability of the source data, uses storage overhead , and reads bits per failure for any failure sequence that is provided in advance. The source data of size is stored on nodes, and the remaining node is empty. Just before the next failure, the repairer copies all data from the node that is going to fail to the empty node, and the new node replacing the failed node becomes the empty node.

This repairer: (a) is not a Regenerating repairer; (b) violates Inequality (4) of Dimakis Lemma II.1 and yet the source data can always be reliably recovered; (c) shows there is no non-trivial general lower bound if the failure sequence is provided to the repairer in advance. This shows that Dimakis Lemma II.1 is not a lower bound that applies to all repairers, and that it is impossible to prove non-trivial general lower bounds if the failure sequence is provided to the repairer in advance.

The righthand side of Inequality (4) of Dimakis Lemma II.1 converges to the same value as the righthand side of Inequality (30) of Poisson Failures Theorem VIII.3 as goes to infinity goes to zero. The main differences between Dimakis Lemma II.1 and Poisson Failures Theorem VIII.3 are the generality of the repairers to which the lower bounds apply and the type of failure sequences used to prove the lower bounds.

Dimakis Lemma II.1 applies to Regenerating repairers within the restrictions of the Regenerating framework, i.e., a Regenerating repairer that predictably reads a given amount of data from each node and transfers a predictable amount of data to a replacement node between failures. On the other hand, Poisson Failures Theorem VIII.3 applies to any repairer, i.e., a repairer that can be completely unpredictable.

For a Regenerating repairer that doesn’t read enough data, the failure sequence that causes the source data to be unrecoverable in Dimakis Lemma II.1 is an atypical failure sequence with distinct failures. On the other hand, for a general repairer that doesn’t read enough data, the failure sequence that causes the source data to be unrecoverable in Poisson Failures Theorem VIII.3 is a typical random failure sequence that is chosen independently of the repairer.

The following examples show that repairers for practical systems do not belong to the class of Regenerating repairers for which Dimakis Lemma II.1 applies, and thus Dimakis Lemma II.1 is not a lower bound on repairers in general.

As can be inferred from Section I-B, a typical repairer for a small code system reads data from only a fraction of the nodes to replace the data on a failed node. Thus, repairers for small code systems are not Regenerating repairers for which the Regenerating lower bound of Dimakis Lemma II.1 applies.

Liquid systems repairers [18], [19] transfer data incrementally to a replacement node over a constant fraction of failures after the node it replaces fails. Thus, repairers for liquid systems are not Regenerating repairers for which the Regenerating lower bound of Dimakis Lemma II.1 applies.

Iii System model

We introduce a model of distributed storage which is inspired by properties inherent and common to systems described in Section I. This model captures some of the essential features of any distributed storage system. All lower bounds are proved with respect to this system model.

There are a number of possible strategies beyond those outlined in Section I that could be used to implement a distributed storage system. One of our primary contributions is to provide fundamental lower bounds on the read rate needed to maintain source data recoverability for any distributed storage system, current or future, for a given storage overhead and failure rate. Appendix B provides details on how the system model introduced in this section applies to real systems.

Iii-a Architecture

Figure 1 shows an architectural overview of the distributed storage model. A storer generates data from source data  received from a source, and stores the generated data at nodes. In our model, the source data  is randomly and uniformly chosen, where

is a random variable and

indicates randomly and uniformly chosen. Thus, , where is the length of and is the entropy of .

Fig. 1: Distributed storage architecture

Figure 2 shows the nodes of the distributed storage system, together with the network that connects each node to a repairer. Each of nodes can store bits, and the capacity is .

Fig. 2: Storage nodes and repairer model.

As nodes fail and are replaced, a repairer continually reads data from the nodes, computes a function of the read data, and writes the computed data back to the nodes. The repairer tries to ensure that the source data can be recovered at any time from the data stored at the nodes.

As shown in Figure 1, after some amount of time passes, a recoverer reads data from the nodes to generate , which is provided to a destination, where is reliably recovered if . The goal is to maximize the amount of time the recoverer can reliably recover .

Iii-B Failures

A failure sequence determines when and what nodes fail as time passes. A failure sequence is a combination of two sequences, a timing sequence

where for index , is the time at which a node fails, and an identifier sequence

where is the identifier of the node that fails at time .

All bits stored at node are immediately erased at time when the node fails, where erasing a bit means setting its value to zero. This can be viewed as immediately replacing a failed node with a replacement node with storage initialized to zeroes. Thus, at each time there are nodes.

A primary objective of practical distributed storage architectures is to distribute the components of the system so that failures are as independent as possible. Poisson failure distributions are an idealization of this primary objective, and are often used to model and evaluate distributed storage systems in practice. For a Poisson failure distribution with rate , the time between when a node is initialized and when it fails is an independent exponential random variable with rate , i.e., is the average lifetime of a node between when it is initialized and when it fails. Our main lower bounds in Section VIII are with respect to Poisson failure distributions.

Iii-C Network

The model assumes there is a network interface between each node and the repairer over which all data from and to the node travels. One of the primary lower bound metrics is the amount of data that travels over interfaces from nodes to the repairer, which is counted as data read by the repairer. For the lower bounds, this is the only network traffic that is counted. All other data traffic within the repairer, i.e. data traffic internal to a distributed repairer, data traffic over an interface from the repairer to nodes, or any other data traffic that does not travel over an interface from a node to the repairer, is not counted for the lower bounds. It is assumed that the network is completely reliable, and that all data that travels over an interface from a node to the repairer is instantly available everywhere within the repairer.

Iii-D Storer

A storer takes the source data  and generates and stores data at the nodes in a preprocessing step when the system is first initialized and before there are any failures. We assume that the recoverer can reliably recover from the data stored at the nodes immediately after the preprocessing step finishes.

For simplicity, we view the storer preprocessing step as part of the repairer, and any data read during the storer preprocessing step is not counted in the lower bounds.

For the lower bounds, there are no assumptions about how the storer generates the stored data from the source data, i.e. no assumptions about any type of coding used, no assumptions about partitioning the source data into objects, etc. As an example, the source data can be encrypted, compressed, encoded using an error-correcting code or erasure code, replicated, or processed in any other way known or unknown to generate the stored data, and still the lower bounds hold. Analogous remarks hold for the repairers described next.

Iii-E Repairer

A repairer for a system is a process that operates as follows. The identifier  is provided to a repairer at time , which alerts the repairer that all bits stored on node are lost at that time. As nodes fail and are replaced, the repairer reads data over interfaces from nodes, performs computations on the read data, and writes computed data over interfaces to nodes. A primary metric is the number of bits the repairer reads over interfaces from storage nodes.

Appendix A provides a detailed description of repairers, including local-computation repairers.

Iii-F Recoverer

For any repairer there is a recoverer such that if the source data is and the state at time is when the repairer is then should be equal to .

Source data  is recoverable at time with respect to repairer and recoverer if . Source data  is unrecoverable at time with respect to repairer and recoverer if .

Iii-G System State

At time , let be the bits stored in the global memory of the repairer, where ,

be the bits stored at nodes , respectively, where is the capacity of each node , and

is the global state of the system at time , where . Thus,

is the size of the global system state at any time .

Iv Guide for lower bound proofs

The ultimate goal is to prove that Inequality (3) holds with respect to the Poisson failure distribution for any repairer.

In practice, the timing of failures, and the identity of which nodes fail, are not known in advance, and thus repairers must handle these uncertainties. A much simpler model for repairers to handle is a periodic failure sequence, i.e., the time between consecutive failures is a constant known to the repairer. Many of the lower bounds we prove hold for periodic failure sequences, and the only uncertainty is which nodes fail.

Lower bounds for are not of great interest, since for , the repairer that maintains a duplicate copy of the source data succeeds in maintaining recoverability of the source data forever for periodic failure sequences. Furthermore, in practice, the interest is to decrease as much as possible, and thus we hereafter restrict attention to .

Section V introduces the notion of a phase, where a phase is a failure sequence of a specified number of distinct failures. Let be the average read rate of a repairer over the first failures of the phase for a periodic failure sequence. The overall idea of the lower bound proof is to show that, for any repairer, either there is an such that is above a lower bound rate, or at the end of the phase the source data is unrecoverable.

Section V shows that the system state at the end of the phase can be generated from , where is the concatenation of the data read from nodes that fail in a phase before they fail and the data at nodes at the start of the phase that don’t fail in the phase. The crucial but simple Compression Lemma V.1 and Compression Corollary V.2 show that if then the source data is unrecoverable from , where is the length of . Since the system state at the end of the phase can be generated from , this implies that if then the source data is unrecoverable from the system state at the end of the phase.

Section VI introduces a restricted class of repairers, Equal-read repairers, that predictably read an equal amount of data from each node between failures. Equal-read repairers are introduced for two reasons: (1) they are similar to (but more general than) the Regenerating repairers discussed in Section II; (2) based on the framework introduced in Section V, the lower bound proofs for Equal-read repairers are easy and straightforward. Equal-read Lemma VI.1 of Section VI

shows that if the predictable read rate of the Equal-read repairer is below a lower bound then the source data is unrecoverable with very high probability at the end of the phase.

The Equal-read Lemma VI.1 lower bound for Equal-read repairers holds for any failure sequence with distinct failures, even if the failure sequence is known in advance to the Equal-read repairer. However, as outlined in Section II, there is no non-trivial lower bound for general repairers if the failure sequence is known in advance. Thus, using random failure sequences for which the repairer cannot predict which future nodes will fail is key to proving general lower bounds.

Repairer actions can be unpredictable: A repairer may read different amounts of data from individual nodes between failures, and may read different amounts of data in aggregate from all nodes between different failures. The repairer actions can depend on the source data, which nodes have failed in the past, and the timing of past failures.

Section VII provides the technical core of the lower bound proofs that use random failure sequences with distinct failures to prove lower bounds on general repairers that may act unpredictably. The proof of Core Lemma VII.1 in Section VII is the most technically challenging proof in the paper. It shows that, for any repairer, when the identifier sequence consists of randomly chosen distinct identifiers within a phase of failures then there is only a tiny probability that when is below a lower bound for each . Supermartingale Theorem D.1 is used to prove Core Lemma VII.1, and may be of independent interest. Core Theorem VII.2 directly uses Core Lemma VII.1 to show that with very high probability there is either an where the read rate  for the repairer up to the failure in the phase is above a lower bound or else the source data is unrecoverable at the end of the phase.

A phase terminates early if the repairer reads enough data from all nodes in a prefix of the phase, i.e., a phase is terminated at the first index where is above a lower bound, and another phase is started at that point. The overall lower bounds are proved by stitching together consecutive phases. Thus, the lower bound holds over the failure sequence within each stitched together phase. There are some technical issues with stitching together phases. The actions of the repairer have an influence on when one phase ends and the next phase begins. Since the distinct failures within a phase depend on when the phase starts, the repairer has an influence on the failure sequence.

What we would like instead are lower bounds where the failure sequence is chosen completely independently of the repairer, which is achieved in Section VIII. Distinct Failures Lemma VIII.1 together with Uniform Failures Theorem VIII.2 shows that the lower bounds for periodic failures shown in Core Theorem VII.2 still apply when the failures within a failure sequence are chosen uniformly at random and are no longer required to be distinct within a phase. Poisson Failures Theorem VIII.3, the main result of this research, shows that the lower bounds of Uniform Failures Theorem VIII.2

 extend when the timing sequence of the failure sequence is Poisson distributed instead of being restricted to being periodic. Thus, Poisson Failures Theorem 

VIII.3 shows that for any repairer the lower bounds apply when nodes fail independently according to a Poisson process.

V Emulating repairers in phases

We prove lower bounds based on considering the actions of a repairer , or local-computation repairer , running in phases. Each phase considers a failure sequence with failures, where each of the failures within a phase are distinct, as described in more detail below.

For any , we write

when all identifiers are distinct, i.e., for , which we hereafter refer to as a distinct identifier sequence. We write

when are distinct identifiers, random variable is defined as

and for , random variable is defined as

where indicates randomly and uniformly chosen. Thus, is a distribution on distinct identifier sequences.

A phase consists of executing on a failure sequence , where

is the timing sequence and

is the distinct identifier sequence that is revealed to as the phase progresses.

For , let

be a prefix of , and let

be a prefix of .

Fix repairer , recoverer , timing sequence  and identifier sequence , and . The variables defined below depend on these parameters, but to simplify notation this dependence is not explicitly expressed in the variable names.

For , , let be the data read by from node between and with respect to , , , and let be the size (or length) of . (If is a local-computation repairer, then is the the locally-computed bits read by over the interface from node between and .) For , let be the data read from all nodes by between and , and let

be the total amount of data read from all nodes by between and .

Let

where is the data read from the node between and the time of its failure with respect to , , . Let

be the total length of data read by in the phase from nodes that fail before their failure in the phase.

Before a phase begins, the storer generated and stored data at the nodes based on source data , and the repairer has been executed with respect to a failure sequence up till time , where is just before the time of the first failure of the phase at time . We assume that the recoverer can recover source data  from the state .

V-a Compressed state

For this subsection, we fix repairer , recoverer , timing sequence  and identifier sequence  and source data . The variables defined below depend on these parameters, but to simplify notation this dependence is not explicitly expressed in the variable names.

We conceptually define two executions of a phase with respect to , , , and . The first execution runs normally from to starting system state and ending in , where is just before , and is just after . Thus, the failures at times and are within the phase, but does not read any bits before or after in the phase.

Let be the concatenation of bits read by from nodes that fail before they fail in the phase, concatenated in the order they are read. Thus contains all the bits of

and , but the order of the bits in is defined by the order in which they are read by .

Let

which we hereafter refer to as the compressed state with respect to , , and , and thus

(6)

The second execution is an exact replay of the first execution, i.e., the repairer reads, computes, and writes exactly the same bits at the same times as in the first execution with respect to the failure sequence  to arrive in the same final state as the first execution. However, the second execution uses the compressed state in place of as the starting point of the execution. The initial global memory state of is set to at time . For all , the state of node is initialized to at time .

Initially at time , is set as:

Let be a time within the phase, i.e., .

Suppose at time that is to read bits over the interface from node : if then the requested bits are read from exactly the same as in the first execution; if then the requested bits are provided to from the next consecutive portion of not yet provided to , which, by the properties of , are guaranteed to be the bits read from node at time in the first execution.

Suppose at time that is to write bits to node : if then the bits are written to exactly the same as in the first execution; if then the write is skipped since whatever bits are subsequently read from node up till the time node fails are already part of .

For each , is reset to and the state of node , , is initialized to zeroes at time .

If is a local-computation repairer instead of a repairer then when is to produce and read locally-computed bits over the interface from node at time and the requested bits are locally-computed by based also on .

It can be verified that the state of the system is at the end of the second execution, whether is a repairer or a local-computation repairer. Thus, can be generated from based on .

V-B Viewing as a cut in an acyclic graph

Similar to [7], [8], the compressed state can be viewed as a cut in an acyclic graph. An example of the acyclic graph is shown in Figure 3, where there are storage nodes. The beginning of the phase is at the bottom, and going vertically up corresponds to time flowing forward. The leftmost vertical column is for the global memory of size , and there is a vertical column for each of the storage nodes, each of size . The bottom row of vertices corresponds to the system state at the start of the phase; the second from the bottom row of vertices corresponds to the system state ; the top row of vertices corresponds to the system state at the end of the phase. The vertices in a storage node column correspond to the state of that storage node over time, where edges flowing out of the column correspond to data transfer out of the node, and edges flowing into the column correspond to data transfer into the node. Similar remarks hold for the global memory column.

The edges pointing vertically up are labeled with the capacity of the corresponding entity, i.e., is the capacity of , and is the capacity of each storage node. The non-vertical edges that connect a first vertex to a second vertex correspond to a data transfer, where the label of the edge corresponds to the amount of data transferred.

In the example shown in Figure 3, fails at time , fails slightly later, and fails at a slightly later time. Each node that fails is replaced with an empty node, and thus there is no edge from a vertex corresponding to a node just before it fails to the vertex above corresponding to the replacement node. Thus, includes all the data transferred along the edges that emanate from the vertices in the columns corresponding to , , and before their failures, where these edges are shown in gray in Figure 3.

In the example shown in Figure 3, and do not fail before the end of the phase. Thus, includes the bits of data transferred along the vertical edges from the bottom row to the second from the bottom row for each column corresponding to these storage nodes, where these edges are shown in gray in Figure 3. In addition, includes the bits of data transferred along the vertical edge from the bottom row to the second from the bottom row for the first column corresponding to global memory .

The cut corresponding to is shown in Figure 3 as the curved gray line, where is the sum of the labels of the edges crossing the cut from the vertices below the cut.

The bit values of determine the edges and the edge label values in the acyclic graph, i.e., the edges and edge label values in the acyclic graph depend on the bit values stored at the vertices in the graph. This is unlike the acyclic graph representation in [7], [8], where the edges and the edge label values are independent of the bit values stored at the vertices in the graph.

Although the acyclic graph visualization of provides some good intuition, Section V-A provides the formal definition of and its properties.

Fig. 3: Viewing as a cut in an acyclic graph

V-C Compression lemma

For this subsection, we fix repairer , recoverer , timing sequence  and identifier sequence . The variables defined below depend on these parameters, but to simplify notation this dependence is not explicitly expressed in the variable names.

The value of the source data  is a variable in this subsection. Random variable

is uniformly distributed on the source data. We let

indicate that source data  is mapped before the start of the phase to a value of by , which in turn is mapped by to a value by the first execution of the emulation of , which is mapped to a value of by by the second execution of the emulation of , which in turn is mapped by to .

Compression Lemma V.1

Fix any repairer or local-computation repairer , recoverer , timing sequence , and distinct identifier sequence . Let . Then,

  • Fix , , and . The size of the set

    is at most since there are at most bit-strings of size at most and any fixed value of maps to a unique value in the second execution. Thus, there are at most values for such that when .       

V-D Compression corollary

Let

(7)

and let

(8)

be the minimal number of nodes so that . Let

(9)

Note that

Generally, , e.g., for the practical system described in Section I-A, .

Throughout the remainder of this section, Section VI and Section VII, we set

(10)

to be the number of failures in a phase, and thus from Equation (9),

(11)

Note that the restriction is mild, since is more interesting in practice than .

Compression Corollary V.2

Fix any repairer or local-computation repairer , recoverer , timing sequence , and distinct identifier sequence . Let , and

where is defined with respect to , , and .

  • Follows from Compression Lemma V.1 and Equations (6), (7), (8), (10).       

Note that Compression Lemma V.1 and Compression Corollary V.2 rely upon the assumption that the source data is uniformly distributed, and all subsequent technical results rely on this assumption.

A natural relaxation of this assumption is that the source data has high min-entropy, where the min-entropy is the log base two of one over the probability of , where is the most likely value for the source data. Thus the min-entropy of the source data is always at most .

Since the source data for practical systems is composed of many independent source objects, typically the min-entropy of the source data for a practical system is close to . It can be verified that all of the lower bounds hold if the min-entropy of the source data is universally substituted for .

Vi Equal-read repairer lower bound

This section introduces and proves a lower bound on a constrained repairer, which we hereafter call an Equal-read repairer, within the model introduced in Section III. An Equal-read repairer is in some ways similar to the Regenerating repairer of [7], [8] described in Section II, in the sense that between each consecutive failures an Equal-read repairer is constrained to read an equal amount of data from each of the nodes between consecutive failures, and thus is the total amount of data read from all nodes between failures. Unlike a Regenerating repairer, an Equal-read repairer is not constrained in any other way, e.g., which data and the amount of data transferred to each node between failures is unconstrained, and there is no constraint on when data is transferred to nodes.

Equal-read Lemma VI.1

For any Equal-read repairer  any recoverer , for any timing sequence  and distinct identifier sequence , if

(12)

then

  • For any Equal-read repairer , the amount of data read from each node between failures is exactly , independent of and . Thus, for any ,

    (13)
    (14)
    (15)

    where Inequality (15) follows from Equations (9) and (10). The proof follows from applying Compression Corollary V.2 where

     

Vii Core lower bounds

From Compression Corollary V.2, a necessary condition for source data  to be reliably recoverable at the end of the phase is that repairer or local-computation repairer  must read a lot of data from nodes that fail during the phase, and must read this data before the nodes fail.

On the other hand, cannot predict which nodes are going to fail during a phase, and only a small fraction of the nodes fail during a phase. Thus, to ensure that enough data has been read from nodes that fail before the end of the phase, a larger amount of data must be read in aggregate from all the nodes.

Core Lemma VII.1, the primary technical contribution of this section, is used to prove Core Theorem VII.2 and all later results.

Core Lemma VII.1

Fix and let

(16)

For , let

(17)

For any repairer or local-computation repairer, , , ,

where .

  • The proof can be found in Appendix C.       

With the settings in Section I-A and , when , and when .

Core Theorem VII.2

Fix with , and Equation (16) defines . For any repairer and recoverer , for any fixed , , with probability at most with respect to and the following two statements are both true:

(1) For all the average number of bits read by the repairer between and per each of the failures is less than

(18)

(2) Source data  is recoverable at time , i.e., .

  • Let

    where is defined with respect to , , and and any , and , i.e.,

    The probability of (1) and (2) both being true with respect to and is at most the sum of the following two probabilities with respect to and :

    (a) The probability that (1) and (2) and are all true. This is at most the probability that (2) and are both true, which Compression Corollary V.2 shows is at most .

    (b) The probability that (1) and (2) and are all true. This is at most the probability that (1) and are both true. Note that

    (19)

    for any , where Equation (17) defines . Thus, Core Lemma VII.1 shows that this probability is at most .

     

Note that is essentially zero in any practical setting. For example, for the settings in Section I-A.

Viii Main lower bounds

In all previous sections, all the failures within a phase are distinct, and when a phase ends and a new phase begins depends on the actions of the repairer, and thus the analysis does not apply to failure sequences where the failures are independent of the repairer. This section extends the results to random and independent failure sequences.

Uniform Failures Theorem VIII.2 in Section VIII-C proves lower bounds for any fixed timing sequence with respect to a uniform identifier sequence distribution. A uniform identifier sequence distribution with distinct failures can be generated as follows, where is defined in Equation (8), and  is defined in Equation (10).

Viii-a Uniform identifier sequence distribution within a phase

Let be a sequence of independently and uniformly distributed in random variables. For , define geometric random variable with respect to as

(20)

and thus . Let

be a sequence of independent geometric random variables, each defined with respect to an independent sequence. Let

be a random distinct identifier sequence as described in Section V. The uniform identifier sequence distribution  for the phase can be generated as follows from and . Let . For , let

For , let

and for , let

Then,

is a uniform identifier sequence distribution.

Note that , i.e., the indices of the identifiers in that are distinct from all the previous identifiers, are random variables defined in terms of . Thus, determines the distinct failure indices in a phase.

For , let be the time of the distinct failure beyond the initial failure. This defines a timing sequence

which is determined by and is independent of .

The expected number of failures in a phase until there are distinct failures beyond the initial failure is

(21)

For , define

(22)

For ,

(23)

Setting , and using Equations (8), (10), (21), (23),

Note that as ,

(24)

Thus, as .

A phase proceeds as follows with respect to source data  and failure sequence , where Equation (17) defines . For , is executed up till time . If with respect to , , then the phase ends at time . If the phase doesn’t end in the above process then for , and the phase ends at time .

Viii-B Distinct failures lemma

The condition ensures that that the amount of data read by up till time in a phase is at least . However, a lower bound on is needed, since is the total number of failures. The issue is that is a random variable that can be highly variable relative to and can depend on , and thus can be highly variable and can be influenced by . Thus, it is difficult to provide lower bounds when considering only a single phase.

To circumvent these issues, we stitch phases together into a sequence of phases, and argue that must read a lot of data per failure over a sequence of phases that covers a large enough number of distinct failures. Distinct Failures Lemma VIII.1 below proves that if we stitch together enough phases then we can ensure that, independent of the actions of , with high probability the total number of failures aggregated over the phases is close to the expected number of failures relative to the number of distinct failures.

The phases can be stitched together as follows. Let