Erasure Coding for Distributed Storage: An Overview

06/12/2018
by   S. B. Balaji, et al.
0

In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node. Efficient repair calls for erasure codes that in the face of node failure, are efficient in terms of minimizing the amount of repair data transferred over the network, the amount of data accessed at a helper node as well as the number of helper nodes contacted. Coding theory has evolved to handle these challenges by introducing two new classes of erasure codes, namely regenerating codes and locally recoverable codes as well as by coming up with novel ways to repair the ubiquitous Reed-Solomon code. This survey provides an overview of the efforts in this direction that have taken place over the past decade.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/04/2018

Determinant Codes with Helper-Independent Repair for Single and Multiple Failures

Determinant codes are a class of exact-repair regenerating codes for dis...
10/03/2020

Codes for Distributed Storage

This chapter deals with the topic of designing reliable and efficient co...
12/10/2021

Optimal Quaternary (r,delta)-Locally Repairable Codes Achieving the Singleton-type Bound

Locally repairable codes enables fast repair of node failure in a distri...
06/07/2021

Rack-Aware Regenerating Codes with Multiple Erasure Tolerance

In a modern distributed storage system, storage nodes are organized in r...
03/09/2018

Network Traffic Driven Storage Repair

Recently we constructed an explicit family of locally repairable and loc...
06/16/2020

Multilinear Algebra for Distributed Storage

An (n, k, d, α, β, M)-ERRC (exact-repair regenerating code) is a collect...
02/23/2020

Treeplication: An Erasure Code for Distributed Full Recovery under the Random Multiset Channel

This paper presents a new erasure code called Treeplication designed for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This survey article deals with the use of erasure coding for the reliable and efficient storage of large amounts of data in settings such as that of a data center. The amount of data stored in a single data center can run into tens or hundreds of petabytes. Reliability of data storage is ensured in part by introducing redundancy in some form, ranging from simple replication to the use of more sophisticated erasure-coding schemes such as Reed-Solomon codes. Minimizing the storage overhead that comes with ensuring reliability is a key consideration in the choice of erasure-coding scheme. More recently a second problem has surfaced, namely, that of node repair.

In [1], [2] the authors study the Facebook warehouse cluster and analyze the frequency of node failures as well as the resultant network traffic relating to node repair. It was observed in [1] that a median of nodes are unavailable per day and that a median of TB of cross-rack traffic is generated as a result of node unavailability. It was also reported that of the cases have exactly one block missing in a stripe. The erasure code that was deployed in this instance was an Reed Solomon (RS) code. Here denotes the block length of the code and the dimension. The conventional repair of an RS code is inefficient in that the repair of a single node, calls for contacting other (helper) nodes and downloading times the amount of data stored in the failed node, which is clearly inefficient. Thus there is significant practical interest in the design of erasure-coding techniques that offer both low overhead and which can also be repaired efficiently.

Coding theorists have responded to this need by coming up with two new classes of codes, namely ReGenerating (RG) and Locally Recoverable (LR) codes. The focus in a RG code is on minimizing the amount of data download needed to repair a failed node, termed the repair bandwidth while LR codes seek to minimize the number of helper nodes contacted for node repair, termed the repair degree. In a different direction, coding theorists have also re-examined the problem of node repair in RS codes and have come up with new and more efficient repair techniques. This survey provides an overview of these recent developments. An outline of the survey itself appears in Fig. 1.

RG codes are discussed in Section II. The two principal classes of RG codes, namely Minimum Bandwidth Regenerating (MBR) and Minimum Storage Regeneration (MSR) appear in the two sections that follow. These two classes of codes are at the two extreme ends of a tradeoff known as the storage-repair bandwidth (S-RB) tradeoff. A discussion on codes that correspond to the interior points of this tradeoff appears in Section V. The theory of regenerating codes has been extended in several directions and these are explored in Section VI. Section VII examines LR codes. There have been several approaches at extending the theory of LR codes to handle multiple erasures and these are dealt with in Section VIII. A class of codes known as Locally ReGenerating (LRG) codes that offer both low repair bandwidth and low repair degree within a single erasure code is discussed in Section IX. This is followed by Section X that discusses recent advances in the repair of Reed-Solomon codes. A brief description of a different approach based on capacity considerations and leading to the development of a liquid cloud storage system appears in Section XI. The final section, discusses practical evaluations and implementations.

Disclaimer: This survey is presented from the perspective of the authors and is biased in this respect. Given the explosion of research activity in this area, the survey also does not claim to be comprehensive and we offer our apologies to the authors whose work has inadvertently or for lack of space, not been appropriately cited. We direct the interested reader to some of the excellent surveys of codes on distributed storage contained in the literature including [3], [4], [5] and [6].

Fig. 1: An overview of the different classes of codes for distributed storage discussed in this survey article.

Ii Regenerating Codes

Parameters:

Fig. 2: An illustration of the data collection and node repair properties of a regenerating code.
Definition 1 ([7]).

Let  denote a finite field of size . Then a regenerating (RG) code  over  having integer parameter set where , , , maps a file on to a collection of -tuples over using an encoding map

with the components of stored on the -th node in such a way that the following two properties (see Fig. 2) are satisfied: Data Collection: The message  can be uniquely recovered from the contents of any nodes.
Node Repair: If the -th node storing fails, then a replacement node can

  1. contact any subset of the remaining nodes of size ,

  2. map the contents of each helper node on to a collection of repair symbols ,

  3. pool together the

    repair symbols thus computed to use them to create a replacement vector

    whose components are stored in the replacement node, in such a way that the contents of the resultant nodes, with the replacement node replacing the failed node, once again forms a regenerating code.

A regenerating code is said to be exact-repair (ER) regenerating code if the contents of the replacement node are exactly same as that of the failed node, ie., . Else the code is said to be functional-repair (FR) regenerating code. A regenerating code is said to be linear if

  1. , and

  2. the map mapping the contents of the -th helper node on to the corresponding repair symbols is linear over .

Thus a regenerating code is a code over a vector alphabet  and the quantity is termed the sub-packetization level of the regenerating code. The total number of  symbols to be transferred for repair of failure node is called the repair bandwidth of the regenerating code. The rate of the regenerating code is given by . Its reciprocal is the storage overhead.

Ii-a Cut-Set Bound

Fig. 3: The graph behind the cut-set file size bound.

Let us assume that  is a functional-repair regenerating code having parameter set: . Since an exact-repair regenerating code is also a functional-repair code, this subsumes the case when  is an exact-repair regenerating code. Over time, nodes will undergo failures and every failed node will be replaced by a replacement node. Let us assume to begin with, that we are only interested in the behavior of the regenerating code over a finite-but-large number of node repairs. For simplicity, we assume that repair is carried out instantaneously. Then at any given time instant , there are functioning nodes whose contents taken together comprise a regenerating code. At this time instant, a data collector could connect to nodes, download all of their contents and decode to recover underlying message vector . Thus in all, there are at most distinct data collectors which are distinguished based on the particular set of nodes to which the data collector connects.

Next, we create a source node that possesses the message symbols , and draw edges connecting the source to the initial set of nodes. We also draw edges between the helper nodes that assist a replacement node and the replacement node itself as well as edges connecting each data collector with the corresponding set of nodes from which the data collector downloads data. All edges are directed in the direction of information flow. We associate a capacity with edges emanating from a helper node to a replacement node and an capacity with all other edges. Each node can only store symbols over . We take this constraint into account using a standard graph-theory construct, in which a node is replaced by nodes separated by a directed edge (leading towards a data collector) of capacity . We have in this way, arrived at a graph (see Fig.3) in which there is one source and at most sinks .

Each sink would like to be able to reconstruct all the source symbols from the symbols it receives. This is precisely the multicast setting of network coding. A principal result in network coding tells us that in a multicast setting, one can transmit messages along the edges of the graph in such a way that each sink is able to reconstruct the source data, provided that the minimum capacity of a cut separating from is . A cut separating from is simply a partition of the nodes of the network into sets: containing and containing . The capacity of the cut is the sum of capacities of the edges leading from a node in to a node in . A careful examination of the graph will reveal that the minimum capacity of a cut separating a sink from source is given by (Fig. 3 shows an example cut separating source from sink). This leads to the following upper bound on file size [7]:

(1)

Network coding also tells us that when only a finite number of regenerations take place, this bound is achievable and furthermore achievable using linear network coding, i.e., using only linear operations at each node in the network when the size of the finite field  is sufficiently large. In a subsequent result [8], Wu established using the specific structure of the graph, that even in the case when the number of sinks is infinite, the upper bound in (1) continues to be achievable using linear network coding.

In summary, by drawing upon network coding, we have been able to characterize the maximum file size of a regenerating code given parameters for the case of functional repair when there is constraint placed on the size of the finite field . Note interestingly, that the upper bound on file size is independent of . Quite possibly, the role played by is that of determining the smallest value of field size for which a linear network code can be found having file size satisfying (1). A functional regenerating code having parameters: is said to be optimal provided (a) the file size achieves the bound in (1) with equality and (b) reducing either or will cause the bound in (1) to be violated.

Ii-B Storage-Repair Bandwidth Tradeoff

We have thus far, specified code parameters and asked what is the largest possible value of file size . If however, we fix parameters and ask instead what are the smallest values of for which one can hope to achieve (1), it turns out, as might be evident from the form of the summands on the RHS of (1), that there are several pairs for which equality holds in (1). In other words, there are different flavors of optimality.

Fig. 4: Storage-repair bandwidth tradeoff. Here, .

For a given file size , the storage overhead and normalized repair bandwidth are given respectively by and . Thus reflects the amount of storage overhead while determines the normalized repair bandwidth. The several pairs for which equality holds in (1), represent a tradeoff between storage overhead on the one hand and normalized repair bandwidth on the other as can be seen from the example plot in Fig. 4. Clearly, the smallest value of for which the equality can hold in (1) is given by . Given , the smallest permissible value of is given by . This represents the minimum storage regeneration point and codes achieving (1) with and are known as minimum storage regenerating (MSR) codes. At the other end of the tradeoff, we have the minimum bandwidth regenerating (MBR) code whose associated values are given by , .

Remark 1.

Since a regenerating code can tolerate erasures by the data collection property, it follows that the minimum Hamming weight of a regenerating code must satisfy . By the Singleton bound, the largest size of a code of block length and minimum distance is given by , where is the size of alphabet of the code. Since in the case of regenerating code, it follows that the size of a regenerating code must satisfy , or equivalently , i.e., . But in the case of an MSR code and it follows that an MSR code is an MDS code over a vector alphabet. Such codes also go by the name MDS array code.

From a practical perspective, exact-repair regenerating codes are easier to implement as the contents of the nodes in operation do not change with time. Partly for this reason and partly for reasons of tractability, with few exceptions, most constructions of regenerating codes belong to the class of exact-repair regenerating codes. Examples of functional-repair regenerating code include the construction in [9] as well as the construction in [10].

Early constructions of regenerating codes focused on the two extreme points of the storage-repair bandwidth (S-RB) tradeoff, namely the MSR and MBR points. The various constructions of MBR and MSR codes are described in Sections III, IV. Not surprisingly, given the vast amount of data stored, the storage industry places a premium on low storage overhead. In this connection, we note that the maximum rate of an MBR code is given by:

which can be shown to be upper bounded by and is achieved when . In the case of MSR codes, there is no such limitation and MSR codes can have rates approaching .

An RG code is said to be a a Help-By-Transfer (HBT) RG code if repair of a failed node can be accomplished without incurring any computation at a helper node. If no computation is required at either helper node or at the replacement node, then the code is termed a Repair-by-Transfer (RBT) RG code. Clearly, an RBT RG code is also an HBT RG code.

Iii MBR Codes

Remark III.1.

If the

message symbols are drawn randomly with uniform distribution from

, it can be shown that in any regenerating code achieving the cut-set bound, the contents of each node correspond to a random variable that is uniform over

. In an MBR code, repair is accomplished by downloading a total of just symbols which clearly, is the minimum possible.

Remark III.2.

Let be an MBR code. If has the RBT property, it trivially follows that all scalar code-symbols of are replicated at least twice. In [11], it is shown that for an MBR code it is not possible to have even a single scalar code-symbol replicated more than twice. Thus the RBT property implies that the collection of scalar code-symbols associated with a codeword represent a set of distinct code symbols, each repeated twice. The converse is not true in general. However when , it can be shown that the two properties are equivalent.

Remark III.3.

In [12], it is shown that for , it is not possible to construct an MBR code that has the HBT property.

Iii-a Polygonal MBR Codes

Fig. 5: An example RBT MBR code for the parameters . Here file size is .

In the following, we describe with the help of an example, one of the first explicit families of MBR codes [13]. We term these codes as polygonal MBR codes. The construction holds for parameters and the constructed MBR codes possess the RBT property.

Example 1.

Consider the parameters and . Thus . First construct a complete graph with vertices and edges. The nine message symbols are then encoded using a MDS code to produce ten code-symbols. Each code-symbol is then uniquely assigned an edge. Each node of the MBR code stores the code-symbols corresponding to the edges incident on that node (see Fig. 5). The data collection property follows as any collection of nodes yields nine distinct (MDS) code-symbols. If a node fails, the replacement node can download from each of the remaining four nodes, the code-symbol corresponding to the edge it shares with the failed node. Hence repair is accomplished by merely transferring the data without any computation (RBT).

Remark III.4.

For the general construction, in order to construct an MBR code, one first forms the complete graph on vertices. Each edge is then mapped to a code-symbol of an MDS code, where and is the file size parameter. An field-size requirement is thus imposed by the underlying scalar MDS code.

Iii-B Product-Matrix (PM) MBR codes

A second, general construction for MBR codes is the PM construction [14] which derives its name from the fact that the contents of nodes can be expressed in the form of a product of two matrices. The two matrices are respectively an encoding matrix and a second, message matrix containing the message symbols. This construction yields MBR codes for all feasible parameters , , with an field-size requirement. The encoding matrix is of the form: , where , are , matrices respectively. Let the -th row of be denoted by . The sub-matrices and are here chosen such that any rows of and any rows of are linearly independent. The symmetric message matrix is derived from the message symbols as follows:

The -th node, under the PM-MBR construction, stores the matrix product . The repair data passed on by helper node to replacement node is given by .

Iii-C Other Work

In [15], the authors introduce a family of RBT MBR codes for

, that are constructed based on a congruent transformation applied to a skew-symmetric matrix of message symbols. In comparison with the

field requirement of polygonal MBR codes, in this construction, a field-size of suffices. In [16], the authors stay within the PM framework, but provide a different set of encoding matrices for MSR and MBR codes that have least-possible update complexity within the PM framework. The authors of [16] also analyze the codes for their ability to correct errors and provide corresponding decoding algorithms. The paper [12] proves the non-existence of HBT MBR codes with . The paper also provides PM-based constructions for two relaxations, namely (i) any failed node which is a part of a collection of systematic nodes can be recovered in HBT fashion from any other nodes and (ii) for every failed node, there exists a corresponding set of helper nodes which permit HBT repair. The paper [11] provides binary MBR constructions for the parameters , and studies the existence of MBR codes with inherent double replication, for all parameters. In [17], the authors provide regenerating-code constructions that asymptotically achieve the MSR or MBR point as increases and these codes can be constructed over any field, provided the file size is large enough. In [18], the authors introduce some extensions to the classical MBR framework by permitting the presence of a certain number of error-prone nodes during repair/reconstruction and by introducing flexibility in choosing the parameter during node repair.

Open Problems 1.

Determine the smallest possible field size of an MBR code for given .

Iv MSR Codes

Among the class of RG codes, MSR codes have received the greatest attention, and the reasons include: the fact that (a) the storage overhead of an MSR code can be made as small as desired, (b) MSR codes are MDS codes and (c) MSR codes have been challenging to construct.

Iv-a Introduction

As noted previously, an MSR code with parameters has file size and . Although MSR codes are vector MDS codes that have optimum repair-bandwidth of for the repair of any node among the nodes, there are papers in the literature that refer to a code as an MSR code even if optimal repair holds only for the systematic nodes. In the current paper, we refer to such codes as systematic MSR codes. While only symbols are sent by each of the helper nodes, the number of symbols accessed by the helper node in order to generate these symbols could be . The class of MSR codes that access at each helper node, only as many symbols as are transferred, are termed optimal-access MSR codes. MSR codes that alter a minimum number of parity symbols while updating a single, systematic symbol, are called update-optimal MSR codes.

There are several exact-repair (ER) MSR constructions available in the literature. In [9], Shah et al. show that interference alignment (IA) is necessarily present in every exact-repair MSR code, and use IA techniques to construct systematic MSR codes, known as MISER codes, for . The IA condition in the context of MSR codes (observed earlier in [19]) demands that the interference components in the data passed by helper nodes must be aligned so that they can be cancelled at the replacement node by data received from the systematic helper nodes. In [20], Suh et al. build on [9] to construct MSR codes for with optimal repair bandwidth for all nodes, under the condition that the helper-node set necessarily includes systematic nodes. In [21], the well-known Product Matrix (PM) framework is introduced to provide MSR constructions for , thereby settling the problem of MSR code construction in the low-rate regime, . While the method adopted in [21] to provide a construction for is to suitably shorten a code for , an extension of the PM framework that yields constructions for any in a single step is provided in [22]. Apart from a few notable constructions such as the Hadamard-design-based code [23] for and its generalization for for systematic node-repair, the problem of high-rate constructions (i.e., ) for all-node repair remained open. The first major result in this direction, is due to Cadambe et al. [24] where the authors apply the notion of symbol extension in interference alignment where multiple symbols are grouped together to form a single vector symbol, to jointly achieve interference alignment. The symbol-extension viewpoint is then used to show that ER MSR codes exist for all , as goes to infinity. The second major development was the zigzag code construction [25, 26], the first non-asymptotic high-rate MSR code construction with permitting rates as close as as desired, with additional desirable properties such as optimal access and optimal update. Zigzag codes however, require a sub-packetization level () that grows exponentially with and a very large finite field size, while the earlier PM codes for the low-rate regime, have and field-size that is linear in . In a subsequent work [27], the authors present a systematic MSR construction having and rate . A second systematic MSR code with is presented in [28]. A lower bound on sub-packetization level of a general MSR code is derived in [29]. The same paper shows that in the case of an optimal-access MSR code. An improved lower bound for general MSR codes

(2)

appears in [30]. These developments made it clear that the ultimate goal in MSR code construction was to construct a high-rate MSR code that simultaneously had low sub-packetization level , low field-size , arbitrary repair degree and the optimal-access property.

In [31], a parity-check viewpoint is adopted to construct a high-rate MSR code for with a sub-packetization level , requiring however, a large field-size. The construction was extended in [32], to satisfying . In [33], the authors provide a construction of MSR codes that holds for all , but which once again required large field size. In [34], the authors provide a construction for an optimal-access systematic MSR code that holds for any parameter set having sub-packetization matching the lower bound given in [29]. In [25, 26, 28, 27, 31, 34, 32, 33], Combinatorial Nullstellansatz (see [35]) is used to prove the MDS property due to which the codes are non-explicit and have large field sizes.

In [36], an explicit optimal-access, systematic MSR code is constructed with optimal , but for limited values of . In [37], the authors present two different classes of explicit MSR constructions, one of which possessed the optimal-access property. Both constructions are for any with sub-packetization level growing exponential in .

In a major advance, in [38], Ye and Barg present an explicit construction of a high-rate, optimal-access MSR code with , field size no larger than , and . Essentially the same construction was independently rediscovered in [39] from a different coupled-layer perspective, where layers of an arbitrary MDS codes are coupled by a simple pairwise coupling transform to yield an MSR code. Just prior to the appearance of these two papers, in an earlier version of [40], the authors show how a systematic MSR code can be converted into an MSR code by increasing the sub-packetization level by a factor of using a pairwise symbol transformation. This result is then extended in [40], to present a technique that takes an MDS code, increases sub-packetization level by a factor of and converts it into a code in which the optimal repair of nodes can be carried out. By applying this transform repeatedly times, it is shown that any scalar MDS code can be transformed into an MSR code. It turns out that the three papers [38, 39, 40], either explicitly or implicitly, employed as a key part of the construction, essentially the same pairwise-coupling transform.

Let . More recently, the lower bound was derived in [41] for optimal-access MSR codes. The same paper also shows that the sub-packetization level of an MDS code that can optimally repair any of the nodes must satisfy . These results established that the earlier constructions in [31, 32, 38, 39, 40, 42] were optimal in terms of sub-packetization level . It is also shown in [41], that a vector MDS code that can repair failed nodes belonging to a fixed set of nodes with minimum repair bandwidth and in optimal-access fashion, and having minimum sub-packetization level must necessarily have a coupled-layer structure, similar to that found in [38, 39, 40]. An explicit construction of MSR codes for with achieving the lower bound for was recently provided in [42].

Open Problems 2.

Derive a tight lower bound on the sub-packetization level of MSR codes and provide matching constructions.

Open Problems 3.

Constructions for explicit optimal-access MSR codes for any with optimal sub-packetization.

TABLE I: A list of MSR constructions and the parameters. In the table , and when All Node Repair is No, the constructions are systematic MSR. By ‘non-explicit’ field-size, we mean that the order of the size of the field from which coefficients are picked is not given explicitly.

Iv-B Constructions of MSR Codes

Product Matrix Construction [21]:

We provide a brief description of the PM construction for parameter set . The message symbols are arranged in the form of a matrix : , where the are symmetric matrices containing the message symbols. Encoding is carried out using a matrix , where is an matrix and is a diagonal matrix. Let the -th row of be , the -th row of be and the -th diagonal element in be . The symbols stored in node are given by:
The matrix is required to satisfy the properties: 1) any rows of are linearly independent, 2) any rows of are linearly independent and 3) the diagonal elements of are distinct.

Node Repair: Let be the index of failed node, thus the aim is to reconstruct . The -th helper node, , , passes on the information: . Upon aggregating the repair information we obtain the vector,

As any -rows of are linearly independent, the vector can be recovered. From , we can obtain and . Since and are symmetric, we can recover the contents of the replacement node.

Data Collection: Let be the sub matrix of corresponding to the nodes contacted for data collection. We wish to retrieve from . This can be done in three steps:

  1. First compute and set , .

  2. It is clear that are symmetric. Thus we know both and . Since for , we can recover and for all .

  3. Since we know for , we can compute the vector . Since any rows of are linearly independent, we can recover . For any set of distinct elements , we can compute , from which can be recovered. can be similarly recovered from . The present description assumes data collection from the first nodes, while a similar argument holds true for any arbitrary set of nodes.

Coupled Layer Code:

We present here the constructions in [38, 39, 40] from a coupled-layer perspective. We explain the construction here only for parameter sets of the form:

where . (The construction can however, be extended to yield MSR codes for any using a technique called shortening). The coupled-layer code can be constructed in two steps: (a) in the first step, we layer , MDS codewords to form an uncoupled data-cube, (b) in the second step, the symbols within the uncoupled-data cube are transformed using a pairwise-forward-transform (PFT) to obtain the coupled layer code. While we discuss only the case when the MDS code employed in the layers is a scalar MDS code, there is a straightforward extension that permits the use of vector MDS codes (see [39]).

Let us first consider the symbols of an uncoupled code where each code symbol is a vector of symbols in . These symbols can be organized to form a three-dimensional (3D) data cube (see Fig.7), where is the node index and where serves to index the contents of a node. For fixed , we think of the symbols as forming a plane or a layer and thus the value of may be regarded as identifying a plane or layer. The symbols in each layer of the uncoupled data cube form an MDS code.


Fig. 6: Uncoupled data cube for , . The red dots represent plane-index .
Fig. 7: Paired symbols are shown using yellow rectangles connected by dotted lines. Uncoupled symbols are transformed using PFT to get the coupled symbols in the coupled data cube.

Let, be the parity check (p-c) matrix of an arbitrarily chosen scalar MDS code defined over . Let denote the element of lying in the th row, and th column. Then the symbols of the uncoupled code satisfy the p-c equations:

(3)

Next, consider an identical data-cube (see Fig. 7) containing the symbols

corresponding to the coupled-layer code. This data-cube will be referred to as the coupled data cube. The symbols of the coupled data cube are derived from the symbols of the uncoupled data cube as follows. Let be an element in , . Let us define . Each symbol which is such that is paired with a symbol . The values of the symbols so paired, are derived from those of their counterparts in the uncoupled data cube as per the linear transformation given below, termed as the PFT:

(4)

In the case of the symbols when , the relation between symbols in the two data cubes is even simpler and given by: . The pairwise reverse transform (PRT) is simply the inverse of the PFT and is used to obtain the uncoupled symbols from the coupled symbols . The p-c equations satisfied by the coupled-layer code can be derived using the p-c equations (3) satisfied by the symbols in the uncoupled data cube and the PRT :

(5)

Node Repair: Let be the failed node. To recover the symbols , each of the remaining nodes sends helper information: . Focusing on (5) for such that and retaining on the left side the unknown symbols, leads to equations of the form:

(6)

where is a known value. These equations can be solved for the contents of the replacement node.

Data Collection: Please refer to [39] for the proof of data collection property.

Ye-Barg Codes [37]:

In [37] the authors present two constructions, for non optimal-access MSR and optimal-access MSR codes respectively. These are the only known MSR constructions that are explicit and yield MSR codes for any parameter set . The same codes are also optimal for the repair of multiple nodes. We describe here, for simplicity, the construction of MSR codes having parameters: where , defined over finite field  for . Let be the collection of symbols of a codeword, where is the node index and is the scalar symbol index. The code is defined via the p-c equations given below:

(7)

where the are all distinct, thereby requiring a field size .

Node Repair: Let be the failed node, be the set of helper nodes. The helper information sent by a node is given by: . Next, fixing and summing equations (7) over the values of , we get:

(8)

It can be shown that the collection of symbols form an MDS code. Therefore, all the can be computed from the known values supplied by the helper nodes and the symbols can thus be recovered from (8).

Data Collection: For every , the collection forms an MDS code. Therefore, any erased symbols can be recovered.

Multiple Node Repair Let be the number of erasures to be recovered. It was shown in [24] that the minimum repair bandwidth required to repair erasures in an MDS code having sub-packetization level is lower bounded by . Given that is the number of helper nodes that need to be contacted during the repair of nodes, is lower bounded by: . The Ye-Barg code presented above achieves this bound [37]. The node repair discussed here assumes a centralized repair setting whereas an alternate, cooperative repair approach is discussed in Section VI-A.

Adaptive Repair Adaptive-repair MSR codes are MSR codes that can repair a failed node by connecting to any nodes, for any and can reconstruct the failed node by downloading symbols each from the helper nodes. Constructions of MSR codes with adaptive repair can be found in [33, 37, 43].

V On the Storage-Repair Bandwidth Tradeoff under Exact Repair

We distinguish between the S-RB tradeoffs for exact and functional-repair RG code, by referring to them as the ER and FR tradeoff respectively. The file size under exact repair cannot exceed that in the FR case since ER may be regarded as a trivial instance of FR. However, unlike in the case of functional-repair codes, the data collection problem in the exact-repair setting, cannot be identified with a multicast problem simply because each replacement node for a failed node acts as a sink for a different set of data. Thus it is not clear that the cut-set bound for FR can be achieved under ER, leaving the door open for an S-RB tradeoff in the case of ER that lies strictly above and to the right of the FR tradeoff in the -plane. There do exist constructions of exact-repair MBR and MSR codes meeting the cut-set bound with equality, showing that the ER tradeoff coincides with the FR tradeoff at the extreme MSR and MBR points.

V-a The Non-existence of ER Codes Achieving FR tradeoff

The first major result on the ER tradeoff was the result in [44], showing that apart from the MBR point and a small region adjacent to the MSR point, there do not exist ER codes whose values lie on the interior point of the FR tradeoff. We set to be the value of at the MSR point.

Theorem V.1.

For any given values of , ER codes having parameters corresponding to an interior point on the FR tradeoff do not exist, except possibly for in the range

(9)

corresponding to a small region in the neighborhood of the MSR point.

Fig. 8: The repair matrix
Proof.

(Sketch) By restricting attention to any symbols of an RG code having parameter set one obtains a second RG code with parameter set in which all the remaining nodes participate in the repair of a failed node. This simplifies the analysis of the repair setting and with this in mind, in the proof, we set . When the message vector is picked uniformly at random, we have associated nodal random variables and repair data variables , where denotes the data passed from node to replacement node . The repair matrix (see Fig. 8) is an matrix whose th entry , is . The diagonal elements of do not figure in the discussion and maybe set equal to . Given subsets , we set , . We introduce the index sets , and for . The file size can be expressed in terms of the joint entropy of the node and repair-data variables (with logs computed to base ):

(10)
(11)
(12)

The cut-set bound in (1) corresponds to the inequalities: . For the bound to hold with equality, the joint random variables and must have maximum entropy. However it can be shown that the entropy of a row in the repair matrix is limited by if the cut-set bound holds with equality. This leads to a contradiction, concluding the proof. ∎

Theorem V.1 does not however, rule out the possibility of an ER code having tradeoff approaching the FR tradeoff asymptotically i.e., as the file size .

V-B The S-RB Tradeoff for

It is possible that the entropies of the random variables involved satisfy Shannon inequalities other than the ones we have noted and which shed light on the ER tradeoff. For the particular case , Tian [45] was able to identify such an inequality with the help of a modified version of the Information Theory Inequality Prover (ITIP) [46, 47].

Fig. 9: The (4,3,3) normalized tradeoff.

Let , represent the normalization of and with respect to file size . A point is said to be achievable if for any , there exists an ER-RG code whose is -close to . The normalized tradeoff, i.e., the tradeoff expressed in terms of and allows comparison of codes across file sizes . In the limit as , the S-RB tradeoff becomes a smooth curve. Let , be RG codes over having respective parameter sets and . Consider a codeword array obtained by vertically stacking codeword arrays of and codeword arrays of . The code comprising of all such arrays is said to be the space-shared code of and . Then is also an RG code with parameter set . The notion of space-sharing clearly extends to multiple codes.

Theorem V.2.

For , the achievable region is given by

(13)
Proof.

Of the four inequalities listed, the first follow the entropy constraints listed in (12) above. The last inequality does not follow from (12), and was found in [45] using an ITIP. It remains to construct a code that operate on points on the -plane, satisfying the inequalities with equality. A single parity-check code serves as an MSR code for . A MBR code can be constructed using the polygonal construction described in Sec. III. A hand-crafted code operating at the interior point of deflection (see Fig. 9) is given in [45]. Every point on the lines determined by equality in (13) is achieved by a code obtained by space-sharing among and . ∎

V-C Layered Codes for Interior Points

1 2 3 4 5
Fig. 10: An canonical layered code.

A simple code-construction technique based on the layering (see Fig. 10 for an example) of MDS codes turns out to provide codes that perform well with respect to file size in the interior region of the S-RB tradeoff. Let be an MDS code having parameters . Let be such that and . Let denote an ordering of the collection of all possible subsets of . Let , be message vectors, not necessarily distinct, and be the codeword in associated with . We create an array in which we place the symbols of codeword in the location specified by subset . It turns out that this array represents an array code which possesses the data collection property of an RG code, but not the repair property. By replicating the array a certain number of times, it turns out that one obtains a regenerating code with parameters , operating between the MSR and MBR points. Further details can be found in [48]. We will refer to this code as the canonical layered code . The canonical layered-code construction has been extended to construct codes with by making use of an outer code designed using linearized polynomials. An alternate generalization of the canonical code to the case of involved adding additional layers consisting of carefully designed parity symbols. Such an approach leads to the improved layered codes in [49], that turn out to be optimal for the set of parameters .

V-D ER Tradeoff Strictly Away from FR Tradeoff for all

In [50], it was shown that the ER tradeoff cannot approach the FR tradeoff even when for any value of . This was established by deriving a positive lower bound on the gap between the ER and FR tradeoffs.

Theorem V.3.

The ER tradeoff between and for any exact-repair regenerating code, with is strictly separated from the FR tradeoff, apart from the MSR and MBR endpoints as well as the region surrounding the MSR point appearing in (9).

The proof the theorem involves identifying contradicting bounds on the entropy of various trapezoidal-shaped subsets within the repair matrix. Subsequent papers [51],[52] derive better bounds, thereby improving the gap to go beyond . In [53], the authors adopt a different approach by first providing three different expression for the entropy of the data file involving mutual information between various repair-data variables, and taking a linear combination of these expressions that leads to a significantly tighter bound on :

(14)

The authors in [54] improve upon the result in (14) using repair-matrix techniques, in combination with the bound in Thm. V.3, leading to the best-known outer bound on the ER tradeoff. For the case of , the outer bound is achieved by the improved layered codes, thus characterizing the ER tradeoff. The bound also characterizes certain interior points when [50].

V-E Determinant Codes for Interior Points

The construction given in [55] has parameters , and file size , where