Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions

06/12/2018 ∙ by S. B. Balaji, et al. ∙ 0

This thesis makes several significant contributions to the theory of both Regenerating (RG) and Locally Recoverable (LR) codes. The two principal contributions are characterizing the optimal rate of an LR code designed to recover from t erased symbols sequentially, for any t and the development of a tight bound on the sub-packetization level (length of a vector code symbol) of a sub-class of RG codes called optimal-access RG codes. There are however, several other notable contributions as well such as deriving the tightest-known bounds on the performance metrics such as minimum distance and rate of a sub-class of LR codes known as availability codes. The thesis also presents some low field size constructions of Maximal Recoverable codes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 30

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The Distributed Storage Setting

In a distributed storage system, data pertaining to a single file is spatially distributed across nodes or storage units (see Fig. 4). Each node stores a large amounts of data running into the terabytes or more. A node could be in need of repair for several reasons including (i) failure of the node, (ii) the node is undergoing maintenance or (iii) the node is busy serving other demands on its data. For simplicity, we will refer to any one of these events causing non-availability of a node, as node failure. It is assumed throughout, that node failures take place independently.

Figure 4: In the figure on the left, the nodes in red are the nodes in a distributed storage system, that store data pertaining to a given data file. The figure on the right shows an instance of node failure (the node in yellow), with repair being accomplished by having the remaining red (helper) nodes pass on data to the replacement node to enable node repair.

In [2] and [5], the authors study the Facebook warehouse cluster and analyze the frequency of node failures as well as the resultant network traffic relating to node repair. It was observed in [2] that a median of nodes are unavailable per day and that a median of TB of cross-rack traffic is generated as a result of node unavailability (see Fig. 5).

Figure 5: The plot on the left shows the number of machines unavailable for more than minutes in a day, over a period of days. Thus, a median of in excess of machines become unavailable per day [2]. The plot on the right is of the cross rack traffic generated as well as the number of Hadoop Distributed File System (HDFS) blocks reconstructed as a result of unavailable nodes. The plot shows a median of TB of cross-rack traffic generated as a result of node unavailability [2].

Thus there is significant practical interest in the design of erasure-coding techniques that offer both low overhead and which can also be repaired efficiently. This is particularly the case, given the large amounts of data running into the tens or s of petabytes, that are stored in modern-day data centers (see Fig. 6).

Figure 6: An image of a Google data center.

2 Different Approaches to Node Repair

The flowchart in Fig. 7, provides a detailed overview of the different approaches by coding theorists to efficiently handle the problem of node repair and the numerous subclasses of codes that they have given rise to. In the description below, we explain the organization presented in the flowchart. The boxes outline in red in the flowchart are the topics to which this thesis has made contributions. These topics are revisited in detail in subsequent section of the chapter.

Figure 7: Flowchart depicts various techniques used to handle node repair. Current thesis has contributions in the topics corresponding to boxes highlighted in red.
Drawbacks of Conventional Repair

The conventional repair of an Reed-Solomon (RS) code where denotes the block length of the code and the dimension is inefficient in that the repair of a single node, calls for contacting other (helper) nodes and downloading times the amount of data stored in the failed node. This is inefficient in respects. Firstly the amount of data download needed to repair a failed node, termed the repair bandwidth, is times the amount stored in the replacement node. Secondly, to repair a failed node, one needs to contact helper nodes. The number of helper nodes contacted is termed the repair degree. Thus in the case of the RS code employed in Facebook, the repair degree is and the repair bandwidth is times the amount of data that is stored in the replacement node which is clearly inefficient.

The Different Approaches to Efficient Node Repair

Coding theorists have responded to this need by coming up with two new classes of codes, namely ReGenerating (RG) [6, 7].and Locally Recoverable (LR) codes [8]. The focus in an RG code is on minimizing the repair bandwidth while LR codes seek to minimize the repair degree. In a different direction, coding theorists have also re-examined the problem of node repair in RS codes and have come up  [9] with new and more efficient repair techniques. An alternative information-theoretic approach which permits lazy repair, i.e., which does not require a failed node to be immediately restored, can be found on [10].

Different Classes of RG Codes

Regenerating codes are subject to a tradeoff termed as the storage-repair bandwidth (S-RB) tradeoff, between the storage overhead of the code and the normalized repair bandwidth (repair bandwidth normalized by the file size). This tradeoff is derived by using principles of network coding. Any code operating on the tradeoff is optimal with respect to file size. At the two extreme ends of the tradeoff are codes termed as minimum storage regenerating codes (MSR) and minimum bandwidth regenerating (MBR) codes. MSR codes are of particular interest as these codes are Maximum Distance Separable (MDS), meaning that they offer the least amount of storage overhead for a given level of reliability and also offer the potential of low storage overhead. We will refer to codes corresponding to interior points of the S-RB tradeoff as interior-point RG codes. It turns out the precise tradeoff in the interior is unknown, thus it is an open problem to determine the true tradeoff as well as provide constructions that are optimal with respect to this tradeoff. Details pertaining to the S-RB tradeoff can be found in [11, 12, 13, 14].

Variations on the Theme of RG Codes

The theory of regenerating codes has been extended in several other directions. Secure RG codes (see [15]) are RG codes which offer some degree of protection against a passive or active eavesdropper. Fractional Repair (FR) codes (see [16]) are codes which give up on some requirements of an RG code and in exchange provide the convenience of being able to repair a failed node simply by transferring data (without need for computation at either end) between helper and replacement node. Cooperative RG codes (see [17, 18, 19]) are RG codes which consider the simultaneous repair of several failed nodes and show that there is an advantage to be gained by repairing the failed nodes collectively as opposed to in a one-by-one fashion.

MDS codes with Efficient Repair

There has also been interest in designing other classes of Maximum Distance Separable (MDS) codes that can be repaired efficiently. Under the Piggyback Framework (see [20]), it is shown how one can take a collection of MDS codewords and couple the contents of the different layers so as to reduce the repair bandwidth per codeword. RG codes are codes over a vector alphabet and the parameter is referred to as the sub-packetization level of the code. It turns out that in an RG code, as the storage overhead gets closer to , the sub-packetization level , rises very quickly. -MSR codes (see [21]) are codes which for a multiplicative factor () increase in repair bandwidth over that required by an MSR code, are able to keep the sub-packetizatin to a very small level.

Locally Recoverable Codes

Locally recoverable codes (see [22, 23, 24, 8, 25]) are codes that seek to lower the repair degree. This is accomplished by constructing the erasure codes in such a manner that each code symbol is protected by a single-parity-check (spc) code of smaller blocklength, embedded within the code. Each such spc code is termed as a local code. Node repair is accomplished by calling upon the short blocklength code, thereby reducing the repair degree. The coding scheme used in the Windows Azure is an example of an LR code. The early focus on the topic of LR codes was on the single-erasure case. Within the class of single-erasure LR codes, is the subclass of Maximum Recoverable (MR) codes. An MR code is capable of recovering from any erasure pattern that is not precluded by the locality constraints imposed on the code.

LR Codes for Multiple-Erasures

More recent work in the literature has been directed towards the repair of multiple erasures. Several approaches have been put forward for multiple-erasure recovery. The approach via codes (see [26, 27]), is simply to replace the spc local codes with codes that have larger minimum distance. Hierarchical codes are codes which offer different tiers of locality. The local codes of smallest block length offer protection against single erasures. Those with the next higher level of blocklength, offer protection against a larger number of erasures and so on.

Codes with Sequential and Parallel Recovery

The class of codes for handling multiple erasures using local codes, that are most efficient in terms of storage overhead, are the class of codes with sequential recovery (for details on sequential recovery, please see [28, 29, 30, 31, 32]). As the name suggests, in this class of codes, for any given pattern of erasures, there is an order under which recovery from these erasures is possible by contacting atmost code symbols for the recovery of each erasure. Parallel Recovery places a more stringent constraint, namely that one should be able to recover from any pattern of erasures in parallel.

Availability Codes

Availability codes (see [33, 34, 4, 35]) require the presence of disjoint repair groups with each repair group contains atmost code symbols that are capable of repairing a single erased symbol. The name availability stems from the fact that this property allows the recreation of a single erased symbol in different ways, each calling upon a disjoint set of helper nodes. This allows the simultaneous demands for the content of a single node to be met, hence the name availability code. In the class of codes with cooperative recovery (see [36]), the focus is on the recovery of multiple erasures at the same time, while keeping the average number of helper nodes contacted per erased symbol, to a small value.

Locally Regenerating (LRG) Codes

Locally regenerating codes (see [37]) are codes in which each local code is itself an RG code. Thus this class of codes incorporates into a single code, the desirable features of both RG and LR codes, namely both low repair bandwidth and low repair degree.

Efficient Repair of RS Codes

In a different direction, researchers have come up with alternative means of repairing RS codes ([38, 9]). These approaches view an RS code over an alphabet , as a vector code over the subfield having sub-packetization level and use this perspective, to provide alternative, improved approaches to the repair of an RS code.

Liquid Storage Codes

These codes are constructed in line with an information-theoretic approach which permits lazy repair, i.e., which does not require a failed node to be immediately restored can be found on [10].

3 Literature Survey

3.1 Locally Recoverable (LR) codes for Single Erasure

In [22], the authors consider designing codes such that the code designed and codes of short block length derived from the code designed through puncturing operations all have good minimum distance. The requirement of such codes comes from the problem of coding for memory where sometimes you want to read or write only parts of memory. These punctured codes are what would today be regarded as local codes. The authors derive an upper bound on minimum distance of such codes under the constraint that the code symbols in a local code and code symbols in another local code form disjoint sets and provide a simple parity-splitting construction that achieves the upper bound. Note that this upper bound on minimum distance is without any constraint on field size and achieved for some restricted set of parameters by parity splitting construction which has field size of . In [23], the authors note that when a single code symbol is erased in an MDS code, code symbols need to be contacted to recover the erased code symbol where is the dimension of the MDS code. This led them to design codes called Pyramid Codes which are very simply derived from the systematic generator matrix of an MDS code and which reduce the number of code symbols that is needed to be contacted to recover an erased code symbol. In [24], the authors recognize the requirement of recovering a set of erased code symbols by contacting a small set of remaining code symbols and provide a code construction for the requirement based on the use of linearized polynomials. In [8], the authors introduce the class of LR codes in full generality, and present an upper bound on minimum distance without any constraint on field size. This paper along with the paper [39] (sharing a common subset of authors) which presented the practical application of LR codes in Windows Azure storage, are to a large extent, responsible for drawing the attention of coding theorists to this class of codes. The extension to the non-linear case appears in [25],[40] respectively. All of these papers were primarily concerned with local recoverability in the case of a single erasure i.e., recovering an erased code symbol by contacting a small set of code symbols. More recent research has focused on the multiple-erasure case and multiple erasures are treated in subsequent chapters of this thesis. For a detailed survery on alphabet size dependent bounds for LR codes and constructions of LR codes with small alphabet size, please refer to Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions. A tabular listing of some constructions of Maximal Recoverbale or partial-MDS codes appears in Table 5 in Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions.

3.2 Codes with Sequential Recovery

The sequential approach to recovery from erasures, introduced by Prakash et al. [28] is one of several approaches to local recovery from multiple erasures as discussed in Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions, Section 10. As indicated in Fig. 15, Codes with Parallel Recovery and Availability Codes can be regarded as sub-classes of Codes with Sequential Recovery (S-LR codes). Among the class of codes which contact at most other code symbols for recovery from each of the erasures, codes employing this approach (see [28, 36, 3, 41, 29, 42, 30, 31, 32]) have improved rate simply because sequential recovery imposes the least stringent constraint on the LR code.

Two Erasures

Codes with sequential recovery (S-LR code) from two erasures () are considered in [28] (see also [3]) where a tight upper bound on the rate and a matching construction achieving the upper bound on rate is provided. A lower bound on block length and a construction achieving the lower bound on block length is provided in [3].

Three Erasures

Codes with sequential recovery from three erasures () can be found discussed in [3, 30]. A lower bound on block length as well as a construction achieving the lower bound on block length appears in [3].

More Than Erasures

A general construction of S-LR codes for any appears in [41, 30]. Based on the tight upper bound on code rate presented in Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions, it can be seen that the constructions provided in [41, 30] do not achieve the maximum possible rate of an S-LR code. In [36], the authors provide a construction of S-LR codes for any with rate . Again, the upper bound on rate presented in Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions shows that is not the maximum possible rate of an S-LR code. In Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions, we observe that the rate of the construction given in [36] is actually which equals the upper bound on rate derived here only for two cases: case (i) for and case (ii) for and exactly corresponding to those cases where a Moore graph of degree and girth exist. In all other cases, the construction given in [36] does not achieve the maximum possible rate of an S-LR code.

3.3 Codes with Availability

The problem of designing codes with availability in the context of LR codes was introduced in [33]. High rate constructions for availability codes appeared in [34],[43],[44],[45]. Constructions of availability codes with large minimum distance appeared in [46],[4, 47, 48], [49]. For more details on constructions of availability codes please see Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions. Upper bounds on minimum distance and rate of an availabiltiy code appeared in [33], [4], [43], [48], [50]. For exact expressions for upper bounds on minimum distance and rate which appeared in literature please refer to Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions.

3.4 Regenerating codes

In the following, we focus only on sub-packetization level of regenerating codes as this thesis is focussed only on this aspect. An open problem in the literature on regenerating codes is that of determining the smallest value of sub-packetization level of an optimal-access (equivalently, help-by-transfer) MSR code, given the parameters . This question is addressed in [51], where a lower bound on is given for the case of a regenerating code that is MDS and where only the systematic nodes are repaired in a help-by-transfer fashion with minimum repair bandwidth. In the literature these codes are often referred to as optimal access MSR codes with systematic node repair. The authors of [51] establish that:

in the case of an optimal access MSR code with systematic node repair. In a slightly different direction, lower bounds are established in [52] on the value of in a general MSR code that does not necessarily possess the help-by-transfer repair property. In [52] it is established that:

while more recently, in [53] the authors prove that:

A brief survey of regenerating codes and in particular MSR codes appear in Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions.

4 Codes in Practice

The explosion in amount of storage required and the high cost of building and maintaining a data center, has led the storage industry to replace the widely-prevalent replication of data with erasure codes, primarily the RS code (see Fig. 8). For example, the new release Hadoop 3.0 of the Hadoop Distributed File System (HDFS), incorporates HDFS-EC (for HDFS- Erasure Coding) makes provision for employing RS codes in an HDFS system.

Figure 8: Some examples of the RS code employed in industry. (taken from Hoang Dau, Iwan Duursma, Mao Kiah and Olgica Milenkovic, “Optimal repair schemes for Reed-Solomon codes with single and multiple erasures,” 2017 Information Theory and Applications Workshop, San Diego, Feb 12-17.)

However, the use of traditional erasure codes results in a repair overhead, measured in terms of additional repair traffic resulting in larger repair times and the tying up of nodes in non productive, node-repair-related activities. This motivated the academic and industrial-research community to explore approaches to erasure code construction which were more efficient in terms of node repair and many of these approaches were discussed in the preceding section. An excellent example of research in this direction is the development of the theory of LR codes and their immediate deployment in data storage in the form of the Windows Azure system.

LR Codes in Windows Azure:

In [39], the authors compare performance-evaluation results of an LR code with that of RS code in Azure production cluster and demonstrates the repair savings of LR code. Subsequently the authors implemented an LR code in Windows Azure Storage and showed that this code has repair degree comparable to that of an RS code, but has storage overhead versus in the case of the RS code (see Fig. 9, and Fig. 10). This LR code is currently is use now and has reportedly resulted in the savings of millions of dollars for Microsoft [54].

Figure 9: A RS code having a repair degree of and a storage overhead of .
Figure 10: The LR code employed in Windows Azure. This code has repair degree , which is only slightly larger than the repair degree of the RS code in Fig. 9. However, the storage overhead of this code at , is much smaller than the comparable value in the case of the ) of the RS code.

A second poular distributed storage system is Ceph and Ceph currently has an LR code plug-in [55]. Some other examples of work directed towards practical applications are described below. Most of this work is work carried out by an academic group and presented at a major storage industry conference and involves performance evaluation through emulation of the codes in a real-world setting.

  1. In [5], the authors implement HDFS-Xorbas. This system employs LR codes in place of RS codes in HDFS-RAID. The experimental evaluation of Xorbas was carried out in Amazon EC2 and a cluster in Facebook and the repair performance of LR code was compared against a RS code.

  2. A method, termed as piggybacking, of layering several RS codewords and then coupling code symbols across layers is shown in [20], to yield a code over a vector alphabet, that has reduced repair bandwidth, without giving up on the MDS property of an RS code. A practical implementation of this is implemented in the Hitchhiker erasure-coded system  [56]. Hitchhiker was implemented in HDFS and its performance was evaluated on a data-warehouse cluster at Facebook.

  3. The HDFS implementation of a class of codes known as HashTag codes is discussed in [57] (see also [58]). These are codes designed to efficiently repair systematic nodes and have a lower sub-packetization level in comparison to an RG code at the expense of a larger repair bandwidth.

  4. The NCCloud [59] is an early work that dealt with the practical performance evaluation of regenerating codes and employs a class of MSR code known as functional-MSR code having parities.

  5. In [60], the performance of an MBR code known as the pentagon code as well as an LRG code known as the heptagon local code are studied and their performance compared against double and triple replication. These code possess inherent double replication of symbols as part of the construction.

  6. The product matrix (PM) code construction technique yields a general construction of MSR and MBR codes. The PM MSR codes have storage overhead that is approximately lower bounded by a factor of . The performance evaluation of an optimal-access version of a rate PM code, built on top of Amazon EC2 instances, is presented in [61].

  7. A high-rate MSR code known as the Butterfly code is implemented and evaluated in both Ceph and HDFS in [62]. This code is a simplified version of the MSR codes with two parities introduced in [63].

  8. In [64], the authors evaluate the performance in a Ceph environment, of an MSR code known as the Clay code, and which corresponds to the Ye-Barg code in [65], (and independently rediscovered after in [66]). The code is implemented in [64], from the coupled-layer perspective present in [66]. This code is simultaneously optimal in terms of storage overhead and repair bandwidth (as it is an MSR code), and also has the optimal-access (OA) property and the smallest possible sub-packetization level of an OA MSR code. The experimental performance of the Clay code is shown to be match its theoretical performance.

5 Contributions and Organization of the Thesis

The highlighted boxes appearing in the flow chart in Fig. 11 represent topics with respect to which this thesis has made a contribution.

Figure 11: The three flowcharts shown here are extracted from the flowchart in Fig. 7. The highlighted boxes indicate topics to which this thesis has contributed.

We now proceed to describe chapter wise, our contributions corresponding to topics in the highlighted boxes. An overview of the contributions appears in Fig. 12.

Figure 12: A chapter wise overview of the contributions of the present thesis. The highlighted chapters indicate the chapters containing the principal results of the chapter. The specific principal results appear in boldface.
Chapter 2: Locally Recoverable Codes: Alphabet-Size Dependent Bounds for Single Erasures

This chapter begins with an overview of LR codes. Following this, new alphabet-size dependent bounds on both minimum distance and dimension of an LR code that are tighter than existing bounds in the literature, are presented.

Chapter 3: Tight Bounds on the Rate of LR Codes with Sequential Recovery

This chapter deals with codes for sequential recovery and contains the principal result of the thesis, namely, a tight upper bound on the rate of a code with sequential recovery for all possible values of the number of erasures guaranteed to be recovered with locality parameter . Matching constructions are provided in the chapter following, Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions. A characterization of codes achieving the upper bound on code rate for the case of erasures is also provided here. The bound on maximum possible code rate assumes that there is no constraint (i.e., upper bound) on the block length of the code or equivalently, on the code dimension. A lower bound on the block length of codes with sequential recovery from three erasures is also given here. Also given are constructions of codes with sequential recovery for having least possible block length for a given dimension and locality parameter . An upper bound on dimension for the case of for a given dual dimension and locality parameter and constructions achieving it are also provided.

Chapter 4: Matching (Optimal) Constructions of Sequential LR Codes

In this chapter, we construct codes which achieve the upper bound on rate of codes with sequential recovery derived in Chapter  Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions for all possible values of the number of erasures guaranteed to be recovered with locality parameter . We deduce the general structure of parity check matrix of a code achieving our upper bound on rate. Based on this, we show achievability of the upper bound on code rate via an explicit construction. We then present codes which achieve the upper bound on rate having least possible block length for some specific set of parameters.

Chapter 5: Bounds on the Parameters of Codes with Availability

This chapter deals with codes with availability. Upper bounds are presented on the minimum distance of a code with availability, both for the case when the alphabet size is constrained and when there is no constraint. These bounds are tighter than the existing bounds in literature. We next introduce a class of codes, termed codes with strict availability which are subclass of the codes with availability. The best-known availability codes in terms of rate belong to this category. We present upper bounds on the rate of codes with strict availability that are tighter than existing upper bounds on the rate of codes with availability. We present exact expression for maximum possible fractional minimum distance for a given rate for a special class of availability codes as where each code in this special class is a subcode or subspace of direct product of copies of an availability code with parameters for some . We also present a lower bound on block length codes with strict avalability and characterize the codes with strict availability achieving the lower bound on block length.

Chapter 6: Tight Bounds on the Sub-Packetization Level of MSR and Vector-MDS Codes

This chapter contains our results on the topic of RG codes. Here, we derive lower bounds on the sub-packetization level an of a subclass of MSR codes known as optimal-access MSR codes. We also bound the sub-packetization level of optimal-access MDS codes with optimal repair for (say) a fixed number of nodes. The bounds derived here are tight as there are constructions in the literature that achieve the bounds derived here. The bounds derived here conversely show that the constructions that have previously appeared in the literature are optimal with respect to sub-packetization level. We also show that the bound derived here sheds light on the structure of an optimal-access MSR or MDS code.

Chapter 7: Partial Maximal and Maximal Recoverable Codes

The final chapter deals with the subclass of LR codes known as Maximal Recoverable (MR) codes. In this chapter we provide constructions of MR codes having smaller field size than the constructions existing in the literature. In particular we modify an existing construction which will result in an MR code with field size of for some specific set of parameters. We also modify (puncture) an existing construction for to form an MR code which results in reduced field size in comparison with the field size of constructions appearing in the literature. We also introduce in the chapter, a class of codes termed as Partial Maximal Recoverable (PMR) codes. We provide constructions of PMR codes having small field size. Since a PMR code is in particular an LR code, this also yields a low-field-size construction of LR codes.

6 Locally Recoverable Codes for Single Erasures

In [22], the authors consider designing codes such that the code designed and codes of short block length derived from the code designed through puncturing operations all have good minimum distance. The requirement of such codes comes from the problem of coding for memory where sometimes you want to read or write only parts of memory. These punctured codes are what would today be regarded as local codes. The authors derive an upper bound on minimum distance of such codes under the constraint that the code symbols in a local code and code symbols in another local code form disjoint sets and provide a simple parity-splitting construction that achieves the upper bound. Note that this upper bound on minimum distance is without any constraint on field size and achieved for some restricted set of parameters by parity splitting construction which has field size of . In [23], the authors note that when a single code symbol is erased in an MDS code, code symbols need to be contacted to recover the erased code symbol where is the dimension of the MDS code. This led them to design codes called Pyramid Codes which are very simply derived from the systematic generator matrix of an MDS code and which reduce the number of code symbols that is needed to be contacted to recover an erased code symbol. In [24], the authors recognize the requirement of recovering a set of erased code symbols by contacting a small set of remaining code symbols and provide a code construction for the requirement based on the use of linearized polynomials. In [8], the authors introduce the class of LR codes in full generality, and present an upper bound on minimum distance without any constraint on field size. This paper along with the paper [39] (sharing a common subset of authors) which presented the practical application of LR codes in Windows Azure storage, are to a large extent, responsible for drawing the attention of coding theorists to this class of codes. The extension to the non-linear case appears in [25],[40] respectively. All of these papers were primarily concerned with local recoverability in the case of a single erasure i.e., recovering an erased code symbol by contacting a small set of code symbols. More recent research has focused on the multiple-erasure case and multiple erasures are treated in subsequent chapters of this thesis. Throughout this chapter:

  1. a codeword in an linear code will be represented by where denotes the th code symbol.

  2. all codes discussed are linear codes and we will use the term nonlinear explicitly when referring to a nonlinear code.

  3. we say a code achieves a bound (an inequality), iff it has parameters such that the bound is satisfied with equality.

  4. The notation or , refers to the minimum distance of a code under discussion.

Let be an code over a finite field . Let be a generator matrix for having columns , i.e., . An information set is any subset of of size satisfying: .

Definition 1.

An code over a finite field  is said to be an LR code with information-symbol (IS) locality over  if there is an information set such that for every , there exists a subset , with , such that and there is a codeword in the dual code with support exactly equal to . is said to be an LR code with all-symbol (AS) locality over  if for every , there exists a subset with , such that and there is a codeword in the dual code with support exactly equal to . Clearly, an LR code with AS locality is also an LR code with IS locality. The parameter appearing above is termed the locality parameter.

Throughout this thesis, when we say LR code, it refers to an LR code with all-symbol (AS) locality. When we discuss LR code with information-symbol (IS) locality, we will state it explicitly. Note that the presence of a codeword in the dual code with support set implies that if the code symbol is erased then it can be recovered from the code symbols in the set . Recovery from erasures is termed as repair. The repair is local, since and typically, is significantly smaller than the block length of the code. It is easy to see that every linear code can trivially be regarded as an LR code with locality parameter . The term a local code of an LR code refers to the code for some where for a set with ,..

6.1 The Bound

A major result in the theory of LR codes is the minimum distance bound given in (1) which was derived for linear codes in [8]. An analogous bound for nonlinear codes can be found in [25],[40].

Theorem 6.1.

[8] Let be an LR code with IS locality over  with locality parameter and minimum distance . Then

(1)

On specializing to the case when , (1) yields the Singleton bound and for this reason, the bound in (1) is referred to as the Singleton bound for an LR code. Note that since (1) is derived for an LR code with IS locality, it also applicable to an LR code with AS locality.

6.2 Constructions of LR Codes

In the following, we will describe two constructions of LR codes having field size of and achieving the bound (1). The first construction called the Pyramid Code construction [23], allows us to construct for any given parameter set , an LR code with IS locality achieving the bound in (1). The second construction which appeared in [46], gives constructions for LR codes with AS locality, achieving the bound (1) for any under the constraint that .

6.2.1 Pyramid Code Construction

The pyramid code construction technique [23], allows us to construct for any given parameter set an LR code with IS locality achieving the bound in (1). We sketch the construction for the case . The general case , or when , follows along similar lines. The construction begins with the systematic generator matrix  of an scalar MDS code having block length . It then reorganizes the sub-matrices of  to create the generator matrix  of the pyramid code as shown in the following:

where . It is not hard to show that the code generated by is an LR code with IS locality and that . It follows that

and the code thus achieves the Singleton bound in (1).

6.2.2 The Tamo-Barg Construction
Figure 13: In the T-B construction, code symbols in a local code of length correspond to evaluations of a polynomial of degree . Here, implies that a local code corresponds to evaluation at points of a linear polynomial.

The construction below by Tamo and Barg [46], provides a construction for LR codes with AS locality achieving the bound (1) for any with . We will refer to this construction as the Tamo-Barg (T-B) construction. Let  be a finite field of size , let , , with and . Set . Let and let , , such that represent a partitioning of . Let be a ‘good’ polynomial, by which is meant, a polynomial over  that is constant on each i.e., for some and degree of is . Let

where the are the message symbols and where the second term is vacuous for , i.e., when . Consider the code of block length and dimension where the codeword c of length corresponding to a given message symbols is obtained by evaluating at each of the elements in after substituting the given values of message symbols in the expression for . It can be shown that is an LR code with AS locality with locality parameter and achieves the bound in (1). The -th local code corresponds to evaluations of at elements of (also see Fig 13). An example of how good polynomials may be constructed is given below, corresponding to the annihilator polynomial of a multiplicative subgroup of . Let be a chain of cyclic subgroups, where so that . Let . Let be the multiplicative cosets of in , with being the multiplicative identity so that . It follows that

so that is constant on all the cosets of in and may be selected as the good polynomial i.e., is one possible choice of good polynomial based on multiplicative group . Further examples may be found in [46, 67, 68]. For constructions meeting the Singleton bound with field size of and greater flexibility in selecting the value of , please see [69].

7 Alphabet-Size Dependent Bounds

This section contains the contributions of the thesis on the topic of LR codes for the case of single erasures. These include the best-known alphabet-size-dependent bounds on both minimum distance and dimension for LR codes for . For , our bound on dimension is the tightest known bound for . The bound in equation (1) as well as the bounds for non-linear and vector codes derived in [25, 70] hold regardless of the size of the underlying finite field. The theorem below which appeared in [71] takes the size of the code symbol alphabet into account and provides a tighter upper bound on the dimension of an LR code for a given that is valid even for nonlinear codes where is the minimum distance of the code. The ‘dimension’ of a nonlinear code  over an alphabet of size is defined to be the quantity .

Theorem 7.1.

[71] Let  be an LR code with AS locality and locality parameter over an alphabet of size . Then the dimension of the code must satisfy:

(2)

where denotes the largest possible dimension of a code (no locality necessary) over having block length and minimum distance .

Proof.

(Sketch of proof) The bound holds for linear as well as nonlinear codes. In the linear case, with , the derivation proceeds as follows. Let be a generator matrix of the locally recoverable code . Then it can be shown that for any integer , there exists an index set such that and where refers to the set of columns of indexed by . This implies that  has a generator matrix of the form (after permutation of columns):

In turn, this implies that the rowspace of defines an code over , if . It follows that and the result follows. Note that the row space of corresponds to a shortening of  with respect to the coordinates . The proof in the general case is a (nontrivial) extension to the nonlinear setting. ∎

We next look at a bound on dimension of binary LR codes for a given that appeared in [72]. We remark that there is an additional bound on dimension given in [72] for the case when the local codes are disjoint i.e., the case when the support sets , are pairwise disjoint. However, here we only provide the bound on dimension given in [72], which applies in full generality, and without the assumption of disjoint local codes.

Theorem 7.2.

[72] For any linear code  that is an LR code with AS locality with locality parameter over with and , we must have:

(3)

The above bound is obtained by applying a Hamming-bound-type argument to an LR code with AS locality. In [43], the authors provide a bound on the minimum distance of LR codes with IS locality111The bound has an extension to codes with availability as well, see Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions for the definition of availability., (the bound thus applies to LR codes with AS locality as well) that depends on the size of the underlying finite field :

Theorem 7.3.

For any linear code  that is an LR code with IS locality with locality parameter over :

(4)

where is the maximum possible minimum distance of a classical (i.e., no locality necessary) block code over .

We next introduce the notion of Generalized Hamming Weights (also known as Minimum Support Weights) which will be used to derive a new bound on the minimum distance and dimension of an LR code with AS locality, that takes into account the size of the underlying finite field . The bound makes use of the technique of code shortening and the GHWs of a code provide valuable information about shortened codes.

7.0.1 GHW and the Minimum Support Weight Sequence

We will first define the Generalized Hamming Weights of a code, introduced in [73], and also known as Minimum Support Weights (MSW) (see [74]) of a code. In this thesis we will use the term Minimum Support Weight (MSW).

Definition 2.

The th Minimum Support Weight (MSW) (equivalently, the th Generalized Hamming Weight) of an code is the cardinality of the minimum support of an -dimensional subcode of C, i.e.,

(5)

where the notation denotes a subcode of and where (called the support of the code ).

Although the MSW definition applies to any code, the interest in this thesis, is on its application to a restricted class of codes that we introduce here.

Definition 3 (Canonical Dual Code).

By a canonical dual code, we will mean an linear code satisfying the following: contains a set of linearly independent codewords of Hamming weight , such that the sets , cover , i.e.,

As it turns out, the dual code of an LR code with AS locality (and as we shall see in subsequent chapters, dual of codes with sequential recovery and dual of codes with availability) is an example of a canonical dual code and this is the reason for our interest in the MSWs of this class of codes.

Theorem 7.4.

[28] Let  be a canonical dual code with parameters and support sets as defined in Definition 3. Let denote the th MSW of . Let be the minimum possible value of cardinality of the union of any distinct support sets , , i.e.,

Let . Let the integers be recursively defined as follows:

(6)
(7)

Then

Note that in [28], the Theorem 7.4 is proved for the case , , but we observe from the proof of Theorem 7.4 (proof for ) given in [28] that the Theorem 7.4 is also true for any , , . We will refer to the sequence appearing in the theorem above as the Minimum Support Weight (MSW) Sequence associated to parameter set . In the subsection below, we derive new alphabet-size dependent bounds on minimum distance and dimension, that are expressed in terms of the MSW sequence.

7.1 New Alphabet-Size Dependent Bound Based on MSW

In this subsection, we present field-size dependent bounds on the minimum distance and dimension of an LR code  with AS locality with as locality parameter. The bounds are derived in terms of the MSW sequence associated with the dual of . The basic idea is to shorten the LR code to a code with (th term of MSW sequence) code symbols set to zero for some . Theorem 7.4 provides a lower bound on the dimension of this shortened code. Classical bounds on the parameters of this shortened code are shown to yield bounds on the parameters of the parent LR code.

Theorem 7.5.

Let be an LR code with AS locality with locality parameter over a field with minimum distance . Let be the maximum possible minimum distance of an LR code with AS locality with locality parameter over a field . Then:

(8)
(9)

where

  1. ,

  2. ,

  3. ,

  4. is the maximum possible minimum distance of a classical (i.e., no locality necessary) block code over and

  5. is the largest possible dimension of a code (i.e., no locality necessary) over having block length and minimum distance .

Proof.

Since is an LR code with AS locality with locality parameter , we have that is an canonical dual code with locality parameter and . We explain the reason for the inequality in the symbol . Since is an LR code AS locality, for every code symbol , there is a codeword in of weight whose support contains . So take a codeword in of weight whose support set contains . Next, choose and take a codeword in of weight whose support set contains . Repeat this process. At the step, choose and take a codeword in of weight whose support set contains . Note that the set of codewords corresponding to support sets form a set of linearly independent codewords in and this process process can be repeated until for some . Since , we have that which implies as . We now set . Hence from Theorem 7.4, , . For simplicity, let us write , . Next, fix with . Let be the support of an dimensional subspace or subcode of with the support having cardinality exactly in . Add arbitrary extra indices to and let the resulting set be . Hence and . Now shorten the code in the co-ordinates indexed by i.e., take where is the compliment of and for a set with ,. The resulting code has block length , dimension and minimum distance (if ) and the resulting code is also an LR code with AS locality with locality parameter . Hence:

The proof of (9) follows from the fact that and Hence :

(10)

An example comparison of the upper bounds on dimension of linear LR codes given in (2), (3) and (9) (our bound) is presented in Table 1. The word bound in the following refers to an upper bound.

, ,
(locality) 2 3 4 5 6
Bound (2) 17 19 20 20 20
Bound in (3) 15 18 20 22 23
Bound (9) 16 18 19 20 20
Table 1: A comparison of upper bounds on the dimension of a binary LR code, for given .

Since , it can be seen that the bound (8) is tighter than the bound (4) when applied to an LR code with AS locality. For the same reason, the bound (9) is tighter than bound (2). For , it is mentioned in [72], that the bound (3) is looser than the bound in (2) for . Since our bound (9) presented here is tighter than the bound appearing in (2), we conclude that for , , our bound (9) is tighter than the bound in (3). Hence our bounds (8),(9) are the tightest known bounds on minimum distance and dimension for . For , our bound (9) is the tightest known bound on dimension for . We here note that the bounds (8), (9) apply even if we replace with any other upper bound on MSW. The bounds derived here are general in this sense. For the sake of completeness, in the following we give a survey of existing small alphabet size constructions which are optimal w.r.t bounds appearing in the literature.

Remark 1.

Let and let be a positive integer. Then if and , it is trivial to observe that:

Trivially, if is a monotonic function of , then assuming continuity of , we get . This is because if is a monotonic function of then the number of linearly independent codewords of weight in the dual code needed to satisfy the conditions necessary for an LR code is a negligible fraction of as increases. Hence the region of interest in locality is when is a constant or when is a small number.

7.2 Small-Alphabet Constructions

7.2.1 Construction of Binary Codes

Constructions for binary codes that achieve the bound on dimension given in (2) for binary codes, appear in [75, 45, 76]. While [76] and [75] provide constructions for and respectively, the constructions in [45] handle the case of larger minimum distance but have locality parameter restricted to . In [43], the authors give optimal binary constructions with information and all symbol locality with . The construction is optimal w.r.t the bound (4). Constructions achieving the bound on dimension appearing in [72] and the further tightened bound for disjoint repair groups given in [77] for binary codes, appear respectively, in [72, 77]. These constructions are for the case . In [76], the authors present a characterization of binary LR codes that achieve the Singleton bound (1). In [78], the authors present constructions of binary codes meeting the Singleton bound. These codes are a subclass of the codes characterized in [76] for the case .

7.2.2 Constructions with Small, Non-Binary Alphabet

In [79], the authors characterize ternary LR codes achieving the Singleton bound (1). In [76, 78, 80], the authors provide constructions for codes over a field of size that achieve the Singleton bound in (1) for . Some codes from algebraic geometry achieving the Singleton bound (1) for restricted parameter sets are presented in [81].

7.2.3 Construction of Cyclic LR Codes

Cyclic LR codes can be constructed by carefully selecting the generator polynomial of the cyclic code. We illustrate a key idea behind the construction of a cyclic LR code by means of an example.

Figure 14: Zeros of the generator polynomial of the cyclic code in Example 1 are identified by circles. The unshaded circles along with the shaded circle corresponding to indicate the zeros of selected to impart the code with . The shaded circles indicate the periodic train of zeros introduced to cause the code to be locally recoverable with parameter . The common element is helpful both to impart increased minimum distance as well as locality.
Example 1.

Let be a primitive element of satisfying . Let be a cyclic code having generator polynomial . Since the consecutive powers of are zeros of , it follows that by the BCH bound. Suppose we desire to ensure that a code  having generator polynomial has and in addition, is locally recoverable with parameter , then we do the following. Set . Let and . It follows that . Summing over we obtain:

It follows that the symbols of  form a local code as they satisfy the constraint of an overall parity-check. Since the code  is cyclic the same holds for the code symbols , for . Thus through this selection of generator polynomial , we have obtained a code that has both locality and . The zeros of are illustrated in Fig. 14. The code  has parameters and . Note that the price we pay for introduction of locality is a loss in code dimension, equal to the degree of the polynomial . Thus an efficient code will choose the zeros of for maximum overlap.

The above idea of constructing cyclic LR code was introduced in [82] and extended in [83, 84, 85, 86]. In [87], the use of locality for reducing the complexity of decoding a cyclic code is explored. The same paper also makes a connection with earlier work [88] that can be interpreted in terms of locality of a cyclic code. In [82] a construction of binary cyclic LR codes for an achieving a bound derived within the same paper for binary codes is provided. In [85], the authors give constructions of optimal binary, ternary codes meeting the Singleton bound (1) for and as well as a construction of a binary code meeting the bound given in [72] for based on concatenating cyclic codes. A discussion on the locality of classical binary cyclic codes as well as of codes derived from them through simple operations such as shortening, can be found in [89, 43]. The principal idea here is that any cyclic code has locality where is the minimum distance of the dual code . In [84], the authors construct optimal cyclic codes under the constraint that the local code is either a Simplex code or else, a Reed-Muller code. In [83], the authors provide a construction of cyclic codes with field size achieving the Singleton bound (1) and also study the locality of subfield subcodes as well as their duals, the trace codes. In [86], constructions of cyclic LR codes with for any and flexible are provided.

8 Summary

This chapter dealt with LR codes for the single-erasure case and presented the requisite background as well as the contributions of the thesis in this direction. The thesis contributions on LR codes for single erasure case correspond to new alphabet-size dependent upper bounds on the minimum distance and dimension of a linear LR code. Thus the upper bounds apply to the case of LR codes over a finite field of fixed size . A key ingredient in the upper bounds derived here are the bounds on the Generalized Hamming Weights (GHW) derived in [28]. Evidence was presented showing our upper bound on dimension to be tighter in comparison with existing upper bounds in the literature.

9 Introduction

The focus of the present chapter is on LR codes for multiple erasures. We begin by providing motivation for studying the multiple-erasure case (Section 9.1). As there are several approaches towards handling multiple erasures in the literature, we next provide a broad classification of LR codes for multiple erasures (Section 10). The principal contributions of this thesis relate to a particular approach towards the recovery from multiple erasures, termed as sequential recovery. Section 11 introduces LR codes with sequential recovery and surveys the known literature on the topic. This is followed by an overview of the contributions of this thesis on the topic of LR codes with sequential recovery in Section 12. Sections 13 and 14 respectively present in detail, the results obtained in this thesis, relating to the case of and erasures respectively. Section 15 presents a principal result of this thesis which involves establishing a tight upper bound on the rate of an LR code with sequential recovery for erasures along with a matching construction. The upper bound derived also proves a conjecture that had previously appeared in the literature. The final section, Section 16, summarizes the contents of the chapter. Throughout this chapter, we use the term weight to denote the Hamming weight.

9.1 Motivation for Studying Multiple-Erasure LR Codes

Given that the key problems on the topic of LR codes for the single erasure case have been settled, the academic community has turned its attention towards LR codes for multiple erasures.

Availability

A strong motivation for studying the multiple-erasure case, comes from the notion of availability. A storage unit could end up storing data that is in extremely high demand at a certain time instant. In such situations, regarded by the storage industry as degraded reads, the storage industry will look to create, on-the-fly replicas of the storage unit’s data. If a code symbol can be recreated in different ways by calling upon pairwise disjoint sets of helper nodes, then one could recreate copies of the data-in-demand in parallel. But a code which can, for any code symbol recreate in this fashion simultaneous copies of a code symbol, also has the ability to correct erasures simultaneously. This follows because any pattern of erasures can affect at most of the helper node sets and thus there is still a helper node set remaining that can repair the erased symbol. The problem of designing codes with availability in the context of locality was introduced in [33] and a high rate construction for availability codes appeared in [34]. For a survey on constructions of availability codes please see Chapter Erasure Codes for Distributed Storage: Tight Bounds and Matching Constructions. Upper bounds on minimum distance and rate of an availabiltiy codes appeared in [4], [43], [35], [48].

Other reasons

Other reasons for being interested in the multiple erasure setting include (a) the increasing trend towards replacing expensive servers with low-cost commodity servers that can result in simultaneous node failures and (b) the temporary unavailability of a helper node to assist in the repair of a failed node.

10 Classification of LR Codes for Multiple Erasures

An overview of the different classes of LR codes that are capable of recovering from multiple erasures proposed in the literature is presented here. All approaches to recovery from multiple erasures place a constraint on the number of unerased symbols that are used to recover from a particular erased symbol. The value of is usually small in comparison with the block length of the code and for this reason, one speaks of the recovery as being local. All the codes defined in this section are over a finite field . A codeword in an code will be represented by . In this chapter, we will restrict ourselves to only linear codes.

10.0.1 Sequential-Recovery LR Codes

An sequential-recovery LR code (abbreviated as S-LR code) is an linear code having the following property: Given a collection of erased code symbols, there is an ordering of these erased symbols such that for each index , there exists a subset satisfying (i) , (ii) , and (iii)

(11)

It follows from the definition that an S-LR code can recover from the erasure of code symbols , for by using (11) to recover the symbols , in succession.

Figure 15: The various code classes of LR codes corresponding to different approaches to recovery from multiple erasures.
10.0.2 Parallel-Recovery LR Codes

If in the definition of the S-LR code, we replace the condition (ii) in (11) by the more stringent requirement:

(12)

then the LR code will be referred to as a parallel recovery LR code, abbreviated as P-LR code. Clearly the class of P-LR codes is a subclass of S-LR codes. From a practical point of view, P-LR codes are preferred since as the name suggests, the erased symbols can be recovered in parallel. However, this will in general, come at the expense of storage overhead. We note that under parallel recovery, depending upon the specific code, this may require the same helper (i.e., non-erased) code symbol to participate in the recovery of more than one erased symbol .

10.0.3 Availability Codes

An availability LR code (see [33, 34, 4, 35]), is an linear code having the property that in the event of a single but arbitrary erased code symbol , there exist recovery sets which are pairwise disjoint and of size with , such that for each , can be expressed in the form:

An availability code is also an P-LR code. This follows because the presence of at most erasures implies, that there will be at least one recovery set for each erased code symbol all of whose symbols remain unerased.

10.0.4 Codes

An linear code  is said to have AS locality (see [26, 27]), if for each co-ordinate , there exists a subset , with , with