In recent years, there has been a widespread adoption of distributed storage systems (DSSs) as a viable storage technology for Big Data. Distributed storage provides an inexpensive storage solution for storing large amounts of data. Formally, a DSS is a network of numerous inexpensive disks (or nodes) where data is stored in a distributed fashion. Storage nodes are prone to failures, and thus to losing the stored data. Reliability against node failures (commonly referred to as fault tolerance) is achieved by means of erasure correcting codes (ECCs). ECCs are a way of introducing structured redundancy, and for a DSS, it means addition of redundant nodes. In case of a node failure, these redundant nodes allow complete recovery of the data stored. Since ECCs have a limited fault tolerance, to maintain the initial state of reliability, when a node fails a new node needs to be added to the DSS network and populated with data. The problem of repairing a failed node is known as the repair problem.
Current DSSs like Google File System ii@ and Quick File System use a family of Reed-Solomon (RS) ECCs . Such codes come under a broader family of maximum distance separable (MDS) codes. MDS codes are optimal in terms of the fault tolerance/storage overhead tradeoff. However, the repair of a failed node requires the retrieval of large amounts of data from a large subset of nodes. Therefore, in the recent years, the design of ECCs that reduce the cost of repair has attracted significant attention. Pyramid codes  were one of the first code constructions that addressed this problem. In particular, Pyramid codes are a class of non-MDS codes that aim at reducing the number of nodes that need to be contacted to repair a single failed node, known as the repair locality. Other non-MDS codes that reduce the repair locality are local reconstruction codes (LRCs)  and locally repairable codes [4, 5]. Such codes achieve a low repair locality by ensuring that the parity symbols are a function of a small number of data symbols, which also entails a low repair complexity, defined as the number of elementary additions required to repair a failed node. Furthermore, for a fixed locality LRCs and Pyramid codes achieve the optimal fault tolerance.
Another important parameter related to the repair is the repair bandwidth, defined as the number of symbols downloaded to repair a single failed node. Dimakis et al.  derived an optimal repair bandwidth-storage per node tradeoff curve and defined two new classes of codes for DSSs known as minimum storage regenerating (MSR) codes and minimum bandwidth regenerating (MBR) codes that are at the two extremal points of the tradeoff curve. MSR codes are MDS codes with the best storage efficiency, i.e., they require a minimum storage of data per node (referred to as the sub-packetization level). On the other hand, MBR codes achieve the minimum repair bandwidth. Product-Matrix MBR (PM-MBR) codes and Fractional Repetition (FR) codes in  and , respectively, are examples of MBR codes. In particular, FR codes achieve low repair complexity at the cost of high storage overheads. Codes such as minimum disk input/output repairable (MDR) codes  and Zigzag codes  strictly fall under the class of MSR codes. These codes have a high sub-packetization level. Alternatively, the MSR codes presented in [11, 12, 13, 14, 15, 16, 17, 18] achieve the minimum possible sub-packetization level.
Piggyback codes presented in  are another class of codes that achieve a sub-optimal reduction in repair bandwidth with a much lower sub-packetization level in comparison to MSR codes, using the concept of piggybacking. Piggybacking consists of adding carefully chosen linear combinations of data symbols (called piggybacks) to the parity symbols of a given ECC. This results in a lower repair bandwidth at the expense of a higher complexity in encoding and repair operations. More recently, the authors in  presented a family of codes that reduce the encoding and repair complexity of PM-MBR codes while maintaining the same level of fault tolerance and repair bandwidth. However, this comes at the cost of large alphabet size. In , binary MDS array codes that achieve optimal repair bandwidth and low repair complexity were introduced, with the caveat that the file size is asymptotic and that the fault tolerance is limited to .
In this paper, we propose a family of non-MDS ECCs that achieve low repair bandwidth and low repair complexity while keeping the field size relatively small and having variable fault tolerance. In particular, we propose a systematic code construction based on two classes of parity symbols. Correspondingly, there are two classes of parity nodes. The first class of parity nodes, whose primary goal is to provide erasure correcting capability, is constructed using an MDS code modified by applying specially designed piggybacks to some of its code symbols. As a secondary goal, the first class of parity nodes enable to repair a number of data symbols at a low repair cost by downloading piggybacked symbols. The second class of parity nodes is constructed using a block code whose parity symbols are obtained through simple additions. The purpose of this class of parity nodes is not to enhance the erasure correcting capability, but rather to facilitate node repair at low repair bandwidth and low repair complexity by repairing the remaining failed symbols in the node. Compared to , we provide two constructions for the second class of parity nodes. The first one is given by a simple equation that represents the algorithmic construction in 
. The second one is a heuristic construction that is more involved, but further reduces the repair bandwidth in some cases. Furthermore, we provide explicit formulas for the fault tolerance, repair bandwidth, and repair complexity of the proposed codes and numerically compare with other codes in the literature. The proposed codes achieve better repair bandwidth compared to MDS codes, Piggyback codes, generalized Piggyback codes, and exact-repairable MDS codes . For certain code parameters, we also see that the proposed codes have better repair bandwidth compared to LRCs and Pyramid codes. Furthermore, they achieve better repair complexity than Zigzag codes, MDS codes, Piggyback codes, generalized Piggyback codes, exact-repairable MDS codes, and binary addition and shift implementable cyclic-convolutional (BASIC) PM-MBR codes . Also, for certain code parameters, the codes have better repair complexity than Pyramid codes. The improvements over MDS codes, MSR codes, and the classes of Piggyback codes come at the expense of a lower fault tolerance in general.
Ii System Model and Code Construction
We consider the DSS depicted in Fig. 1, consisting of storage nodes, of which are data nodes and are parity nodes. Consider a file that needs to be stored on the DSS. We represent a file as a matrix , called the data array, over , where denotes the Galois field of size , with being a prime number or a power of a prime number. In order to achieve reliability against node failures, the matrix is encoded using an vector code  to obtain a code matrix , referred to as the code array, of size , . The symbol in is then stored at the -th row of the -th node in the DSS. Thus, each node stores symbols. Each row in is referred to as a stripe so that each file in the DSS is stored over stripes in storage nodes. We consider the code to be systematic, which means that for . Correspondingly, we refer to the nodes storing systematic symbols as data nodes and the remaining nodes containing parity symbols only as parity nodes. The efficiency of the code is determined by the code rate, given by . Alternatively, the inverse of the code rate is referred to as the storage overhead.
For later use, we denote the set of message symbols in the data nodes as and by , , the set of parity symbols in the -th node. Subsequently, we define the set as
where is an arbitrary index set. We also define the operator for integers and .
Our main goal is to construct codes that yield low repair bandwidth and low repair complexity of a single failed data node. We focus on the repair of data nodes since the raw data is stored on these nodes and the users can readily access the data through these nodes. Thus, their survival is a crucial aspect of a DSS. To this purpose, we construct a family of systematic codes consisting of two different classes of parity symbols. Correspondingly, there are two classes of parity nodes, referred to as Class and Class parity nodes, as shown in Fig. 1. Class and Class parity nodes are built using an code and an code, respectively, such that . In other words, the parity nodes from the code correspond to the parity nodes of Class and Class codes. The primary goal of Class parity nodes is to achieve a good erasure correcting capability, while the purpose of Class nodes is to yield low repair bandwidth and low repair complexity. In particular, we focus on the repair of data nodes. The repair bandwidth (in bits) per node, denoted by , is proportional to the average number of symbols (data and parity) that need to be downloaded to repair a data symbol, denoted by . More precisely, let be the sub-packetization level of the DSS, which is the number of symbols per node.111For our code construction, , but this is not the case in general. Then,
where is the size (in bits) of a symbol in , where for some prime number and positive integer . can be interpreted as the repair bandwidth normalized by the size (in bits) of a node, and will be referred to as the normalized repair bandwidth.
The main principle behind our code construction is the following. The repair is performed one symbol at a time. After the repair of a data symbol is accomplished, the symbols read to repair that symbol are cached in the memory. Therefore, they can be used to repair the remaining data symbols at no additional read cost. The proposed codes are constructed in such a way that the repair of a new data symbol requires a low additional read cost (defined as the number of additional symbols that need to be read to repair the data symbol), so that (and hence ) is kept low.
The read cost of a symbol is the number of symbols that need to be read to repair the symbol. For a symbol that is repaired after some others, the additional read cost is defined as the number of additional symbols that need to be read to repair the symbol. (Note that symbols previously read to repair other data symbols are already cached in the memory and to repair a new symbol only some extra symbols may need to be read.)
Iii Class Parity Nodes
Class parity nodes are constructed using a modified MDS code, with , over . In particular, we start from an MDS code and apply piggybacks  to some of the parity symbols. The construction of Class parity nodes is performed in two steps as follows.
Encode each row of the data array using an MDS code (same code for each row). The parity symbol is obtained as222We use the superscript to indicate that the parity symbol is stored in a Class parity node.
where denotes a coefficient in and . Store the parity symbol in the corresponding row of the code array. Overall, parity symbols are generated.
Modify some of the parity symbols by adding piggybacks. Let , , be the number of piggybacks introduced per row. The parity symbol is updated as
where , the second term in the summation is the piggyback, and the superscript in indicates that the parity symbol contains piggybacks.
The fault tolerance (i.e., the number of node failures that can be tolerated) of Class codes is given in the following theorem.
An Class code with piggybacks per row can tolerate
node failures, where .
See Appendix A.
We remark that for , Class codes are MDS codes.
When a failure of a data node occurs, Class parity nodes are used to repair of the failed symbols. Class parity symbols are constructed in such a way that, when node is erased, data symbols in this node can be repaired reading the (non-failed) data symbols in the -th row of the data array and parity symbols in the -th row of Class parity nodes (see also Section IV-C). For later use, we define the set as follows.
For , the index set is defined as
Then, the set is the set of data symbols that are read from row to recover data symbols of node using Class parity nodes.
An example of a Class code is shown in Fig. 2. One can verify that the code can correct any node failures. For each row , the set is indicated in red color. For instance, .
The main purpose of Class parity nodes is to provide good erasure correcting capability. However, the use of piggybacks helps also in reducing the number of symbols that need to be read to repair the symbols of a failed node that are repaired using the Class code, as compared to MDS codes. The remaining data symbols of the failed node can also be recovered from Class parity nodes, but at a high symbol read cost of . Hence, the idea is to add another class of parity nodes, namely Class parity nodes, in such a way that these symbols can be recovered with lower read cost.
Iv Class Parity Nodes
Class parity nodes are obtained using an linear block code with over to encode the data symbols of the data array. This generates Class parity symbols, , , . In , we presented an algorithm to construct Class codes. In this section, we present a similar construction in a much more compact, mathematical manner.
Iv-a Definitions and Preliminaries
For , the index set is defined as
Assume that data node fails. It is easy to see that the set is the set of data symbols that are not recovered using Class parity nodes.
For the example in Fig. 2, the set is indicated by hatched symbols for each column , . For instance, .
For later use, we also define the following set.
For , the index set is defined as
Note that .
For the example in Fig. 2, the set is indicated by hatched symbols for each row . For instance, , thus we have .
The purpose of Class parity nodes is to allow the recovery of the data symbols in , , at a low additional read cost. Note that after recovering symbols using Class parity nodes, the data symbols in the sets are already stored in the decoder memory. Therefore, they are accessible for the recovery of the remaining data symbols using Class parity nodes without the need of reading them again. The main idea is based on the following proposition.
If a Class parity symbol is the sum of one data symbol and a number of data symbols in , then the recovery of comes at the cost of one additional read (one should read parity symbol ).
This observation is used in the construction of Class parity nodes in Section IV-B below to reduce the normalized repair bandwidth . In particular, we add up to Class parity nodes which allow to reduce the additional read cost of all data symbols in all ’s to . (The addition of a single Class parity node allows to recover one new data symbol in each , , at the cost of one additional read.)
Iv-B Construction of Class Nodes
For , each parity symbol in the -th Class parity node, , is sequentially constructed as
The construction above follows Proposition 1 wherein and . This ensures that the read cost of each of the symbols is . Thus, the addition of each parity node leads to data symbols to have a read cost of . Note that adding the second term in creftype 4 ensures that data symbols are repaired by the Class parity nodes. The same principle was used in . It should be noted that the set of data symbols used in the construction of the parity symbols in creftype 4 may be different compared to the construction in . However, the overall average repair bandwidth remains the same.
In the sequel, we will refer to the construction of Class parity nodes according to (4) as Construction .
With the aim to construct a code, consider the construction of an Class code where the Class code, with , is as shown in Fig. 2. For , the parity symbols in the first Class parity node (the -th node) are
The constructed parity symbols are as seen in Fig. 3, where the -th row in node contains the parity symbol . Notice that and . In a similar way, the parity symbols in nodes and are
Consider the repair of the first data node in Fig. 2. The symbol is reconstructed using . This requires reading the symbols , , , and . Since is a function of all data symbols in the first row, reading is sufficient for the recovery of . From Fig. 3, the symbols , , and can be recovered by reading just the parities , , and , respectively. Thus, reading symbols is sufficient to recover all the symbols in the node, and the normalized repair bandwidth is per failed symbol. A more formal repair procedure is presented in Section IV-C.
Adding Class parity nodes allows to reduce the additional read cost of data symbols from each , , to . However, this comes at the cost of a reduction in the code rate, i.e., the storage overhead is increased. In the above example, adding Class parity nodes leads to the reduction in code rate from to . If a lower storage overhead is required, Class parity nodes can be punctured, starting from the last parity node (for the code in Example 4, nodes , , and can be punctured in this order), at the expense of an increased repair bandwidth. If all Class parity nodes are punctured, only Class parity nodes would remain, and the repair bandwidth is equal to the one of the Class code. Thus, our code construction gives a family of rate-compatible codes which provides a tradeoff between repair bandwidth and storage overhead: adding more Class parity nodes reduces the repair bandwidth, but also increases the storage overhead.
Iv-C Repair of a Single Data Node Failure: Decoding Schedule
The repair of a failed data node proceeds as follows. First, symbols are repaired using Class parity nodes. Then, the remaining symbols are repaired using Class parity nodes. With a slight abuse of language, we will refer to the repair of symbols using Class and Class parity nodes as the decoding of Class and Class codes, respectively.
We will need the following definition.
Consider a Class parity node and let denote the set of parity symbols in this node. Also, let for some and be the parity symbol , where , i.e., the parity symbol is the sum of and a subset of other data symbols. We define .
Suppose that node fails. Decoding is as follows.
Decoding the Class code. To reconstruct the failed data symbol in the -th row of the code array, symbols ( data symbols and ) in the -th row are read. These symbols are now cached in the memory. We then read the piggybacked symbols in the -th row. By construction (see (3)), this allows to repair failed symbols, at the cost of an additional read each.
Decoding the Class code. Each remaining failed data symbol is obtained by reading a Class parity symbol whose corresponding set (see Definition 5) contains . In particular, if several Class parity symbols contain , we read the parity symbol with largest index . This yields the lowest additional read cost.
V A Heuristic Construction of Class Nodes With Improved Repair Bandwidth
In this section, we provide a way to improve the repair bandwidth of the family of codes constructed so far. More specifically, we achieve this by providing a heuristic algorithm for the construction of the Class code, which improves Construction in Section IV for some values of and even values of .
The algorithm is based on a simple observation. Let and be two parity symbols constructed from data symbols in in two different ways as follows:
where (see Definition 3), , and (see Definition 4). Note that the only difference between the two parity symbols above is that does not involve (and that does not involve ). This has a major consequence in the repair of the data symbols and using and , respectively. Consider the repair using parity symbol . From Proposition 1, it is clear that the repair of symbol will have an additional read cost of , since the remaining data symbols are in . As the symbol and , from Proposition 1 and the fact that , we can repair with an additional read cost of . The remaining symbols each have an additional read cost of , whereas the symbols repaired using incur an additional read cost of for the symbol and for the remaining symbols. Clearly, we see that the combined additional read cost, i.e., the sum of the individual additional read costs for each data symbol using is lower (by ) than that using .
In the way Class parity nodes are constructed and due to the structure of the sets and , it can be seen that and when . From Construction of the Class code in Section IV
we observe that for oddand , the parity symbols in node are as in creftype 5 for . Furthermore, for , the parity symbols in node have the structure in creftype 6. On the other hand, for even and , the parity symbols in the node are as in creftype 5 for . However, contrary to case of odd, the parity symbols in the node follow creftype 6. But since , we know that and . Thus, it is possible to construct some parity symbols in this node as in creftype 5, and Construction of Class nodes in the previous section can be improved. However, the improvement comes at the expense of the loss of the mathematical structure of Class nodes given in (4).
Consider the code as shown in Fig. 4. Fig. 4(a) shows the node using Construction in Section IV, while Fig. 4(b) shows a different configuration of the node . Note that . Thus, each pair contains one symbol and . The node has parity symbols according to creftype 6, while has two parity symbols as in creftype 5 and two parity symbols according to creftype 6. The configuration of the code arising from Construction has a normalized repair bandwidth of , while the code with node in Fig. 4(b) has a repair bandwidth of , i.e., an improvement is achieved.
In order to describe the modified code construction, we define the function as follows.
Consider the construction of the parity symbol as (see Definition 5). Then,
For a given data symbol , the function gives the additional number of symbols that need to be read to recover (considering the fact that some symbols are already cached in the memory). The set represents the set of data symbols that the parity symbol is a function of. We use the index set to represent the indices of such data symbols. We denote by , the index set corresponding to the -th parity symbol in the node (there are parity symbols in a parity node).
In the following, denote by a temporary matrix of read costs for the respective data symbols in . After Class decoding,
In Section V-A below, we will show that the construction of parities depends upon the entries of . To this extent, for some real matrix and index set , we define as the set of indices of matrix elements of from whose values are equal to the maximum of all entries in indexed by . More formally, is defined as
The heuristic algorithm to construct the Class code is given in Appendix B and we will refer to the construction of the Class code according to this algorithm as Construction . In the following subsection, we clarify the heuristic algorithm to construct the Class code with the help of a simple example.
V-a Construction Example
Let us consider the construction of a code using a Class code and a Class code. In total, there are three parity nodes; two Class parity nodes, denoted by and , respectively, and one Class parity node, denoted by , where the upper index is used to denote that the parity node is constructed using the heuristic algorithm. The parity symbols of the nodes are depicted in Fig. 4. Each parity symbol of the Class parity node is the sum of data symbols , constructed such that the read cost of each symbol is lower than as shown below.
Each parity symbol in this node is constructed using unique symbols as follows.
Select such that its read cost is maximum, i.e., . Choose , as . Note that we choose since .
Recursively construct the next parity symbol in the node as follows. Similar to Item 1.b, choose . Construct . Likewise, we have
Choose an element . In other words choose a symbol in which has not been used in and . We have . Construct . Thus, .
To construct the last parity symbol, we look for data symbols from the sets and . However, all symbols in have been used in the construction of previous parity symbols. Therefore, we cyclically shift to the next pair of sets . Following Items 1.e and 1.f, we have and .
Note that for all , thus this completes the construction of the code. The Class parity node constructed above is depicted in Fig. 4(b).
In general, the algorithm constructs parity nodes, , recursively. In the -th Class node, , each parity symbol is a sum of at most symbols . Each parity symbol , , in the -th Class parity node with is constructed recursively with recursion steps. In the first recursion step, each parity symbol is either equal to a single data symbol or a sum of data symbols. In the latter case, the first symbol is chosen as the symbol with the largest read cost . The second symbol is if such a symbol exists. Otherwise (i.e., if ), symbol is chosen. In the remaining recursion steps a subsequent data symbol (if it exists) is added to . Doing so ensures that symbols have a new read cost that is reduced to when parity symbols are used to recover them. Having obtained these parity symbols, the read costs of all data symbols in are updated and stored in . This process is repeated for successive parity nodes. If for the -th parity node, its parity symbols are equal to the data symbols whose read costs are the maximum possible.
In the above example, only a single recursion for the construction of is needed, where each parity symbol is a sum of two data symbols.
Vi Code Characteristics and Comparison
|Fault Tolerance||Norm. Repair Band.||Norm. Repair Compl.||Enc. Complexity|
In this section, we characterize different properties of the codes presented in Sections III-V. In particular, we focus on the fault tolerance, repair bandwidth, repair complexity, and encoding complexity. We consider the complexity of elementary arithmetic operations on field elements of size in , where for some prime number and positive integer . The complexity of addition is , while that of multiplication is , where the argument of denotes the number of elementary binary additions.333It should be noted that the complexity of multiplication is quite pessimistic. However, for the sake of simplicity we assume it to be . When the field is there exist algorithms such as the Karatsuba-Ofman algorithm [26, 27] and the Fast Fourier Transform
and the Fast Fourier Transform[28, 29, 30] that lower the complexity to and , respectively.
Vi-a Code Rate
Vi-B Fault Tolerance
The proposed codes have fault tolerance equal to that of the corresponding Class codes, which depends on the MDS code used in their construction and (see Theorem 1). Class nodes do not help in improving the fault tolerance. The reason is that improving the fault tolerance of the Class code requires the knowledge of the piggybacks that are strictly not in the set , while Class nodes can only be used to repair symbols in .
In the case where the Class code has parameters , the resulting code has fault tolerance for , i.e., one less than that of an MDS code.
Vi-C Repair Bandwidth of Data Nodes
According to Section IV-C, to repair the first symbols in a failed node, data symbols and Class parity symbols are read. The remaining data symbols in the failed node are repaired by reading Class parity symbols.
The Class nodes are used to repair symbols with an additional read cost of ( per symbol). The remaining erased symbols are corrected using the -th Class node. The repair of one of the symbols entails an additional read cost of . On the other hand, since the parity symbols in the -th Class node are a function of symbols, the repair of the remaining symbols entails an additional read cost of at most each. In all, the erased symbols in the failed node have a total additional read cost of at most . The normalized repair bandwidth for the failed systematic node is therefore given as
Note that is function of , and it follows from creftypeplural 8 and VI-B that when increases, the fault tolerance reduces while improves. Furthermore, as increases (thereby as increases), decreases. This leads to a further reduction of the normalized repair bandwidth.
Vi-D Repair Complexity of a Failed Data Node
To repair the first symbol requires multiplications and additions. To repair the following symbols require an additional multiplications and additions. Thus, the repair complexity of repairing failed symbols is