I Introduction
Distributed Storage Systems (DSSs) have been deployed by various enterprises to reliably store massive amounts of data under the frequent storage node failure events. A failed node is regenerated (repaired) by collecting information from other survived nodes with the regeneration process guided by a predefined network coding scheme. Under this setting, Dimakis et al. [dimakis2010network] obtained the expression for the maximum reliably storable file size, denoted as capacity , as a function of given system parameters: the node capacity and the bandwidth required for repairing a failed node. The capacity analysis in [dimakis2010network] underscores the following key messages. First, there exists a network coding scheme which utilizes the resources and enables a reliable storage of a file of size . Second, it is not feasible to find a network coding scheme which can reliably store a file larger than , given the available resources of . In subsequent research efforts, the authors of [rashmi2009explicit, cadambe2013asymptotic, ernvall2014codes] proposed explicit network coding schemes which achieve the capacity of DSSs. These coding schemes are optimal in the sense of efficiently utilizing resources for maintaining the reliable storage systems.
Focus on the clustered nature of distributed storage has been a recent research direction taken by several researchers [sohn2016capacity, sohn2017TIT, prakash2017storage, hu2017optimal]. According to these recent papers, storage nodes dispersed into multiple racks in real data centers are seen as forming clusters. In particular, the authors of the present paper proposed a system model for clustered DSSs in [sohn2016capacity] that reflects the difference between intra and crosscluster bandwidths. In the system model of [sohn2016capacity], the file to be stored is coded and distributed into storage nodes, which are evenly dispersed into clusters. Each node has storage capacity of , and the data collector contacts arbitrary out of existing nodes to retrieve the file. Since nodes are dispersed into multiple clusters, the regeneration process involves utilization of both intra and crosscluster repair bandwidths, denoted by and , respectively. In this proposed system model, the authors of [sohn2016capacity] obtained the closedform expression for the maximum reliably storable file size, or capacity , of the clustered DSS. Furthermore, it has been shown that network coding exists that can achieve the capacity of clustered DSSs. However, explicit constructions of capacityachieving network coding schemes for clustered DSSs have yet to be found.
This paper proposes a network coding scheme which achieves capacity of the clustered DSS, with a minimum required node storage overhead. In other words, the suggested code is shown to be a minimumstorageregenerating (MSR) code of the clustered DSS. This paper focuses on two important cases of and , where represents the ratio of cross to intracluster repair bandwidths. The former represents the system where crosscluster communication is not possible. The latter corresponds to the minimum value that can achieve the minimum storage overhead of , where is the file size. When , it is shown that appropriate application of locally repairable codes suggested in [papailiopoulos2014locally, tamo2016optimal] achieves the MSR point for general settings with the application rule depending on the parameter setting. For the case, an explicit coding scheme is suggested which is proven to be an MSR code under the conditions of and . There have been some previous works [tebbi2014code, hu2017optimal, sahraei2017increasing, prakash2017storage] on code construction for DSS with clustered storage nodes, but to a limited extent. The works of [hu2017optimal, tebbi2014code] suggested a coding scheme which can reduce the crosscluster repair bandwidth, but these schemes are not proven to be an MSR code that achieves capacity of clustered DSSs with minimum storage overhead. The authors of [sahraei2017increasing] provided an explicit coding scheme which reduces the repair bandwidth of a clustered DSS under the condition that each failed node can be exactly regenerated by contacting any one of other clusters. However, the approach of [sahraei2017increasing] is different from that of the present paper in the sense that it does not consider the scenario with unequal intra and crosscluster repair bandwidths. Moreover, the coding scheme proposed in [sahraei2017increasing] is shown to be a minimumbandwidthregenerating (MBR) code for some limited parameter setting, while the present paper deals with an MSR code. An MSR code for clustered DSSs has been suggested in [prakash2017storage], but this paper has the data retrieval condition different from the present paper. The authors of [prakash2017storage] considered the scenario where data can be collected by contacting arbitrary out of clusters, while data can be retrieved by contacting arbitrary out of nodes in the present paper. Thus, the two models have the identical condition only when each cluster has one node. The difference in data retrieval conditions results in different capacity values and different MSR points. In short, the code in [prakash2017storage] and the code in this paper achieves different MSR points.
Ii Backgrounds and Notations
A given file of symbols is encoded and distributed into nodes, each of which has node capacity . The storage nodes are evenly distributed into clusters, so that each cluster contains nodes. A failed node is regenerated by obtaining information from other survived nodes: nodes in the same cluster help by sending each, while nodes in other clusters help by sending each. Thus, repairing each node requires the overall repair bandwidth of
(1) 
A data collector (DC) retrieves the original file by contacting arbitrary out of nodes  this property is called the maximumdistanceseparable (MDS) property. The clustered distributed storage system with parameters is called an clustered DSS. In an clustered DSS with given parameters of , capacity is defined in [sohn2016capacity] as the maximum data that can be reliably stored. The closedform expression for is obtained in Theorem 1 of [sohn2016capacity]. Aiming at reliably storing file , the set of pair values is said to be feasible if holds. According to Corollaries 1 and 2 of [sohn2017TIT], the set of feasible points shows the optimal tradeoff relationship between and , as illustrated in Fig. 1. In the optimal tradeoff curve, the point with minimum node capacity is called the minimumstorageregenerating (MSR) point. Explicit regenerating codes that achieve the MSR point are called the MSR codes. According to Theorem 3 of [sohn2017TIT], node capacity of the MSR point satisfies
(2)  
(3) 
Note that is the minimum storage overhead to satisfy the MDS property, as stated in [dimakis2010network]. Thus, is the scenario with minimum crosscluster communication when the minimum storage overhead constraint is imposed.
Here we introduce some useful notations used in the paper. For a positive integer , represents the set . For natural numbers and , we use the notation if divides . Similarly, write if does not divide . For given and , we define
(4)  
(5) 
For vectors we use boldfaced lower case letters. For a given vector
, the transpose of is denoted as . For natural numbers and , the set is represented as . For a matrix , the entry of at the row and column is denoted as . We also express the nodes in a clustered DSS using a twodimensional representation: in the structure illustrated in Fig. 2, represents the node at the row and the column. Finally, we recall definitions on the locally repairable codes (LRCs) in [papailiopoulos2014locally, tamo2016optimal]. As defined in [tamo2016optimal], an LRC represents a code of length , which is encoded from information symbols. Every coded symbol of the LRC can be regenerated by accessing at most other symbols. As defined in [papailiopoulos2014locally], an LRC takes a file of size and encodes it into coded symbols, where each symbol is composed of bits. Moreover, any coded symbol can be regenerated by contacting at most other symbols, and the code has the minimum distance of .Iii MSR Code Design for
In this section, MSR codes for (i.e., ) is designed. Under this setting, no crosscluster communication is allowed in the node repair process. First, the system parameters for the MSR point are examined. Second, two types of locally repairable codes (LRCs) suggested in [papailiopoulos2014locally, tamo2016optimal] are proven to achieve the MSR point, under the settings of and , respectively.
Iiia Parameter Setting for the MSR Point
We consider the MSR point which can reliably store file . The following property specifies the system parameters for the case.
Proposition 1.
Consider an [n,k,L] clustered DSS to reliably store file . The MSR point for is
(6) 
where is defined in (4). This point satisfies .
Proof.
See Appendix DA. ∎
IiiB Code Construction for
We now examine how to construct an MSR code for the case. The following theorem shows that a locally repairable code constructed in [papailiopoulos2014locally] with locality is a valid MSR code for .
Theorem 1 (Exactrepair MSR Code Construction for )
Let be the LRC explicitly constructed in [papailiopoulos2014locally] for locality . Consider allocating coded symbols of in a clustered DSS, where nodes within the same repair group of are located in the same cluster. Then, the code is an MSR code for the clustered DSS under the conditions of and .
Proof.
See Appendix A. ∎
Fig. 3 illustrates an example of the MSR code for the and case, which is constructed using the LRC in [papailiopoulos2014locally]. In the clustered DSS scenario, the parameters are set to
Thus, each storage node contains symbols, while the clustered DSS aims to reliably store a file of size . This code has two properties, 1) exact regeneration and 2) data reconstruction:

Any failed node can be exactly regenerated by contacting nodes in the same cluster,

Contacting any nodes can recover the original file of size .
The first property is obtained from the fact that and form a MDS code for . The second property is obtained as follows. For contacting arbitrary nodes, three distinct coded symbols having superscript one and three distinct coded symbols having superscript two can be obtained for some and . From Fig. (a)a, the information suffice to recover . Similarly, the information suffice to recover . This completes the proof for the second property. Note that this coding scheme is already suggested by the authors of [papailiopoulos2014locally], while the present paper proves that this code also achieves the MSR point of the clustered DSS, in the case of and .
IiiC Code Construction for
Here we construct an MSR code when the given system parameters satisfy . The theorem below shows that the optimal LRC designed in [tamo2016optimal] is a valid MSR code when holds.
Theorem 2 (Exactrepair MSR Code Construction for )
Let be the LRC constructed in [tamo2016optimal] for and . Consider allocating the coded symbols of in a clustered DSS, where nodes within the same repair group of are located in the same cluster. Then, is an MSR code for the clustered DSS under the conditions of and .
Proof.
See Appendix B. ∎
Fig. 4 illustrates an example of code construction for the case. Without loss of generality, we consider case; parallel application of this code multiple times achieves the MSR point for general , where is the set of positivie integers. In the clustered DSS with , the code and system parameters are:
from Proposition 1. The code in Fig. 4 satisfies the exact regeneration and data reconstruction properties:

Any failed node can be exactly regenerated by contacting nodes in the same cluster,

Contacting any nodes can recover the original file of size .
Note that in Fig. 4 is a set of coded symbols generated by a MDS code, and this statement also holds for . This proves the first property. The second property is directly from the result of [tamo2016optimal], which states that the minimum distance of the is
(7) 
Note that the is already suggested by the authors of [tamo2016optimal], while the present paper proves that applying this code with achieves the MSR point of the clustered DSS, in the case of and .
Iv MSR Code Design for
We propose an MSR code for in clustered DSSs. From (2) and (3), recall that is the minimum value which allows the minimum storage of . First, we obtain the system parameters for the MSR point. Second, we design a coding scheme which is shown to be an MSR code under the conditions of and .
Iva Parameter Setting for the MSR Point
The following property specifies the system parameters for the case. Without a loss of generality, we set the crosscluster repair bandwidth as .
Proposition 2.
The MSR point for is
(8) 
This point satisfies and .
Proof.
See Appendix DB. ∎
IvB Code Construction for
Here, we construct an MSR code under the constraints of and . Since we consider the case, the system parameters in Proposition 2 are set to
(9)  
Construction 1.
Suppose that we are given source symbols . Moreover, let the encoding matrix
(10) 
be a matrix, where each encoding submatrix is a matrix. For , node stores and node stores , where
(11)  
(12) 
Remark 1.
The code generated in Construction 1 satisfies the followings:

[label=()]

Every node in cluster contains message symbols.

Every node in cluster contains parity symbols.
Note that this remark is consistent with (9), which states . Under this construction, we have the following theorem, which specifies the MSR construction rule for the DSS with .
Theorem 3 (Exactrepair MSR Code Construction for )
If all square submatrices of are invertible, the code designed by Construction 1 is an MSR code for DSS with .
Proof.
See Appendix C. ∎
The following result suggests an explicit construction of an MSR code using the finite field.
Corollary 1.
Applying Construction 1 with encoding matrix set to the Cauchy matrix [10.2307/j.ctt7t833] achieves the MSR point for an DSS. A finite field of size suffices to design .
Proof.
The proof is directly from Theorem 3 and the fact that all submatrices of a Cauchy matrix has full rank, as stated in [shah2010explicit]. Moreover, the Cauchy matrix of size can be constructed using a finite field of size , according to [suh2011exact]. ∎
An example of MSR code designed by Construction 1 is illustrated in Fig. 5, in the case of . This coding scheme utilizes a Cauchy matrix
(13) 
using the finite field with the primitive polynomial . The element in is denoted by the decimal number of , where is the primitive element. For example, is denoted by in the generator matrix . When , the system parameters are
from Proposition 2, which holds for the example in Fig. 5. Here we show that the proposed coding scheme satisfies two properties: 1) exact regeneration of any failed node and 2) recovery of message symbols by contacting any nodes.
1) Exact regeneration: Fig. 6 illustrates the regeneration process. Suppose that node containing the message fails. Then, node transmits symbols, and . Nodes and transmit symbol each, for example and , respectively. Then, from the received symbols of and matrix , we obtain
Thus, the contents of the failed node can be regenerated by
where the matrix inversion is over . Note that the exact regeneration property holds irrespective of the contents transmitted by and , since the encoding matrix is a Cauchy matrix, all submatrices of which are invertible.
2) Data recovery: First, if DC contacts two systematic nodes, the proof is trivial. Second, contacting two parity nodes can recover the original message since is invertible. Third, suppose that DC contacts one systematic node and one parity node, for example, and . Then, DC can retrieve message symbols and parity symbols . Using the retrieved symbols and the information on the encoding matrix , DC additionally obtains
Thus, DC obtains
which completes the data recovery property of the suggested code.
V Conclusion
A class of MSR codes for clustered distributed storage modeled in [sohn2016capacity] has been constructed. The proposed coding schemes can be applied in practical data centers with multiple racks, where the available crossrack bandwidth is limited compared to the intrarack bandwidth. Two important cases of and are considered, where represents the ratio of available cross to intracluster repair bandwidth. Under the constraint of zero crosscluster repair bandwidth (), appropriate application of two locally repairable codes suggested in [papailiopoulos2014locally, tamo2016optimal] is shown to achieve the MSR point of clustered distributed storage. Moreover, an explicit MSR coding scheme is suggested for , when the system parameters satisfy and . The proposed coding scheme can be implemented in a finite field, by using a Cauchy generator matrix.
Appendix A Proof of Theorem 1
We focus on code , the explicit ()LRC constructed in Section V of [papailiopoulos2014locally]. This code has the parameters
(A.1) 
where is the repair locality and is the minimum distance, and other parameters () have physical meanings identical to those in the present paper. By setting , the code has node capacity of
(A.2) 
where the last equality holds from the condition and the definition of in (4).
We first prove that any node failure can be exactly regenerated by using the system parameters in (6). According to the description in Section VB of [papailiopoulos2014locally], any node is contained in a unique corresponding repair group of size , so that a failed node can be exactly repaired by contacting other nodes in the same repair group. This implies that a failed node does not need to contact other repair groups in the exact regeneration process. By setting each repair group as a cluster (note that each cluster contains nodes), we can achieve
(A.3) 
Moreover, Section VB of [papailiopoulos2014locally] illustrates that the exact regeneration of a failed node is possible by contacting the entire symbols contained in nodes in the same repair group, and applying the XOR operation. This implies , which result in
(A.4) 
combined with (1) and (A.2). From (A.2) and (A.4), we can conclude that code satisfies the exact regeneration of any failed node using the parameters in (6).
Appendix B Proof of Theorem 2
We first prove that the code has minimum distance of , which implies that the original file of size can be recovered by contacting arbitrary nodes. Second, we prove that any failed node can be exactly regenerated under the setting of (6). Recall that the LRC constructed in [tamo2016optimal] has the following property, as stated in Theorem 1 of [tamo2016optimal]:
Lemma 1 (Theorem 1 of [tamo2016optimal]).
The code constructed in [tamo2016optimal] has locality and optimal minimum distance , when .
Note that we consider code of optimal LRC. Since divides , Lemma 1 can be applied. The result of Lemma 1 implies that the minimum distance of is
(B.1) 
Since we consider the case, we have
(B.2) 
from (5). Inserting (B.2) into (B.1), we have
(B.3) 
where the second last equality holds since from (B.2). Thus, this proves that contacting arbitrary nodes suffices to recover the original source file.
Now, all we need to prove is that any failed node can be exactly regenerated under the setting of system parameters specified in Proposition 1. According to the rule illustrated in [tamo2016optimal], the construction of code can be shown as in Fig. 7. First, we have source symbols to store reliably. By applying a ReedSolomon code to the source symbols, we obtain where . Then, we partition symbols into groups, where each group contains symbols. Next, each group of symbols is encoded by an MDS code, which result in a group of symbols of . Finally, we store symbol in node . By this allocation rule, symbols in the same group are located in the same cluster.
Assume that , the node at cluster, containing symbol fails for and . From Fig. 7, we know that symbols of stored in cluster can decode the MDS code for group . Thus, the contents of can be recovered by retrieving symbols from nodes in the the cluster (i.e., the same cluster where the failed node is in). This proves the ability of exactly regenerating an arbitrary failed node. The regeneration process satisfies
(B.4) 
Moreover, note that the code in Fig. 7 has
(B.5) 
source symbols. Since parameters obtained in (B.4) and (B.5) are consistent with Proposition 1, we can confirm that code is a valid MSR point under the conditions and .
Appendix C Proof of Theorem 3
Recall that the code designed by Construction 1 allocates systematic nodes at cluster and parity nodes at cluster, as illustrated in Fig. 8. Moreover, recall that the system parameters for DSS with are
(C.1) 
from Proposition 2 and the definition of . First, we show that exact regeneration of systematic nodes (in the first cluster) is possible using in the DSS with Construction 1. We use the concept of the projection vector to illustrate the repair process. For , let be the projection vector assigned for , in repairing . Similarly, let be the projection vector assigned for , in repairing . Assume that the node containing fails. Then, node transmits symbols , while node transmits symbol . For simplicity, we set and , where is the dimensional standard basis vector with a in the coordinate and elsewhere. This means that node transmits symbols it contains, while transmits the last symbol it contains, i.e., the symbol . Thus, the newcomer node for regenerating systematic node obtains the following information
(C.2) 
We now show how the newcomer node regenerates using information . Recall that the parity symbols and message symbols are related as in the following equations:
(C.3) 
obtained from (10) and (12). Among these parity symbols, parity symbols received by the newcomer node can be expressed as
(C.4) 
where the matrix in (C.4) is generated by removing rows from . Since we are aware of message symbols of and the entries of matrix, subtracting the constant known values from (C.4) results in
(C.5) 
where
(C.6) 
for . Note that the matrix in (C.5) can be obtained by removing columns from the matrix in (C.4). Since every square submatrix of is invertible, we can obtain , which completes the proof for exactly regenerating the failed systematic node.
Second, we prove that exact regeneration of the parity nodes (in the second cluster) is possible. Let be the projection vector assigned for in repairing . Similarly, let be the projection vector assigned for in repairing . Assume that the parity node fails, which contains . Then, node transmits symbols , while node transmits symbol . For simplicity, we set and . This means that node transmits symbols it contains, while transmits the last symbol it contains, i.e., the symbol . Thus, the newcomer node for regenerating parity node obtains the following information
(C.7) 
We show how the newcomer node regenerates using the information . Among parity symbols in (C.3), parity symbols received by the newcomer node can be expressed as
(C.8) 
where is defined in Construction 1. Note that is a matrix, which is generated by removing rows from , for . Since we know the values of message symbols and the entries of matrix, subtracting constant known values from (C.8) results in
(C.9) 
where is generated by removing columns from for . Similarly, is generated by removing rows from for . Thus, is an invertible matrix, so that we can obtain , which contains
(C.10) 
Since contains every message symbol , we can regenerate using (C.3). This completes the proof for exactly regenerating the failed parity node.
Finally, we prove that message symbols can be obtained by contacting arbitrary nodes. In this proof, we use a slightly modified notation for representing message and parity symbols. For , the message symbol and the parity symbol are denoted as and , respectively. Then, (C.3) is expressed as
(C.11) 
Suppose that the data collector (DC) contacts nodes from the cluster, and nodes from the cluster, for . Then, DC obtains parity symbols and message symbols. Since there exists total of message symbols, the number of message symbols that DC cannot obtain is . Let the parity symbols obtained by DC be , and the message symbols not obtained by DC be . Then, the known parities can be expressed as
(C.12) 
where is a matrix obtained by taking rows from , for