Modern distributed storage systems have been transitioning to erasure coding based schemes with good storage efficiency in order to cope with the explosion in the amount of data stored online. Locally Repairable Codes (LRCs) have emerged as the codes of choice for many such scenarios and have been implemented in a number of large scale systems e.g., Microsoft Azure  and Hadoop .
A block code is called a locally repairable code (LRC) with locality if every symbol in the encoding is a function of other symbols. This enables recovery of any single erased symbol in a local fashion by downloading at most other symbols. On the other hand, one would like the code to have a good minimum distance to enable recovery of many erasures in the worst-case. LRCs have been the subject of extensive study in recent years [7, 6, 15, 17, 10, 12, 5, 14, 18, 19, 3]. LRCs offer a good balance between very efficient erasure recovery in the typical case in distributed storage systems where a single node fails (or becomes temporarily unavailable due to maintenace or other causes), and still allowing recovery of the data from a larger number of erasures and thus safeguarding the data in more worst-case scenarios.
A Singleton-type bound for locally repairable codes relating its length , dimension , minimum distance and locality was first shown in the highly influential work . It states that a linear locally repairable code must obey111The bound in  was shown even for a weaker requirement of locality only for the information symbols, but we focus on the more general all-symbol locality.
Note that any linear code of dimension has locality at most , so in the case when the above bound specializes to the classical Singleton bound , and in general it quantifies how much one must back off from this bound to accommodate locality.
A linear LRC that meets the bound (1) with equality is said to be an optimal LRC. This work concerns the trade-off between alphabet size and code length for linear codes that are optimal LRCs. Initially, the existence of such optimal LRCs and constructions were only known over fields that were exponentially large in the block length [8, 17].222If locality is desired only for the information symbols, then it is easy to construct optimal LRCs over linear-sized fields using any MDS code via the “Pyramid” construction . As we said, our focus is on LRCs with all-symbol locality which is more challenging to ensure. In a celebrated paper, Tamo and Barg  constructed clever subcodes of Reed-Solomon codes that yield a class of optimal locally repairable codes inheriting the field size of Reed-Solomon codes. This shows that one can have optimal LRCs with a field size similar to that of Maximum Distance Separable (MDS) codes which attain the classical Singleton bound .
One is thus tempted to make an analogy between optimal LRCs and MDS codes. The famous MDS conjecture says that there are no non-trivial (meaning, distance ) MDS codes with length exceeding where is its alphabet size, except in two corner cases ( even and , or ) where the length is at most . This conjecture was famously resolved in the case when is prime by Ball .
For optimal LRCs, it was shown that an analogous strong conjecture does not hold  for almost every distance — using elliptic curves, they gave LRCs length (an earlier construction using rational function fields achieved length ). A construction of length was given for small distances in . Note that all these constructions have length that is at most .
The MDS conjecture makes a very precise statement about the maximum possible length of MDS codes. An asymptotic upper bound of (in fact even ) is much easier to establish for MDS codes. Given this apparent parallel and the above-mentioned constructions which don’t achieve code lengths exceeding , one might wonder if the Tamo-Barg result is asymptotically optimal, in the sense that optimal LRCs must have length at most . Rather surprisingly, it was not even known if must be bounded as a function of at all — that is, it was conceivable that one could have arbitrarily long optimal LRCs over an alphabet of fixed size! Indeed, Barg et.al,  gave optimal LRCs using algebraic surfaces of length when the distance and . This then inspired the discovery of optimal LRCs with unbounded length for via cyclic codes . We also include a simple construction of arbitrarily optimal LRCs for over any fixed field size that satisfies .
Our Results. Given this state of knowledge, the natural question that arises is whether there is any upper bound at all on the length of optimal locally repairable codes (as a function of its alphabet size). In this paper, we answer this question affirmatively. In fact, we show that as soon as the distance , one cannot have unbounded length optimal LRCs (unlike the cases of ). Below is a statement of our upper bound on the code length of optimal LRCs. To the best of our knowledge, this is the first upper bound on the length of optimal LRCs.
Theorem 1.1 (Upper bound on code length of LRCs).
Let , and let be an optimal LRC with locality (that meets the bound (1) with equality) of length over an alphabet of size . Then when is not divisible by , and when .
Our actual upper bound is a bit better when and in particular yields when . The technical condition that is at least arises in ensuring that the code consists of disjoint recovery groups of size each, that together ensure recoverability with locality for every codeword symbols. Meanwhile, we have to point out that our bound yields nothing when is proportional to . For this setting, we show another bound that , showing that cannot be too large for an LRC with small locality unless the alphabet size is large. This follows from a combination of the puncturing argument in  and the Plotkin bound.
In our second result, we complement the above result on the limitation of LRCs with a construction of super-linear (in ) length for .
Theorem 1.2 (Construction of long LRCs).
Again, to the best of our knowledge, this is the first code achieving super linear length in for . The previous best construction due to  achieved a length of for . We establish Theorem 1.2 via a greedy choice of the columns of the parity check matrix. Explicit constructions of such codes, as well as closing the gap between our upper and lower bounds on code length, are interesting questions for future work.
Organization of the paper. The paper is organized as follows. In Section 2, we provide some preliminaries on locally repairable codes. In Section , we prove our upper bounds on the length of optimal LRCs. In Section , we present the greedy construction of an optimal LRC with super-linear length in its alphabet size.
stands for . The floor function and ceiling function of are denoted by and , respectively. An code is a linear code over the field of size that has length , dimension , and distance . We now define the local recoverability property of a code formally. We give this definition in general without assuming linearity, though we restrict our focus to linear codes in this paper.
Let be a -ary block code of length . For each and , define . For a subset , we denote by the projection of on . For , a subset of that contains is a called a recovery set for if and are disjoint for any , where . Furthermore, is called a locally recoverable code with locality if, for every , there exists a recovery set for of size .
The above definition of recovery sets is slightly different from that of recovery sets given in literature where is excluded in the recovery set . The reason why we include in the recover set of is for convenience of proofs in this paper.
For linear codes, which are the focus of this paper, the following lemma establishes a connection between the locality and the dual code . The proof is folklore, but we include it for the sake of completeness.
A subset of is a recovery set at of a linear code over if and only if there exists a codeword in whose support contains and is a subset of .
Let be a generator matrix of , where
are column vectors of length. Assume that is a recovery set at . We prove the claim by contradiction. Suppose that there exists such that contains no codeword whose support contains and is a subset of . This implies that is not a linear combination of . Thus, . If for all , then contains the zero vector for . This is a contradiction to the definition of recovery sets.
Now assume that not all are the zero vector. Partition into two disjoint sets and such that (i) the vectors are linearly independent; and (ii) all vectors in are linear combinations of . This implies that there exists a matrix of size such that, for every codeword , the projection of at is equal to . As is not a linear combination of , it follows that is not a linear combination of , i.e., and are linearly independent. Thus, the set is the entire space . This implies that, for any and , both the set and are equal to . Hence, . This is a contradiction to the definition of recovery sets.
The other direction is obvious by the definition. ∎
For a -ary -linear LRC with locality , the Singleton-type bound says
Like the classical Singleton bound, the Singleton-type bound (2) does not take into account the cardinality of the code alphabet . Augmenting this result, a recent work  established a bound on the distance of locally repairable codes that depends on , sometimes yielding better results. However, in this paper, we specifically refer as optimal LRC a linear code achieving the bound (2). We now rewrite this bound in a form that will be more convenient to us.
Let be positive integers with . If the Singleton-type bound (2) is achieved, then
Assume that the Singleton-type bound (2) is achieved, i.e., . Write for some integers and . Then . This gives . This also implies that is an integer. Therefore, we must have as . The desired results follows as
It turns out that the other direction of Lemma 2.2 is also true if .
3 An Upper Bound on Code Lengths
In this section, we investigate the upper bound on the code lengths of optimal LRCs over a finite field . For simplicity, we assume that is divisible by throughout this section. However, in Remark 3 and 4, we extend our results to cover the cases when is not divisible by . We end this section by another upper bound that handles the case .
3.1 Justifying the assumption of disjoint recovery sets
We first argue that a -local LRC with block length divisible by can be assumed, under modest conditions on the parameters, to contain disjoint recovery sets that each allow for recovery of codeword symbols. This structure will then be helpful to us in upper bounding the length of LRCs.
We remark that the structure theorem in  showed that the information symbols can be arranged into disjoint groups each with a local parity check, under the assumption that . However, we seek all-symbol’s locality, and their argument does not directly apply.
Let be an linear optimal LRC with locality . Then, there exist disjoint recovery sets, each of size provided that
Put . Then and . Lemma 2.2 implies that the parity-check matrix of code must has size . We now construct a parity-check matrix of as follows. First, we arbitrarily choose . By Lemma 2.1, there is a codeword of such that contains and has size at most . Put and choose . By Lemma 2.1 again, there is a codeword of such that contains and has size at most . Put and choose . Continue in this fashion to get codewords and for such that are pairwise distinct and . As are linearly independent, we have . It is clear that if and only if the sets are pairwise disjoint.
We claim that must be equal to . Then our desired result follows. Suppose that this claim is not true, i.e, . Assume that there are recovery sets with size less than . Since the union of is , we have , i.e., . Hence, the number of -sized recovery sets satisfies . Without loss of generality, we may assume that with are of size . We are going to show that there are at least pairwise disjoint sets, say , among such that for all and , . Since is , we have
If for some , we remove all the sets that contain this . Note that
This implies that we remove at most sets from . Let be the sets left after this operation. From our argument, we know that if for , then does not belong to any other set in . This implies our requirement that for all and ,. It remains to lower bound . Since we remove at most sets from , we have
Now let us construct a parity-check matrix as follows. We take the first rows to be (note that these vectors are linearly independent). Then extend arbitrarily to an parity-check matrix of size . We reconsider the submatrix of consisting of the first columns. Then it must have the following form
where , is a zero matrix and is a matrix. Note that and has minimum distance . This implies any columns of are linearly independent or equivialently the rank of is at least . However, the number of nonzero rows of is at most
The first inequality follows from the assumption . This contradiction concludes that and the desired result follows. ∎
A similar result is still hold when is not divisible by . In this case, the minimum number of recovery sets covering indices becomes . Let us start from the code meeting the Singleton-type bound,
It follows that
Since is an integer, we have
The fact that the number of recovery sets covering all indices is at least leads to . The rest of the proof is the same.
3.2 Proving the upper bound
In this subsection, we prove Theorem 1.1 (restated more formally below) that gives an upper bound on the length of a LRC in terms of its alphabet size . The parity check view of an LRC will be instrumental in our argument. We will make use of Lemma 3.1 and the classical Hamming upper bound on the size of codes as a function of minimum distance to derive our result.
Again we let with . By Theorem 3.1, we know that there exist codewords of such that the supports , each of size , are pairwise disjoint. Put . By considering an equivalent code, we may assume that for and the projection of at are equal to all-one vector of length .
The parity-check matrix has the following form
where is an matrix over . The submatrix consisting of the first rows of is a block diagonal matrix. Let be the -th column of , i.e.,
for some , where stands for transpose.
for and . We claim that any of are linearly independent. Indeed, for any vectors and scalars satisfying , i.e., , we have .
Note that together with are at most distinct columns of . It follows that they are linearly independent and thus the coefficient must be all zero.
Moreover, we note that the first components of are all zero for . We shorten the vector by puncturing its first coordinates. Denote by the shortened vectors. It is clear that any of are still linearly independent. Let be the matrix whose columns consists of for and and let be a linear code whose parity-check matrix is . Then is a linear code with length , dimension at least and distance at least . We now apply the Hamming bound to -linear code.
Let for some and .
Case 1. or . In this case, we have . Applying the Hamming bound to gives
i.e., . The last inequality follows from the fact that .
Case 2. or . In this case, we have . Deleting the first coordinate of gives a -ary -linear code. Applying the Hamming bound to gives
i.e., . In conclusion, we have
The desired result follows. ∎
Let us extend this result to the case is not divisible by . From Remark 3, we obtain recovery sets covering all of the indices. There are at most indices that belong to more than of these recovery sets. We first build the parity-check matrix whose first rows are where corresponds to recovery set . Then, we remove the columns from whose indices belong to multiple recovery sets. After removing at most columns, we apply the same argument to the resulting matrix. It is thus clear that the same result also holds for the case is not divisible by , with a small adjustment of in the final upper bound on the code length.
From our proof of Theorem 3.2, one might see why our argument is not applicable to the optimal LRC with distance less than . In our argument, the optimal LRC of distance is reduced to a code of distance at least without locality. If , this reduced code might be the Hamming code whose code length is independent of the alphabet size. That explains the reason why our argument fails in this scenario. On the other hand, there indeed exists unbounded length of optimal LRCs of distance . Therefore, our argument reveals the inherent differences of optimal LRCs with distance less than and above.
The minimal distance of optimal LRCs is upper bounded by .
By Theorem in , the dimension , locality and minimal distance of optimal LRCs must obey
where is the largest dimension of a linear code in of distance , and is an arbitrary integer parameter, .
Pick to be so that . The Plotkin bound now gives . Set and (10) gives that . On the other hand, the Singleton-type bound says that
This implies that
Solving this inequality in gives us
The following Corollary is an immediate consequence.
Assume that and is a constant, then the length of optimal LRCs is upper bounded by .
4 Construction of LRCs of super-linear length
To the best of our knowledge, all known constructions of optimal LRCs have block length unless . Our upper bound in the preceding section implies that must be upper bounded by (roughly) . A natural question arises whether there exists optimal LRC with super linear length in , e.g, and some constant . In this section we answer this question affirmatively, showing such codes for all .
When and , the Singleton-type bound (1) can’t be met [6, Corollary 10]. In this case, by an optimal LRC we mean a code attaining the trade-off . When is not divisible by , by shortening the code, it is still possible to obtain the optimal LRCs. We leave this discussion to Corollary 4.3 and 4.5.
Before stating our main results, we notice a simple but useful fact, i.e., the locality and minimum distance of a linear code can be reflected by representing its generator matrix properly. Thus, it is sufficient to concentrate on the construction of generator matrix. As a warmup, let us begin with the generator matrix of optimal LRCs with minimum distance and .
Assume that , and , there exist optimal LRCs of arbitrarily lengths as long as .
For , and , the Singleton-type bound implies that . Since , we let be a Vandermonde matrix over such that
is a Vandermonde matrix. Define matrix
Where and are all- and all- vectors in , respectively. We partition the columns of into blocks such that if its -th component is non-zero. From the expression of matrix , it is clear that each column belongs to exactly one block and the columns in distinct blocks are linearly independent. Moreover, any columns in the same block are linearly independent due to the property of Vandermonde matrix . Next we show that any columns of are linearly independent. It suffices to verify this claim for the case . To see this, we pick any three columns from . Nothing needs to prove if these three columns belong to the same block. We assume that they belong to at least two blocks. Without loss of generality, is in a block that does not contain and . From above observation, we see that is linearly independent from and . It is clear and are linearly independent no matter whether they belong to the same block or different blocks. Thus, any columns of are linearly independent. Let be the linear code whose parity-check matrix is . It is clear that has length , dimension , distance and locality . The condition leads to and thus . The desired result follows since . ∎
Next, we proceed to our main result of this section, the construction of optimal LRCs of super-linear length for . Like the case , it is sufficient to construct a generator matrix of these codes.
Assume and . There exist optimal LRCs of length . In particular, one obtains the best possible length for optimal LRC of minimum distance if and .
Let with some constant that only depends on and , i.e., . We will determine later. It suffices to construct a matrix and show that the code derived from this parity-check matrix is an optimal LRC. Label and order the coordinates with , i.e., precedes if or and . Let where . That means consists of the columns for . We start from and determine the value of column by column in the above order. In each step, we make sure that the new column together with any columns preceding the -th column are linearly independent. Meanwhile, the matrix holds the same form444The same form is referred to that their distributions of non-zero entry in upper half matrix (matrix lying above ) are the same, i.e., entry of value and represents the nonzero entry and zero, respectively. as the matrix in (8). If we can achieve both of the conditions, we are done. Define blocks such that . That means we partition the columns into disjoint blocks. Algorithm below gives the iterative method to compute the columns ’s.
Algorithm For , and , do the following operation. Find of form (9)555Only the -th component out of the first