The problem of private information retrieval (PIR), since its introduction , has attracted significant attention from researchers in the fields of theoretical computer science, cryptography, information theory, and coding theory. In the classical PIR model, a user wishes to retrieve one of the available messages, from non-communicating databases, each of which has a copy of these messages. User privacy needs to be preserved during message retrieval, which requires that the identity of the desired message not be revealed to any single database. To accomplish the task efficiently, good codes should be designed to download the least amount of data per-bit of desired message, the inverse of which is referred to as the capacity of the PIR system. This capacity problem in the classical setting was settled recently .
In practical systems, the databases may suffer from failures, and are also constrained on the storage space. Erasure codes can be used to improve both storage efficiency and failure resistance. This consideration motivated the investigation of PIR from MDS-coded databases [3, 4, 5, 6], with coding parameter , i.e., the messages can be recovered by accessing any databases. The capacity of PIR from MDS-coded databases (MDS-PIR) was characterized  as
In a given code, the smallest required number of symbols in each message is called the message size (sometimes also referred to as the sub-packetization factor), which is an important factor impacting the practicality and efficiency of the code. The capacity-achieving code given in  requires , which can be extremely large for a system with even a moderate number of messages. The problem of reducing the message size of capacity-achieving codes was recently considered by Xu and Zhang , and it was shown that under the assumption that all answers are of the same length, the message size must satisfy . These existing results may have left the impression that capacity-achieving codes would necessitate a message size exponential in the number of messages.
In this work, we show that the minimum message size for capacity-achieving PIR codes can in fact be significantly smaller than previously believed, by providing capacity-achieving linear codes with message size . Two linear code constructions, referred to as Construction-A and Construction-B, respectively, are given. The two constructions have the same download cost and message size, however Construction-B has a better upload cost (i.e., a lower communication cost for the user to send the queries), at the expense of being slightly more sophisticated than Construction-A. The key difference between the two proposed constructions and existing codes in the literature is that the proposed codes reduce the reliance on symmetry, and the answers may be of different lengths. We further show that this is in fact the minimum message size when , the proof of which requires a careful analysis of the converse proof of the information-theoretic MDS-PIR capacity. The code constructions and converse proof reflect a reverse engineering approach which further extends [7, 8], however the analysis technique here is considerably more involved due to the integer constraints. Finally, we show that, when is small, it is in fact possible to design codes with a message size even smaller than .
The rest of the paper is organized as follows. In Section 2, a formal problem definition is given. Construction-A and Construction-B are then given in Section 3 and Section 4, respectively, where the correctness and performance are also proved and analyzed. The optimality of message size is established by first identifying several critical properties of capacity-achieving codes in Section 5.1, then lower-bounding the minimum message size when in Section 5.2. A special code is given in Section 5.3 to show that when , the message size can be even lower than . Finally, Section 6 concludes the paper. Several technical proofs are relegated to the Appendix.
2 System Model
There are a total of mutually independent messages
in the system. Each message is uniformly distributed over, i.e., the set of length- sequences in the finite alphabet . All the messages can be collected and written as a single length-
row vector. Each message is MDS-coded and then distributed to databases, such that from any databases, the messages can be fully recovered. Since the messages are MDS-coded, it is without loss of generality to assume that for some integer .
When a user wishes to retrieve a particular message , queries are sent to the databases, where is the query for database-. The retrieval needs to be information theoretically private, i.e., any database is not able to infer any knowledge as to which message is being requested. For this purpose, a random key in the set is used together with the desired message index to generate the set of queries . Each query belongs to the set of allowed queries for database-, denoted as . After receiving query , database- responds with an answer . Each symbol in the answers also belongs to the finite field , and the answers may have multiple (and different numbers of) symbols. Using the answers from all databases, together with and , the user then reconstructs .
A more rigorous definition of the linear information retrieval process we consider in this work can be specified by a set of coding matrices and functions as follows. For notational simplicity, we denote the cardinality of a set as .
A linear private information retrieval code from linearly MDS-coded databases (a linear MDS-PIR code) consists of the following coding components:
A set of MDS encoding matrices:
where , is an matrix in the alphabet for encoding message , and each encodes the messages into the information to be stored at database-, denoted as ;
A set of MDS decoding recovery functions:
for each such that , whose outputs are denoted as ;
A query function
i.e., for retrieving message , the user sends the query to database-;
An answer length function
i.e., the length of the answer from each database, a non-negative integer, is a deterministic function of the query, but not the particular realization of the messages;
An answer generating matrix
i.e., the answer , when is the query received by database-;
A reconstruction function
i.e., after receiving the answers, the user reconstructs the message as .
These functions satisfy the following three requirements:
MDS recoverable: For any such that , we have .
Retrieval correctness: For any , we have
Privacy: For every , and ,
is in fact a random variable, sinceis the random key. It follows that even when the messages are viewed as deterministic, is still not deterministic. In contrast, for any specific query realization , the corresponding answer is deterministic when the messages are viewed as deterministic. The distinction between and is indicated by the bracket and the parenthesis .
In order to measure the performance of an MDS-PIR code, we consider the following two metrics, with the focus on minimizing the latter while keeping the former optimal:
The message size , which is the number of symbols to represent each individual message. This quantity should be minimized, because in practical applications, a smaller message size implies a more versatile code.
A third metric, the upload cost, is also of interest in practical systems (also particularly in computer science literature, e.g., ), although it is not our main focus in this work. The upload cost can be defined as
which is roughly the total number of bits that the user needs to send to the servers during the query phase.
We will need several more parameters before proceeding. Define , then
for some positive integers and , which are co-prime.
3 New MDS-PIR Code: Construction-A
In this section, we provide the first MDS-PIR code construction with message length , which we refer to as Construction-A.
3.1 The Coding Components of Construction-A
Each message can be divided into sub-messages, denoted as , and each sub-message contains symbols in the alphabet . The construction relies on two novel ingredients: a new indexing on the key (query) and the introduction of pseudo code symbols.
The first novel ingredient in the construction, which is different from previous ones in the literature, is the random key , which is a length- vector uniformly distributed in the set
where indicates modulo . In this code construction, we need to first choose a (in fact, any) linear -MDS code , in the alphabet as our base code. There are many known techniques to construct such codes, such as Reed-Solomon codes and Cauchy matrix based constructions; see . The coding functions can then be given as follows:
Each sub-message , and , is encoded by into coded symbols , with placed at database-, where is the -th column of the generator matrix of code operated on each sub-message, which produces the stored information at database-.
The decoding function is obvious which is naturally induced by that of .
For any , the query generating function produces a length- column vector as
Database- first produces a query matrix
where is the all-one column vector of length , and indicates matrix transpose; the element of on the -th row and -th column is denoted as . The query length function is then defined as:
where is the indicator function, i.e., is the number of columns in which has an element less than .
The second novel ingredient, which is different from previous ones in the literature, is the introduction of pseudo code symbols and pseudo message symbols in the sub-messages: for . For , an intermediate answer vector of length- is formed as
each component of which is the finite field addition of some components of the vector that are indicated by the corresponding column of . The eventual answer of length is formed by concatenating the components of which are not constantly zero, i.e., those corresponding to the positions indicated in (13).
For any , define the interference database set . The -th component of , , can be written as
where the length- row vector is defined as
Note that is not a function of , since unless . Thus as long as , the vector can be fully recovered by the MDS property of the code ; see Fig. 1 for an illustration. Further note that the -th component of for can be written as
from which, since is known, we can recover
Denote . As long as , we can recover the vector by again invoking the property of the MDS code .
3.2 An Example for Construction-A
Let us first consider an example , which induces in the code. The queries and the answers are shown in Table 1 with the answers indicated by the encoded symbols . We omit the index since here . Consider the case to retrieve message , and the key is . Then the queries are
The corresponding queries (and query matrices such induced) and answers are marked bold in Table 1. In these matrices, each column has at least one element being , and thus the total number of transmission symbols is . It is seen that from , the symbol can be recovered by the MDS property, and thus . Similarly, we can recover . Using both , we can then recover the original message by decoding the MDS code .
3.3 Correctness, Privacy, and Download Cost
According to the last coding component function (the reconstruction function) in Construction-A, the correctness of the proposed code hinges on two conditions: for all and for all . We establish these two conditions in the following lemma, whose proof can be found in Appendix A.
In Construction-A, for any request of message- and any random key ,
for any ;
for any .
We also have the following main theorem for Construction-A.
Codes obtained by Construction-A are both private and capacity-achieving.
The fact that the code is private is immediate, by observing that is uniformly distributed on the set
regardless of the value of .
The expected lengths of the answers is
assuming an arbitrary message
is being requested. The probabilities involved in the summanddepend on
By the definition of , it is clear that if any item in
is less than , then for all , which will induce transmitted symbols in the retrieval from all databases for ; this event occurs with probability . On the other hand, when the event does not occur, in the vector
the number of elements that are less than is , which induces symbols being transmitted. Therefore
from which it follows that the code is indeed capacity achieving, by taking into account (9). ∎
The following lemma is also immediate, and we state it as a lemma below.
The upload cost of Construction-A is .
4 New MDS-PIR Code: Construction-B
In this section, we provide an alternative code construction, namely Construction-B. This construction requires a lower upload cost than Construction-A, however, it relies on two different coding strategies for the two cases of high rate codes and low rate codes . The high rate code construction is essentially built on a product code, while the low rate codes bear more similarity to Construction-A.
4.1 Construction-B for
In this construction, the same random key as in Construction-A is used, and the MDS encoding matrices and decoding functions are also exactly the same as in Construction-A. We need a second generic - code in the alphabet in this construction. Construction-B essentially utilizes a product code with row code and column code . In this context, it is more convenient to view the message as being represented as an matrix, denoted as
Next we provide the coding components in Construction-B.
The query generating function at server- produces the following query vector
where , and it operates on a vector by operating on each component individually.
Define an query pattern matrix as
where the first row has the first elements being ’s and the rest being ’s, and the remaining rows are obtained by cyclically shifting the first elements in the first row but keeping the last in place. The query length function is then defined as
i.e., it is the number of columns in the matrix selected by the vector that have non-zero elements.
Recall that coded message at database- is a length- vector
In order to generate the answer, each vector is encoded by into a length- intermediate code vector
where is the generator matrix of the code . An intermediate answer vector is then produced
The eventual answer of length is formed by concatenating the components of which are not constantly zero, i.e., those indicated by (25).
For any , define the interference database set . For , the -th symbol in the intermediate answer is
where the matrix is defined as
Note that is not a function of , since unless . Thus as long as , the vector can be fully recovered by the MDS property of the code . Further note that for can be written as
from which, since is known, we can recover
Denote and , the latter of which is the set of databases that provide at least symbols of the requested messages in the form (30). For any , we can recover by invoking the property of code . Then as long as , can be fully recovered by invoking the property of code .
4.1.1 An Uncompressed Description of Construction-B
The description of the coding components above is in a compressed form, and offers little intuition. The following equivalent description, on the other hand, can provide better intuition at the expense of more redundant items. Let the extended pattern matrix of size be defined as
i.e., expanding the original pattern matrix by appending an all- matrix of size . The same query answer can now be equivalently produced at each server by using the following auxiliary query
i.e., without using the function mapping, and then following the same manner in answer generating, using the extended patter matrix . The stored contents of message across all the databases can be visualized as follows
where each column corresponds to a database, and the contents below the horizontal line are not stored physically, but can be generated as part of the answer computation.
It is straightforward to verify that with the auxiliary query