Private information retrieval (PIR) schemes allow a user to download files from the database without revealing any information on which records a user wants to retrieve. In the original setting for PIR , the whole database is replicated among non-colluding nodes, which results in high storage cost, so this motivates the use of erasure codes which means that only a fraction of the entire database is stored in each node, and this is called code-based PIR schemes.
In , Shah et al. present the first work of the code-based PIR schemes proving that only an extra bit of download is needed to retrieve the desired record, and they also provide another PIR scheme using the product-matrix minimum bandwidth regenerating (MBR) codes . Chan et al.  give retrieval schemes for a general class of linear storage codes, and discover the relationship between storage and retrieval cost in the context of their proposed PIR schemes, and subsequently, Tajeddine and Rouayheb  design an explicit scheme using MDS codes achieving the optimal curve of the trade-off in . Later, Kumar et al.  propose PIR schemes that use an arbitrary systematic linear storage code of rate , and, interestingly, locally repairable codes (LRCs)  and Pyramid codes , which have more efficient repair property, can be used to achieve the optimal scheme.
As the classical PIR setting has been extended to many variations, one interesting scenario is when a user wants to retrieve more than one record. Clearly, the user can use a single-message scheme multiple times, but is there a more efficient way to do this? This is the multi-message PIR (MPIR) problem. In , Banawan and Ulukus consider the problem of capacity which is defined as the maximum of the retrieval rate over all possible PIR schemes by analysing the capacity of multi-message PIR schemes with replicated database, and give a capacity-achievable scheme when the number of desired records is at least half of the number of total records.
In this paper, we propose the general multi-message PIR model where the product-matrix regenerating codes is used for storage. The use of regenerating codes beneficially reduces the repair cost when a node failure occurs in the system, hence our scheme obtains more efficient repair compared to schemes using MDS codes. To the best of our knowledge,  and  are the only paper that uses regenerating codes in their PIR scheme, and our work is the first to explore multi-message PIR with coded databases. Furthermore, we analyse the relationship between the costs of storage, retrieval and repair, and design explicit schemes that fit the optimal curve of the trade-off using the product-matrix MSR and MBR codes from .
The organisation of this paper is as follows. We recall the MSR and MBR codes, and the product-matrix constructions from  in Section 2. The system model of multi-message PIR scheme using product-matrix regenerating codes is then given in Section 3. In Section 4, we obtain the decodability condition and trade-off analysis between storage, retrieval, and repair costs in the system. Motivating examples and explicit constructions of our optimal MPIR schemes using MSR and MBR codes are presented in Section 5. We give the discussion on our constructions in Section 6. Lastly, in the Appendix A, we propose an alternative optimal MPIR scheme using product-matrix MSR codes with a different retrieval pattern. This scheme has slightly higher cPoP and lower storage overhead compared to the scheme using MSR codes in Section 5, and it turns out to be a generalisation of our single-message construction in .
2 Product-Matrix MSR and MBR Codes
2.1 MSR and MBR Codes
An regenerating code  is defined to be a distributed storage code storing the database of size among nodes where each node stores symbols satisfying two properties: (i)(recovery) The entire database can be recovered from the data stored in any nodes; (ii)(repair) If one of the storage nodes fails, then a newcomer node connects to some set of remaining nodes where , and downloads symbols from each of these nodes in order to regenerate symbols in such a way that we can perform (i) and (ii) again when another node failure occurs.
The total amount of symbols downloaded for regenerating is called the repair bandwidth, and typically the repair bandwidth is smaller than the size of the whole database. There are various repair models, but for PIR we focus on the exact repair model, where a newcomer node will regenerate the same data as was stored in the failed node in order to maintain the initial state of the storage nodes.
In , the parameters of a regenerating code is shown to necessarily satisfy
and the achievable trade-off between storage overhead and repair bandwidth is characterised by fixing the repair bandwidth, and then deriving the minimum which satisfies the above equation. Two interesting extremal points on the optimal trade-off curve are the minimum storage regeneration (MSR) point which minimises storage overhead first and then minimises repair bandwidth, and the minimum bandwidth regeneration (MSR) point which minimises in the reverse order. It can be shown that the MSR point is achieved by
and MSR codes are regenerating codes that satisfies the above equation. Also the MBR point is achieved by
and MBR codes are regenerating codes that satisfies the above equation.
2.2 The Product-Matrix MSR Codes ()
Under the product-matrix framework, each codeword is represented by an code matrix which is the product
of an encoding matrix and an message matrix . The message matrix contains the message symbols. In the code matrix , row consists of the encoded symbols stored by node for each .
In , Rashmi, Shah and Kumar gave an explicit construction for the MSR code with , so the parameters are where using the product-matrix framework. First, they let the encoding matrix be any matrix given by
where is an matrix and is an diagonal matrix such that (i) any rows of are linearly independent, (ii) any rows of are linearly independent, (iii) the diagonal elements of are all distinct. The rows of are denoted by . Next, the message matrix is defined as
where and are symmetric matrices constructed such that entries in the upper-triangular part of each matrix are filled up by distinct message symbols and entries in the strictly lower-triangular are chosen to make the matrices symmetric. This is the MSR code we will use in Section 5.1.
2.3 The Product-Matrix MBR Codes ()
Rashmi, Shah and Kumar also gave an explicit construction for the MBR code with parameters
where using the product-matrix framework in . First, the encoding matrix is an matrix given by
where is an matrix and is an matrix such that (i) any rows of are linearly independent, (ii) any rows of are linearly independent. The rows of are denoted by . Next, the message matrix is defined as
where is a matrix constructed such that entries in the upper-triangular part of each matrix are filled up by distinct message symbols and entries in the strictly lower-triangular are chosen to make the matrices symmetric, and the matrix are filled up by the remaining message symbols. This is the MBR code we will use in Section 5.2.
3 System Model
In this section, we formally present the storage model and its retrieval scheme. Consider there are non-communicating nodes in the system that store a database which consists of records, each of length , denoted by . Each record is encoded and distributed across nodes by the same product-matrix regenerating code with parameters which can be written as
where is the corresponding message matrix of . Write
and denote by the row of . Hence, we can see the entire system as
and each node stores symbols in total. We denote by the row of which is all symbols stored in node , and the row of which is all symbols of stored in node .
We assume that in the retrieval step the user wants to download records when , denoted by . The user submits a query matrix over to node . We can interpret rows of as subqueries, and for instance is set to be in our constructions. Finally, node computes and responds with an answer . The retrieval steps are as follows:
(Initialisation) The user generates an matrix whose elements are chosen independently and uniformly at random over . Let be row of .
(Query Generation) The query matrix is defined by binary matrices
In other words, is an matrix such that which is a coded data piece of a desired record stored in node . If the entry of is 1, then it implies that the entry is privately retrieved by the subquery of .
(Response Mappings) Each node returns .
Let be the entropy function. An MPIR scheme is said to be a perfect information-theoretic PIR scheme if
(i)(privacy) for every ;
According to our definition, (i) implies that a node does not obtain any information about which records are being downloaded by the user, and (ii) ensures that the user can recover the desired records with no errors from all responses .
To measure the efficiency of the MPIR scheme, we use three metrics, namely Storage Overhead (SO), communication Price of Privacy (cPoP) and Repair Ratio (RR). First, SO is defined to be the ratio of the total storage used in the scheme to the total size of the whole database which is
in our model, and the cPoP is defined in  as the ratio of the total amount of downloaded data to the total size of all desired records which, in our model, is
Lastly, RR is defined in our paper  as the ratio of the total amount of symbols downloaded for repairing a failed node to the size of the failed node which is equal to
in our model.
4 Decodability Condition and Trade-off Analysis
From the retrieval scheme, we can see that in fact, the response from node is
Then, the response in is where is the row of . Hence, records should be decoded by solving the system of linear equations
for all where the unknowns are
Consider first the unknowns , we can see that for each
where For the unknowns , we know that
Hence, the retrieval scheme is decodable if the following system of linear equations
has a unique solution, where the unknowns are
This condition is called decodability condition.
Next, we will give the trade-off analysis between storage overhead and cPoP. First, we count the number of unknowns in the system of linear equations in the decodability condition which is equal to . Next, we count the number of linearly independent equations in the system. Consider
so we have, for each ,
Since is of rank , it has a parity check matrix of rank such that . So we have
This gives us linearly independent equations for each . Then, for
since any rows of would give us , the remaining rows must be able to be written in terms of linear combinations of those rows of . This give us equations in . Hence, there are at most linearly independent equations in the system. If the retrieval scheme meets the decodability condition, then
which implies that
In terms of storage overhead and cPoP we have
This shows that there is a trade-off between cPoP and storage overhead, and in terms of repair ratio and cPoP we have
This shows that cPoP is bounded below by repair ratio.
5 Our constructions
5.1 An MPIR scheme using a product-matrix MSR code
In this construction, we use the product-matrix MSR code from  with
over the finite field , so the parameters of the MSR code are
We first start with an example to motivate our scheme.
Suppose that we have 3 records over the finite field , each with size , which can be written as
We use a product-matrix MSR code over to encode each record by choosing the encoding matrix to be the Vandermonde matrix, and the message matrix for the record as described in Section 2.2:
Hence, each node stores
|node 1||node 2||node 3||node 4||node 5|
|node 6||node 7||node 8||node 9||node 10|
Recall that is the symbol of record , stored in node . Here the entire database can be recovered from the content of any 3 nodes, and if any one node failed, it can be repaired by downloading one symbol each from 4 of the remaining nodes.
In the retrieval step, suppose the user wants record and . The query is a matrix which we can interpret as subqueries submitted to node for each . To form the query matrices, the user generates a random matrix whose entries are chosen uniformly at random from . Recall that is a matrix which is part of the query submitted to node , attempting to retrieve information about record . Choose
The query matrices are . Then each node computes and returns the length-vector . Write . Recall that
Consider first subquery 1, we obtain
where , and is the first row of .
The user can solve for from as they form the equation
where the left matrix is the submatrix of which is invertible. Therefore, the user gets , , and for record 1 and , and for record 2. Similarly, from subquery 2, the user obtains , and for record 1 and , and for record 2. Hence, the user has all the symbols of which are stored in the node and all the symbols of which are stored in the node . From the property of regenerating codes, the user can reconstruct and as desired.
|node 1||node 2||node 3||node 4||node 5||node 6||node 7||node 8||node 9||node 10|
Now we give the general construction of our MPIR scheme and prove the decodability and privacy. Recall that we use the MSR code with parameters
over to store each record , which means that
where is the message matrix corresponding to as described in Section 2.2, so
Suppose that the user wants to retrieve records . In the retrieval step, the user sends a query matrix , which we can interpret as subqueries, to each node . To form the query matrices, the user generates a random matrix whose entries are chosen uniformly at a random from . We choose, for ,
and for others which are not defined above, we choose As
For the rest, we have