Private Function Retrieval

11/13/2017 ∙ by Mahtab Mirmohseni, et al. ∙ Sharif Accelerator 0

The widespread use of cloud computing services raises the question of how one can delegate the processing tasks to the untrusted distributed parties without breeching the privacy of its data and algorithms. Motivated by the algorithm privacy concerns in a distributed computing system, in this paper, we introduce the private function retrieval (PFR) problem, where a user wishes to efficiently retrieve a linear function of K messages from N non-communicating replicated servers while keeping the function hidden from each individual server. The goal is to find a scheme with minimum communication cost. To characterize the fundamental limits of the communication cost, we define the capacity of PFR problem as the size of the message that can be privately retrieved (which is the size of one file) normalized to the required downloaded information bits. We first show that for the PFR problem with K messages, N=2 servers and a linear function with binary coefficients the capacity is C=1/2(1-1/2^K)^-1. Interestingly, this is the capacity of retrieving one of K messages from N=2 servers while keeping the index of the requested message hidden from each individual server, the problem known as private information retrieval (PIR). Then, we extend the proposed achievable scheme to the case of arbitrary number of servers and coefficients in the field GF(q) with arbitrary q and obtain R=(1-1/N)(1+1/N-1/(q^K-1/q-1)^N-1).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Distributed systems are considered as an inevitable solution to store or process large amount of data. However, distributing the computation raises major concerns regarding security and privacy of data and algorithms. This is particularly crucial, if we have to offload the computation and storage tasks to some untrusted, but probably cheaper or more powerful, parties. There is a rich history of study in the literature for data privacy in distributed environments. However, these days, the algorithm privacy could be even more important than data privacy. Not only the algorithms could be very valuable, but also in some cases parameters of the algorithm could carry lifetime secrets such as biological information of the individuals. Compared to

data privacy, our understanding of the fundamental limits of algorithm privacy is very limited.

Motivated by this, we introduce the private function retrieval (PFR) problem, where a set of servers with access to a database is connected to a user. The user wishes to compute a function of the files while keeping the function private from each individual server. The goal is to characterize the fundamental limits of the communication cost (between the user and the servers) needed in order to privately compute the function.

Recently, there has been intense interest in characterizing the fundamental performance limits of distributed computing systems from an information theoretic perspective. Among these, we can name the distributed storage systems [1], distributed cache networks [2], private information retrieval (PIR) [3, 4, 5], and distributed computing [6, 7]. In all of these cases, information theoretic ideas and tools have been found useful to provide a fundamental, and often very different, understanding on how to run the system efficiently. In this work, our goal is to characterize the fundamental limits of PFR from an information theoretic perspective.

To be precise, in this paper, we consider a system including one user connected to non-colluding servers, each storing a database of equal-size files, . The user wishes to compute a linear combination of these files by downloading enough equations from the servers. While retrieving the linear combination, the user wishes to keep the coefficients hidden from each individual server. This means that each server must be equally likely uncertain about which combination is requested by the user. The goal is to minimize the required downloading load to retrieve the result of computation privately.

The PFR problem can be considered as an extension of the PIR problem, where the user is interested in one of files. The PIR problem has been introduced in [3] and its capacity in the basic setup has been characterized recently in [5]. Several extensions of PIR problem has been studied in literature, including the PIR with colluding servers [8, 9], the PIR with coded servers [10], and the symmetric PIR [11, 12].

To address the problem of PFR, we first focus on the cases where the coefficients are from a binary field. For this case, we find the optimal scheme for two servers () and any arbitrary number of files, . In particular, we show that the capacity of this case is . Interestingly, this is equal to the capacity of the PIR with two servers and arbitrary number of files, . We extend this scheme, and propose an achievable solution for the general setup with servers, files and coefficients from a general valid field.

The capacity of PFR problem has been studied in a parallel and independent work [13]. In [13], the capacity of PFR has been characterized for a system with two servers (), two messages () and arbitrary linear combination. In this paper, we characterize the capacity of PFR for a system with servers, an arbitrary number of files, and binary coefficients. The achievable schemes proposed by two papers are very different.

The remainder of this paper is organized as follows. Section II formally introduces our information-theoretic formulation of the PFR problem. Section III presents main results. Sections IV and V contain proofs.

Ii Problem Setting

We consider a system, including a user connected to non-colluding servers, each stores an identical copy of a database. The database incudes files , where each file has equal-size segments (or so-called layered) for , i.e., . Each segment , , is chosen independently and uniformly at random from the finite field , for some and prime number . The database is shown as , where .

The user is interested in a specific linear function of , represented as

(1)

where is an

–dimensional non-zero vector, with entries from the finite field

, for some integer . We assume that is a sub-field of , thus , for some integer , where . Therefore, the operations in (1) are well-defined over . Excluding the parallel vectors, there are distinct options for vector , denoted by . We use a short-hand notation .

Note that in the PFR problem with binary coefficients, we set and that yields .

Assume that the user chooses for some , thus the user wishes to compute by downloading some equations from the servers. So, the user sends queries , to server 1 to respectively, where is the query sent by the user to the -th server in order to retrieve . Since the queries are independent of the messages, we have

(2)

for all .

In response to , the -th server computes an answer as a function of its database and the received query, thus

(3)

Let , for some integer , represent the download cost in -ary units.

While retrieving from and , the user must keep the index , (or equivalently the vector ) hidden from each individual server. To satisfy the privacy constraint, the query-answer function must be identically distributed in each server . That is

(4)

for each and .

An PFR scheme consists of query-answer function for and ; and decoding functions that map to

as the estimate of

for with probability of error

while the privacy constraint (4) is satisfied.

The rate of this code is defined as

(5)
Definition.

A rate is achievable if there exists a sequence of PFR schemes where as . The capacity of PFR problem is defined as

Thus from the Fano’s inequality, the correctness condition, i.e., , implies that

(6)

where from the Landau notation, we have if .

Iii Main Results

The first theorem presents the capacity of the PFR problem with binary coefficients () when servers are available with arbitrary messages.

Theorem 1.

For the PFR problem, with K messages and servers and binary coefficients, the capacity is

(7)
Remark 1:

Recall that the user needs the results of for some integer and for some . Clearly has options listed in a set . Therefore, the goal is to design an achievable scheme which has two properties: (1) correctness, meaning the user can decode what is asked for, (ii) privacy meaning that for every single server, all members of are equiprobable, independent of the real . The above theorem states that minimum communication load, normalized to the size of a file, to guarantee both privacy and correctness is .

Remark 2:

In the proposed achievable scheme, the set of requests to each server is symmetric with respect to all vectors in , thus the privacy is guaranteed. However, the requests of two servers are coupled to exploit two opportunities. In the first opportunity, every requests from a server, except a few, has a counterpart request from the other server, such that these two together can reveal for some . This justifies the factor of in (7). In the second opportunity, in some cases, a request from one server directly reveals a value of for some . This has been reflected in the factor in (7). These two opportunities are exploited together efficiently such that not only the correctness and privacy have been guaranteed, but also the scheme achieves the optimal bound.

Remark 3:

We note that for this case, the user asks for the results of , where has options listed in a set . Therefore, the user wants to hide its requested combination among (virtual) files, namely . Apparently these virtual files are not linearly independent. One solution for this problem is to ignore this dependency, and to consider a PIR problem with virtual files. That approach achieves the rate of (see [5] for the rate of PIR). However, here, the surprising fact is that the proposed scheme achieves the rate of , as if there are only options for . This is done by efficiently exploiting the linear dependency of vectors in .

Remark 4:

The PFR problem with binary coefficients reduces to the PIR problem if we restrict the possible coefficient vectors to those with unit Hamming weight. Thus, the converse of PIR problem in[5, Theorem 1] with is valid for the PFR problem with binary coefficients. The proposed achievable scheme detailed in IV meets this converse.

The next lemma extends the achievable scheme of Theorem 1 to the case of arbitrary number of servers and arbitrary field for the coefficient vectors .

Lemma 2.

For the PFR problem with servers, messages, and the coefficient vectors , if , the following rate is achievable.

(8)
Remark 5:

In this case, the user needs the results of for some integer and for some . Eliminating parallel vectors in , there are options for , listed in the set . If we treat each of , for as a virtual file, and apply the PIR scheme for these virtual files, we achieve the rate of

One can verify that the proposed scheme strictly outperforms the PIR-based scheme.

Corollary 3.

For the PFR problem with messages and the coefficient vectors , with servers, the capacity is equal to

(9)

The above corollary derives directly from Lemma 2. This rate meets the PIR converse.

Iv PFR Scheme with binary coefficients (Achievability Proof of Theorem 1)

In this section, we present the achievable scheme for the PFR problem with two servers () and arbitrary number of messages , where the coefficients are from the binary field.

The proposed scheme guarantees the privacy by keeping the requests to one server symmetric with respect to all . However, the requests to both servers are coupled in a certain way. In most of the cases, each request to one server has a counterpart in the set of requests from the other server. These two together reveals for some . Some other requests directly reveals for some without any recombining with other server.

Let and define . Also, consider as a random permutation of the set . The user generates this permutation, uniformly at random, among all permutations, and keeps it private from the servers. Apply this random permutation to reorder the messages. In particular, reorder the message vectors to get for . Without loss of generality assume that the user is interested in retrieving , for some .

Phase 1:

  1. User asks server 1 to send back

    (10)
  2. User asks server 2 to send back

    (11)

Phase 2:

  1. User asks server 1 to send back

    (12)

    and also

    (13)
  2. User asks server 2 to send back

    (14)

    and also

    (15)

It is important to note that the above requests will be send to the servers in a random order.

The requests and answers from server 1 and server 2 for are shown in Table I and Table II, respectively.

TABLE I: Requests from server 1 for
TABLE II: Requests from server 2 for

Iv-a Proof of correctness

To prove the correctness, we show that the user can recover from the combinations (10)-(15), received from both servers, while the rate of the scheme is equal to (7).

We remind that , and thus combinations must be derived from the available equations at the user. is given in (10). To obtain for all , the user combines (10) and (15). Similarly, is given in (11) and for all can be obtained by combining (11) and (13). Finally, and are given in (12) and (14), respectively.

The total number of downloads is

and so the rate of the code is

Iv-B Proof of privacy

Our privacy proof is based on the fact that we preserve the equal number of requests for any possible coefficient vector in addition to using a random permutation over the message layers. Furthermore, we send the requests to each server in a random order.

First, consider server 1 with its requests (10), (12) and (13). As seen, server 1 only observes that the user requests a linear combination for layers of messages, while two layers are left out. The indices of these layers do not leak any information about the requested combinations vector , thanks to the random permutation of the message layers. Now, let’s check the requested coefficient vectors in (10), (12) and (13). We note that the set is equal to the set . This means that each possible coefficient vector is requested exactly twice, and in a random order, and thus no information can be obtained by server 1. In fact, it can be easily shown that for any coefficient vector , there is a permutation of the set that maps the requests of from one server to the requests of from the same server. The privacy condition at server 2 is guaranteed similarly.

V General PFR Scheme (Proof of Lemma 2)

In this section, we present the general achievable scheme for the PFR problem with servers, messages and the linear combinations over . At first, we define some notations. We define

(16)

as the set of all options for . In addition, for each , we define

(17)

as the set of all parallel and non-zero vectors to . We also define as a set of all -tuples of vectors with each element from (all possible vectors):

Note that .

Moreover, we define as a set of all -tuples of vectors with each element from :

Apparently, . Note that .

Now we are ready to detail the proposed scheme in three steps.

Step 1: Consider layers of messages. The user generates a random permutation of the set , and keeps it private from the servers. Apply this random permutation to reorder the message vectors and define for . In addition, choose distinct and consider a Vandermonde matrix as

Also, consider as a random permutation of the set and apply this random permutation to the columns of to get .

Step 2: For each , , repeat the following:

  1. User asks server 1 to send back

    (18)
  2. User asks server , to send back

    (19)

    where is the element of permuted Vandermonde matrix in the -th row and the -th column.

We show in the proof of correctness that in each round of this step (i.e., for each ), the desired combination is retrieved over message layers.

Step 3: For each , repeat the following:

  1. User asks server 1 to send back

    (20)
  2. User asks server , to send back

    (21)

    where is the element of permuted Vandermonde matrix in the -th row and the -th column.

  3. User asks server  to send back

    (22)

Note that the set of requests to each server are sent in a random order.

We show in the proof of correctness that in each round of this step (i.e., for each ), the desired combination is retrieved over message layers.

V-a Proof of correctness

To prove the correctness, we show that the user can recover from the combinations (18)-(22), received from servers, while the rate of the scheme is equal to (9).

In Step 2, we have rounds. In each round , }, layers of the desired combination are recovered. The reason follows. Subtracting (18) from (19), the user has access to

(23)

for . Since Vandemonde matrix is full rank, (23) provides independent linear combinations of (that is layers of the desired combination). Thus, we obtain from (23). In total,

(24)

layers of the desired combination are recovered in Step 2. The user downloads one equation from each server in each round. Thus, the total number of the downloaded equations in this step is

(25)

In Step 3, we have rounds. In each round , , layers of the desired combination are recovered. The reason follows. Since Vandemonde matrix is full rank, the user from (20) and (21) has access to the independent linear combinations of

These are layers of the desired combination, as in this step, the coefficient vectors satisfy , and thus all are parallel to . Eliminating the layers from (22), the user recovers the -th layer in this step which is . Therefore, layers of the desired combination are recovered for each .

In total,

(26)

layers of the desired combination are recovered in Step 2. The user downloads one equation from each server in each round. Thus, the total number of the downloaded equations in this step is

(27)

From (25) and (27), the total number of downloads is

and totally

(28)

layers of the desired combination are recovered and so the rate of the code is as (9).

V-B Proof of privacy

The privacy proof is based on the fact that we preserve the equal number of requests for any possible combination vector in addition to using a random permutation over the message layers. In addition, the requests to each servers are sent in random order.

First, consider server 1 with its requests (18) and (20). As seen, server 1 only observes that the user requests a linear combination of message layers with all possible coefficient vectors. The indices of the message layers do not leak any information about the requested combinations vector , thanks to the random permutation of the message layers. In addition, due to the random order of the requests, asking for all possible coefficient vectors makes equiprobable for .

Now, consider server , , with its requests (19), (21) and (22). Again, server  only observes that the user requests a linear combination of message layers with all possible coefficient vectors (with a random order). Because:

  • The set of that is used in the scheme covers (i.e., all possible -tuples of vectors with each element from ).

  • The set is equal to .

  • When (in Step 3), two sets and are equal.

Thus, it can be shown that for any combination vector , there is a random permutation of the set that maps the request of to the request of .

References