# Private Inner Product Retrieval for Distributed Machine Learning

In this paper, we argue that in many basic algorithms for machine learning, including support vector machine (SVM) for classification, principal component analysis (PCA) for dimensionality reduction, and regression for dependency estimation, we need the inner products of the data samples, rather than the data samples themselves. Motivated by the above observation, we introduce the problem of private inner product retrieval for distributed machine learning, where we have a system including a database of some files, duplicated across some non-colluding servers. A user intends to retrieve a subset of specific size of the inner products of the data files with minimum communication load, without revealing any information about the identity of the requested subset. For achievability, we use the algorithms for multi-message private information retrieval. For converse, we establish that as the length of the files becomes large, the set of all inner products converges to independent random variables with uniform distribution, and derive the rate of convergence. To prove that, we construct special dependencies among sequences of the sets of all inner products with different length, which forms a time-homogeneous irreducible Markov chain, without affecting the marginal distribution. We show that this Markov chain has a uniform distribution as its unique stationary distribution, with rate of convergence dominated by the second largest eigenvalue of the transition probability matrix. This allows us to develop a converse, which converges to a tight bound in some cases, as the size of the files becomes large. While this converse is based on the one in multi-message private information retrieval, due to the nature of retrieving inner products instead of data itself some changes are made to reach the desired result.

There are no comments yet.

## Authors

• 1 publication
• 28 publications
• 14 publications
• ### Multi-Message Private Information Retrieval with Private Side Information

We consider the problem of private information retrieval (PIR) where a s...
05/30/2018 ∙ by Seyed Pooya Shariatpanahi, et al. ∙ 0

• ### Capacity of Quantum Private Information Retrieval with Collusion of All But One of Servers

Quantum private information retrieval (QPIR) is the problem to retrieve ...
03/29/2019 ∙ by Seunghoan Song, et al. ∙ 0

• ### Private Information Retrieval Schemes with Regenerating Codes

A private information retrieval (PIR) scheme allows a user to retrieve a...
11/07/2018 ∙ by Julien Lavauzelle, et al. ∙ 0

• ### Multi-Message Private Information Retrieval using Product-Matrix MSR and MBR Codes

Multi-message private information retrieval (MPIR) is an interesting var...
08/06/2018 ∙ by Chatdanai Dorkson, et al. ∙ 0

• ### Staircase-PIR: Universally Robust Private Information Retrieval

We consider the problem of designing private information retrieval (PIR)...
06/22/2018 ∙ by Rawad Bitar, et al. ∙ 0

• ### Concentration Bounds for Co-occurrence Matrices of Markov Chains

Co-occurrence statistics for sequential data are common and important da...
08/06/2020 ∙ by Jiezhong Qiu, et al. ∙ 0

• ### Climbing the WOL: Training for Cheaper Inference

Efficient inference for wide output layers (WOLs) is an essential yet ch...
07/02/2020 ∙ by Zichang Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

With the growth in data volume over recent years, the tasks of data storage and processing are often offloaded from in-house trusted systems to some external entities. Such distributed environments raise challenges, not experienced before. One of the most important ones is privacy concern, which can have different interpretations. Based on the applications use-case, the private asset might be the training data, test data, and even the model parameters (the learning algorithm). While the first two have been the subject of extensive research, from both computational cryptography and information-theoretic perspectives, the last one has been less understood.

In the privacy of the machine learning algorithms, the goal is to ensure the privacy of the parameters. Many different scenarios can be considered in which the parameters are in danger of breaching, and need to be addressed. Here, we focus on the case, where the learner must download some data samples from the servers to train the model. In this case, the learner wants to keep the identity of this subset hidden from the servers. The reason is that in many cases, revealing the identity of the selected training samples would reveal considerable information about the intention of the learner, and can be used to guess the learning algorithm and calculate parameters of the model. For example, assume that learner downloads some training samples from a server to train a classification algorithm, say support vector machines (SVM). The server can easily guess that, and run the same algorithm, and gain full knowledge about the intention and the model.

In this paper, we investigate the above privacy concern in a distributed setting, while our goal is to achieve privacy in a fundamental and information-theoretic level where no information is revealed about the algorithms to data owners. We argue that some of the most basic machine learning algorithms in different areas, including but not limited to SVM for classification, regression for relationship estimation, and principal component analysis (PCA) for dimensionality reduction, share an important feature in using sample data in their algorithm. To run these methods, the learner needs the inner products of the data files instead of the raw data. This can be particularly important when the length of input vectors is large compared to the number of data used for learning.

On a separate line of research, the privacy in distributed settings, referred to as private information retrieval (PIR), is investigated. In [1], the basic setup of PIR is studied, where the goal is to retrieve a file from a dataset, replicated in some non-colluding servers, without revealing its index. In particular, the capacity, as the infimum of the normalized download rate, is characterized. This is followed by [2, 3, 4, 5, 6] for different cases such as symmetric privacy, possibility of collusion among the servers, and coded storage instead of uncoded replication of data files in servers. In particular, in [7], the multi-message PIR (MPIR) problem is studied, where the objective is to privately download a subset of files, instead of just one, and the capacity is approximately, and in some cases tightly, characterized. The problem of retrieving a linear function of files from the servers, referred to as private computation (PC) or private function retrieval (PFR), is investigated in [8] and [9]. In [10] the capacity for private linear computation in MDS coded databases is studied. Recently the new problem of retrieving a polynomial function of files from some servers has been introduced and discussed in [11] and [12] by using Lagrange encoding in coded databases.

In this paper, we study a system, including a dataset of files, replicated across non-colluding servers. A user (learner) wishes to retrieve a subset of inner products out of all possible inner products of data files, without revealing the identity of the subset to each server. We prove that as the length of files, , goes to infinity, the set of inner products of all data files (listed in the vector ) converges, in distribution, to a set of mutually independent uniform random variables. To show that, we introduce some dependencies in the sequence of , , while keeping the marginal distribution of the same. Thanks to this dependency, we show that forms a time-homogeneous irreducible Markov chain, with uniform distribution as its unique stationary distribution. Moreover, the rate of convergence is governed by the second largest eigenvalue of the transition probability matrix, where . This property motivates us to suggest MPIR as an achievable scheme. In addition, we rely on the above property to develop a converse which becomes tight in some case, as the length of files goes to infinity. While this converse is based on [7], a few changes are needed to be made to reach our goal. This is because of the difference in retrieving inner products instead of data files in [7]. For example, the number of possible inner products cannot be any arbitrary integer which forces us to introduce an equivalent problem with arbitrary number of inner products in the process of reaching converse results.

The organization of the paper is as follows: In Section II, we discuss and motivate why retrieving the set of inner products are critical in machine learning. Next in Section III, we formally define the problem setting. We state our main results in Section V and their proofs in Section VI.

## Ii Background and Motivation

In what follows, we review some of the most basic machine learning algorithms, in three areas of classification, regression, and dimension reduction, and show that all three are based on the inner products of the samples, rather than the samples.

1. Support vector machines (SVM)

: The SVM is one of the basic classification algorithms, where the goal is to correctly label the data files. This algorithm has many use cases such as face detection, bioinformatics (gene classifications), text categorization and etc. Here, we describe a simple case of SVM from

[13, Page 63] and we discuss that knowing the inner products is enough to run the algorithm (instead of knowing the entire database).

Consider an input alphabet consisting of length

vectors, a target output alphabet

and a distribution on . The learner has training samples from , denoted by , drawn from . The goal here is to find function from hypothesis set , such that the following generalization error is minimized over :

 RD(h)=Pr(x,y)∼D{h(x)≠y}. (1)

Although many different hypotheses sets exist, can be chosen as described in [13]

as a linear classifier defined as follows,

 (2)

The solution to this problem boils down to solving the following convex optimization problem:

 minw,b12||w2|| (3) subject to: yi(⟨w,xi⟩+b)≥1,∀i∈[1:m], (4)

where for any integer , denotes . This notation is used throughout this paper. The above problem can be solved by introducing Lagrange variables for each constraint. Thus, the dual form of the constrained optimization problem is derived as following.

 maxαi,i∈[1:m]m∑i=1αi−12m∑i,j=1αiαjyiyj⟨xi,xj⟩ (5) subject to: αi≥0 and m∑i=1αiyi=0,∀i∈[1:m]. (6)

Solving the dual problem on , we have:

 w =m∑i=1=αiyixi,b=yi−m∑j=1αjyj⟨xj,xi⟩. (7)

As is clear from (5)-(7), in order to solve the main problem for , we only need the inner products of samples and their labels to solve the dual problem for and a linear combination of data samples to get  111To having a linear combination of the samples privately, we can use a scheme called private function retrieval. So, when the length of vectors , , is large, retrieving inner products instead of raw samples is more efficient in a distributed learning setting.

2. Regression: The regression algorithm predicts the real-valued label of a point by using a data set. Regression is a very common task in machine learning for approximately and closely deriving the relationship between variables. The regression is similar to continuous-label version of the classification, as opposed to the classification’s discrete labels. Many use cases can be considered for the regression algorithm, such as optimizing the price of products by learning the relation of price and the sale volume in different markets and analyzing the product sale drivers such as distribution methods in markets. Here, we first describe a simple regression problem from [13, Page 245] and show in order to solve this problem we only need the inner products as opposed to retrieve all data files.

Similar to SVM, consider an input alphabet consisting of vectors of length and a distribution on . The learner has training samples from , denoted by , drawn from . The difference is that the target output alphabet

can be a continuous space. Since the labels are real numbers, the learner is not able to predict them precisely. So, a loss function is considered to show the distance between the label and the predicted value.

Now, we discuss a simple linear regression problem. Similar to SVM, the hypothesis set is as follows.

 H={x↦⟨w,x⟩+b|w∈RL,b∈R}. (8)

The loss here is empirical mean squared error. So, the optimization problem is as follows,

 minw,b1mm∑i=1(⟨w,xi⟩+b−yi)2, (9)

which can be written in a simpler form as:

 min~wF(~w)=1m||X⊤~w−y||2, (10)

where , and . It is clear that the objective functions is convex and reaches its optimum value in . So, we have:

 2mX(X⊤~w−y)=0⇔XX⊤~w=Xy. (11)

Now, if is invertible, we can calculate . Otherwise, we replace the inverse with pseudo-inverse.

 ~w={(XX⊤)−1Xyif XX⊤ is % invertible(XX⊤)†Xyotherwise. (12)

It can be easily shown that the above result can be rewritten as below.

 ~w={X(X⊤X)−1yif X⊤X is % invertibleX(X⊤X)†yotherwise. (13)

As seen, the solution only needs inner products () and a linear combination of data files ( ) and not all data files. If the length of data vectors, , is large, downloading all data files needs much more resource.

3. Principal component analysis (PCA): The purpose of this algorithm is to reduce the dimensionality of data with large vector length, so that its most important features can be better analyzed. The reason is that sometimes the generalization ability of method decreases with the increase in dimension of data. The following example is from [14, Page 324].

Consider the vectors of length ,

, as data files. The goal here is to reduce the dimensionality of these vectors using linear transformation. To do this, we define a matrix

where . We also have a mapping , whose output is the lower dimensionality representation of data. Then a second matrix is defined to recover . This means that if is the reduced representation, then the is the recovered data. Minimizing the magnitude of empirical distance between the original data and the recovered data is the goal of PCA.

 argminW,Um∑i=1||xi−UWxi||2. (14)

It is shown in [14] that and this problem can be rewritten as follows.

 argminUm∑i=1||xi −UU⊤xi||2, (15) subject to: U⊤U=I, (16)

where

is the identity matrix. According to Theorem 23.2 in

[14, Page 325] the solution for above problem is to calculate

which are eigenvectors of matrix

() corresponding to largest eigenvalues of the matrix. The solution is .

If the dimension of the original vectors is too large (), then we can rewrite the answer. We define . Let be an eigenvector of matrix (so ). This means that we have and thus,

 XX⊤Xu=λXu⇒AXu=λXu. (17)

Therefore, if is an eigenvector of , corresponding to eigenvalue , then is an eigenvector of matrix , corresponding to the same eigenvalue. So in PCA, when vector length is large, it is simpler to calculate the matrix that is matrix of inner products of original data. Then, the eigenvectors of this matrix corresponding to its largest eigenvalues are enough.

These three algorithms make clear that methods using inner products of data files are important and common tasks of machine learning. Thus, retrieving the inner products privately from the servers is an important step in machine learning privacy.

## Iii Problem Statement

Consider a set of data files, , for some integer , where files are selected independently and uniformly at random from a finite field , for some integer . Thus,

 H(W1,W2,…,WK)=LKlog(q). (18)

Files can be represented in the vector form as

 Wk=(wk1,...,wkL)⊤ wkℓ∈ F(q),for k∈[1:K],ℓ∈[1:L]. (19)

We assume that files are replicated in non-colluding servers, for some integer . We define , as the set of the inner product of all pairs of data files,

 X(L)={⟨Wi,Wj⟩, ∀i,j∈[1:K]}. (20)

Also, we define as index of inner products as follows,

 T={{i,j}, ∀i,j∈{1,2,...,K}}. (21)

Note that each member of corresponds to an inner product in set , i.e., .

A user wishes to retrieve a subset of size of inner products. More precisely, the user chooses a set , where , and , and entreats to know , defined as

 X(L)P={⟨Wi,Wj⟩,∀{i,j}∈P}. (22)

The cardinality of is known to all servers. The user wishes to retrieve while ensuring privacy of from each server.

In order to retrieve these inner products user creates queries and sends to server , through an error-free secure link. In response, server , responds with . Since user has no knowledge of files,

 I(W1,...,WK;Q[P]1,...,Q[P]N)=0. (23)

The answer of server , , is a function of query sent to that server and the set of data files available there, thus

 H(A[P]n|W1,...,WK,Q[P]n)=0. (24)

Also, denotes set . The queries and answers must satisfy two conditions:

(i) Correctness Condition: This condition states that by having all queries and answers from servers, the user can calculate inner products indexed by the set . Equivalently,

 H(X(L)P| A[P]1:N,Q[P]1:N)=0. (25)

(ii) Privacy Condition: In order to satisfy privacy, regardless of what set is chosen, query and answer for each server must be identically distributed, i.e., , , we must have,

 (Q[P1]n,A[P1]n,W1,...,WK)∼(Q[P2]n,A[P2]n,W1,...,WK). (26)

For an achievable scheme, satisfying (25) and (26), we define the retrieval rate , as the ratio between information of the inner products in and total downloading cost to retrieve the inner products , minimized over all possible requests , , i.e.,

 R(P,L)=minP⊆T,|P|=PH(X(L)P)∑Nn=1H(A[P]n). (27)

The capacity is the supremum of all achievable .

## Iv Preliminary

In order to proceed we need to review the results of MPIR problem in [7]. Consider a system, including data files, replicated in noncoluding servers. Each data file is chosen independently and uniformly at random from the finite field . A user wishes to retrieve a subset indexed by of data files, ensuring the privacy of . Assume , where is known publicly. Rate is defined as information of subset of data files indexed by over download cost, and the capacity is defined as the supremum over all rates in privacy preserving schemes. Then we have [7],

 R––MPIR(K,P,N)≤CMPIR≤¯¯¯¯RMPIR(K,P,N), (28)

where for , we have

 1¯¯¯¯RMPIR(K,P,N)=1R––MPIR(K,P,N)=1+K−PPN,

and for , we have

 1¯¯¯¯RMPIR(K,P,N)=% \tiny{⌊KP⌋−1}∑i=01Ni+(KP−⌊KP⌋)1N% \tiny{⌊KP⌋},

and is equal to

 ∑Pi=1βir\tiny{K−P}i[(1+1ri)\tiny{K}−(1+1ri)\tiny{K−P}]∑Pi=1βir% \tiny{K−P}i[(1+1ri)\tiny{K}−1],

where is defined as and . In addition, , is the solution of the set of linear equations and , .

## V Main Results

The main result is stated in the following theorem.

###### Theorem 1.

For a system with files in and servers, where the user is interested in a subset of size of inner products, we have

 1R––MPIR(K(K+1)/2,P,N)− O(λL−12)<1C (29) ≤1¯¯¯¯RMPIR(K(K+1)/2,P,N),

where is a constant independent of and .

###### Corollary 1.

If , then we have

 limL→∞1C=1+K(K+1)−2P2PN. (30)
###### Corollary 2.

If , then we have

 limL→∞1C=1+1N+...+1N\tiny{K(K+1)2P−1}. (31)

The proof can be found in the next section. Assuming is large enough, for achievability, we use the scheme of MPIR. For converse, we prove that as goes to infinity, entries of converges to a set of independent random variables with uniform distribution, with the rate of convergence dominated by a constant , . For large , in some cases, the achievable rate and converse match. In other cases, these two are very close.

## Vi Proof

We sort the elements of set in a vector , such that in comes before if or and . Likewise, we sort the elements of in a vector .

In this section, we provide the proof for Theorem 1.

First we show that as , the distribution of converges to a uniform distribution over :

 ∀y∈FK(K+1)/2(q): limL→∞Pr{X(L)=y}=1qK(K+1)/2. (32)

Indeed, we increase by one, and show that the distribution of over becomes closer to a uniform distribution. In addition, we derive the rate of convergence.

Let us denote the members of set by , i.e.,

 FK(K+1)/2(q)={y1 ... yqK(K+1)/2} (33)

We denote the probability mass function of over by , i.e.

 p(L)=(p(L)1,...,p(L)qK(K+1)/2)⊤∈[0,1]qK(K+1)/2 (34)

where

 p(L)i=Pr{X(L)=yi}, i∈[1:qK(K+1)/2]. (35)

Apparently,

 qK(K+1)/2∑i=1p(L)i=1. (36)

Our goal is to investigate how changes, as we increase to . Let

 W(L)i=(wi1,...,wiL)⊤, i∈[1:K]. (37)

Without loss of generality, we assume that

 W(L+1)i≜(wi1,...,wiL,wi(L+1))⊤ i∈[1:K], (38)

where is selected uniformly at random from . We note that by this construction and become correlated. However, the distribution of is still the same as it was discussed in the problem formulation but this correlation allows us to derive the converging distribution.

###### Lemma 1.

The sequence forms a Markov chain with a time-homogeneous transition probability , i.e.

 p(L+1)=Mp(L), (39)

where

 [M]i,j=Pr{Δ(L,L+1)=yi−yj}, ∀i,j∈[1:qK(K+1)/2]. (40)
###### Proof.

Defining the data files as above, then we have,

 ⟨W(L+1)i,W(L+1)j⟩=⟨W(L)i,W(L)j⟩+wi(L+1)wj(L+1), ∀i,j∈[1:K]. (41)

Thus for the vector of inner products , we also can write,

 X(L+1)=X(L)+Δ(L,L+1) (42)

where

 Δ(L,L+1)=(w1(L+1)w1(L+1),w1(L+1)w2(L+1),...,wK(L+1)wK(L+1))⊤∈FK(K+1)/2(q). (43)

Because of the way we constructed from , for , it is apparent that is independent of data files , , and irrespective of . We have

 Pr{X(L+1)=yi}=∑j∈qK(K+1)/2Pr{X(L)=yj}.Pr{Δ(L,L+1)=yi−yj} (44)

Thus from (35), we can rewrite the above equation as

 p(L+1)=Mp(L), (45)

where is a constant matrix, with entry be equal to

 [M]i,j=Pr{Δ(L,L+1)=yi−yj}, ∀i,j∈[1:qK(K+1)/2]. (46)

We note that is constant and independent of . ∎

To show that the limit in (32) exists, in the following lemma, we guarantee that the Markov chain has steady distribution.

###### Lemma 2.

Markov chain formed by the sequence is irreducible.

###### Proof.

In order to prove lemma we show that there exists some , such that , . This means it is possible to get to any state from any state in this chain or equivalently this chain is irreducible. We note that for any integer

 X(L+Γ)=X(L)+Δ(L,L+Γ), (47)

where

 Δ(L,L+Γ)=(Γ∑γ=1w1(L+γ)w1(L+γ),Γ∑γ=1w1(L+γ)w2(L+γ),...,Γ∑γ=1wK(L+γ)wK(L+γ))⊤. (48)

One can see that

 Pr{Δ(L,L+Γ)=yi−yj}=[MΓ]i,j, ∀i,j∈[1:K], (49)

denotes entry of matrix .

This lemma is equivalent to claim that there exists some such that every realization of in is possible with some positive probability. Notice that the following relationship holds,

 Δ(L,L+Γ)=Γ∑γ=1Δ(L+γ−1,L+γ) (50)

It is obvious that , , are mutually independent. The reason is that is only dependent of .

We first show that for every vector in with only one non-zero element is a probable (has a positive probability) realization of . In other words, we show that, for any , where and , , for some , then . Let us assume

 y(e)=a,for some a∈F(q)∖{0}. (51)

We know that by definition , for some . Here, we consider two cases for values of .

• : In this case . In other words, , for some

From [15, Page 66 ], we have

 ∀a∈F(q),∃s,t∈F(q):a=s2+t2, (52)

Therefore one possible case that can create such is as follows:

 wr(L+γ)=⎧⎨⎩tr=ie=je,γ=1sr=ie=je,γ=20o.w. (53)

Clearly this case has positive probability and therefore .

• In the case . In other words, , for some , .

We have (see [15, Page 66])

 ∃s1,s2,t1,t2∈Fq:−a2=s21+t21 and−1=s22+t22 (54)

Therefore one possible case that can create such is as follows:

 wr(L+γ)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ar=ie,γ=11r=je,γ=1s1r=ie,γ=2t1r=ie,γ=3s2r=je,γ=4t2r=je,γ=50o.w. (55)

In particular, one can verify that

 Δ(L,L+5)(e)=Γ∑γ=1wie(L+γ)wje(L+γ)=a×1+s1×0+t1×0+0×s2+0×t2=a. (56)

 Γ∑γ=1wie(L+γ)wie(L+γ)=a2+s21+t21+0+0=a2−a2=0, (57)
 Γ∑γ=1wje(L+γ)wje(L+γ)=12+0+0+s22+t22=12−12=0. (58)

Other entries of are zero trivially.

Since the probability of (55) is not zero, therefore in this case also .

From these two cases above, we can say every vector with one non-zero element is a probable (with positive probability) realization of . We now show that every vector in is a possible realization with positive probability for when . First we write as,

 Δ(L,L+5K(K+1)/2)=K(K+1)/2∑γ=1Δ(L+5(γ−1),L+5(γ)). (59)

Let be an arbitrary vector. To show that is a possible realization of with non-zero probability, we first define , as follows,

 y(i)(i) =y(i), (60) y(i)(j) =0, j∈[1:K(K+1)/2]∖{i}. (61)

This means is zero in every index except in index where its value is . We can see that,

 y=K(K+1)/2∑i=1y(i). (62)

By construction, is a vector that has at most one non-zero element, thus it is a probable realization for . Now if , , which is possible with positive probability then because of (62) and (59), we know , therefore . Also because of (49) every element in the matrix is positive. ∎

###### Corollary 3.

Markov sequence has a steady state.

###### Proof.

Markov sequence has a unique steady state if there exist an integer that has an all positive row  [16, Page 176], as it is proved in Lemma 2. ∎

###### Lemma 3.

As , Markov chain converges to a random vector with uniform distribution over .

###### Proof.

It is known that if a Markov chain has steady state, its stationary distribution is equal to its steady state probabilities  [16, Page 174]. We use this fact to find that steady state. As obtained, we know . It is easy to see that for any , the set is equal to . Thus,

 qK(K+1)/2∑j=1[M]i,j=1, ∀i∈[1:qK(K+1)/2]. (63)

Let . It is easy to see that due to (63), . Thus uniform distribution is stationary state probability of this Markov chain. ∎

###### Lemma 4.

Let denote the PMF of over . Then, , where is the second largest eigenvalue (absolute value of eigenvalue) of and .

We show that,

 p(L)+O(λL−12)1=π (64)

where