It is well recognized that broadcasting can offer significant bandwidth savings compared to point-to-point communication [1, 2], and could be leveraged in several wireless network applications. Use cases include Wi-Fi (cellular) networks where an access point (a base station) is connected to a set of Wi-Fi (cellular) devices through a wireless broadcast channel, and where devices request messages, such as YouTube videos. Another use case has recently emerged in the context of distributed computing [3, 4], where worker nodes exchange data among themselves to complete computational tasks.
A canonical setup which captures the essence of broadcast channels is the index coding framework . In an index coding instance, a server is connected to a set of clients through a noiseless broadcast channel. The server has a database that contains a set of messages. Each client: 1) possesses a subset of the messages that she already knows, which is referred to as the side information set, and 2) requests a message from the database which is not in her side information set. The server has full knowledge of the requests and side information sets of all clients. A linear index code (or index code in short)111In this work, we solely focus on linear index codes. is a linear coding scheme that comprises a set of coded broadcast transmissions which allow each client to decode her requested message using her side information set. The goal is to find an index code which uses the smallest possible number of broadcast transmissions. The key ingredient in designing efficient (i.e., with a small number of transmissions) index codes is the use of coding across messages.
The starting observation of this work is that, using coding over broadcast channels can cause privacy risks. In particular, a curious client may infer information about the requests and side information sets of other clients, which can be deemed sensitive by their owners. For example, consider a set of clients that use a server to download YouTube videos. Although YouTube videos are publicly available, a client requesting a video about a medical condition may not wish for others to learn her request, or learn what are other videos that she has already downloaded.
To illustrate why coding can create privacy leakage, consider the index coding instance shown in Figure 1. A server possesses a set of messages, which we refer to as to . The server is connected to a set of clients: client wants message and has as side information message ; client wants and has ; client wants and has ; and client wants and has . In this case, an optimal (i.e., with the minimum number of transmissions) index code consists of sending transmissions, namely and : it is easy to see that each client can decode the requested message from one of these transmissions using the side information. However, this index code can allow curious clients to violate the privacy of other clients who share the broadcast channel, by learning information that pertains to their requests and/or side information sets. For example, assume that client is curious. Upon learning the two transmissions, client knows that nobody is requesting message . Moreover, she knows that if a client is requesting or (similarly, or ), then this client should have the other message as side information in order to decode the requested message.
The solution that we propose to limit this privacy leakage stems from the following observation: it may not be necessary to provide clients with the entire set of broadcast transmissions. Instead, each client can be given access, and learn the coding operations, for only a subset of the transmissions, i.e., the subset that would allow her to decode the message that she requested. Consider again the example in Figure 1. The optimal index code consists of two transmissions. However, each client is able to decode her request using exactly one of the two transmissions. Therefore, if each client only learns the coding coefficients for the transmission that she needs, then she will have no knowledge of the content of the other transmission, and thus would have less information about the requests of the other clients. Limiting the access of each client to just one out of the two transmissions was possible for this particular example; however, it is not the case that every index code has this property.
Our approach in this paper builds on the idea described above. In particular, given an index coding instance that uses transmissions, we ask: Can we limit the access of each client to at most transmissions, while still allowing each client to decode her requested message? In other words, for a given index coding instance, what is the best (in terms of number of transmissions) index code that we can design such that each client is able to decode her request using at most out of these transmissions? Our work attempts to understand the fundamental relation between limiting the accessibility of clients to the coding matrix and the attained level of privacy. In particular, we propose the use of -limited-access schemes, that transform the coding matrix so as to restrict each client to access at most rows of the transformed matrix, as opposed to the whole of it. Our contributions include:
We formalize the intuition that using -limited-access-schemes can indeed increase the attained level of privacy against curious clients. We demonstrate this using two privacy metrics, namely an entropy-based metric and the maximal information leakage. In both cases, we show that the attained level of privacy is linearly dependent on the value of , i.e., privacy increases linearly with the number of rows of the coding matrix that we hide.
We design polynomial time (in the number of clients) universal -limited-access schemes (i.e., that do not depend on the structure of the coding matrix), and require a simple matrix multiplication. We prove that these schemes are order-optimal in some regimes, in particular when either or (the number of clients) is large. Interestingly, when is larger than a threshold, these schemes enable to restrict the amount of access to half of the coding matrix with an overhead of exactly one additional transmission. This result indicates that some privacy-bandwidth trade-off points can be achieved with minimal overhead.
We propose algorithms that depend on the structure of the coding matrix and show that, when and are both small, they provide improved performance with respect to the universal schemes mentioned above. These schemes use a graph-theory representation of the problem, and are optimal for some special instances.
We provide analytical and numerical performance evaluations of our schemes. We show how our proposed -limited-access schemes provide a bandwidth-privacy trade-off, namely how much bandwidth usage (i.e., number of transmissions) is needed to achieve a certain level of privacy (captured by the value of ). We show that our proposed schemes provide a trade-off curve that is close to the lower bound when either or is large. In the case where both and are small, we show through numerical evaluations that our proposed algorithms give an average performance that is close to the lower bound.
The paper is organized as follows. Section II introduces our notation, formulates the problem, and gives a geometric interpretation. Section III discusses how -limited-access schemes limit the privacy leakage. Section IV shows the construction of -limited-access schemes and proves their order-optimality when either or is large. Section V designs algorithms which are better-suited for cases when both and are small. Section VI discusses related work and Section VII concludes the paper. Some of the proofs are delegated to the appendices.
Ii Notation, Problem Formulation and Geometric Interpretation
Notation. Calligraphic letters indicate sets; is the cardinality of ; is the set of integers
; boldface lower case letters denote vectors and boldface upper case letters indicate matrices; given a vector, indicates the -th element of ; given matrices and , indicates that is formed by a set of rows of ; is the all-zero row vector of dimension ; denotes a row vector of dimension of all ones and
is the identity matrix of dimension; is the all-zero row vector of length with a in position ; for all , the floor and ceiling functions are denoted with and , respectively; logarithms are in base 2;
refers to the probability of event.
Index Coding. We consider an index coding instance, where a server has a database of messages , where is the set of message indices, and with being the message size, and where operations are done over the binary field. The server is connected through a broadcast channel to a set of clients , where is the set of client indices. We assume that . Each client has a subset of the messages , with , as side information and requests a new message with that she does not have. We assume that the server employs a linear code, i.e., it designs a set of broadcast transmissions that are linear combinations of the messages in . The linear index code can be represented as , where is the coding matrix, is the matrix of all the messages and is the resulting matrix of linear combinations. Upon receiving , client employs linear decoding to decode the requested message .
Problem Formulation. In , it was shown that the index coding problem is equivalent to the rank minimization of an matrix , whose -th row , has the following properties: (i) has a in the position (i.e., the index of the message requested by client ), (ii) has a in the -th position for all , (iii) can have either or in all the remaining positions. For instance, with reference to the example in Figure 1, we would have
where can be either or . It was shown in  that finding an optimal linear coding scheme i.e., with minimum number of transmissions) is equivalent to completing (i.e., assign values to the components of ) so that it has the minimum possible rank. Once we have completed , we can use a basis of the row space of (of size ) as a coding matrix . In this case, client can construct as a linear combination of the rows of , i.e., performs the decoding operation , where is the decoding row vector of chosen such that . Finally, client can successfully decode by subtracting from the messages corresponding to the non-zero entries of (other than the requested message). We remark that any linear index code that satisfies all clients with transmissions (where is not necessarily optimal) – and can be obtained by any index code design algorithm [6, 7, 8] – corresponds to a completion of (i.e., given , we can create a corresponding in polynomial time).
In our problem formulation we assume that we start with a given matrix of rank , i.e., we are given distinct vectors that belong to a -dimensional subspace. Using a basis of the row space of the given , we construct . Then, we ask: Given distinct vectors , , in a -dimensional space, can we find a minimum-size set with vectors, such that each can be expressed as a linear combination of at most vectors in (with )? The vectors in form the rows of the coding matrix that we will employ. Then by definition, client will be able to reconstruct using the matrix . We can equivalently restate the question as follows: Given a coding matrix , can we find , with as small as possible, such that and each row of can be reconstructed by combining at most rows of ? Note that corresponds to the conventional transmission scheme of an index coding problem for which . In the remainder of the paper we will refer to a scheme that chooses to be the coding matrix as -limited-access scheme.
Transmission Protocol. In order to realize the privacy benefits of using -limited-access schemes – which we will thoroughly illustrate in Section III – we propose a different transmission protocol for the index coding setup. Figure 2 shows both the conventional and the proposed transmission protocols. In the conventional protocol, the server designs a set of packets, each corresponding to an equation from the set of equations . As shown in Figure 2(a), packet consists of (i) a payload which contains the linear combination and (ii) a header which contains the coefficients used to create the equation. In the conventional protocol, the server sends these packets (both headers and payloads) on the broadcast channel to all clients. Our proposed protocol, however, operates differently. Specifically, the server generates packets which correspond to the set of equations in a way that is similar to the conventional protocol. The server then sends only the payloads of these packets on the broadcast channel. Differently, the server sends the coefficients corresponding to only to client using a private key or on a dedicated private channel (e.g., the same channel used by to convey her request to the server). Thus, using a -limited-access scheme incurs an extra transmission overhead to privately convey the coding vectors. In particular, the total number of transmitted bits can be upper bounded as while the total number of transmitted bits C using a conventional scheme is . The extra overhead incurred is negligible in comparison to the broadcast transmissions that convey the encoded messages when and are both , which is a reasonable assumption for large file sizes (for instance, when sharing YouTube videos).
Geometric Interpretation. The geometric interpretation of our problem is depicted in Figure 3. An index code corresponds to a particular completion of the matrix . Therefore, the set of row vectors in lies in the row span of (which is of dimension ). We denote this subspace of dimension by . The problem of finding a matrix can be interpreted as finding a set of subspaces, each of dimension at most , such that each row vector , , is covered by at least one of these smaller subspaces. Once these subspaces are selected, then the rows of are taken as the union of the basis vectors of all these subspaces. Client is then given the basis vectors of subspace , i.e., the one which covers , instead of the whole matrix . Therefore would have perfect knowledge of instead of . Having less information about naturally translates to less information about the requests of other clients, as we more formally discuss in the next section.
Iii Achieved Privacy Levels
In this section, we investigate and quantify the level of privacy that -limited-access schemes can achieve compared to a conventional index coding scheme (i.e., when each client has access to the entire coding matrix). In what follows, we consider the setup described in the previous section and suppose that client is curious, i.e., by leveraging the (at most) rows that she receives, she seeks to infer information about client . Specifically, we are interested in quantifying the amount of information that can obtain about (i.e., the identity of the request of ) as a function of .
We assume that the index coding instance is random, i.e.
, we consider the requests and side information sets of clients as random variables and denote them asand , respectively. The operation of the server is shown in Figure 4 and is described as follows:
Step-1: The server obtains the information about the requests and side information sets of all clients .
Step-3: The server then applies the -limited-access scheme to obtain , where is a deterministic mapping from to (see Section IV for the construction of ). This implies that is a deterministic function of and (i.e., the parameter of the scheme).
Step-4: The server sends to client . If multiple can be selected, then the server picks and transmits one such matrix uniformly at random, independently of the underlying which might have generated this .
We are now interested in quantifying the level of privacy that is achieved by the protocol described above. Towards this end, we use two privacy metrics, namely an entropy-based metric and the maximal information leakage.
Iii-a Entropy-Based Privacy Metric
The entropy-based privacy metric is inspired by the geometric interpretation of our problem in Figure 3. We let (respectively, ) be the random variable associated with the subspace spanned by the rows of the coding matrix (respectively, spanned by the row vectors of ). Client receives the matrix and as such she knows . Given this, we now define the entropy-based privacy metric and evaluate it for the proposed protocol.
The entropy-based privacy metric is defined as
and quantifies the amount of uncertainty that has about the subspace spanned by the rows of the index coding matrix .
Before characterizing , we state the following lemma, which is proved in Appendix A.
Given a subspace of dimension , let be the set of subspaces of dimension where . Then is equal to
Assume an index coding setting with observing a particular subspace and a number of transmissions for the -limited access scheme. Moreover, we consider a stronger adversary (i.e., curious client) and assume that she also knows the specific realization of . Given this, we can compute
where: (i) the equality in follows because is a deterministic function of and , which is the parameter of the scheme (see Step-3); (ii) the equality in
follows by assuming that the underlying system maintains a uniform distribution across all feasible-dimensional subspaces of ; (iii) the equality in follows by virtue of Lemma III.1. We note that when , then the quantity in (1) decreases linearly with , i.e., as intuitively expected, the less rows of the coding matrix learns, the less she can infer about the subspace spanned by the rows of the coding matrix . This suggests that, by increasing , has less uncertainty about . Note also that is zero when ; this is because, under this condition, receives the entire index coding matrix, i.e., , and hence she is able to perfectly reconstruct the subspace spanned by its rows. However, although when , might still have uncertainty about . Quantifying this uncertainty is an interesting open problem; this uncertainty, in fact, depends on the underlying system, e.g., on the index code used by the server and on the distribution with which the index code matrix is selected.
Iii-B Maximal Information Leakage
The second metric that we consider as our privacy metric is the Maximal Information Leakage (MIL) 
. Given two discrete random variablesand with alphabets and , the MIL from to is denoted by and defined as
where the second equality is shown in . The MIL metric captures the amount of information leaked about through
to an adversary, who is interested in estimating a (possibly probabilistic) functionof . This is captured by the fact that
forms a Markov chain as shown in the expression in (2). The metric considers a worst-case such adversary, that is, an adversary who is interested in computing a function for which the maximum information can be leaked out of . The result in 
shows that this quantity depends only on the joint distribution ofand . The following properties of the MIL are useful :
(Property 1): If , then ,
(Property 2): ,
(Property 3): .
To describe how we use the MIL as a privacy metric in our setup, we first need to define what are the corresponding random variables and , and then argue that the estimation of client of the requests of other clients forms a Markov chain as required by the MIL definition. To do so, we first define the following sets:
1) Given , and an integer , let be the set of all possible sub-matrices of with exactly rows, that client can use to reconstruct the vector :
2) Given , and , let be the set of all possible sub-matrices of with the minimum possible number of rows, such that client with side information can decode :
Since the requests and the side information sets are considered as random variables, then all subsequently generated codes, namely , and can be treated as random variables as well. We denote the corresponding random variables of these quantities as , and respectively. In other words, for a given realization of and , the corresponding realizations of the aforementioned codes used by the server are , and .
When using conventional index codes (i.e., without -limited-access schemes), client (i.e., the curious client and hence the adversary) would try to infer information about from observing and given her information of . Therefore, one can think of client estimate of as being a particular estimation function, the input of which is . Differently, after using -limited-access schemes, client would only have observed instead of . Therefore, in the context of MIL, one choice of the variables and is and respectively. The function would therefore be client ’s estimate of out of . The following proposition shows that this choice of variables , and allows us to use the MIL as a metric.
The following Markov chain holds
conditioned on the knowledge of in every stage of the chain.
Proof: We have the following:
holds since is a deterministic function of (see also Step-3 of the proposed protocol);
holds since , independent of , as described in Step-4 of the proposed protocol.
We define as our MIL privacy metric222We use the notation to denote that the variables and are conditioned on .. The quantity gives the maximum amount of information that can extract about given the knowledge of . The following theorem – proved in Appendix B – provides a guarantee on .
Using the MIL, the attained level of privacy against a curious client when -limited-access schemes are used is
The quantity in (4) characterizes the maximum amount of information that can be leaked to a curious client when -limited-access schemes are used. It is clear that decreasing would decrease this amount of information; this aligns with the intuition that the less rows a server gives to a client, the less information a client would be able to infer about other clients sharing the broadcast domain. In order to shed more light on the benefits of using -limited-access schemes, one could compare the quantity with the MIL obtained when -limited-access schemes are not used, i.e., when a client observes the whole matrix . Let this quantity be denoted as . Then we have the following result, which is proved in Appendix C.
Using the MIL, the attained level of privacy against a curious client for a conventional index coding setup is
The results in Theorem III.3 and Theorem III.4 can be interpreted with the help of Figure 5. The -limited-access schemes achieve privacy gains as compared to conventional index codes, when the two bounds in (4) and (5) strictly mismatch. A sufficient (but not necessary) condition for this is to select .
Iv Construction of -limited-access Schemes
In this section, we focus on designing -limited-access schemes and assessing their theoretical performance in terms of number of additional transmissions required with respect to a conventional index coding scheme. Recall that we are given a coding matrix that requires transmissions. Then, we seek to construct a matrix , so that , and each client needs to access at most rows of to decode her requested message. In particular, we aim at constructing matrices with as small as possible. Trivially, . Towards this end, we first derive upper and lower bounds on . Our main result is stated in the theorem below.
Given an index coding matrix with , it is possible to transform it into with , such that each client can decode her requested message by combining at most rows of , if and only if
Moreover, we provide polynomial time (in ) constructions of such that:
When , then
When , then
Proof: The lower bound on in (6) is proved in Appendix D. In particular, the bound in (6) says that, if we are allowed to combine at most out of the vectors, then we should be able to create a sufficient number of vectors. The two upper bounds on in (7) and (8) are proved in Section IV-A, where we give explicit constructions for .
We note that, as expected, the smaller the value of that we require, the larger the value of that we need to use. Trivially, for we would need , i.e., the server would need to send uncoded transmissions. Thus, there is a trade-off between the bandwidth – measured as the number of broadcast transmissions – and privacy – captured by the value of that we require. Interestingly, when , with just one extra transmission, i.e., , we can restrict the access of each client to at most half of the coding matrix, independently of the coding matrix . In other words, for this regime, we can achieve a certain level of privacy with minimal overhead. However, as we further reduce the value of , the overhead becomes more significant. Moreover, the results in Theorem IV.1 also imply that our constructions are order-optimal in the case of large values of (when )333Note that is always (i.e., the number of distinct vectors for a given is at most ). The case of large values of corresponds to the case where this bound on the number of distinct vectors is not loose: there is a corresponding lower bound on , i.e., . Therefore, the case of large values of corresponds to .. In addition, when , our scheme is at most one transmission away from the optimal number of transmissions, and this is for any value of . This is shown in the following lemma, which is proved in Appendix D.
Consider an index coding setup. We have
When and for any value of , then , i.e., the provided construction is order-optimal.
Figure 6 shows the trade-off exhibited by our proposed -limited-access schemes between bandwidth usage () and the attained privacy () - we use as a proxy to the amount of attained privacy against a curious client (see Section III). The figure shows the performance of our constructions in Theorem IV.1 (labeled as Scheme-1), as well as the lower bound in (6) (labeled as LB) and an upper bound which corresponds to uncoded transmissions (labeled as UB). Figure 6(a) confirms the order-optimality of our constructions when . In addition, our schemes perform similarly well when is sufficiently large (and not necessarily equal to ) as shown in Figure 6(b) where . Finally, Figure 6(c) shows the performance for a small value of (). The figure shows that our proposed constructions do not perform as well when and are small, a case which we study in more details in Section V.
We now conclude this section by giving explicit constructions of the matrix and prove the two upper bounds on in (7) and (8). Our design of allows to reconstruct any of the vectors of size . As such our constructions are universal, in the sense that the matrix that we construct does not depend on the specific index coding matrix .
Recall that is full rank and that the -th row of can be expressed as , where is the coefficients row vector associated with . We next analyze two different cases/regimes, which depend on the value of .
Case I: . When , let
which results in a matrix with , matching the bound in (7). We now show that each can be reconstructed by combining up to vectors of . Let be the Hamming weight of . If , then we can reconstruct as , which involves adding rows of . Differently, if , then we can reconstruct as , where is the bitwise complement of . In this case, reconstructing involves adding rows of .
When , then it is sufficient to send uncoded transmissions, where the -th transmission satisfies . In this case has access only to the -th transmission, i.e., . This completes the proof of the upper bound in (7).
Example: We show how the scheme works via a small example, where and . In this case, we have
If , then it can be reconstructed as with rows of used in the reconstruction. Differently, if , then it can be reconstructed as with again rows of used in the reconstruction.
Case II: . Let and . If divides , then , , otherwise and . Then, we can write
where, for , the matrix , of dimension , is constructed as follows
where , of dimension , has as rows all non-zero vectors of length . Therefore, . Similarly, the matrix , of dimension , is constructed as follows
where , of dimension , has as rows all non-zero vectors of length . Therefore, .
In other words, the matrix is constructed as a block-diagonal matrix, with the diagonal elements being for all . Therefore, equation (8) holds by computing
What remains is to show that any vector can be reconstructed by adding at most vectors of . To show this, we prove that any vector can indeed be constructed with the proposed design of . We note that we can express the vector as , where are parts of the vector each of length , while is the last part of of length . Then, we can write where for , and is the set of indices for which is not all-zero. According to the construction of , for all , the corresponding vector is one of the rows in . The proof concludes by noting that . This is true because, if does not divide , then ; otherwise, but (i.e., does not exist), therefore . This completes the proof of the upper bound in (8).
Example: We show how the scheme works via a small example, where and . For this particular example, we have and . Thus, the idea is that, to reconstruct a vector , we treat as disjoint parts; the first are of length and the remaining part is of length . We then construct as disjoint sections, where each section allows us to reconstruct one part of the vector. Specifically, we construct
Any vector can be reconstructed by picking at most vectors out of , one from each section. For example, let . This vector can be reconstructed by adding vectors number , and from .
V Constructions for small values of and
In Section IV, we have proved that, independently of the value of , if , then it is sufficient to add one additional transmission to the transmissions of the conventional index coding scheme. Moreover, the analysis provided in Lemma IV.2 showed the order-optimality of our universal scheme in Theorem IV.1 (referred to as Scheme-1) for values of when is large (i.e., exponential in ). Figure 7 shows the performance of Scheme-1 in Theorem IV.1 as a function of the values of for , with in Figure 7(a) and in Figure 7(b). The performance of Scheme-1 was obtained by averaging over 1000 random index coding instances. In each instance, a code is constructed using the scheme described in Section IV-A, and only the rows actually used by the clients are retained. The performance of the scheme is finally computed by the average number of rows retained in those 1000 iterations. Figure 7 shows that our proposed scheme performs well not only for the case of large (i.e., ) but also for lower values of . However, Figure 7 also suggests that for small values of both and (note the left-half of the plot in Figure 7(a)), we need to devise schemes that better adapt to the specific values of the index coding matrix and vectors (recall that Scheme-1 is universal, and hence independent of the value of ). We next propose and analyze the performance of such algorithms.
V-a Special Instances
We first represent the problem through a bipartite graph as follows. We assume that the rank of the matrix is . Then, there exists a set of linearly independent vectors in ; without loss of generality, we denote them as to . Therefore, each vector can be expressed as a linear combination of some/all vectors from ; we denote these vectors as the component vectors of . We can then represent the problem as a bipartite graph with and , where represents the vector for , represents the vector for , and an edge exists from node to node if is one of the component vectors of . Figure 9 shows an example of such graph, where and . For instance, (i.e., ) can be reconstructed by adding (i.e., ). Given a node in the graph, we refer to the sets and as the outbound and inbound sets of , respectively: the inbound set contains the nodes which have edges outgoing to node , and the outbound set contains the nodes to which node has outgoing edges (i.e., the nodes each of which has an incoming edge from ). Nodes on either sides of the bipartite graph have either inbound or outbound sets. For instance, with reference to Figure 9, and . For this particular example, there exists a scheme with which can reconstruct any vector with at most additions. The matrix which corresponds to this solution consists of the following vectors: , , , , and . It is not hard to see that each vector in can be reconstructed by adding at most vectors in . The vectors in that are not in can be aptly represented as intermediate nodes on the previously described bipartite graph. These intermediate nodes are shown in Figure 9 as highlighted nodes. Each added node represents a new vector, which is the sum of the vectors associated to the nodes in its inbound set. We refer to the process of adding these intermediate nodes as creating a branch, which is defined next.
Given an ordered set of nodes, where precedes for , a branch on is a set of intermediate nodes added to the graph with the following connections: node has two incoming edges from and , and for , has two incoming edges from nodes and .
For the example in Figure 9, we created branches on two ordered sets, and . Once the branch is added, we can change the connections of the nodes in in accordance to the added vectors. For the example in Figure 9, we can replace in with only .
Using this representation, we have the following lemma.
If for some permutation of , then this instance can be solved by exactly transmissions for any .
Proof: One solution of such instance would involve creating a branch on the set . The scheme used would have the matrix with its -th row for . Note that and for all . Moreover, for , if for some , then for all . If we let be the maximum index for which , then we have