1. The Work in Context
1.1. Data Sharing among Firms
In today’s dataoriented economy (OECD, 2015), countless applications are based on the ability to extract statistically significant models out of acquired user data. Still, firms are hesitant to share information with other firms (Richter and Slowinski, 2019; Commission et al., 2018), as data is viewed as a resource that must be protected. This is in tension with the paradigm of the Wisdom of the crowds (Surowiecki, 2005), which emphasizes the added predictive value of aggregating multiple data sources. As early as 2001, the authors in (Banko and Brill, 2001) (note also a similar approach in (Halevy et al., 2009)) concluded that
“… a logical next step for the research community would be to direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora.”
Two popular frameworks to address issues arising in settings where data is shared are multiparty computation (Cramer et al., 2015) and differential privacy (Dwork, 2008). However, these paradigms are focused on addressing the issue of privacy (whether of the individual user or the firm’s data bank), but do not answer the basic conundrum of sharing data with competing firms: On one hand, cooperation enables the firm to enrich its own models, but at the same time enable other firms to do so as well. A firm is thus tempted to game the mechanism to allow itself better inference than other firms. We call this behavior exclusivity attacks
. Even if supplying intentionally false information could be a legal risk, the nature of data processing (rich with outliers, spam accounts, natural biases), allows firms to have “reasonable justification” to alter the data they share with others.
In this work, we present a model of collaborative information sharing between firms. The goal of every firm is first to have the best available model given the aggregate data. As a secondary goal, every firm wishes the others to have a downgraded version of the model. An appropriate framework to address this objective is the Noncooperative computation (NCC) framework, introduced in (Shoham and Tennenholtz, 2005). The framework was considered with respect to oneshot data aggregation tasks in (Kantarcioglu and Jiang, 2013).
1.2. Open and Longterm Environments
In our work, we present a general communication protocol for collaborative data sharing among firms, that can be associated with any specific machine learning or data aggregation algorithm. The protocol possesses an
online nature, when any participating firm may send (additional) data points at any time. This is in contrast with previous NCC literature, which focuses on oneshot datasharing procedures. The longterm setting yields two, somewhat contradicting, attributes:
A firm may send multiple subsequent inputs to the protocol, using it to learn how the model’s parameters change after each contribution. For an attacker, this allows better inference of the true model’s parameters, without revealing its true data points, as we demonstrate in Example 1 below.

A firm is not only interested in attaining the current correct parameters of the model, but also has a future interest to be able to attain correct answers, given that more data is later added by itself and its competitors. This has a chilling effect on attacks, as even a successful attack in the oneshot case could result in data corruption. For example, a possible shortterm attack could be for a firm to send its true data, attain the correct parameters, and then send additional garbage data. Since we do not have builtin protection against such actions in the mechanism (for reasons further explained in Remark 1), this would result in data corruption for the other firms. Nevertheless, if the firm itself is interested in attaining meaningful information from the mechanism in the future, it would be disincentivized to do so.
We now give an example demonstrating the first point. In (Kantarcioglu and Jiang, 2013), the authors consider the problem of collaboratively calculating the average of data points. They show in their Theorem 4.6 and Theorem 4.7 that whether the number of different data points is known is essential to the truthfulness of the mechanism. When the number of data points is unknown, the denominator of the average term is unknown, and it is impossible for an attacker to know with certainty how to attain the true average from the average the mechanism reports given a false input of the attacker. We now show that in a model where it is possible to send multiple requests (in fact, two), it is possible to report false information and attain the correct average:
Example 0 ().
Consider a firm with some data points with a total sum and number of points . Other firms have data points with a total sum and number of points . Assume .^{1}^{1}1These assumptions are not required for the attack scheme to succeed, but make for a simpler demonstration. Instead of reporting , the firm first reports , receives an average , then reports and receives the updated average . The average that others, following the mechanism as given, attain is , the true average is , and they are different by our assumption on . The firm is thus successful in misleading others. Moreover, the firm can infer the true average. Given
the firm^{2}^{2}2The only case where is when . In this case, upon having , we can choose , and a similar argument shows that we can infer the true average. can calculate
and thus have all the information required to calculate the true average.
Remark 1 ().
Why should we not consider simply forbidding multiple subsequent updates by a firm? As noted in (Yokoo et al., 2004; Gafni et al., February 712; Afek et al., 2017), modern internetbased environments lack clear identities and allow for multiple inputs by the same agent using multiple identities. A common distinction in blockchain networks separates public (“permissionless”) and private (“permissioned”) networks (Liu et al., 2019), where public networks allow open access for everyone, while private networks require additional identification for participation. In both cases, however, it is impossible to totally prevent falsename manipulation, where a firm uses multiple identities to send her requests. Therefore, any “simple” solution of the problem demonstrated in Example 1 is impossible. The mechanism does not know whether multiple subsequent updates are really sent by different firms, or they are in fact “sock puppets” of a single firm. The mechanism therefore can not adjust appropriately (e.g., drop any request after the first one). In this work, we assume a firm may control up to identities, and so in the formal model, we allow up to subsequent updates of a single firm. The false identities are not part of the formal model: They instead are encapsulated by giving firms this ability to update times subsequently.
1.3. Our Results

We define two longterm datasharing protocols (the continuous and periodic communication protocols) for data sharing among firms. The models differ in how communication is structured temporally (whether the agents can communicate at any time, or are asked for their inputs at given times). Each model can be coupled with any choice of algorithm to aggregate the data shared by the agents.

We give a condition for NCCvulnerability of an algorithm (given the communication model) in Definition 1. A successful NCC attack is one that (i) Can mislead the other agents, and (ii) Maintains the attacker’s ability to infer the true algorithm output. We give a stronger condition of NCCvulnerability* that can moreover (i*) Mislead the other agents in every possible scenario. As a simple example of using these definitions, we show in Appendix B that finding the maximum over agent reports is NCCvulnerable but not NCCvulnerable*.

For the center problem, we show that it is vulnerable under continuous communication but not vulnerable under periodic communication. Moreover, we show that it is not vulnerable* even in continuous communication, using a notion of explicitlylying attacks.

For Multiple Linear Regression, we show that it is vulnerable* under continuous communication but not vulnerable under periodic communication. The vulnerability* in continuous communication depends on the number of identities an attacker can control: We show a form of attack so that an attacker with
identities (where is the dimension of the feature space) is guaranteed to have an attack, and an attacker with less than identities can not attack.
The vulnerability(*) results for the continuous communication protocol are summarized in Table 1. Both algorithms are not vulnerable(*) under the periodic communication protocol.
Vulnerable  Vulnerable*  

LinearRegression  Yes, for any  
Center  Yes, for any  No 
We overview related work in Appendix A.
2. Model and Vulnerability Notions
We consider a system where agents receive factual updates containing data points or states of the world. The agents apply their reporting strategy, performing ledger updates. Upon any ledger update, the ledger distributes the latest aggregate parameter calculation using , the computation algorithm.
Formally, let be a set of agents. An update is of some type, depending on the computational problem. An update with metadata complements an update with an agent , and a type , where “Factual” updates represent a factual state of nature observed by an agent, and “Ledger” updates are what the agent shares with the ledger, which may differ from what she factually observes. We note that the ledger (which for simplicity we assume is a centralized third party) does not make the data public, but only shares the algorithm’s updated outputs according to the protocol’s rules. The computation algorithm is an algorithm that receives a series of updates of any length and outputs a result. In the continuous communication protocol, we have that algorithm outputs are shared with all agents upon every ledger update.
In this section and Sections 34 we focus on the continuous communication protocol. The continuous communication protocol simulates a system where agents may push updates at any time, initiated by them and not by the system manager. We model this by allowing them to respond to any change in the state of the system, including responding to their own ledger updates. The only limit to an agent endlessly sending updates to the ledger is that we restrict it to update at most times subsequently. The continuous communication protocol is a messaging protocol between nature, the agents, and the ledger. A particular protocol run is instantiated with natureinput , which is a series of some length with each element being of the form , which is a tuple comprised of agent and an update .
For the analysis, we extract some useful variables from the run of the protocol that will be used in subsequent examples and proofs.
Let a run be all the messages sent in the system during the application of the continuous communication protocol with natureinput (where messages sent to ’all’ appear once, and the messages appear in their order of sending).
Let be the subsequences of all ledger, factual updates respectively in of agent (if the index is omitted, then simply all such updates, regardless of an agent). Let (“observed history” of ) be all the messages in received or sent by during the run of the nature protocol: These are all factual updates of , ledger updates by , and algorithm outputs shared by the ledger. Let be the the elements of starting with index and until (and including) index .
An update strategy for is a mapping from an observed history to a ledger update by agent . The truthful update strategy is the following: If the last element in is of type , update with . Otherwise, do not update.
A full run of the protocol with nature input and strategies is the run after completion of the nature protocol where nature uses input and each agent responds using strategy . Since we’re interested in the effect of one agent deviating from truthfulness, we say that we run natureinput with strategy , where is the deviating agent, and it is assumed that all other agents play . We denote the resulting run .
We can now define an NCCattack on the nature protocol given algorithm and updates restriction .
Definition 0 ().
An algorithm is if there exists an agent and update strategy such that:
i) There is a full run of the protocol with some natureinput and the strategy such that its last algorithm output is different from the last algorithm output in .
ii) For any two natureinputs such that the observed histories satisfy
In words, to consider strategy as a successful attack, the first condition requires that there is a case where the rest of the agents other than observe something different than the factual truth. Notice that we strictly require that the other agents (and not only the ledger) observe a different outcome: If updates with a ledger update that does not match its factual update, but this does not affect future algorithm outputs, we do not consider it an attack (It is a “Tree that falls in a forest unheard”). The second condition requires that the attacker is always able to infer (at least in theory) the last true algorithm output. Under NCC utilities (which we omit formally defining, and work instead directly with the logical formulation, similar to Definition 1 in (Shoham and Tennenholtz, 2005)), failure to infer the true algorithm output under strategy makes it worse than , no matter how much the agent manages to mislead others (which is only its secondary goal).
We remark without formal discussion that being NCCvulnerable is enough to show that truthfulness is not an expost Nash equilibrium if the agents were to play a noncooperative game using strategies with NCC utilities. However, it does not suffice to show that truthfulness is not a BayesianNash equilibrium, as the cases where the deviation from truthfulness satisfies condition may be of measure 0. We give a stronger definition we call
NCCvulnerable*, that would guarantee the inexistence of the truthful BayesianNash equilibrium for any possible probability measure, by amending condition
to hold for all cases:Definition 0 ().
An algorithm is if there exists an agent and update strategy with both condition of Definition 1, and:
i*) For every full run of the protocol with some natureinput , the last algorithm output is different than the last algorithm output in .
As long as there is at least one full run of the protocol, it is clear that being NCCvulnerable* implies being NCCvulnerable. Similarly being NCCvulnerable(*) implies being NCCvulnerable(*) (i.e., the implication works for both the vulnerable and vulnerable* cases).
In Appendix B, we illustrate the difference between the two definitions, as well as simple proof techniques, using a simple algorithm.
3. –Center and –Median in the Continuous Communication Protocol
In this section, we analyze the performance of prominent clustering algorithms in terms of our vulnerability(*) definitions. Together with Section 4
this demonstrates the applicability of the approach for both unsupervised and supervised learning algorithms.
Definition 0 ().
kcenter: Each agent’s update is a set of data points, where each data point is of the form . A possible output of the algorithm is some centers that are among the data points . Let for and some norm function with . In words, is the set of all agents that have as their closest point among . Let be the cost function. In words, the cost of a possible algorithm output is the maximum distance between a point and a center it is attributed to. We then have
(1) 
i.e., the centers are the points among the reported points that minimize the cost if chosen as centers. Ties (both when determining and the final centers) are broken in favor of the candidate with the smallest norm^{4}^{4}4If this is not enough to determine, complement it with some arbitrary rule, e.g. over the radian coordinates of the points: This does not matter for the argument..
3.1. Sneak Attacks and Vulnerability
In this subsection, we present a template for a class of attacks. We then show it is successful in showing the vulnerability of the protocol for center.
Notice that when we defined strategies, we required them to be memoryless, i.e., only observe and not their own past behavior (which by itself anyway only depends on the past observed histories, which are contained in ). However, the conditions in Strategy Template 2 require for example to check whether the attack was initiated before. The technical lemma below shows that this is possible to infer from .
Lemma 0 ().
If , the sneak attack is well defined, i.e., the conditions to start and end attack can be implemented using only .
We defer the proof details to Appendix C.
Strategy Template 2 presents the general sneak attack form, which requires four parameters: , , the factual update and last algorithm output that serve as a signal for the attacker to send  the deviation from truth performs, and , the update returning the ledger to a synced state.
Two properties are important for a successful sneak attack. First, the attacker must know with certainty the algorithm output given the counterfactual that it would have sent (as would have), rather than . Second, after sending both and , it should hold that all future algorithm outputs are the same as if sending only . For example, if updates are sets of data points and the algorithm outputs some calculation over their union (later formally defined in Definition 5 as a set algorithm), this holds if .
We formalize this intuition in the following lemma:
Lemma 0 ().
A sneak attack where , and that moreover can infer the last algorithm output in after starting the attack and sending , satisfies condition .
The proof of the lemma is given in Appendix C.
We now give a sneak attack for center in . The example can be extended to a general dimension by setting the remaining coordinates in the attack parameters to .
Example 0 ().
center with is NCCvulnerable using a sneak attack: Use Strategy Template 2 with , with say .
Condition is satisfied for natureinput . The run with yields algorithm outputs but the run with yields .
As for condition : Let be some natureinput, and let be the index of the element of after which the algorithm outputs (i.e., is , upon where agent starts the attack). Let , . Assume for simplicity that , otherwise a symmetric argument to the one we lay out follows. Given the algorithm output , we know that is the closest center to . Thus, . The last inequality is due to that every point is either in or , and so its distance from the closest center is at most . We thus have that (as illustrated in Figure 3).
Therefore, under , after agent sends , we have . For any other choice of centers (that may partially intersect), we have (as illustrated in Figure 3). Choosing we have that the algorithm output must be . This shows that agent can infer with certainty the algorithm output under . We thus satisfy the conditions of Lemma 3, which guarantees condition is satisfied.
3.2. –Center Vulnerability*
In the previous subsection, we have shown that Center is vulnerable. However, in this subsection, we show it is not vulnerable*.
We note that a significant property of the center algorithm is that its output is a subset of its input.
Definition 0 ().
A set algorithm is an algorithm where each update is a set, and the algorithm is defined over the union of all updates .
A multiset algorithm is an algorithm where each update is a multiset of data points, and the algorithm is defined over the sum of all updates .
A setchoice algorithm is a set algorithm that satisfies , i.e., the algorithm output is a subset of the input.
Many common algorithms such as max, min, or median, are setchoice algorithms, as well as center and median that we discuss.
We notice a property of the sneak attack in Example 4: deducts points that exist in the factual update and does not include them in the ledger update. In fact, throughout the run of the union of ledger updates by agent is a subset of the union of its factual updates. This leads us to develop the following distinction. We partition the space of attack strategies (all attacks, not necessarily just sneak attacks) into two types, explicitlylying attacks and omission attacks. This distinction has importance beyond the technical discussion, because of legal and regulatory issues. Strategic firms may be willing to omit data (which can be excused as operational issues, data cleaning, etc), but not to fabricate data.
Formally, for set and multiset algorithms, we can partition all nontruthful strategies in the following way:
Definition 0 ().
An explicitly lying strategy is a strategy that for some natureinput has a point , i.e., the strategy sends a ledger update with a point that does not exist in the union of all factual updates for that agent.
An omission strategy is a a strategy that satisfies condition (i.e., misleads others) that is not explicitlylying.
For an omission strategy it must hold that for every run the agent past ledger updates are a subset of its factual updates, i.e., .
We now use the notion of explicitlylying strategy to prove that center and median are not vulnerable*. For this we need one more technical notion:
Definition 0 ().
A setchoice algorithm has forceable winners if for any set and a point , there is a set with so that .
In words, if the point is part of the algorithm input, it is always possible to send an update to force the point to be an output of the algorithm. It is interesting to compare this requirement with axioms of multiwinner social choice functions, as detailed e.g. in (Elkind et al., 2017).
Theorem 8 ().
A setchoice algorithm with forceable winners is not NCCvulnerable* for any .
We prove the theorem using the two following claims.
Claim 1 ().
A strategy that satisfies condition for a setchoice algorithm is explicitlylying.
Proof.
Consider a natureinput where agent receives no factual updates. To satisfy condition , it must send some ledger update for the algorithm output under to differ from that under . Since the union of all its factual updates is an empty set, it must hold that it sends a data point that does not exist there. ∎
Claim 2 ().
An explicitlylying strategy for a setchoice algorithm with forceable winners violates condition .
Proof.
Consider the shortest natureinput (in terms of number of elements) where sends a ledger update with an explicit lie , and let be the union of all ledger, factual updates respectively by . Let , and the natureinput element that generates a factual update of an agent that forces (such an element exist by the forceable winners condition). Let . Notice that (as required in Definition 7 of forceable winners), but , and so . Also note that (as it is an explicit lie). Let be with an additional last element respectively.
Now notice that are composed of the observed history , together with the observations following each of their different last elements. As the last element is a factual update of an agent , the agent sends a truthful ledger update. We then have Thus, the immediate algorithm output, and any further algorithm output following some ledger update by agent is taken over the same set, whether it is under or , and so identifies. We conclude that .
On the other hand, the last algorithm output in is , and thus has the element by Definition 7. On the other hand, the last algorithm output in is . Since is a setchoice algorithm, it does not output since it does not appear in the input set. ∎
Corollary 0 ().
center is not NCCvulnerable* for any .
Proof.
center is a setchoice algorithm. We show that it has forceable winners. We show the construction for , but the general is similar. Let some with . Let . Let . It must hold that . ∎
Corollary 0 ().
median is not NCCvulnerable* for any .
The proof is given in Appendix D.
4. Linear Regression under Continuous Communication
In this section, we study the vulnerability(*) of linear regression.
Definition 0 ().
Multiple linear regression in features : Given a set of data points with points, where the data points features are a matrix with all elements of the first column normalized to 1, the targets are a vector , then
We slightly abuse notation by defining both as a function on a series of updates , as well as on a set of data points. The latter satisfies, as long as the columns are linearly independent, . We subsequently assume for simplicity that the columns are always linearly independent (e.g., by having a first ledger update with linearly independent features. The property is then automatically maintained with any future updates).
It is not difficult to find omission sneak attacks for linear regression, as we demonstrate in Figure 5.
In Example 1 in Appendix E, we show a more complicated explicitlylying sneak attack for (also called “simple linear regression”). The attack can be generalized for . This yields
Theorem 2 ().
LR is NCCvulnerable.
4.1. Triangulation Attacks and Vulnerability*
To study vulnerability*, we now define a stronger type of attacks and show they exist for , as long as . We name this type of attacks triangulation attacks, and present a template parameterized by functions in Strategy Template 3.
The idea of triangulation attacks is that for any state of the ledger, the attacker can find subsequent updates so that it can both infer the algorithm output if it applied strategy instead of (using the “triangulations”), and mislead others by the final update . Informally, this attack has the desirable property that regardless of the state of the ledger (and how corrupted it may be by previous updates of the attacker), the attacker can infer the true state.
As in the case of the sneak attack, we should show the strategy template can be implemented using only the information in .
Lemma 0 ().
The triangulation attack is well defined, i.e., the conditions in lines and can be implemented using only information available in . The assignment in line is valid, that is, given that line is executed there exists an algorithm output in .
We defer the proof details to Appendix C.
We now prove there is a triangulation attack for with .
Theorem 4 ().
is NCCvulnerable* using a triangulation attack .
Proof.
We shortly outline the overall flow of the proof. First, we give explicit construction of the functions. This suffices to show that condition is satisfied, which means there is an inference function that maps observed histories under to the last algorithm output under . Given that inference function, we construct and show that with it condition is satisfied. We give a formal treatment of inference function in Definition 4 and Lemma 5 of Appendix B, but for our purpose in this proof it suffices that it is a map as specified.
Construction of and condition :
Let
be the last algorithm output before the application of . Define
,
where is the vector with , and
Let be a run with some natureinput and the triangulation attack with the specified (and any function ). Consider all the factual updates by agents induced by . They are each of the form of , where is of size and is , and where is the number of data points in the update. To consider all factual updates of the agents , we can vertically concatenate these matrices. Let this aggregate be denoted . Similarly, let be the concatenation of all factual updates by . Let the concatenation of all ledger updates by before submission of any of the updates be . Recall that we denote by the algorithm outputs (right before, and after each , e.g. is applied after and generates ). Let be the (concatenated) inputs to the algorithm that generate . In terms of the defined variables above, we can write:
(2) 
To show that condition holds, it suffices to show that we can infer the last algorithm output of the run . Let be the concatenation of all factual updates of all agents, then it is the input that generates , and it holds that:
(3) 
Since in Equation 3, besides , all RHS variables are observed history under , we conclude that it is enough to deduce in order to infer , and thus also the last algorithm output under which is .
Let .
For every , we have
(4) 
By the construction of , we can rewrite these equations in the following way. Let be the matrix with , and all other elements zero. Let be the vector with
and all other elements zero.
We have for :
(5) 
If we examine the differences between the equation and the equation, we get for ,
(6) 
Notice that for any , is not the zero vector. If it was, since is invertible, we will have that , which would contradict the following claim:
Claim 3 ().
For every algorithm output , and a single point update so that , the new algorithm output for the data with satisfies , and has a different value at than .
The proof of the claim is given in Appendix E.
Moreover, by definition is a vector that has all elements besides element and that are (since it is not a zero vector), and so the th element of the vector is nonzero. Therefore, for the vector , the th element is nonzero as well (Since has all elements with index higher than as zero). For any with , all elements with index higher than are zero. Therefore, the set is linearly independent, and the matrix where each column is is invertible. If we let be the matrix where each column is , we can rewrite Eq 6 as , where is the identity matrix. We conclude that is invertible and . We can directly calculate the RHS of this expression from the observed history under , and by the first equation of Eq 5 we can infer , overall concluding the proof for condition .
Construction of and condition (i*). Let be the inference function (which existence is guaranteed by the previous discussion) that matches observed histories running with the true algorithm outputs under . I.e., we has . Let the last algorithm output in be . Let .
If , does not send an update, and so for the natureinput that has observed history the last algorithm output under is different than that under , as required by condition .
If , sends an update with a point that satisfies . By Claim 3, the resulting algorithm output is different from .
∎
We demonstrate the construction and inference of the triangulation attack in an opensource implementation
https://github.com/yotamgafni/triangulation_attack. Figure 7 shows a run of the attack for a random example for LR.We show an asymptotically matching lower bound for triangulation attacks.
Theorem 5 ().
There is no triangulation attack for with or less functions (i.e., ).
Proof.
Consider all natureinput elements that are of the form , where is a matrix, and is the zero vector. of the same sizes but without any restriction over . We show that for any triangulation attack , we can find two natureinputs among this family with different observed history under , but the same observed history under .
By the choice of , the first algorithm output satisfies . As we know from the proof of Theorem 4, in particular Equation 5 (where it was done for a specific given triangulation attack), that the attack generates vector equations for (including the one over ). We also know that the first row of is all elements. We can make it a stricter constraint by demanding that the first row of is of the form . Then, the principal submatrix of (removing the first row and column) is a general PSD matrix (as a principal submatrix of the PSD matrix). To uniquely determine such a matrix of size , we need vector equations, but the triangulation equations only yield such equations. So there are some that are in the family of natureinputs and have the same observed history under . Fix some invertible . Since , there must be some so that