DeepAI
Log In Sign Up

Long-term Data Sharing under Exclusivity Attacks

01/22/2022
by   Yotam gafni, et al.
3

The quality of learning generally improves with the scale and diversity of data. Companies and institutions can therefore benefit from building models over shared data. Many cloud and blockchain platforms, as well as government initiatives, are interested in providing this type of service. These cooperative efforts face a challenge, which we call “exclusivity attacks”. A firm can share distorted data, so that it learns the best model fit, but is also able to mislead others. We study protocols for long-term interactions and their vulnerability to these attacks, in particular for regression and clustering tasks. We conclude that the choice of protocol, as well as the number of Sybil identities an attacker may control, is material to vulnerability.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/09/2021

Synthesis of Winning Attacks on Communication Protocols using Supervisory Control Theory

There is an increasing need to study the vulnerability of communication ...
12/12/2017

Vulnerability of Complex Networks in Center-Based Attacks

We study the vulnerability of synthetic as well as real-world networks i...
05/05/2021

Vulnerability of Blockchain Technologies to Quantum Attacks

Quantum computation represents a threat to many cryptographic protocols ...
03/14/2018

The Hsu-Harn-Mu-Zhang-Zhu group key establishment protocol is insecure

A significant security vulnerability in a recently published group key e...
06/11/2019

Sharing of vulnerability information among companies -- a survey of Swedish companies

Software products are rarely developed from scratch and vulnerabilities ...
07/28/2020

Cognitive Honeypots against Lateral Movement for Mitigation of Long-Term Vulnerability

Lateral movement of advanced persistent threats (APTs) has posed a sever...

1. The Work in Context

1.1. Data Sharing among Firms

In today’s data-oriented economy (OECD, 2015), countless applications are based on the ability to extract statistically significant models out of acquired user data. Still, firms are hesitant to share information with other firms (Richter and Slowinski, 2019; Commission et al., 2018), as data is viewed as a resource that must be protected. This is in tension with the paradigm of the Wisdom of the crowds (Surowiecki, 2005), which emphasizes the added predictive value of aggregating multiple data sources. As early as 2001, the authors in (Banko and Brill, 2001) (note also a similar approach in (Halevy et al., 2009)) concluded that

“… a logical next step for the research community would be to direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora.”

Two popular frameworks to address issues arising in settings where data is shared are multi-party computation (Cramer et al., 2015) and differential privacy (Dwork, 2008). However, these paradigms are focused on addressing the issue of privacy (whether of the individual user or the firm’s data bank), but do not answer the basic conundrum of sharing data with competing firms: On one hand, cooperation enables the firm to enrich its own models, but at the same time enable other firms to do so as well. A firm is thus tempted to game the mechanism to allow itself better inference than other firms. We call this behavior exclusivity attacks

. Even if supplying intentionally false information could be a legal risk, the nature of data processing (rich with outliers, spam accounts, natural biases), allows firms to have “reasonable justification” to alter the data they share with others.

In this work, we present a model of collaborative information sharing between firms. The goal of every firm is first to have the best available model given the aggregate data. As a secondary goal, every firm wishes the others to have a downgraded version of the model. An appropriate framework to address this objective is the Non-cooperative computation (NCC) framework, introduced in (Shoham and Tennenholtz, 2005). The framework was considered with respect to one-shot data aggregation tasks in (Kantarcioglu and Jiang, 2013).

1.2. Open and Long-term Environments

In our work, we present a general communication protocol for collaborative data sharing among firms, that can be associated with any specific machine learning or data aggregation algorithm. The protocol possesses an

online nature, when any participating firm may send (additional) data points at any time. This is in contrast with previous NCC literature, which focuses on one-shot data-sharing procedures. The long-term setting yields two, somewhat contradicting, attributes:

  • A firm may send multiple subsequent inputs to the protocol, using it to learn how the model’s parameters change after each contribution. For an attacker, this allows better inference of the true model’s parameters, without revealing its true data points, as we demonstrate in Example 1 below.

  • A firm is not only interested in attaining the current correct parameters of the model, but also has a future interest to be able to attain correct answers, given that more data is later added by itself and its competitors. This has a chilling effect on attacks, as even a successful attack in the one-shot case could result in data corruption. For example, a possible short-term attack could be for a firm to send its true data, attain the correct parameters, and then send additional garbage data. Since we do not have built-in protection against such actions in the mechanism (for reasons further explained in Remark 1), this would result in data corruption for the other firms. Nevertheless, if the firm itself is interested in attaining meaningful information from the mechanism in the future, it would be disincentivized to do so.

We now give an example demonstrating the first point. In (Kantarcioglu and Jiang, 2013), the authors consider the problem of collaboratively calculating the average of data points. They show in their Theorem 4.6 and Theorem 4.7 that whether the number of different data points is known is essential to the truthfulness of the mechanism. When the number of data points is unknown, the denominator of the average term is unknown, and it is impossible for an attacker to know with certainty how to attain the true average from the average the mechanism reports given a false input of the attacker. We now show that in a model where it is possible to send multiple requests (in fact, two), it is possible to report false information and attain the correct average:

Example 0 ().

Consider a firm with some data points with a total sum and number of points . Other firms have data points with a total sum and number of points . Assume .111These assumptions are not required for the attack scheme to succeed, but make for a simpler demonstration. Instead of reporting , the firm first reports , receives an average , then reports and receives the updated average . The average that others, following the mechanism as given, attain is , the true average is , and they are different by our assumption on . The firm is thus successful in misleading others. Moreover, the firm can infer the true average. Given

the firm222The only case where is when . In this case, upon having , we can choose , and a similar argument shows that we can infer the true average. can calculate

and thus have all the information required to calculate the true average.

Remark 1 ().

Why should we not consider simply forbidding multiple subsequent updates by a firm? As noted in (Yokoo et al., 2004; Gafni et al., February 7-12; Afek et al., 2017), modern internet-based environments lack clear identities and allow for multiple inputs by the same agent using multiple identities. A common distinction in blockchain networks separates public (“permissionless”) and private (“permissioned”) networks (Liu et al., 2019), where public networks allow open access for everyone, while private networks require additional identification for participation. In both cases, however, it is impossible to totally prevent false-name manipulation, where a firm uses multiple identities to send her requests. Therefore, any “simple” solution of the problem demonstrated in Example 1 is impossible. The mechanism does not know whether multiple subsequent updates are really sent by different firms, or they are in fact “sock puppets” of a single firm. The mechanism therefore can not adjust appropriately (e.g., drop any request after the first one). In this work, we assume a firm may control up to identities, and so in the formal model, we allow up to subsequent updates of a single firm. The false identities are not part of the formal model: They instead are encapsulated by giving firms this ability to update times subsequently.

1.3. Our Results

  • We define two long-term data-sharing protocols (the continuous and periodic communication protocols) for data sharing among firms. The models differ in how communication is structured temporally (whether the agents can communicate at any time, or are asked for their inputs at given times). Each model can be coupled with any choice of algorithm to aggregate the data shared by the agents.

  • We give a condition for NCC-vulnerability of an algorithm (given the communication model) in Definition 1. A successful NCC attack is one that (i) Can mislead the other agents, and (ii) Maintains the attacker’s ability to infer the true algorithm output. We give a stronger condition of NCC-vulnerability* that can moreover (i*) Mislead the other agents in every possible scenario. As a simple example of using these definitions, we show in Appendix B that finding the maximum over agent reports is NCC-vulnerable but not NCC-vulnerable*.

  • For the -center problem, we show that it is vulnerable under continuous communication but not vulnerable under periodic communication. Moreover, we show that it is not vulnerable* even in continuous communication, using a notion of explicitly-lying attacks.

  • For Multiple Linear Regression, we show that it is vulnerable* under continuous communication but not vulnerable under periodic communication. The vulnerability* in continuous communication depends on the number of identities an attacker can control: We show a form of attack so that an attacker with

    identities (where is the dimension of the feature space) is guaranteed to have an attack, and an attacker with less than identities can not attack.

The vulnerability(*) results for the continuous communication protocol are summarized in Table 1. Both algorithms are not vulnerable(*) under the periodic communication protocol.

Vulnerable Vulnerable*
-LinearRegression Yes, for any
-Center Yes, for any No
Table 1. A summary of vulnerability(*) results in the continuous communication protocol.

We overview related work in Appendix A.

2. Model and Vulnerability Notions

We consider a system where agents receive factual updates containing data points or states of the world. The agents apply their reporting strategy, performing ledger updates. Upon any ledger update, the ledger distributes the latest aggregate parameter calculation using , the computation algorithm.

Formally, let be a set of agents. An update is of some type, depending on the computational problem. An update with metadata complements an update with an agent , and a type , where “Factual” updates represent a factual state of nature observed by an agent, and “Ledger” updates are what the agent shares with the ledger, which may differ from what she factually observes. We note that the ledger (which for simplicity we assume is a centralized third party) does not make the data public, but only shares the algorithm’s updated outputs according to the protocol’s rules. The computation algorithm is an algorithm that receives a series of updates of any length and outputs a result. In the continuous communication protocol, we have that algorithm outputs are shared with all agents upon every ledger update.

In this section and Sections 3-4 we focus on the continuous communication protocol. The continuous communication protocol simulates a system where agents may push updates at any time, initiated by them and not by the system manager. We model this by allowing them to respond to any change in the state of the system, including responding to their own ledger updates. The only limit to an agent endlessly sending updates to the ledger is that we restrict it to update at most times subsequently. The continuous communication protocol is a messaging protocol between nature, the agents, and the ledger. A particular protocol run is instantiated with nature-input , which is a series of some length with each element being of the form , which is a tuple comprised of agent and an update .

Figure 1. A continuous protocol run for with some and the algorithm , as explained in the proof of Proposition 2. An agent’s observed history are all the nodes in her line, or nodes that have an outgoing edge from a node in her line.
Input: Nature-input , Parameter the maximum number of subsequent updates by an agent
Output: Full Messaging History
1 for  factual message in  do
       Nature sends a message to with ; activeMessage True; // There is an active message
2       while activeMessage = True  do
             /* As long as some agent is responding */
3             activeMessage False; for  agent to  do
4                   if agent wishes to send a ledger update and last updates are not all of type 333We can perhaps question whether agent respects the condition that not all of the last updates are not all of type for some . If she does not, she may send a message regardless of this constraint. But since nature can choose not to accept/respond to it, we simplify the protocol by assuming the agents self-enforce the constraint. then
5                         sends a message to Ledger with ; Ledger sends a message to all with ’s algorithm output over all the past ledger updates; activeMessage True;
Protocol 1 The continuous communication protocol

For the analysis, we extract some useful variables from the run of the protocol that will be used in subsequent examples and proofs.

Let a run be all the messages sent in the system during the application of the continuous communication protocol with nature-input (where messages sent to ’all’ appear once, and the messages appear in their order of sending).

Let be the sub-sequences of all ledger, factual updates respectively in of agent (if the index is omitted, then simply all such updates, regardless of an agent). Let (“observed history” of ) be all the messages in received or sent by during the run of the nature protocol: These are all factual updates of , ledger updates by , and algorithm outputs shared by the ledger. Let be the the elements of starting with index and until (and including) index .

An update strategy for is a mapping from an observed history to a ledger update by agent . The truthful update strategy is the following: If the last element in is of type , update with . Otherwise, do not update.

A full run of the protocol with nature input and strategies is the run after completion of the nature protocol where nature uses input and each agent responds using strategy . Since we’re interested in the effect of one agent deviating from truthfulness, we say that we run nature-input with strategy , where is the deviating agent, and it is assumed that all other agents play . We denote the resulting run .

We can now define an NCC-attack on the nature protocol given algorithm and updates restriction .

Definition 0 ().

An algorithm is if there exists an agent and update strategy such that:

i) There is a full run of the protocol with some nature-input and the strategy such that its last algorithm output is different from the last algorithm output in .

ii) For any two nature-inputs such that the observed histories satisfy

In words, to consider strategy as a successful attack, the first condition requires that there is a case where the rest of the agents other than observe something different than the factual truth. Notice that we strictly require that the other agents (and not only the ledger) observe a different outcome: If updates with a ledger update that does not match its factual update, but this does not affect future algorithm outputs, we do not consider it an attack (It is a “Tree that falls in a forest unheard”). The second condition requires that the attacker is always able to infer (at least in theory) the last true algorithm output. Under NCC utilities (which we omit formally defining, and work instead directly with the logical formulation, similar to Definition 1 in (Shoham and Tennenholtz, 2005)), failure to infer the true algorithm output under strategy makes it worse than , no matter how much the agent manages to mislead others (which is only its secondary goal).

We remark without formal discussion that being -NCC-vulnerable is enough to show that truthfulness is not an ex-post Nash equilibrium if the agents were to play a non-cooperative game using strategies with NCC utilities. However, it does not suffice to show that truthfulness is not a Bayesian-Nash equilibrium, as the cases where the deviation from truthfulness satisfies condition may be of measure 0. We give a stronger definition we call

-NCC-vulnerable*, that would guarantee the inexistence of the truthful Bayesian-Nash equilibrium for any possible probability measure, by amending condition

to hold for all cases:

Definition 0 ().

An algorithm is if there exists an agent and update strategy with both condition of Definition 1, and:

i*) For every full run of the protocol with some nature-input , the last algorithm output is different than the last algorithm output in .

As long as there is at least one full run of the protocol, it is clear that being -NCC-vulnerable* implies being -NCC-vulnerable. Similarly being -NCC-vulnerable(*) implies being -NCC-vulnerable(*) (i.e., the implication works for both the vulnerable and vulnerable* cases).

In Appendix B, we illustrate the difference between the two definitions, as well as simple proof techniques, using a simple algorithm.

3. –Center and –Median in the Continuous Communication Protocol

In this section, we analyze the performance of prominent clustering algorithms in terms of our vulnerability(*) definitions. Together with Section 4

this demonstrates the applicability of the approach for both unsupervised and supervised learning algorithms.

Definition 0 ().

k-center: Each agent’s update is a set of data points, where each data point is of the form . A possible output of the algorithm is some centers that are among the data points . Let for and some norm function with . In words, is the set of all agents that have as their closest point among . Let be the cost function. In words, the cost of a possible algorithm output is the maximum distance between a point and a center it is attributed to. We then have

(1)

i.e., the centers are the points among the reported points that minimize the cost if chosen as centers. Ties (both when determining and the final centers) are broken in favor of the candidate with the smallest norm444If this is not enough to determine, complement it with some arbitrary rule, e.g. over the radian coordinates of the points: This does not matter for the argument..

3.1. Sneak Attacks and Vulnerability

In this subsection, we present a template for a class of attacks. We then show it is successful in showing the vulnerability of the protocol for -center.

Figure 2. A general template for the sneak attack. Until the special conditions are met, and after the re-sync is done, the strategy behaves as .
Input: Observed history . Parameters
Output: A ledger update
/* Condition to start attack */
1 if The last element in is , the last algorithm output in is , and the condition to start attack was not invoked before then
2       Return
/* Condition to end attack */
3 else if The condition to start attack was invoked, after that some agent (either or another) received a factual update, but the condition to end attack was not yet invoked then
4       Let be the last update in if it is a factual update for , or otherwise. Return
/* If the special conditions do not hold, act as */
5 else if Last update in is factual for  then
6       Return
Strategy Template 2 A template for a sneak attack

Notice that when we defined strategies, we required them to be memory-less, i.e., only observe and not their own past behavior (which by itself anyway only depends on the past observed histories, which are contained in ). However, the conditions in Strategy Template 2 require for example to check whether the attack was initiated before. The technical lemma below shows that this is possible to infer from .

Lemma 0 ().

If , the sneak attack is well defined, i.e., the conditions to start and end attack can be implemented using only .

We defer the proof details to Appendix C.

Strategy Template 2 presents the general sneak attack form, which requires four parameters: , , the factual update and last algorithm output that serve as a signal for the attacker to send - the deviation from truth performs, and , the update returning the ledger to a synced state.

Two properties are important for a successful sneak attack. First, the attacker must know with certainty the algorithm output given the counter-factual that it would have sent (as would have), rather than . Second, after sending both and , it should hold that all future algorithm outputs are the same as if sending only . For example, if updates are sets of data points and the algorithm outputs some calculation over their union (later formally defined in Definition 5 as a set algorithm), this holds if .

We formalize this intuition in the following lemma:

Lemma 0 ().

A sneak attack where , and that moreover can infer the last algorithm output in after starting the attack and sending , satisfies condition .

The proof of the lemma is given in Appendix C.

We now give a sneak attack for -center in . The example can be extended to a general dimension by setting the remaining coordinates in the attack parameters to .

Example 0 ().

-center with is -NCC-vulnerable using a sneak attack: Use Strategy Template 2 with , with say .

Condition is satisfied for nature-input . The run with yields algorithm outputs but the run with yields .

As for condition : Let be some nature-input, and let be the index of the element of after which the algorithm outputs (i.e., is , upon where agent starts the attack). Let , . Assume for simplicity that , otherwise a symmetric argument to the one we lay out follows. Given the algorithm output , we know that is the closest center to . Thus, . The last inequality is due to that every point is either in or , and so its distance from the closest center is at most . We thus have that (as illustrated in Figure 3).

(a)
(b)
Figure 3. An illustration of Example 4 with . In (a), the fact that is the algorithm output is enough to show that all input elements are within , otherwise would be a better choice for a center. In (b), which is displayed on a logarithmic scale, we see that given that all prior input elements are within , and with additional elements , the algorithm must output as centers for a small enough .

Therefore, under , after agent sends , we have . For any other choice of centers (that may partially intersect), we have (as illustrated in Figure 3). Choosing we have that the algorithm output must be . This shows that agent can infer with certainty the algorithm output under . We thus satisfy the conditions of Lemma 3, which guarantees condition is satisfied.

3.2. –Center Vulnerability*

In the previous subsection, we have shown that -Center is vulnerable. However, in this subsection, we show it is not vulnerable*.

We note that a significant property of the -center algorithm is that its output is a subset of its input.

Definition 0 ().

A set algorithm is an algorithm where each update is a set, and the algorithm is defined over the union of all updates .

A multi-set algorithm is an algorithm where each update is a multi-set of data points, and the algorithm is defined over the sum of all updates .

A set-choice algorithm is a set algorithm that satisfies , i.e., the algorithm output is a subset of the input.

Many common algorithms such as max, min, or median, are set-choice algorithms, as well as -center and -median that we discuss.

We notice a property of the sneak attack in Example 4: deducts points that exist in the factual update and does not include them in the ledger update. In fact, throughout the run of the union of ledger updates by agent is a subset of the union of its factual updates. This leads us to develop the following distinction. We partition the space of attack strategies (all attacks, not necessarily just sneak attacks) into two types, explicitly-lying attacks and omission attacks. This distinction has importance beyond the technical discussion, because of legal and regulatory issues. Strategic firms may be willing to omit data (which can be excused as operational issues, data cleaning, etc), but not to fabricate data.

Formally, for set and multi-set algorithms, we can partition all non-truthful strategies in the following way:

Definition 0 ().

An explicitly lying strategy is a strategy that for some nature-input has a point , i.e., the strategy sends a ledger update with a point that does not exist in the union of all factual updates for that agent.

An omission strategy is a a strategy that satisfies condition (i.e., misleads others) that is not explicitly-lying.

For an omission strategy it must hold that for every run the agent past ledger updates are a subset of its factual updates, i.e., .

We now use the notion of explicitly-lying strategy to prove that -center and -median are not vulnerable*. For this we need one more technical notion:

Definition 0 ().

A set-choice algorithm has forceable winners if for any set and a point , there is a set with so that .

In words, if the point is part of the algorithm input, it is always possible to send an update to force the point to be an output of the algorithm. It is interesting to compare this requirement with axioms of multi-winner social choice functions, as detailed e.g. in (Elkind et al., 2017).

Theorem 8 ().

A set-choice algorithm with forceable winners is not -NCC-vulnerable* for any .

We prove the theorem using the two following claims.

Claim 1 ().

A strategy that satisfies condition for a set-choice algorithm is explicitly-lying.

Proof.

Consider a nature-input where agent receives no factual updates. To satisfy condition , it must send some ledger update for the algorithm output under to differ from that under . Since the union of all its factual updates is an empty set, it must hold that it sends a data point that does not exist there. ∎

Claim 2 ().

An explicitly-lying strategy for a set-choice algorithm with forceable winners violates condition .

Proof.

Consider the shortest nature-input (in terms of number of elements) where sends a ledger update with an explicit lie , and let be the union of all ledger, factual updates respectively by . Let , and the nature-input element that generates a factual update of an agent that forces (such an element exist by the forceable winners condition). Let . Notice that (as required in Definition 7 of forceable winners), but , and so . Also note that (as it is an explicit lie). Let be with an additional last element respectively.

Now notice that are composed of the observed history , together with the observations following each of their different last elements. As the last element is a factual update of an agent , the agent sends a truthful ledger update. We then have Thus, the immediate algorithm output, and any further algorithm output following some ledger update by agent is taken over the same set, whether it is under or , and so identifies. We conclude that .

On the other hand, the last algorithm output in is , and thus has the element by Definition 7. On the other hand, the last algorithm output in is . Since is a set-choice algorithm, it does not output since it does not appear in the input set. ∎

Figure 4. Demonstration of the proof of Claim 2. is an explicit lie by agent . is the state of the ledger under . is the state of the ledger under . is a complementary set to from Definition 7 (forceable winners). Given that the next ledger update by a truthful agent is either or (which is represented by the rows), then the behavior under the different strategies (represented by the columns) is such that under , the two underlying states of the world are the same, but not so under .
Corollary 0 ().

-center is not -NCC-vulnerable* for any .

Proof.

-center is a set-choice algorithm. We show that it has forceable winners. We show the construction for , but the general is similar. Let some with . Let . Let . It must hold that . ∎

Corollary 0 ().

-median is not -NCC-vulnerable* for any .

The proof is given in Appendix D.

4. Linear Regression under Continuous Communication

In this section, we study the vulnerability(*) of linear regression.

Definition 0 ().

Multiple linear regression in features : Given a set of data points with points, where the data points features are a matrix with all elements of the first column normalized to 1, the targets are a vector , then

We slightly abuse notation by defining both as a function on a series of updates , as well as on a set of data points. The latter satisfies, as long as the columns are linearly independent, . We subsequently assume for simplicity that the columns are always linearly independent (e.g., by having a first ledger update with linearly independent features. The property is then automatically maintained with any future updates).

It is not difficult to find omission sneak attacks for linear regression, as we demonstrate in Figure 5.

Figure 5.

A sneak attack for simple linear regression. Since the points by others and the factual update of the agent yield the same LR estimator

, the result of running the regression on all points is regardless of what are the actual points by others.

In Example 1 in Appendix E, we show a more complicated explicitly-lying sneak attack for (also called “simple linear regression”). The attack can be generalized for . This yields

Theorem 2 ().

-LR is -NCC-vulnerable.

4.1. Triangulation Attacks and Vulnerability*

To study vulnerability*, we now define a stronger type of attacks and show they exist for , as long as . We name this type of attacks triangulation attacks, and present a template parameterized by functions in Strategy Template 3.

Figure 6. A general template for the triangulation attack, with . Until the special conditions are met, and after the re-sync is done, the strategy behaves as .
Input: Observed history . Functions
Output: A ledger update
1 Let if there is a factual update after the last ledger update by . Otherwise, if a triangulation attack is ongoing, let be its current step or else exit. Let be the last algorithm output in . if  then
2       Return
3else if  then
4       Return
Strategy Template 3 A template for a triangulation attack

The idea of triangulation attacks is that for any state of the ledger, the attacker can find subsequent updates so that it can both infer the algorithm output if it applied strategy instead of (using the “triangulations”), and mislead others by the final update . Informally, this attack has the desirable property that regardless of the state of the ledger (and how corrupted it may be by previous updates of the attacker), the attacker can infer the true state.

As in the case of the sneak attack, we should show the strategy template can be implemented using only the information in .

Lemma 0 ().

The triangulation attack is well defined, i.e., the conditions in lines and can be implemented using only information available in . The assignment in line is valid, that is, given that line is executed there exists an algorithm output in .

We defer the proof details to Appendix C.

We now prove there is a triangulation attack for with .

Theorem 4 ().

is -NCC-vulnerable* using a triangulation attack .

Proof.

We shortly outline the overall flow of the proof. First, we give explicit construction of the functions. This suffices to show that condition is satisfied, which means there is an inference function that maps observed histories under to the last algorithm output under . Given that inference function, we construct and show that with it condition is satisfied. We give a formal treatment of inference function in Definition 4 and Lemma 5 of Appendix B, but for our purpose in this proof it suffices that it is a map as specified.

Construction of and condition :

Let

be the last algorithm output before the application of . Define

,

where is the vector with , and

Let be a run with some nature-input and the triangulation attack with the specified (and any function ). Consider all the factual updates by agents induced by . They are each of the form of , where is of size and is , and where is the number of data points in the update. To consider all factual updates of the agents , we can vertically concatenate these matrices. Let this aggregate be denoted . Similarly, let be the concatenation of all factual updates by . Let the concatenation of all ledger updates by before submission of any of the updates be . Recall that we denote by the algorithm outputs (right before, and after each , e.g. is applied after and generates ). Let be the (concatenated) inputs to the algorithm that generate . In terms of the defined variables above, we can write:

(2)

To show that condition holds, it suffices to show that we can infer the last algorithm output of the run . Let be the concatenation of all factual updates of all agents, then it is the input that generates , and it holds that:

(3)

Since in Equation 3, besides , all RHS variables are observed history under , we conclude that it is enough to deduce in order to infer , and thus also the last algorithm output under which is .

Let .

For every , we have

(4)

By the construction of , we can rewrite these equations in the following way. Let be the matrix with , and all other elements zero. Let be the vector with

and all other elements zero.

We have for :

(5)

If we examine the differences between the equation and the equation, we get for ,

(6)

Notice that for any , is not the zero vector. If it was, since is invertible, we will have that , which would contradict the following claim:

Claim 3 ().

For every algorithm output , and a single point update so that , the new algorithm output for the data with satisfies , and has a different value at than .

The proof of the claim is given in Appendix E.

Moreover, by definition is a vector that has all elements besides element and that are (since it is not a zero vector), and so the -th element of the vector is non-zero. Therefore, for the vector , the -th element is non-zero as well (Since has all elements with index higher than as zero). For any with , all elements with index higher than are zero. Therefore, the set is linearly independent, and the matrix where each column is is invertible. If we let be the matrix where each column is , we can rewrite Eq 6 as , where is the identity matrix. We conclude that is invertible and . We can directly calculate the RHS of this expression from the observed history under , and by the first equation of Eq 5 we can infer , overall concluding the proof for condition .

Construction of and condition (i*). Let be the inference function (which existence is guaranteed by the previous discussion) that matches observed histories running with the true algorithm outputs under . I.e., we has . Let the last algorithm output in be . Let .

If , does not send an update, and so for the nature-input that has observed history the last algorithm output under is different than that under , as required by condition .

If , sends an update with a point that satisfies . By Claim 3, the resulting algorithm output is different from .

We demonstrate the construction and inference of the triangulation attack in an open-source implementation

https://github.com/yotam-gafni/triangulation_attack. Figure 7 shows a run of the attack for a random example for -LR.

Figure 7. A script-run triangulation attack for -LR. The round red points represent an existing state of the ledger. The yellow x points (in (1)) represent a new factual update for the strategic agent. The red line in (1) represents the resulting linear regression estimator, if the agent reports truthfully. The four figures (2a)-(2d) show the flow of our triangulation attack construction. In (2a) is the last state of the ledger before the triangulation, with no triangulation point sent by the strategic agent. The rest of (2b)-(2d) consecutively add triangulation points (blue triangles). At the end of the triangulation attack (after (2d)), the linear regression estimator is different than in (1). It is possible to infer the estimator in (1) using knowledge of the triangulation points and estimators of (2a)-(2d) (without knowledge of the red points).

We show an asymptotically matching lower bound for triangulation attacks.

Theorem 5 ().

There is no triangulation attack for with or less functions (i.e., ).

Proof.

Consider all nature-input elements that are of the form , where is a matrix, and is the zero vector. of the same sizes but without any restriction over . We show that for any triangulation attack , we can find two nature-inputs among this family with different observed history under , but the same observed history under .

By the choice of , the first algorithm output satisfies . As we know from the proof of Theorem 4, in particular Equation 5 (where it was done for a specific given triangulation attack), that the attack generates vector equations for (including the one over ). We also know that the first row of is all elements. We can make it a stricter constraint by demanding that the first row of is of the form . Then, the principal sub-matrix of (removing the first row and column) is a general PSD matrix (as a principal submatrix of the PSD matrix). To uniquely determine such a matrix of size , we need vector equations, but the triangulation equations only yield such equations. So there are some that are in the family of nature-inputs and have the same observed history under . Fix some invertible . Since , there must be some so that