## 1. Introduction

In today’s data-driven world, individuals share vast amount of personal information with several service providers (SPs) to receive personalized services. During such data sharings, data owners usually do not want SPs to share their personal data with other third parties. Such issues are typically addressed via a consent (or data usage agreement) between a data owner and an SP to determine how much the SP gains the ownership of the user’s data. However, user’s data may often end up in the hands of unauthorized third parties since (i) SPs sometimes share (or sell) users’ personal information without their authorization or (ii) databases of SPs are sometimes breached (e.g., due to insufficient or non-existing security measures).

When such a leakage occurs, data owners would at least like to know the source of it to keep the corresponding SP(s) liable due to the leakage. If techniques exist to help the data owners identify the source of the data leakages, SPs (knowing potential consequences) will be hesitant to share users’ data without their authorization or they will take more serious security measures against data breaches. Thus, it is crucial to develop such techniques that can be used when sharing personal data with untrusted SPs and that are robust against various attacks that may be launched by malicious SPs. In this work, we propose such a technique by focusing on sequential personal information, such as genomic data, location data, or financial data.

Similar techniques exist for multimedia to prevent copyrighted content being copied or shared without the authorization of the data owner. Digital watermarking (Podilchuk and Delp, 2001) is a technique to prove the ownership of digital content by embedding a mark into it, in which embedded watermarks can be the same for each copy of digital object. In order to detect the source of an illegal distribution, different mark should be used for each copy. Digital fingerprinting (Wu et al., 2004) is a technique to identify the recipient of a digital object by embedding a unique mark (called fingerprint) into the digital object. The aim of fingerprinting is to identify the guilty agent who is responsible for data leakage. Watermarking and fingerprinting techniques have been developed for different types of digital content, such as audio, video, software, relational databases, graphs, and maps. However, such techniques are not directly applicable for our scenario (sharing personal correlated data) because (i) they (especially for multimedia) utilize the high redundancy in the data, (ii) the embedded marks need to be large, which reduces the utility of shared data, and (iii) they do not consider the correlations between data points.

In this work, we propose a fingerprinting technique for sequential data having correlations between data points. We consider several different malicious behaviour that can be launched by the malicious SPs against the proposed fingerprinting scheme including: (i) flipping data points, (ii) using a subset of the data points, (iii) utilizing the correlations in the data, and (iv) colluding SPs to identify and/or distort the fingerprint.

The proposed fingerprinting scheme essentially relies on adding controlled noise into particular data points in the original data. We build a correlation model (that captures the statistics in the data) and add the fingerprint to the data that is consistent with the nature of the data. By doing so, we avoid malicious SPs identify and distort the added fingerprint using the auxiliary information about the data model. We also consider colluding SPs (that receive different fingerprinted copies of the same data) who aim to detect and distort the fingerprints. The proposed fingerprinting scheme utilizes Boneh-Shaw codes (Boneh and Shaw, 1998) to improve its collusion resistance and integrates such codes with a novel algorithm to also provide robustness against other types of malicious behavior, such as flipping or utilizing correlations in the data (that are not considered by Boneh-Shaw codes). Furthermore, we propose a detection algorithm that identifies the suspects based on the similarity of their copies with leaked data and selects one of them by checking Boneh-Shaw codes embedded in their copies. Besides providing robustness against the aforementioned attacks, the proposed scheme also keeps the data utility high by controlling the fraction of fingerprinted data points.

In most data sharing scenarios, individuals also want to share their personal information with the SPs under certain privacy guarantees. Several existing techniques rely on adding controlled noise on the shared data to provide privacy guarantees (including large tech companies, such as Google (Erlingsson et al., 2014) and Apple (Team and others, 2017)). Realizing these similar methodologies between the proposed scheme and privacy-preserving data sharing, we also propose a scheme that provides both privacy and liability (i.e., robust fingerprinting) while sharing personal data with the SPs. Here, the main challenge is the conflicting goals between data privacy and robust fingerprinting. To provide higher privacy, the same (or similar) noise pattern should be used at each new sharing of the data (e.g., by maximizing the overlap between noisy data points across different sharings). On the other hand, for robust fingerprinting, shared noise patterns should be unique to identify the source of a potential data leakage (e.g., by minimizing the overlap between noisy data points across different sharings). Thus, we propose a scheme that controls the size of this overlapping region to identify a sweet-spot between privacy and robust fingerprinting.

We implement the proposed fingerprinting scheme for genomic data sharing using a real-life genomic dataset. Via simulations, we show the robustness of the proposed scheme against various types of attacks that can be launched against a fingerprinting scheme. We show that the proposed scheme is efficient and scalable in terms of its running time. We also study the balance between fingerprint robustness and data privacy via simulations on genomic data.

The rest of the paper is organized as follows. We summarize the related work in Section 2. We describe the problem settings in Section 3. We propose the probabilistic fingerprinting scheme in Section 4 and explain how to utilize Boneh-Shaw codes to resist collusion attacks in Section 5. We propose a hybrid approach in Section 6 to provide privacy and liability together. We present our experimental results in Section 7. We discuss more about our scheme based on our evaluation results in Section 8 and conclude the paper in Section 9.

## 2. Related Work

Digital watermarking is the act of embedding an owner-specific mark into a digital object (e.g., an image, song, or video) to prove the ownership of the object (Cox et al., 2002). Digital watermarks are typically used for copyright and copy protection of multimedia content (Memon and Wong, 2001; Chung et al., 1998). The techniques to watermark audio (Bassia et al., 2001; Ko et al., 2005), image (Chang et al., 2003; Wang et al., 2001), and video (Swanson et al., 1998; Hartung and Girod, 1998) have been developed. Watermarking techniques have also been proposed for text documents (Brassil et al., 1999; Atallah et al., 2001; Topkara et al., 2006), graphs (Qu and Potkonjak, 1999; Gross-Amblard, 2011; Zhao et al., 2015), maps (Yan et al., 2011; Wang et al., 2012; Lee and Kwon, 2013), time-series data such as electrocardiograms (Kozat et al., 2009; Soltani Panah and Van Schyndel, 2014), and spatiotemporal (trajectory) datasets (Jin et al., 2005; Lucchese et al., 2010). In general, watermarking schemes (i) benefit from the high redundancy in the data, (ii) mostly aim to prove ownership of the digital object, and (ii) do not consider robustness against various types of attacks (that are discussed in Section 3.2).

Digital fingerprinting is similar to watermarking in a sense that it also embeds a mark into the object. However, fingerprinting can be seen as a personalized version of watermarking since embedded mark (i.e., fingerprint) is different in each copy of data with the objective to identify the recipient if data is disclosed to a third party without authorization of data owner. Since each copy is different in fingerprinting, malicious recipients can collude and detect fingerprinted points by comparing their copies. Boneh and Shaw proposed a general fingerprinting solution for binary data that is robust against collusion (Boneh and Shaw, 1998). Their scheme constructs fingerprinting codes in such a way that the attackers cannot detect some of the fingerprints due to inclusion of overlapped fingerprints in each copy. However, fingerprint length needs to be significantly long to guarantee robustness against collusion, which reduces the utility of the data. Furthermore, they do not consider complex attacks against the fingerprinting algorithms, such as using correlations in the data for detecting fingerprinted points. Due to such limitations, their codes cannot be used directly for personal data sharing. As we explain in Section 5.2.2, we utilize Boneh-Shaw codes to improve the robustness of the proposed scheme against collusion attacks.

Some other fingerprinting schemes have been proposed for multimedia (Wu et al., 2004), relational databases (Li et al., 2005; Liu et al., 2004; Lafaye et al., 2008), and sequential data (Ayday et al., 2019). Similar to multimedia watermarking schemes, fingerprinting schemes for multimedia also utilize the high redundancy in digital object, which are not applicable to personal data sharing (which typically includes less redundancy). Fingerprinting schemes for relational databases (Li et al., 2005; Liu et al., 2004; Lafaye et al., 2008) insert fingerprint by altering the least significant bits of some attributes in selected tuples. Since databases consist of numerous tuples, the redundancy is much higher compared to personal data. Moreover, these techniques do not consider correlations between attributes. The scheme proposed for sequential data (Ayday et al., 2019) is inefficient due to solving an optimization problem in each step of the algorithm. In addition, the objective of the optimization problem (i.e., minimizing the probability of identifying the whole fingerprint in a collusion attack) is unrealistic, because the attackers do not need to identify the whole fingerprint to perform a successful attack; they can achieve their goal by modifying some of the fingerprinted points. Therefore, in this work we develop an efficient fingerprinting scheme which is robust against various types of attacks. Furthermore, to the best of our knowledge, for the first time we study the fingerprinting robustness and privacy together in a data sharing scheme.

## 3. System Model, Threat Model, and Robustness Measures

In this section, we present our system model, threat model, and robustness measures.

### 3.1. System and Data Model

We show our system model in Figure 1. We assume a data owner (Alice) with a sequence of data points , where is the length of the data and each can have a value from the set . Alice wants to share her data with multiple service providers (SPs) to receive a service from these SPs related to her data. We represent these SPs with a set and each SP with an index such that . We assume Alice wants to add a unique fingerprint in each sharing to detect the source of data leakage in the case of an unauthorized data sharing by any of these SPs. Hence, for each , Alice creates a unique fingerprinted copy of original data by changing the values of some data points. We list the frequently used symbols in Table 1.

Changing the states of more data points for fingerprinting increases the chance of Alice to detect the malicious SP(s) who leaks her data. However, fingerprinting naturally degrades the utility of shared data and one of our goals is to minimize this degradation while providing a robust fingerprinting scheme. We define the utility of a shared copy as , where is the utility of data point , if (i.e., correctly shared), and if (i.e., fingerprinted).

On one hand, the aim of Alice is to detect the source of data leakage in case her data is shared without her consent by an SP (or collusion of multiple SPs). On the other hand, the aim of the malicious SPs is to avoid being detected while sharing as many data points as possible. A malicious may avoid being detected by Alice by changing the states of some data points in its copy or excluding them in its unauthorized sharing. Multiple malicious SPs may also collude to detect fingerprinted data points. Furthermore, background knowledge about the data can be used to detect fingerprinted data points (we discuss more on the threat model in Section 3.2). One such auxiliary information that malicious SPs can use is correlations in the data.

We assume that data is correlated and in this work, for clarity of presentation, we consider pairwise correlations between consecutive data points. The proposed mechanism can also be extended to consider more complex correlations, which may result in eliminating more possible values with low correlations. Thus, we let values be publicly available for any and . This model represents the inherent characteristics of different data types, such as location patterns or genomic data. For instance, consecutive data points that are collected with small differences in time are correlated in location patterns. Similarly, in genomic data, point mutations (e.g., single nucleotide polymorphisms) may have pairwise correlations between each other (e.g., linkage disequilibrium (Slatkin, 2008)). Therefore, a malicious SP may use such correlations to detect and exclude fingerprinted data points (e.g., the ones that are not compliant with the expected correlation model) in its unauthorized data sharing.

Original data owned by Alice (data owner). | |
---|---|

Possible values (states) of a data point. | |

Fingerprinted copy shared with . | |

Data leaked by malicious SP(s). | |

The set of all SPs receiving a fingerprinted copy from Alice. | |

The set of colluding SPs in the collusion attack. | |

The probability of sharing data point as in the proposed scheme. | |

The number of colluding SPs in the collusion attack. | |

The probability of fingerprinting a data point in the proposed scheme. | |

, | The probability of flipping a data point in the flipping attack and removing a data point in the subset attack. |

The estimation of by the colluding SPs. If is publicly known, . |

### 3.2. Threat Model

In this section, we present the attacks which may be performed by malicious SPs (attackers). We consider these attacks when developing the proposed scheme and we evaluate the robustness of the proposed scheme against these attacks in Section 7. In all of these attacks, the main goals of the attacker(s) are (i) to avoid being detected by the data owner and (ii) to share as much correct data point as possible (i.e., not to further distort the data). The attacker(s) needs to modify the values of some data points in its copy to distort the fingerprint. These modifications in the data mostly cause utility loss for the attacker(s). Note however that in some cases, such as in the collusion attack, utility of the attackers may improve as a result of the attack. Let be the leaked copy of Alice’s data . The utility of the attacker(s) is defined as , where is the utility of , if and if .

#### 3.2.1. Flipping Attack

In this attack, a malicious SP flips the values of some data points randomly to distort the fingerprint (before it does the unauthorized sharing). The malicious SP flips each data point with probability . If it decides to flip, then it selects one of the remaining values (states) of the corresponding data point with equal probability and shares that state. Higher decreases the probability of being detected by Alice. However, the utility of shared data becomes lower for higher values. This attack does not change the size of data.

#### 3.2.2. Subset Attack

This attack is similar to flipping attack, but here, a malicious SP excludes (removes) some randomly chosen data points before leaking data, instead of flipping them. We denote the probability of excluding a data point as . This attack is not as powerful as the flipping attack because flipping data points might create a fingerprint pattern that looks similar to some other SP’s fingerprint pattern, and hence Alice may falsely accuse an innocent SP. However, to succeed in the subset attack, the malicious SP needs to exclude almost all of the fingerprinted data points.

#### 3.2.3. Correlation Attack

As discussed, the correlations between consecutive data points are assumed to be publicly known. A malicious SP can use these correlations to degrade the robustness of the fingerprinting mechanism (we will define the robustness of the mechanism later). Assume the malicious receives from Alice and let two data points in the received data be and . If is low, then the attacker infers that either or is fingerprinted with a high probability. After detecting such a pair, the malicious SP may change their values or exclude them before unauthorized sharing. In other words, the malicious SP combines the correlation attack with flipping or subset attacks.

Since flipping attack can be more powerful (as discussed in Section 3.2.2), we assume the malicious SP combines correlation attack with flipping attack as follows: For two consecutive data points and , the malicious checks the conditional probability of having the value of given having the value of . If this probability is less than a threshold , the malicious SP changes the value of to a different value from the set that provides the highest conditional probability (correlation) between and . Otherwise (i.e., if the the conditional probability is greater than or equal to ), the malicious SP flips the value of with probability (as in the flipping attack). By doing so, the malicious SP (i) distorts the fingerprint with a high probability (by distorting the data points that are not compliant with the inherent correlations in the data) and (ii) adds random noise to the data to further reduce the chance of being detected by the data owner. Following a similar strategy, the malicious SP checks all pairs up to and . If the malicious SP does not combine correlation attack with flipping attack, Alice can perform a similar correlation attack on and obtain a similar result with the malicious SP. Therefore, the randomness in the flipping attack makes it difficult for Alice to detect the malicious SP.

#### 3.2.4. Collusion Attack

If multiple malicious SPs collude, by comparing their copies, they may detect and distort the fingerprinted data points. The goal of the colluding SPs is to share a single copy of data owner’s data without being detected. One well-known collusion attack against fingerprinting schemes, called majority attack in the literature (Li et al., 2005), is when colluding SPs compare all their received data points and choose to share the data value that is observed by the majority of the colluding SPs. However, doing such an attack alone cannot be successful for attackers if there is no randomness in the attack. Otherwise, similar to correlation attack, data owner (Alice) can simulate the majority attack of the colluding SPs (and identify the malicious SPs easily). Therefore, collusion attack should also be combined with flipping attack. Furthermore, the colluding SPs can also use correlations to perform a more successful attack. In Section 5, we explain how such an attack (i.e., collusion attack combined with correlation and flipping attacks) can be performed against the proposed fingerprinting scheme in detail.

### 3.3. Robustness Measures

One common requirement of fingerprinting schemes is their robustness against malicious attacks that may destroy or distort the embedded fingerprint. This is important because such attacks may cause the data owner to accuse an innocent SP due to the data leakage, and hence cause false positives in the fingerprint detection algorithm. Therefore, a fingerprinting scheme is considered as robust if it resists malicious attacks and allows the detection of the guilty SP that leaks the data after performing an attack. In this work, we assume that the detection algorithm always returns a guilty SP when Alice observes an unauthorized copy of her data. The proposed detection methods compute a score for each SP having a copy of Alice’s data and identify the guilty SP based on these scores. In order to quantify the robustness of the proposed scheme, we use the accuracy () of the detection algorithm which is defined as the probability of detecting the guilty SP from the leaked data. In (Li et al., 2005), misattribution false hit () is defined as a robustness measure (defined as the probability of detecting an incorrect fingerprint from the leaked data). Therefore, our robustness measure is equivalent to .

If collusion attack occurs, the accuracy of the detection algorithm can be evaluated by the probability of correctly identifying one or all colluding SPs. In (Wu et al., 2004), the possible goals for designing the fingerprinting schemes are defined as “catch one”, “catch many”, and “catch all”. In catch one scenario, the goal is to design the scheme to maximize the probability of catching one of the colluding SPs, while seeking to minimize the probability of falsely accusing an innocent user. In the other scenarios, the probability of falsely accusing the innocent SPs () increases. Hence, most collusion resistant fingerprinting schemes, such as Boneh-Shaw codes (Boneh and Shaw, 1998), aim to catch one of the colluding SPs. Since we utilize Boneh-Shaw fingerprinting codes in our proposed scheme, we also aim to catch one of the colluding SPs in case of collusion attack. Thus, we define as the probability of detecting one guilty SP from the leaked data.

## 4. Probabilistic Fingerprinting Scheme for Correlated Data

In this section, we propose our probabilistic fingerprinting scheme which considers correlations in data (considering the attack in Section 3.2.3). We also present two techniques for the data owner to detect the source of data leakage. In Section 5, we will improve this scheme to also consider colluding malicious SPs.

### 4.1. Proposed Fingerprinting Algorithm

Assume data owner (Alice) has a sequence of data points and she wants to share her data with an as after fingerprinting. Alice determines a fingerprinting probability (for each data point), which means approximately data points will be fingerprinted (i.e., their value will be changed) when sharing data points with . Lower values increase the utility of shared data (by changing values of fewer data points), however they also decrease the robustness of the fingerprint (i.e., chance of the data owner to detect the source of data leakage).

Under these settings, a naive algorithm fingerprints each data point with the same probability, without considering correlations in the data. Hence, each data point is shared correctly () with probability and incorrectly/fingerprinted () with probability . For each fingerprinted (incorrectly shared) data point, the shared state is selected among states in with equal probability. However, if Alice applies this naive probabilistic fingerprinting scheme, a malicious SP can detect some of the fingerprinted data points using the correlations and distort the fingerprint via flipping, as discussed in Section 3.2.3.

In order to prevent such an attack, one needs to consider the correlations in the fingerprinting scheme. In our proposed probabilistic fingerprinting scheme, for each data point , considering the correlations in the data, we assign a different probability for sharing each different state of this data point in . Let be the probability of sharing data point as (i.e., ). The proposed scheme assigns a value for all and . By doing so, unlike the naive approach, we consider the correlations in the data, and hence prevent a malicious SP to detect and distort fingerprints. We propose an iterative algorithm that starts from the first data point and assigns probabilities . Since we consider correlations between consecutive data points, for the first data point , similar to the naive approach, Alice shares the correct value with probability and each incorrect value with probability . Based on these probabilities, the algorithm selects a value for from the set . For the subsequent data points, the algorithm computes the fingerprinting probabilities by checking the correlation with the preceding data points.

Let the shared value of (i.e., ) be . The algorithm checks the conditional probabilities for all . If the algorithm decides to share as , and if is low, a malicious SP can detect that either or is fingerprinted. To eliminate such correlation attack, the proposed algorithm uses a threshold and sets if . This means that the algorithm never selects as the value of if is not consistent with the inherent correlations in the data. Let be the actual value of . If , is set to . All of the remaining probabilities () are assigned directly proportional to the value of . After assigning all probabilities, the algorithm chooses one of the values from based on the assigned probabilities and sets the value of accordingly.

Since the proposed algorithm considers correlations between consecutive data points, it may result in changing the value of high number of consecutive data points or sharing the correct value for high number of consecutive data points to prevent malicious SPs from detecting and distorting the fingerprint. As a result of this, total number of fingerprinted points in the original data may significantly deviate from the expected number (). For instance, if is selected as , we expect to fingerprint approximately of total data points. However, considering correlations may cause fingerprinting significantly more (or fewer) data points than anticipated. To prevent this, we dynamically decrease (or increase) the fingerprinting probability depending on number of currently fingerprinted data points as follows.

To keep the average number of fingerprinted data points as , the proposed algorithm divides the data points into blocks consisting of data points. We expect (on the average) one fingerprinted data point in each block. Therefore, the algorithm keeps a count of the number of fingerprinted data points at the end of each block. If the ratio of fingerprinted data points is less than , the algorithm sets the fingerprinting probability for the next block as . Here, is a design parameter in the range and we evaluate the selection of in Section 7. If the ratio of fingerprinted data points is greater than , the algorithm sets the fingerprinting probability for the next block as .

The steps of the proposed scheme are also shown in Algorithm 1. For each SP, Alice executes the same algorithm with a different seed value and stores the fingerprint pattern of each SP to use it in the detection in case her data is shared without her consent. If the data size and number of SPs are large, Alice can also just store the seed value for each SP.

Figure 2 shows a toy example to illustrate the execution of the proposed algorithm (Algorithm 1). Each step shows one execution of the outer loop of Algorithm 1 for deciding the value of one data point. In the first step, probabilities are assigned just by using . In the next steps, correlation values are also used to determine the probabilities that are used to choose the shared values (). When the correlation is less than , the algorithm assigns for the probability of selecting the corresponding value. At the end of each block (a block includes data points in the example), the algorithm checks the total number of fingerprinted values. Since the expected number of fingerprinted points in a block is and the algorithm fingerprinted data points, the fingerprinting probability is adjusted as for the second block.

### 4.2. Detecting the Source of Data Leakage

Let the leaked copy of Alice’s data be . The goal of Alice is to detect the SP that leaks her data. Here, we propose two methods that can be used by Alice to detect the source of the data leakage. Both methods compute a score (probability or similarity) for each SP (with whom Alice previously shared her data) to be guilty for leaking . Alice chooses the SP that has the highest score as the guilty SP. Note that, the size of the leaked data may not be equal to the size of the Alice’s data (i.e., ) if subset attack is performed. In that case, we represent excluded data points with (that is not in set ).

#### 4.2.1. Probabilistic Detection

In (Papadimitriou and Garcia-Molina, 2010), considering the data leakage problem, the authors compute the probability of being guilty for each agent (SP) with some independence assumptions (to efficiently compute the probabilities). We adapt their detection method to our problem setting as follows: For each leaked data point , Alice identifies each SP that received the corresponding data point as (i.e., ). Thus, for each data point , Alice constructs a set consisting of such SPs (that received the leaked state of the data). Alice assumes that each SP in can be guilty for leaking with a probability . For instance, if the value of is shared with SPs, each of these SPs is considered as guilty for sharing with probability . Eventually, an is considered as guilty with probability . At the end, Alice identifies the guilty SP as the one with highest probability of being guilty.

#### 4.2.2. Similarity-Based Detection

Another efficient way of detecting the guilty SP is computing the similarity of fingerprinted data shared with each SP with the leaked data. For an , Alice compares the leaked data with the copy and counts the matching data points in the fingerprint pattern. In other words, Alice checks the size of the following set: . Alice also counts the fingerprinted data points of . Eventually, the SP with the maximum value is identified as guilty.

Similarity-based detection is easy to implement and it efficiently detects the guilty SP if there is no collusion between the SPs or the number of colluding SPs is small. However, when the number of colluding SPs increases, the similarity scores of SPs become very close to each other and the detection performance of similarity-based detection decreases. Probabilistic detection differs from similarity-based detection by considering the number of SPs having the leaked data point (i.e., ). In similarity-based detection, the effect of each data point to the similarity score is same. However, in probabilistic detection, the probability of being guilty is computed by multiplying values, and hence the effect of each data point to the probability of being guilty is different. In Section 7, we compare the performance of these two detection techniques in terms of their accuracy to identify the guilty SP.

## 5. Considering Colluding Service Providers

As discussed in Section 3.2.4, the colluding SPs may distort the fingerprint via the majority attack. However, lack of randomness in this attack may result in detection of colluding SPs by Alice. Here, we first show a strong attack against the proposed fingerprinting scheme (also against the existing fingerprinting schemes in general) by integrating colluding SPs, correlations in the data (in Section 3.2.3), and the flipping attack (in Section 3.2.1). Then, we propose utilizing Boneh-Shaw codes (Boneh and Shaw, 1998) to improve robustness against such a strong attack that also involves colluding SPs.

### 5.1. Probabilistic Majority Attack

As discussed, the goal of the colluding SPs is to share (leak) a copy of the data without being detected by the data owner (Alice). In a standard collusion attack, the colluding SPs compare their received values for each data point and select the most observed value to share. We propose an advanced collusion attack called “probabilistic majority attack” (to distinguish it from the standard majority attack), in which the colluding SPs decide the value of each leaked data point by considering (i) all observed values for that data point, (ii) correlation of that data point with the others, and (iii) the probability of adding a fingerprint to a data point (). If the colluding SPs do not know the fingerprinting probability , we assume they use an estimated probability in their attack ( if is publicly known).

Let the set of colluding SPs be and . The goal of the colluding SPs is to create a copy to share and avoid being detected by Alice. In this attack, the colluding SPs first decide the value of each data point by computing a probability for each possible state of . To compute , colluding SPs first check their received values for data point . Let be the number of observations of for data point by colluding SPs (in ). In the standard majority attack, the colluding SPs choose with the maximum value as the value of (assuming it is the original value of the data point) to avoid detection by Alice. However, it is possible (with lower probability) that other values with lower may also be the original value of . For example, assume there are three colluding SPs and let . For a data point , observing two s () and one () is not enough to conclude that is the original value () of data point . It is also possible (with probability ) that the original value is and all three colluding SPs received fingerprinted (noisy) values from Alice.

Thus, in the probabilistic majority attack, the attackers compute such probabilities for each using their estimated fingerprinting probability and also using publicly known pairwise correlations between the data points. As discussed in Section 3.2.3, correlations can be used to detect and distort fingerprinted data points by the attackers. Therefore, the attackers tend to select each leaked value that have high correlation with the previous shared (leaked) values. In order to integrate the correlations with the collusion attack, the conditional probabilities (due to correlations) are used as weights to determine the sharing probabilities of colluding SPs (). The colluding SPs first compute the weighted probability values (referred as ) and then compute by normalizing the weighted probability values. The weighted probability for value of a data point is computed as follows:

Here, is the probability of to be the original value of data point by assuming each data point is fingerprinted with probability . The conditional probability is used as a weight. Then, is computed by normalizing values as

The colluding SPs decide on the value of each shared point proportional to these probabilities. Therefore, the malicious SPs do not necessarily select the value observed by the majority. By doing so, we allow the malicious SPs to further distort the fingerprint compared to the standard majority attack.

Furthermore, existing techniques to prevent collusion attacks (e.g., Boneh-Shaw codes) assign common fingerprints to multiple SPs, which allows data owner to detect colluding SPs with a high chance. However, to avoid such detection, the attackers may flip some random data points before they leak the data. Thus, in the probabilistic majority attack, we also let colluding SPs flip each with probability (as discussed in Section 3.2.1).

To illustrate the probabilistic majority attack with a toy example, let and . Assume of the colluding malicious SPs have received value for the first data point () and the other malicious SP has received value . Let the estimated fingerprinting probability . Colluding SPs compute , , and . Since is the first data point and we consider pairwise correlations between consecutive data points, here, conditional probabilities are all considered as . Then, colluding SPs choose a value (to share) from with the following probabilities: , , and . Finally, the chosen value is flipped with probability .

### 5.2. Integrating Boneh-Shaw Codes

In order to provide robustness against collusion attacks, Boneh and Shaw proposed fingerprinting codes for detecting one of the colluding SPs (Boneh and Shaw, 1998). The effectiveness of their codes depend on the “marking assumption”, which states that colluding SPs cannot detect the fingerprint if all of them have the same fingerprint. Hence, when the colluding SPs have the same value for the same data point , it is assumed that they choose this value as as the leaked data point. However, Boneh-Shaw codes do not consider correlation and flipping attacks, and hence their detection method is not successful when colluding SPs also utilize the correlations in the data and use the flipping attack. Although Boneh-Shaw codes are defined for binary data, we apply these codes to non-binary data. We briefly describe the Boneh-Shaw codes in Section 5.2.1 and show how to utilize these codes in the proposed scheme in Section 5.2.2.

#### 5.2.1. Boneh-Shaw Codes

Boneh-Shaw codes (Boneh and Shaw, 1998) are designed to detect one of the colluding SPs with probability at least , where at most SPs can collude, under the marking assumption (in the original paper (Boneh and Shaw, 1998), the error probability is represented as , here we use to prevent confusion with privacy parameter in differential privacy). Note that is the maximum number of colluding SPs, for which the detection technique can provide robustness (i.e., detect one of the colluding SPs) with high probability (-error) and represents the actual number of colluding SPs in the collusion attack. Boneh-Shaw codes are defined as binary codes, in which each codeword consists of zeros and ones. In order to fingerprint data, some data points are selected from the original data and XOR’ed with the permuted codeword in order to prevent colluding SPs from detecting and distorting the fingerprint (the number of data points must be equal to the length of the codeword). The same permutation must be used for each SP and the permutation must be hidden from SPs. The data points that are XOR’ed with ones in the codeword becomes fingerprinted. Therefore, the ones in the binary code represent fingerprints.

In Boneh-Shaw (,)-codes, the first codeword consists of ones, th codeword consists of zeros and ones, and the last (th) codeword consists of zeros. For instance, the four codewords of -codes are , , , and . If the recipients of first and third codewords collude, based on the marking assumption, they create (as the leaked data), where is a randomly selected bit. To detect the source of the leakage (i.e., the guilty SP), the data owner checks each -bit block, observes that the third block consists of all ones and the second block contains a zero with high probability, and concludes that the recipient of the third codeword is involved in the collusion. However, it is also possible that the colluding SPs create , which leads the data owner to accuse the owner of the second codeword. Thus, increasing decreases the error in detection (), but it also increases the length of the fingerprint (and hence decreases the utility of the shared data).

#### 5.2.2. Using Boneh-Shaw Codes in the Proposed Scheme

The marking assumption of (Boneh and Shaw, 1998) does not consider the flipping attack and correlation attack which are included in the probabilistic majority attack. If the colluding SPs flip some of the bits randomly or based on correlations, the data owner may accuse an SP who is not involved in the collusion. Therefore, it is not possible to directly use Boneh-Shaw codes and their detection algorithm in our scenario. Instead, we utilize Boneh-Shaw codes to assign shared (i.e., overlapping) fingerprints between different SPs.

In Algorithm 1, Alice creates fingerprinted copy of each SP independently. Hence, the fingerprints of two SPs may be the same for some random data points. Here, we explicitly assign overlapping fingerprints using Boneh-Shaw codes to improve robustness against collusion attacks. In order to adopt binary codes into our scheme, we consider ones as fingerprinted data points and zeros as the original data points. Therefore, when a data point is fingerprinted using Boneh-Shaw codes, the same value is also used as a fingerprint in the copies for other SPs if their codewords also include one for the same data point.

We integrate the Boneh-Shaw codes into our scheme as follows: Alice creates the first fingerprinted copy of her data (to share with ) as described in Algorithm 1. Approximately data points are fingerprinted in . Let be the number of fingerprinted data points for . We want to use some portion of these data points as Boneh-Shaw codes. Then, Alice decides the value of and such that to apply Boneh-Shaw codes for the next sharings. As mentioned, and are design parameters of Boneh- Shaw codes determining the length of codes and error in detection (). We represent as , which is the length of Boneh-Shaw codeword. Here, is the number of Boneh-Shaw codewords that Alice can create and is the block size, as explained in Section 5.2.1. If Alice wants to share her data with more than SPs, she assigns the same Boneh-Shaw codewords in a similar order. receives the same codeword with , receives the same codeword with , and so on. In this way, although same Boneh-Shaw codewords are assigned to some SPs, since other parts of their fingerprints will be different, the proposed detection algorithm can still identify the guilty SP using the entire fingerprint. If , Alice uses all fingerprint of for Boneh-Shaw codes in later sharings of her data (with other SPs). This provides better robustness against collusion attacks, since the number of overlapping fingerprints is high. However, in this case, the robustness for the attacks performed by single SP (e.g., flipping or correlation attack) will become weak because the number of unique fingerprints is low. Therefore, we set approximately equal to to detect the guilty SP even the attack is performed by single SP or multiple SPs.

Alice randomly selects of fingerprinted points in . These fingerprinted data points are considered as the first codeword in Boneh-Shaw codes. For her next sharing with the second SP (), Alice randomly selects of points to assign the same fingerprints to (i.e., for these data points). Furthermore, Alice, assigns the original value for the remaining points (i.e., for these data points). In other words, the Boneh-Shaw codeword of consists of zeros and ones.

In order to assign approximately fingerprinted points to , the fingerprinting probability of is selected as since fingerprints are already assigned before running the probabilistic algorithm. Alice runs the proposed algorithm to sequentially add the remaining fingerprints. Also, since fingerprints and original values are already assigned (as the Boneh-Shaw codeword of ) before the algorithm, the algorithm will skip these points while adding fingerprints. Furthermore, when the algorithm is determining the probabilities for each possible value of a data point (i.e., inner loop of the algorithm), it also considers the correlations of the data points with the already assigned Boneh-Shaw codeword. The updated algorithm is shown in Algorithm 2, which includes these new conditions. For each , Alice repeats the same process by first adding fingerprints and original values (i.e., Boneh-Shaw codeword). Then, Alice runs the probabilistic algorithm to determine the values of remaining points in as in Algorithm 2.

Here, we describe the algorithm explained in Section 5.2 on a toy example, which is also illustrated in Figure 3. Let the original data of Alice be , where and . Let be and Alice shares with after running Algorithm 2, where underlined points represent fingerprinted data points. We use such a high value for the clarity of this toy example, typically the value of is significantly smaller to have higher data utility for Alice. Let and . Alice selects random fingerprints from as the Boneh-Shaw codeword. Let the indices of these selected fingerprints be , , and . For the second SP (), Alice keeps two of these fingerprints and sets the other one as the original value of the data point. Hence, Alice inserts Boneh-Shaw codeword as before running Algorithm 2. Since Alice has already two fingerprinted points, she sets fingerprinting probability for as in Algorithm 2. For the third and fourth SPs ( and ), Alice inserts the Boneh-Shaw codewords before running Algorithm 2 as and , respectively. Alice assigns the values of missing points (represented with dash) sequentially using the Algorithm 2 as shown in Figure 3.

### 5.3. Detection Algorithm

Here, we propose a detection algorithm for the proposed collusion-resilient fingerprinting scheme. In practice, when Alice realizes that a copy of her data is leaked without her consent, she cannot know whether is leaked by single SP (by performing flipping or correlation attack) or multiple SPs (by performing collusion attack). Therefore, Alice cannot use the detection techniques in Section 4.2 or the detection technique of Boneh-Shaw codes directly. We propose a detection algorithm that utilizes both techniques. As we show in the experimental evaluations, similarity-based detection in Section 4.2 performs slightly better than probabilistic detection. Hence, here, we describe the detection algorithm using similarity-based detection. It can also be implemented using probabilistic detection technique similarly.

First, Alice needs to determine whether her data is leaked by single SP or multiple SPs. In order to do so, she initially computes a similarity score () for each (that received her data) as explained in Section 4.2.2. Each score is a number in the range . If there is a collusion of two SPs and the colluding SPs observe different values for a data point, they select either of these values with equal probabilities. Thus, we expect that they damage approximately half of the fingerprinted data points (this will be more if the collusion includes more than 2 SPs). Hence, we assume that there is an attack by single SP if the similarity score of an SP is greater than . In such a case, Alice identifies the SP with the highest similarity score as guilty. Otherwise, there is a collusion attack with high probability, and hence Alice identifies the suspects according to their similarity scores and returns one of them utilizing the detection technique of Boneh-Shaw codes.

Let the index of SP with maximum similarity score be . Alice generates a suspect list by including SPs having highest similarity scores. Hence, if is greater than , there will be just one SP in the suspect list and the algorithm will return as guilty. In such case, Alice concludes that there is no collusion attack with high probability. If is less than or equal to 0.5, there will be more than one suspects, which means that a collusion attack is performed with high probability. In this case, the algorithm returns one of the suspects using the detection method of Boneh-Shaw codes (Boneh and Shaw, 1998) as follows: In Boneh-Shaw codes, it is expected that the colluding SPs create a copy consisting of several random values followed by all ones and the starting point of ones (a block with all ones) gives us one of the colluding SPs (ones in the Boneh-Shaw codes represent the fingerprinted data points and zeros represent the points that are not fingerprinted). Let represents a block (consists of data points) having at least one zero value and represents a block having all ones. Assuming the leaked copy created by colluding SPs is , is identified as guilty if the first observed block is the th block. However, as a result of probabilistic majority attack and flipping attack described in Section 5.1, some ones may turn into zeros and some zeros may turn into ones. Therefore, the detection algorithm may fail against such an attack. To avoid this, we define as a block having majority of points as one and as a block having majority of points as zero. Then, the algorithm checks all suspects in the suspect list starting from (having the highest similarity score). If th block is and th block is , the algorithm returns as guilty. Otherwise, the algorithm continues with the other SPs in the suspect list in the order of decreasing similarity scores. For each in the suspect list, the algorithm returns it as guilty if the th block is and the th block is . When such an SP is found, the algorithm stops and returns the SP as guilty. If there is no such an SP, the algorithm returns as guilty.

The steps of the proposed detection algorithm are also shown with an example in Figure 4. After checking the similarity of leaked data with the fingerprinted data points of each SP, the algorithm adds SPs into the suspect list. When the algorithm checks the st and the nd blocks of leaked data for , it does not return as guilty since both blocks are . Then, it checks the rd and the th blocks of leaked data for and returns as guilty since the th block is and the rd block is .

## 6. Using Fingerprinting for Privacy-Preserving Data Sharing

Here, we explore how the proposed fingerprinting technique can also provide privacy-preserving data sharing guarantees for the data owner. We develop a mechanism in which the added fingerprint (to provide liability) can also used to provide privacy. When sharing data with untrusted parties, local differential privacy (LDP) is a commonly used concept for providing (statistical) privacy guarantees (Erlingsson et al., 2014; Wang et al., 2017). LDP-based mechanisms typically add controlled noise to data points to guarantee that an untrusted data collector cannot determine the original value of a data point from the reported (perturbed) value. Thus, one trivial idea is to use the noise pattern added by the LDP-based mechanism as the fingerprint. However, one major challenge about this idea is the conflicting objectives of fingerprinting and privacy mechanism.

To show these conflicting objectives via an example, assume that a data owner shares her data with multiple SPs using an LDP-based mechanism by adding noise to different data points based on the LDP parameter () at each sharing. In that case, if two or more SPs collude, they can recover (infer the actual values of) most of the noisy data points by just aggregating their received data and selecting the data values that are observed by the majority of them (i.e., similar to performing a standard majority attack). Thus, such a sharing strategy is not preferred for the privacy objective. To overcome this problem, the data owner can just choose to add noise to her data once and share the same noisy data with all the SPs. On the other hand, a fingerprinting mechanism requires the data owner to share different fingerprint patterns with the SPs to uniquely identify the source of an unauthorized sharing. This implies that the robustness of the fingerprinting scheme and the privacy of the shared data are inversely proportional.

In order to consider both privacy and liability at the same time, we propose a hybrid approach as follows: When Alice (data owner) shares her data with the first SP using the proposed algorithm, she selects a parameter in the range () to determine her privacy level in the fingerprinting scheme. She randomly selects of fingerprinted points as the overlapping fingerprints. These fingerprinted points are shared the same with all the SPs. Thus, for each new data sharing, Alice first inserts overlapping fingerprints and then applies the proposed fingerprinting algorithm for the remaining fingerprints. While higher values provide higher privacy, lower values provide better fingerprint robustness. In Section 7.5, we show this trade-off between fingerprint robustness and privacy for different values of .

## 7. Evaluation

To evaluate its robustness and utility, we implemented the proposed fingerprinting algorithm (in Section 5.2) and the detection algorithm (in Section 5.3). We here present our experimental evaluations.

### 7.1. Data Model and Settings

We evaluated the proposed scheme on a genomic data sharing scenario. Due to the fast decrease in the cost of sequencing, nowadays, individuals can obtain their genome sequences easily. This trend has also led individuals to share their genomic data with medical institutions and direct-to-consumer service providers for various genetic tests or research purposes. Since genomic data contains sensitive personal information, such as the risk of developing particular diseases, sharing genomic data without the authorization of the data owner causes privacy violations. Therefore, fingerprinting genomic data can be a solution or disincentive to prevent its unauthorized sharing. Furthermore, genomic data contains inherent pairwise correlations between point mutations (single nucleotide polymorphisms - SNPs), which makes genomic data an ideal usecase to evaluate the proposed scheme. SNP is the variation of a single nucleotide (from the set ) in the population. For each SNP position, only two different nucleotides can be observed: (i) major allele, which is observed in the majority of the population and (ii) minor allele, which is observed rarely. Moreover, each SNP consists of two nucleotides, one is inherited from the mother and the other from the father. Thus, each SNP is represented by the number of its minor alleles, and for genomic data.

We used a dataset that consists of 7690 SNPs belonging to 99 people from Central European ethnicity (1). Using this dataset, we computed the pairwise correlations between SNPs to build our correlation model. Unless stated otherwise, we set the data size 1,000, fingerprinting probability , and the correlation threshold of the algorithm . The threshold is used in the algorithm to prevent adding fingerprints causing low correlation. Note that the attacker has its own correlation threshold in its correlation attack. We evaluate the effect of on utility and robustness in Section 7.3. We choose the data size as 1,000 to show the robustness of the proposed scheme for a relatively small data. As we show via experiments later, robustness increases with the increasing data size because as data size increases, we obtain more fingerprinted data points for the same fingerprinting probability . As expected, utility of the data owner (as introduced in Section 3.1) decreases linearly with increasing fingerprinting probability and we observed that the average utility is when is .

We expect to change the value of approximately data points as fingerprint for each SP, and (as discussed in Section 5.2.2) we used approximately half of the fingerprinted data points for Boneh-Shaw codes. We set the number of Boneh-Shaw codes () as and the block size of Boneh-Shaw codes () as . Hence, fingerprinted data points of first SP were used for Boneh-Shaw codes. Another design parameter in the algorithm is , which is used to dynamically adjust fingerprinting probability to keep the number of fingerprinted data points close to . can have any value in the range . Although the selection of

does not have a significant effect on the average number of fingerprinted data points in each shared copy, it has a significant effect on the standard deviation. As shown in Table

2, increasing decreases the standard deviation and the number of fingerprinted data points for each SP is close to each other for higher values. Therefore, values close to 1 provide almost fingerprinted data points for each SP. However, for values close to 1, the fingerprinting probability becomes close to for some blocks within the algorithm where is used as fingerprinting probability. This may cause an attacker to understand the values of the data points in these blocks. For instance, if the algorithm adds two fingerprints in the first block (including data points), the fingerprinting probability reduces to (close to for higher values of ) in the second block. Therefore, an attacker may understand that all data values in the second block are the original values with very high probability by detecting two fingerprints in the first block (e.g., via a correlation or collusion attack). Hence, we set in our experiments, which provides a reasonable standard deviation, as shown in Table 2. We repeated all experiments 10 times for each individual in the dataset (totally 99 individuals), and hence all results are given as the average of 990 executions.0 | 0.25 | 0.5 | 0.75 | |
---|---|---|---|---|

Standard Deviation | 10.17 | 4.20 | 2.31 | 1.68 |

### 7.2. Flipping and Subset Attacks

In this experiment, we compared the flipping and subset attacks in terms of their effect on the fingerprint robustness. Also, to compare the similarity-based and probabilistic detection techniques (which are the basic building blocks of the proposed detection algorithm in Section 5.3), we implemented them for this experiment. We set the total number of SPs (that received the data owner’s fingerprinted data) to 1,000. Figure 5 shows the accuracy () of the similarity-based detection technique and probabilistic detection technique for different values of (probability of flipping a data point in flipping attack) and (probability of removing a data point in subset attack). From these results, we can conclude that (i) similarity-based detection provides slightly better accuracy than the probabilistic detection, (ii) flipping attack is more powerful than subset attack, and (iii) the attacker needs to flip at least half of the data points to avoid being detected, which decreases the utility of the attacker (defined in Section 3.2) to negative values. For the rest of the experiments, we use the detection algorithm described in Section 5.3, which is based on the similarity-based detection technique (that is shown to perform better than probabilistic detection).

To observe the effect of the total number of SPs on robustness, we computed by changing the number of SPs that receives a fingerprinted copy from Alice and we show our results in Table 3. We observed similar results with Figure 5 even the data is shared with 10,000 SPs. If data is shared with 10,000 SPs and any of these SPs leaks its copy by flipping up to of the data points, similarity-based technique detects the guilty SP with accuracy. As we also show in Figure 5, a malicious SP obtains almost 0 utility by flipping of the data points. We computed the utility of the attacker () as described in Section 3.2 by assigning equal utility for each data point. Hence, an attacker needs to share data with negative utility after subset or flipping attack to avoid being detected.

The number of SPs | ||||
---|---|---|---|---|

10 | 100 | 1,000 | 10,000 | |

0.52 | 0.50 | 0.46 | 0.44 | |

0.57 | 0.54 | 0.51 | 0.49 |

### 7.3. Correlation Attack

We implemented the correlation attack described in Section 3.2.3 to evaluate the robustness of the proposed scheme. We set the total number of SPs (that received the data owner’s fingerprinted data) to 1,000. As before, we set the correlation threshold of the algorithm (decided by data owner) as . Therefore, the fingerprinted copies did not include consecutive pairs of data points whose correlation is less than 0.05. Note however that the correlation threshold of the attack is determined by the attacker. We also implemented the naive fingerprinting scheme described in Section 4, in which each data point is fingerprinted with probability . As described in Section 3.2.3, in correlation attack, data points whose correlation with the previous data point is less than the correlation attack threshold is flipped and the remaining data points are flipped with probability . Figure 6 shows the comparison of the proposed scheme with the naive approach for different values of when . The proposed scheme provides detection accuracy up to and accuracy decreases to when . However, as also shown in Figure 6, the utility of the attacker () reduces to when . The utility of the attacker is almost constant for , because the proposed algorithm guarantees that there is no pairs of data points whose correlation is less than correlation threshold of the algorithm (). We also observed that the naive approach is not robust against correlation attacks and the attacker can easily prevent detection by utilizing the correlations in the data. This clearly shows the importance of considering correlations in the data within the fingerprinting algorithm.

Furthermore, we observed the robustness against correlation attack for different values of fingerprinting probability . As shown in Table 4, the robustness increases when the increases because the number of fingerprinted data points is directly proportional to . We also obtained similar results (as in Table 3) for the effect of the total number of SPs on robustness.

0.02 | 0.04 | 0.06 | 0.08 | 0.1 | 0.12 | 0.14 | 0.16 | 0.18 | |
---|---|---|---|---|---|---|---|---|---|

0.24 | 0.6 | 0.88 | 0.94 | 0.98 | 0.99 | 0.99 | 1 | 1 |

### 7.4. Collusion Attack

As mentioned in Section 3.2.4, in a standard collusion attack, colluding SPs select (and share) the most commonly observed value for each data point. In this attack, although the utility of the data leaked by the colluding SPs () is high, detection techniques can detect the colluding SPs with high probability. Hence, we defined a probabilistic majority attack in Section 5.1 that also includes correlation and flipping attacks. With this attack, the colluding SPs decrease the accuracy of the detection algorithm by reducing the utility of data. In this section, we first compare the utility and robustness under these two attacks. We set and as the Boneh-Shaw parameters. Hence, we can create 10 different Boneh-Shaw codewords with block size of . Note that is the number of codewords and Alice can share her data more than SPs in the proposed scheme by repeating codewords as we discussed before. We set the total number of SPs (that received data owner’s data) to 10, , and . We quantified both utility and fingerprint robustness for different values of number of colluding SPs (). As shown in Figure 7, using probabilistic majority attack decreases the colluding SPs’ probability of being detected by reducing the data utility. Since the probabilistic majority attack is more powerful than the standard one for the colluding SPs, we perform the probabilistic majority attack for the rest of the experiments.

As mentioned, Boneh-Shaw codes are not designed to be robust against majority attacks. They assume the colluding SPs decide the value of a data point randomly if they observe more than one value in their copies. Hence, Boneh-Shaw codes do not provide guarantees against the proposed probabilistic majority attack (including flipping). Furthermore, in order to provide the robustness guarantees of Boneh-Shaw codes, the number of fingerprinted points needs to be high. For instance, to create Boneh-Shaw codes for with robustness guarantee, the block size needs to be greater than 1,000 and we need more than 10,000 fingerprinted data points. Fingerprinting such a high number of data points also decreases the utility of the data owner () significantly. Therefore, direct use of Boneh-Shaw codes do not provide guarantees, especially for small data sizes (we only use approximately data points for fingerprinting in our experiments). To show these shortcomings, we implemented Boneh-Shaw codes as a standalone fingerprinting scheme and observed its robustness against the probabilistic majority attack using the same parameters with the previous experiment (Figure 7). We observed detection accuracy when the number of colluding SPs was 2 (for the same scenario, the accuracy of the proposed scheme is more than , as shown in Figure 7). Therefore, we conclude that using Boneh-Shaw codes as a standalone fingerprinting scheme does not provide robustness against probabilistic majority attack (which includes correlation and flipping attacks). While our proposed scheme utilizes the Boneh-Shaw codes to increase its robustness against collusion attacks, Algorithm 2 generates unique fingerprints to provide robustness against correlation and flipping attacks.

We also evaluated the robustness of the proposed scheme by increasing the number of SPs receiving fingerprinted copies Alice’s (data owner’s) data. Due to the size of the data, we selected and in our experiments. We can create different Boneh-Shaw codewords with these parameters. In such a scenario, if Alice wants to share her data with more than SPs, she assigns the same codewords to other SPs. Therefore, Boneh-Shaw codewords of SPs whose index values are equivalent in modulo are same where their remaining unique fingerprints are generated by the Algorithm 2. For instance, all , , and receive the same Boneh-Shaw fingerprints in our proposed scheme. We show the robustness of the proposed scheme for different values of number of SPs in Table 5. We observed that the detection accuracy of Alice reduces when she shares her data with more SPs. For instance, increasing the number of SPs from 50 to 100 decreased the accuracy from 0.824 to 0.759 in our experiments when the number of colluding SPs () was selected as 3. Note that flipping probability () was selected as in these experiments.

Number of SPs | |||||
---|---|---|---|---|---|

50 | 100 | 150 | 200 | 250 | |

0.995 | 0.992 | 0.99 | 0.989 | 0.981 | |

0.824 | 0.759 | 0.69 | 0.667 | 0.614 | |

0.606 | 0.504 | 0.411 | 0.344 | 0.343 |

Next, we evaluated the effect of flipping probability () on fingerprint robustness during a collusion attack. Our results are shown in Figure 8. Increasing reduces the robustness and the utility considerably. When there is no flipping () in the probabilistic majority attack, Alice detects one of colluding SPs among 100 SPs (that she shared her data) with accuracy. The accuracy becomes less than when is selected as , where the utility of the attackers also decreases from 0.87 to 0.75.

Another important parameter for the fingerprinting scheme is data size. In our experiments we used a data with 1,000 SNPs (). Therefore, we just changed the state of approximately 100 data points as a fingerprint when fingerprinting probability was selected as 0.1. By keeping the same fingerprinting probability, increasing data size allows to change more data points as fingerprint. With more fingerprinted data points, the data owner can detect the colluding SPs with higher accuracy. To show the effect of data size (i.e., ) on robustness, we conducted experiments by increasing data size (we kept and increased proportional to data size). As shown in Table 6, Alice detects one of 3 colluding SPs among 100 SPs with accuracy when and . Thus, we observed that the robustness of the proposed scheme significantly improves with increasing data size.

1,000 | 2,000 | 3,000 | 4,000 | 5,000 | |

0.759 | 0.914 | 0.973 | 0.993 | 0.997 |

### 7.5. Privacy-Preserving Fingerprinting

As we explained in Section 6, both the proposed scheme and LDP-based mechanisms change the values of some points in data before sharing. However, while the goal of fingerprinting schemes is to provide robustness, LDP-based mechanisms aim to guarantee privacy of the individuals. To show the trade-off between fingerprint robustness and privacy, we first implemented the randomized response (RR) mechanism (Warner, 1965) which satisfies -LDP when the state of each data point is correctly shared with probability for genomic data with three possible states. Also, each of the incorrect two values can be shared with a probability of . To compare LDP with the proposed scheme, we set , for which RR satisfies LDP with . As the privacy metric, we used the average estimation error, which is a commonly used to quantify genomic privacy (Wagner, 2017). This metric quantifies the average distance of the copy created by colluding SPs () from the original data () as .

We set the number of SPs (that received the data) to and performed the probabilistic majority attack described in Section 5.1 with colluding malicious SPs by setting both of the flipping probability () and the correlation threshold of the colluding SPs () to . Although RR mechanism is similar to naive probabilistic scheme described in Section 4, all SPs receive the same copy in RR to guarantee LDP (as discussed in Section 6). Since fingerprint robustness is defined as detecting one colluding SP (in Section 3.3), and all SPs have the same copy, we observed detection accuracy () for the RR mechanism, which is equivalent to randomly accusing any SP that received the data. We also observed as when the probabilistic majority attack was performed for the RR mechanism.

Next, we also implemented the hybrid scheme described in Section 6. In this scheme, the parameter determines the amount of overlapping fingerprints which are included in the copy of each SP. Hence, the proposed fingerprinting scheme (in Section 5.2.2) is equivalent to the hybrid approach when . Figure 9 shows both fingerprint robustness () and privacy () provided by the hybrid scheme for different values of . When is selected as we observed similar robustness compared to the RR mechanism. However, the hybrid scheme provides slightly higher error (better privacy) than the RR mechanism since the correlations in the data are considered in hybrid scheme as opposed to the RR mechanism. Privacy improves with decreasing fingerprint robustness and when increases, the loss in the robustness is significantly higher than the gain in the privacy. We observed that for values of that are below , the proposed hybrid scheme provides both high fingerprint robustness and reasonable privacy.

## 8. Discussion

In this section, based on our experimental results we further discuss about considering privacy and liability together, the practicality of the proposed scheme, and the application of the proposed scheme to different domains.

### 8.1. Fingerprint Robustness and Privacy

When individuals share their personal data with SPs, they want to protect their privacy as well as to identify source of a potential data leakage in case of illegal distribution of their data. As shown in Section 7.5, techniques to protect the privacy of individuals (such as LDP-based mechanisms) do not let individuals to detect the source of unauthorised sharing. Similarly, fingerprinting techniques to provide liability for unauthorized sharing of personal data do not consider the privacy of the individuals. In this work, for the first time, we studied liability and privacy together and proposed a hybrid approach to provide privacy and fingerprint robustness together. Although it is not possible to achieve the highest privacy and the highest fingerprint robustness at the same time (due to their conflicting objectives), we showed that it is possible to achieve a scheme that provides reasonable privacy and fingerprint robustness at the same time. We anticipate that this work will pave the way towards a new research direction to achieve privacy and liability using a single mechanism. We plan to study this problem extensively in the future and develop novel mechanisms that provide formal privacy and robustness guarantees.

### 8.2. Complexity and Practicality

In the proposed fingerprinting algorithm (Algorithm 2), each data point sequentially decides on a probability for each possible value in set and inserts the fingerprints accordingly. Hence, the complexity of fingerprinting algorithm is . To detect the guilty SP in case of data leakage, the data owner needs to compare all fingerprint patterns (given to all SPs) with the leaked data. Since the expected value of fingerprinted data points is in each fingerprinted copy, the complexity of the detection algorithm is , where is the number of SPs that received a fingerprinted copy. Note that this is also the storage complexity for Alice if she stores all the fingerprint patterns. As mentioned before, if Alice does not want to store all fingerprint patterns, she can just store the seed value for each SP and check the similarity of fingerprint patters by running the fingerprinting algorithm again in case of data leakage. Note that this slightly increases the complexity of detection algorithm since it requires Alice to run the fingerprinting algorithm along with the detection algorithm. Thus, we conclude that the running times of both fingerprinting algorithm and detection algorithm grow linearly with the design parameters.

To show the practicality of the proposed scheme, we observed its running time using a computer with 1.8 GHz Dual-Core Intel Core i5 processor and 8 GB memory. Based on our experiments, we measured the average running time of fingerprinting algorithm to create one fingerprinted copy as ms. and the average running time of detection algorithm as ms. when , , , and . These results also show the efficiency and practicality of the proposed scheme.

### 8.3. Application of the Proposed Scheme to Other Domains

In the evaluations, we implemented the proposed fingerprinting scheme for genomic data sharing. However, the proposed scheme can be used for sharing other types of personal data, such as location data. Here, we briefly discuss the differences for the application of the proposed algorithm between genomic data and location data. For different data types, the main difference from genomic data will be the set of possible values for data points (). For genomic data, contains three values for all data points. However, when location data is considered, can be the set of point of interests (POIs) at which Alice can be located at a specific time. Hence, for each data point, the set may be different. The size of for location data will also be larger, and hence the running time of the fingerprinting algorithm will be higher for location data. In terms of robustness, we expect an improvement for location data application due to having more possible values for each data point. In genomic data, when a data point is fingerprinted, its value must be one of two remaining values, and hence providing the same fingerprint for a data point to multiple SPs is very likely for genomic data. However, when Algorithm 2 adds a fingerprint to a location data point for two different SPs, the probability of adding the same fingerprint is lower (due to size of ), which improves the uniqueness of the fingerprints. As discussed before, uniqueness of the fingerprints increases fingerprint robustness.

## 9. Conclusion

We have proposed a probabilistic fingerprinting scheme that also considers the correlations in the data during fingerprint insertion. First, we have shown how to assign probabilities for the sharing decision of each data point that are consistent with the inherent correlations in the data. Then, we have described the integration of Boneh-Shaw codes into the proposed algorithm to improve fingerprint robustness against collusion attacks. We have also proposed a detection algorithm that initially selects the suspects based on similarity scores and decides the guilty SP using the detection technique of Boneh-Shaw codes. Furthermore, to provide privacy along with fingerprint robustness, we have proposed a hybrid approach that controls the trade-off between privacy and fingerprint robustness. Our experimental results on genomic data show that the proposed fingerprinting scheme is robust against a wide range of attacks and it can also provide reasonable privacy by slightly degrading the fingerprint robustness. The proposed scheme is a first step for sharing personal data with service providers by providing both liability and privacy guarantees at the same time.

## References

- [1] (2020) Note: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html[Online; accessed 10-January-2020] Cited by: §7.1.
- Natural language watermarking: design, analysis, and a proof-of-concept implementation. Proceedings of 4th International Workshop on Information Hiding. Cited by: §2.
- Robust optimization-based watermarking scheme for sequential data. In 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), pp. 323–336. Cited by: §2.
- Robust audio watermarking in the time domain. IEEE Transactions on Multimedia 3 (2), pp. 232–241. Cited by: §2.
- Collusion-secure fingerprinting for digital data. IEEE Transactions on Information Theory 44 (5), pp. 1897–1905. Cited by: §1, §2, §3.3, §5.2.1, §5.2.2, §5.2, §5.3, §5.
- Copyright protection for the electronic distribution of text documents. Proceedings of the IEEE 87 (7), pp. 1181–1196. Cited by: §2.
- Finding optimal least-significant-bit substitution in image hiding by dynamic programming strategy. Pattern Recognition 36 (7), pp. 1583 – 1595. Cited by: §2.
- Digital watermarking for copyright protection of mpeg2 compressed video. IEEE Transactions on Consumer Electronics 44 (3), pp. 895–901. Cited by: §2.
- Digital watermarking. Springer. Cited by: §2.
- Rappor: randomized aggregatable

Comments

There are no comments yet.