Record linkage has become an essential component in any cross-organizational or cross-domain data analytics applications. Example applications range from healthcare applications, such as health research or personalized healthcare, business applications, such as targeted marketing and recommendation, and government services to national security applications, such as crime and fraud detection.
Due to the absence of unique entity identifiers in different databases held by different organizations (parties), linking data from different databases that correspond to the same individual needs to be conducted using the commonly available person-specific identifiers (e.g. name, address, age, and gender). However, such person-specific identifiers contain personal identifiable information (PII) about the entities, and therefore can be used to re-identify and infer information about the entities when shared across organizations. Linking data with privacy constraints received much attention in the literature over the last two decades. A large body of work has been done to develop privacy-preserving record linkage (PPRL) techniques, using a variety of privacy-enhancing or privacy-preserving technologies, such as cryptographic techniques and/or probabilistic techniques including Bloom Filter (BF) encoding combined with differential privacy (DP) [Ala12, Sch16, Xue20]. Probabilistic encoding techniques, such as Bloom filter encoding, are computationally efficient for fuzzy linking of large-scale data, and are therefore highly suitable for Big Data applications [Vat17b].
Linkage is generally a classification problem that aims to classify pairs of records into the classes of ‘matches’ and ‘non-matches’. Since the number of record pairs that need to be compared for the classification task becomes quadratic in the size of the databases, the records are first binned into blocks such that highly similar records are grouped together, and then records are compared with only the records in the same bin, reducing the computational complexity from quadratic to sub-quadratic. The bins of encoded records are sent to the server (third-party/linkage unit) for conducting the linkage using a classifier.
The frequency distribution of encoded records in the bins could reveal information about the bins by conducting a frequency inference attack [Vat14]. This has been addressed by several works in the literature, ranging from non-provable privacy guarantees, such as -anonymous grouping [Vat13b, Vat13c, Kar12b, Ran15], to provable privacy guarantees, such as differential privacy [Cao15, Ina08, Ina10, Kuz13, He17]. The standard DP notion for PPRL incurs computational cost in terms of additional record pair comparisons and it does not consider bias in the data, especially the fairness-bias. In this work, we consider record linkage with not only privacy constraints, but also with cost and fairness constraints for practical PPRL applications.
Definition I.1 (Fairness constraint)
Fairness of linkage with regard to a certain protected sensitive feature that has sensitive groups (for example gender with groups, ‘male’ and ‘female’) determines how much the linkage classifier distorts from producing linkage decisions with equal probability for individuals across different protected groups, for example equal true match rates (true positive rates/TPRs) for female and male groups, i.e.
distorts from producing linkage decisions with equal probability for individuals across different protected groups, for example equal true match rates (true positive rates/TPRs) for female and male groups, i.e..
Definition I.2 (Cost constraint)
Assuming encoded records from two databases and are grouped into bins using a blocking protocol and blocking strategy . The set of records in been blocked into bin is represented as . Blocking strategy specifies the records in are compared with the records in for each pair . The computational cost of linkage is . With perturbed bins that include dummy or noisy records (where is calculated depending on the privacy budget for DP) in addition to the original encoded records from or , the computational cost increases to .
Developing classifiers that are fair with respect to a protected/sensitive feature [Zaf15]
, such as gender or race, is an important problem for machine learning applications in general and specifically for PPRL.This is to avoid significant bias been introduced towards certain groups of individuals, for example against black people in fraud and crime detection systems [Flo16, Lar16] or online recommendation systems [Swe13], and against women in job recommendation systems [Dat15].
Fairness-bias in data imposes different levels of challenges on classifying record pairs into matches and non-matches for record pairs belonging to different groups. For example, let’s assume that several identifiers (e.g. last name and address) exhibit more variance in female records in different databases than in their male counterparts due to marriage and/or separation, which causes record pairs belonging to the female cohort to be more difficult to classify as ‘matches’ than male cohort. In addition, supervised machine learning classifiers can learn to ignore poor performance on a small (minority) group if it can exploit knowledge about the majority population, potentially leading to unfair outcomes. Without careful treatment a classifier may inadvertently be biased towards the cohort that is easier to classify.
Achieving fair linkage across different groups is a difficult yet important challenge. Addressing privacy constraints in addition to fairness constraints for fair PPRL introduces additional challenges in terms of balancing all three key factors, which are privacy, fairness, and cost. Notably, privacy, fairness, and cost are not independent from each other. Existing works have discussed the trade-offs between privacy and communication and computational cost [He17], and between fairness and privacy [Vat13c]. With fairness constraints considered, the trade-off between privacy, fairness, and cost needs to be addressed and balanced.
In this paper, we study how to address fairness and cost constraints in PPRL using fairness and cost-constrained differential privacy (DP) algorithms. We first define two new notions of DP constrained on fairness only and constrained on cost and fairness for PPRL. We propose a PPRL framework that satisfies fairness and cost constrained DP. We provide formal proofs for the two new notions of DP for PPRL and empirically study and analyse the different constraints for the PPRL problem using the proposed framework. We conducted experiments on two person-specific datasets that validate the fairness-bias in the original PPRL algorithm with the standard DP notion and the improvement in the fairness and computational cost using our proposed framework with the two new notions of DP. To the best of our knowledge, this is the first work that addresses fairness and cost constraints in DP for PPRL.
We first demonstrate that current (standard) DP notion for PPRL is unfair and biased towards minority groups of individuals
We then propose two new notions of DP with constraints of privacy, fairness and cost. Specifically, we formalize the two notions of fairness-constrained DP and cost-constrained fairness-aware DP for PPRL problem.
We introduce a framework enabling PPRL with fairness and cost constrained DP. Specifically, we introduce two methods that add noise to the blocks of (Bloom filter) encoded records 1) adhering to the fairness constrained DP and 2) fairness and cost constrained DP.
We provide formal proofs for the two new notions of DP guarantees for PPRL and show that the two PPRL mechanisms that follow these two notions of DP provide -DP guarantees constrained to cost and fairness of linkage.
We conduct experiments on two datasets, real and synthetic North Carolina Voter Registration datasets and synthetic Australian Bureau of Statistics datasets, and evaluate the record linkage performance in terms of linkage accuracy, fairness metrics, computational cost and privacy budget and show that our proposed methods outperform the existing and baseline methods in terms of fairness and cost.
Outline: The rest of the paper is organized as follows: We review related work in PPRL and fairness in the following section. We then provide preliminaries of the PPRL problem and demonstrate the fairness issues in PPRL through experimental results to motivate the problem in Section III. Next, we formalize feature-level DP for PPRL in Section IV and new notions of fairness and cost constrained DP for PPRL in Section V. In Section VI, we present our framework for PPRL based on Bloom filter encoding combined with differential private blocking methods following the new DP definitions. We present and discuss the experimental results of our algorithms in Section VII. Finally, we provide the takeaway messages from this work and discuss some open questions and future research directions in Section VIII.
Ii Related work
A long line of research has been conducted in privacy-preserving record linkage (PPRL) [Vat13] and the sub-problem of PPRL, which is private blocking to reduce the computation complexity of PPRL [Vat17b]. Only limited work has provided formal differential privacy (DP) guarantees [Kuz13, Ina08, Ina10, He17].
For Bloom filter-based encoding used in PPRL, few studies have been conducted to provide DP guarantees to the Bloom filter encodings. Blip is a method that flips each bit in the Bloom filter with the probability of to achieve -differential privacy [Ala12], where is the maximum number of tokens hash-mapped into the Bloom filters.
Another method uses a flipping probability to flip the bits in the Bloom filter to meet -differential privacy guarantees [Sch16]. In contrast to Blip, this method uses a parameter to control the bits to be flipped (i.e. privacy-utility trade-off) depending on the privacy budget . The perturbed bit value of the bit in a Bloom filter is (with ) with probability, with probability, and with probability. This gives , where is the number of hash functions in Bloom filter.
A recent work proposed to use
-value to generate a thresholded Laplace distribution in order to calculate the number of 1s in the noise vector (i.e. number of flips as 1-bits in the noise vector denote that the corresponding bits in the Bloom filter need to be flipped in the Bloom filter)[Xue20].
A common solution to address the computation complexity of PPRL (and record linkage in general) is blocking where the records are pre-assigned into similarity groups/bins and then the comparisons are limited to only those records that are within the same bins. Bins of records can be susceptible to frequency inference attacks where the frequency distribution of blocks are compared with a known frequency distribution of external values (for example, zipcodes, if blocking is performed using zipcodes). Most existing private blocking techniques have addressed the privacy leakages using non-provable techniques, such as -anonymity [Vat13b, Vat13c, Kar12b, Ran15], pruning rare bins [Kuz13], or using locality sensitive hashing [Kara14]. Only Few studies have addressed private blocking using differential privacy guarantees [Cao15, Ina08, Ina10, Kuz13, He17].
Differential privacy is used to add noise into the blocks generated using hierarchical clustering
Differential privacy is used to add noise into the blocks generated using hierarchical clustering[Kuz13]. However, a recent study in [Cao15] shows that even with DP guarantees, these private blocking techniques can reveal some private information by learning from the final output of PPRL. [He17] proposed end-to-end DP guarantees for PPRL by introducing output-constrained DP notion. In this work, DP noise is added to bins of encoded records such that the disclosure of true matching records is insensitive to the presence or absence of a single non-matching (noisy/dummy) record. However, no work has so far studied DP constrained to fairness and/or cost. Moreover, fairness in record linkage is also an immature research topic with only one recent work in fairness-aware PPRL [Vat20a].
There have been several algorithms and techniques proposed in the machine learning literature to improve fairness or mitigate bias in classification problems [Meh21]. These are broadly categorized into: pre-processing, in-processing (i.e. at training time), and post-processing. The aim of pre-processing is to learn a new representation of data such that it removes the information correlated to the sensitive attribute and preserves the information of as much as possible [Zem2013, Fel15, Kra18]. The classifier can thus use the new data representation and produce results that preserve fairness. Any classifier can be supported and no re-training is required with this category of methods.
In-processing techniques add a constraint or a regularization term to the objective functions of classifiers [Aga18, Goe18, Hua19, Cel19]. Post-processing methods attempt to modify a learned classifier in a way that satisfies fairness constraints [Ple17, Woo17, Dwor18]. In this work, we use pre-processing techniques to add fairness and/or cost-constrained DP noise to input data (grouped into bins) to achieve fair and cost effective PPRL.
Similarity score distribution of true matches and true non-matches in men and women groups (right) and comparison of linkage quality (in terms of precision and recall) for men and women groups and overall on Australian Bureau of Statistics (ABS) dataset (as used in our experiments in SectionVII
) using the standard differential private blocking and logistic regression-based PPRL.
Iii Problem Motivation
We first define the PPRL problem and the general differential privacy notion for PPRL problem. We then discuss the limitations of the existing differentially private algorithms for PPRL in terms of fairness constraints using an experimental study on North Carolina Voter Registration (NCVR) dataset.
Definition III.1 (Pprl)
Assuming database owners (or parties) , , , with their respective databases , , , (containing sensitive or confidential person-specific data), PPRL links these databases to identify whether the record in dataset match with the record in dataset , i.e. refer to the same real-world entity, where and . PPRL applies a classification function on the encoded records from parties that takes as input the similarity scores or distances between encoded quasi-identifying (QID) attributes of records, i.e. , where is a similarity function that returns the overall similarity between two records and .
Without loss of generality, we assume in the rest of this paper and denote the two databases as and . We assume a semi-honest (honest-but-curious) Linkage Unit () is available to conduct the linkage on encoded records sent by the parties, which is a commonly used linkage model in many real applications [Ran13]. We also assume a set of QID attributes (e.g. name, address, and date of birth), which will be used for the linkage, is common to all these databases. Without loss of generality, we assume parties or databases in this paper and name the two parties as ‘Alice’ and ‘Bob’.
Fairness of the PPRL classifier measures the classification model’s behavior towards different individuals grouped by a particular protected or sensitive feature [Bin18]. The protected feature could either be part of the QIDs used to link records or not. For example, let’s assume “gender” is a protected feature dividing a dataset into two groups: male and female. Fairness of a PPRL classifier on this dataset would define whether the model treats both the male and female user groups equally in terms of correct predictions of record pairs belonging to the different groups as ‘matches’ without giving benefit to one group more than the other.
PPRL can result in biased predictions for different groups based on the protected feature. For example, with gender as the protected feature, female record pairs might have poor accuracy of linkage compared to male record pairs due to different levels of challenges involved in the linkage. The female group of individuals might have more likelihood of changing their last name or address than the male group due to marriage and/or separation. Additionally, if the classifier is trained on a protected feature-imbalanced dataset, then the predictions could be biased towards the minority group. These challenges impose fairness-bias in PPRL classifiers.
Fig. 1 illustrates the fairness-bias on synthetic Australian Bureau of Statistics (ABS) dataset used for linkage experiments 111Available from https://github.com/cleanzr/dblink-experiments/tree/master/data using the standard differential privacy-based private blocking and logistic regression PPRL classification [Kuz13, He17]. As can be seen, when data is biased towards a certain group (women/female in our experiments) in terms of errors and variations in the features used for the linkage as well as small size of the group in the training data, then the standard DP notion for PPRL exhibits fairness-bias towards the minority group (women group in this example). The similarity scores of matches and non-matches for the women group are highly overlapping than the men group (right plot in Fig. 1), making the linkage more challenging for the women group. Hence, the linkage quality measured in terms of precision and recall is considerably lower for the women group than men group (left plot in Fig. 1).
There are many different fairness definitions proposed in the literature. The three commonly used definitions are Demographic Parity, Equalized Opportunity, and Equalized Odds. As discussed in
There are many different fairness definitions proposed in the literature. The three commonly used definitions are Demographic Parity, Equalized Opportunity, and Equalized Odds. As discussed in[Vat20a], Equalized Odds is the best fit fairness definition for PPRL. Since Demographic Parity requires similar rates of classification of record pairs as ‘matches’ for different groups regardless of the ground truth, it can result in linkage accuracy loss. Moreover, unlike other classification tasks, PPRL is a class-imbalanced problem with significantly lower number true matches than true ‘non-matches’, which can lead to many false positives with the Demographic Parity criteria. Equalized Opportunity only considers the true positive rate, whereas Equalized Odds considers the errors (false negatives and false positives). In PPRL we are particularly concerned about linkage errors, and therefore use Equalized Odds as the fairness criteria in our study.
Moreover, the number of similarity comparisons required for the PPRL function increases quadratically with increasing size of datasets, and therefore blocking has been used to reduce the comparison space. Blocking aims at reducing the comparison space for linkage by eliminating the comparisons between pair of records that are highly unlikely to be matches [Chr11, Vat17b]. The main aim of these techniques is to group records into disjoint or overlapping bins such that only records within the same bin need to be compared with each other. Differential privacy algorithms for private blocking have been developed to prevent information leakage from bins by adding dummy or noisy (encoded) records into the bins at the cost more record pair comparisons [Ina08, Ina10, Kuz13, Cao15, He17]. However, these methods do not consider fairness-bias in the data, and thus dummy records could amplify the bias towards minority group.
Iv Feature-level Differential Privacy for PPRL
Frequency inference attacks on encoded bins (generated by a private blocking method) can reveal information about the records in bins, for example bins with fewer records (such as bins with rare/uncommon last names, if the blocking strategy is to bin records based on last name) can be re-identified. Differential privacy noise addition has been used in the literature to make the blocks resilient against frequency attacks [He17, Ina08, Ina10, Kuz13]. These existing methods add noise, that is constrained on the standard DP guarantees, to the bins of records as dummy records.
Definition IV.1 (Differentially private blocking for PPRL:)
Alice and Bob agree on a blocking function with bins and strategy . A specific number of dummy records are inserted into each bin of the blocking strategy such that the bin sizes are differentially private. Each dummy record does not match any record.
[He17] defines PPRL neighbours for record-level data. In this paper, they propose a weaker ()-DP, but an end-to-end privacy definition for the two party setting. In their work, it is able to reveal records that are classified as ‘matches’ and to reveal statistics about non-matching records while not revealing the presence or absence of individual non-matching records in the dataset. However, similar to other works [Ina08, Ina10, Kuz13], their DP definition is only constrained on privacy guarantees, and does not take into account fairness and computational cost constraints.
In order to add DP noise that is constrained on fairness, we split the data into the groups of the protected feature value. We define protected feature-level DP for PPRL, where the bins of records within each protected feature group are guaranteed to be differentially private. We note that the privacy guarantees are provided for the entire record within each protected group, not only for the protected feature. For example, if the protected feature is ‘gender’ and it has only groups, which are ‘male’ and ‘female’, then the data is split into two disjoint groups. So, the adjacent datasets become two male or female groups that are different by one male or female record in each group, respectively. Please note that our proposed new DP notions are applicable to multiple protected features as well. With multiple protected features, for example gender with ‘female’ and ‘male’ groups and age group with ‘young’ and ‘old’ groups, the number of protected feature groups become (i.e. ‘young female’, ‘old female’, ‘young male’, and ‘old male’. Without loss of generality, we assume a single protected feature in defining our new DP notions. The neighbours in each of the disjoint groups is defined based on feature-level PPRL neighbours as:
Theorem IV.1 (Feature-level-PPRL neighbors)
Given function and , for any , and any differ in one pair of non-matching records from protected feature group , , . and are neighbors w.r.t to for protected feature group , denoted by if
If and differs in a matching record, then their matching outputs with a given are different. Hence, and differ in one or more non-matching records. Also, for any , to ensure , the number of non-matching records added to to get is the same as the number of non-matching records deleted from . Then, a neighbouring pair of and regarding to one protected feature is differed by only one pair of non-matching pair .
Definition IV.2 (Feature-level DP)
A 2-party PPRL protocol for computing function is feature-level -differential privacy (DP) for any feature if for any , the views of Alice during the execution satisfies
where is a matching rule. And the same holds for the views of Bob.
The expectation of number of dummy records added regarding to each group in each bin is:
The pdf of Laplace DP noise is , , where is the sensitivity of the counting query to PPRL.
Let , . Then, assign , to the equation.
For a feature-level ()-Differential Privacy, the overall privacy budget is .
The expectation of dummy records number for one group is . Because the Laplace noise for each group is independent from each other, the expectation of dummy records number for each bin is:
Then, the expectation of dummy records number for one bin regarding to the overall privacy budget is .
This concludes the proof.
V Fairness-aware feature-level DP for PPRL
Assume a protected feature divides the dataset into disjoint groups. Then the fairness aware feature-level DP is defined as:
Definition V.1 (Fairness-aware feature-level DP)
A randomized mechanism satisfies ()- fairness aware DP if
this mechanism satisfies -fairness aware DP,
The output is constrained on
Fairness: The performance of the outputs between different groups of protected feature should be equalized. And Equalized Odds fairness criteria is used.
Cost: The efficiency is analysed in terms of the additional communication and computational costs introduced by DP noise for PPRL, which is the number of fake candidate pairs in this case.
In the following, we define the feature-level DP constrained on fairness and cost.
V-a Fairness constrained feature-level DP
A true match can only be an original record pair. A true non-match can be either an original record pair or a dummy record pair. The number of the dummy records is proportional to the number of the true non-matches. A false non-match is caused by the bias in original records in the dataset. The number of these false non-matches can be estimated by sampling the dataset for an approximation value.
An original record pair can be classified as a false match if the similarity between their Bloom filters is high. The number of these false positive matches can be obtained by taking a sampling of the dataset for an estimation. On the other hand, the added dummy records can cause false positive matches. As the dummy record is created from flipping bits in original binary records, with a flipping probability for feature . If is too small, then the similarity between the dummy record and the original record can be large enough to be classified as a false match. For brevity, threshold based classifier is used to determine matches and non-matches, and dice coefficient is used as similarity function between two Bloom filters.
The Dice-coefficient of two BFs (b1, b2) is calculated as:
where c is the number of common bit positions that are set to 1 in both BFs, and is the number of bit positions set to 1 in , .
So, for threshold based classifier, when is the threshold, a record pair for feature is classified as a false match if:
where the dice-coefficient between a dummy record and its progenitor record is calculated as:
where is the flipping probability for feature , is the length of one Bloom filter. As is the number of bit positions that are set to 1 in the original BF. The value of
follows Binomial Distribution with number ofand probability , . The probability of getting exactly successes in independent Bernoulli trials is given by the probability mass function:
for , .
With the probability distribution of, the probability that a dummy record and its progenitor record is classified as a false match FP is:
This equation can not be simplified. Instead, as the length of the Bloom filter is large enough, where in our case the length is 300, we approximate the probability of getting successes in
using Central limit theorem.
Suppose is a sequence of i.i.d. random variables with
is a sequence of i.i.d. random variables withand . Sum of the variables is . Then, as increases to infinity, the random variables converge in distribution to a normal . The probability of a false match with dummy record and its progenitor can be derived as follows:
When the value of flipping probability for protected feature increases, the probability of a pair with dummy record being classified as a false positive with threshold classifier decreases.
When the value of flipping probability for protected feature excess a certain value, .
The number of false positive matches from dummy records is calculated as:
where is the number of record pairs with at least one dummy record.
We use the following fairness loss function, which is the maximum of distance between the false positive rate of two features and distance between the false negative rate of two features:
where and are the true and predicted class labels, respectively, with two values of 1 (for matches) and 0 (for non-matches).
Fairness is calculated as:
The false positive matches rate for feature is:
The original records in the dataset contribute to false negative rate and false positive rate, while dummy records have effects on false positive rate but not false negative rate. So the false positive matches and false positive matches from original record pairs are regarded as constant values. By fixing the number of dummy records for different features, fairness loss can be simplified as a function of flipping probabilities .
Definition V.2 ((, Fairness)- Constrained DP)
Given , a DPRL randomised mechanism satisfies (,Cost)-Constrained DP if there exists a solution of (), such that
M satisfies feature-level ()-DP;
Minimises constrained optimization problem:
V-B Cost constrained feature-level DP
A blocking protocol and blocking strategy is used to block the dataset into bins. Given for protected groups of the sensitive feature, the expected number of dummy records that need to be added for each of these groups for DP guarantees per bin is a constant denoted by , where , , and corresponds to the dummy records of group from bin .
By fixing the flipping probability for each feature , fairness loss function can be simplified as a function of privacy budgets .
The number of candidate matches is a random variable, denoted by , with expected value
The expected privacy cost for group is:
The expected False positive rate from Equation. 10 is:
Definition V.3 ((,Cost)-Constrained feature-level DP)
Given , a DPRL randomised mechanism satisfies (,Cost)-constrained feature-level DP if there exists a solution of (), such that
M satisfies ()-feature-level DP;
fairness loss is minimised.
It is obvious that the privacy budget for each group and cost are conflicting with each other. So we use Lagrangian function to optimize this problem. Consider the following optimization problem minimizing the overall cost with privacy budget constraints:
Vi Framework for PPRL with Fairness and cost constrained DP
In this section, we present a framework for PPRL based on Bloom filter encoding and the new notions of differential privacy (DP) for fairness and computational cost-aware PPRL. Records are encoded into Bloom filters at the data owners site. The encoded records are then sent to a linkage unit which performs All-Pairwise Comparisons (APC) to determine the linked or matched pairs of records.
Blocking techniques are useful as a pre-processing step prior to APC to achieve efficiency and high recall in record linkage. In PPRL with DP Blocking, DP hides the presence or absence of a single record, and hence the number of candidate matches stays roughly the same on and
that differ in a single record. It provides a strong end-to-end privacy guarantee – it leaks no information other than the sizes of the databases and the set of matching records. So, the number of dummy records that are added to each blocked bin is uniformly distributed to different groups. Considering fairness-bias in data, we introduce PPRL with feature level DP Blocking method, with which the dummy records added for different groups are biased and fairness-aware:
The number of all dummy records added is calculated based on all groups.
The number of dummy records added as one group label is calculated by feature level DP noise.
Given the dummy records is generated from modifying original record by flipping each bit in Bloom filter with a probability, The data bias of dummy records is also fairness aware regarding to the flipping probability.
Dummy records for female and male can be manipulated separately both on numbers of dummy records and error rate/flipping probability.
In each bin, the number of dummy records for each group is independent from each other. Thus, the added feature level DP noise is fairness aware and hide records from different groups.
An overview of our proposed framework for fairness-aware PPRL with Bloom filter encoding and new notions of DP is shown in Fig. 2. The proposed feature-level DP blocking algorithm for PPRL with new DP notions is outlined in Algorithm 1, and is described in the following sub-section.
Vi-a Algorithm Description
In this algorithm, the two data owners (parties) Alice and Bob agree on a blocking function with bins and blocking strategy . Then, a chosen number of dummy records are added to each bin to make the bin sizes deferentially private. The dummy records are carefully created such that they do not match with any record. Linkage is done using any machine learning classifier (e.g. logistic regression classifier) or simple threshold-based classifier.
Vii Experimental Evaluation
We conducted our experiments on three-pairs of datasets sampled from two sources/domains which contain person-specific data: 1) Australian Bureau of Statistics (ABS) datasets and 2) North Carolina Voter Registration (NCVR) datasets.
ABS: This is a synthetic dataset used internally for linkage experiments at the Australian Bureau of Statistics (ABS) 222Available from https://github.com/cleanzr/dblink-experiments/tree/master/data. It simulates an employment census and two supplementary surveys. We sampled 5000 records for two parties with area level (categorical), mesh block (categorical), sex, industry (multi-categorical), and part time / full time (binary-categorical).
NCVR-no-mod: We extracted a pair of datasets with 5000 records each and a pair of datasets with 10000 records each for two parties from the real North Carolina Voter Registration (NCVR) database 333Available from http://dl.ncsbe.gov/data/ with 50% of matching records between the two parties. Ground-truth is available based on the voter registration identifiers. We used given name (string), surname (string), suburb (string), postcode (string), and gender (categorical) attributes for the linkage.
NCVR-mod: We generated another series of synthetic NCVR datasets for each pair of NCVR datasets generated above, where we included 50% synthetically corrupted records using the GeCo tool [Tra13]. We applied various corruption functions from the GeCo tool on randomly selected attribute values, including character edit operations (insertions, deletions, substitutions, and transpositions), and optical character recognition and phonetic modifications based on look-up tables and corruption rules [Tra13]. This allows us to evaluate how real data errors impact the linkage quality and fairness of linkage decisions.
In our experiments, we use two binary protected features, which are gender with ‘male’ and ‘female’ groups, and age group with ‘young’ () and ‘old’ () groups.
The evaluation metrics used to measure the matching or linkage performance, fairness of linkage, computational efficiency, and privacy guarantees are:
Matching performance: Given the number of true and false positives/matches and , and the number of true and false negatives/non-matches and predicted by the classifier:
False positive rate is the proportion of true negatives/non-matches that are predicted by the linkage classifier as positives/matches, i.e. .
Precision is the percentage of correctly classified matches against all pairs that are classified as matches: . It is also known as true positive rate (TPR).
Recall is the percentage of correctly classified matches against all true matches: . 444Please note that while Bloom filter encoding does not admit s (i.e. and hence recall is ), the recall of the linkage algorithm may not be due to Differential privacy noise addition, the blocking/binning quality, or data errors and variations that could lead to recall loss.
-measure [Han21] has been used in the recent record linkage literature as an alternative to -measure, as the -measure has a limitation of appropriately measuring the linkage quality due to the relative importance given to precision and recall. -measure is calculated as .
Fairness loss: Equalized Odds fairness criteria is used and the fairness loss based on equalized odds criteria is calculated as the maximum of the absolute difference between FPR of the two groups (based on the protected feature, e.g. male and female groups) and the absolute difference between FNR of the two groups.
where and are the true and predicted class labels, respectively, with two values of 1 (for matches) and 0 (for non-matches). Fairness is calculated as:
Number of record pair comparisons: Computational cost of linkage is measured as the required number of record pair similarity comparisons (including real and dummy pairs).
Differential privacy budget : privacy is measured using the privacy budget for Differential privacy guarantees. This is one of the widely used metrics under the indistinguishability category, as described in the taxonomy proposed in [Wag18]).
Fairness and matching performance metrics are in the range of and the higher the values for all metrics except the metric (for which the lower the value), the better the performance of the linkage in terms of accurate and fair linkage. Lower values for number of record comparisons and privacy budget indicate better computational efficiency and privacy guarantees.
In this work, we used threshold classifier and Logistic regression machine learning model to perform the linkage (i.e. to classify matches and non-matches). We used logistic regression classifier available in sklearn library in Python for classifying record pairs.
First, we demonstrate and validate the false positive probability in Equation. (7) and false positive rate predictions in Equation. (14). Consider the case when there are 5000 entities in each dataset with gender as sensitive features, feature-level DP blocking method is used in the record linkage process with equal privacy budgets and equal flipping probability between different genders. The length of the bloom filter used is 300. The number of hash functions used in bloom filters is 30. The number of iterations of bloom filter is 5. The length of sub-strings (q-grams) is 2. The length of label for each blocked bin is 30. The number of iterations of encoding is 2. The flipping probability varies in the range . Threshold classifier with threshold is used after the linkage unit to classify matches and non-matches.
The false positive probability for one record pair with at least one dummy record with threshold classifier is shown in Fig. 4. In this figure, the theoretical prediction in blue curve matches to the results of real false positive probability in red plot. When flipping probability is equal to zero, which means the dummy record is a copy of its ancestor, the probability of this dummy record to be classified as a false match is . With the increase of flipping probability, the false positive probability decreases. It is noted that when the flipping probability is greater than a certain value (e.g. around in Fig. 4), the False Positive probability reduces to zero. The behavior of False positive probability versus flipping probability matches to Theorem. V.1.
Then, we evaluate the effect of privacy budget in false positive rate performance with threshold classifier , while flipping probability is fixed for now. Recall the expression of False positive rate is . When Flipping probability is fixed, the false positive probability for one dummy record pair is fixed. So the numbers of FPs and TNs depend on the original dataset and the number of additional dummy records. As in Fig. 4, the value of false positive probability converges when flipping probability is great than . So, to reduce the effect of flipping probability and its potential bias in record linkage, the values of the flipping probability for all gender groups ( groups in our experiments) are ranging from .
As shown Fig. 4, FPR is reversely proportional to the number of dummy records. It is remarkable that with the inclusion of of dummy records, FPR reduces as a reason of an increase in the number of TNs from dummy record pairs. When privacy budget increases, the FPR increases. Dummy record pairs can only be classified as either FP or TN, and there are always more TNs than FPs. This is because one dummy record is paired with many records other than the true match record from the other dataset. In Fig. 4, the blue curve is the theoretical results in Equation. 10. The red plot is the average empirical results with feature-level DP blocking method. The theoretical result matches with our empirical results of the feature-level DP method, and both of them validate the relationship between and FPR as in Equation. 10.
With the knowledge of the effect of flipping probability on false positive probability (), we evaluate four scenarios:
Baseline 1: no noise,
Baseline 2: Feature-level DP blocking method,
Method A: Fairness constrained feature-level DP blocking method,
Method B: Cost constrained fairness-aware feature-level DP blocking method.
In baseline 1, there are no noise added. In baseline 2, the DP noise that are added for different protected features are the same. In other words, the privacy budget and flipping probability for all protected feature are the same. In method A, privacy budget for each protected feature is kept as a constant value, while for group satisfies Definition. V.2. In method B, flipping probabilities for each protected group are the same, while the privacy budget for group satisfies Definition. V.3.
With differential private blocking method introduced to the record linkage process, there is an improvement in Fairness in terms of Equalised Odds from small privacy budget to large privacy budget as shown in Fig. 6, Fig. 8, Fig. 10 and Fig. 12. When gender is considered as sensitive feature, Fairness improves significantly for both threshold classifier () and logistic linear regression classifier on NCVR-Non-mod dataset, NCVR-mod dataset and ABS dataset. In Fig. 10, when age group is used as a sensitive feature, fairness with differential private blocking improves only for logistic linear regression. This is because for threshold classifier, the fairness for four scenarios (including the baselines with no fairness) with small privacy budgets are already high and the differential privacy noise is stochastic with large privacy budget. The fairness of our methods with respect to age group (in contrast to gender) does not always improve compared to baseline 1 without noise. However, our methods improved the fairness with respect to both age group and gender compared to baseline 2. Our results indicate that our methods with any protected feature (age group or gender) outperform the baselines by achieving both high privacy and high fairness.
Both Method A and Method B help reduce the fairness loss compared to Baseline 2 all the time. From Fig. 6, Fig. 8, Fig. 10 and Fig. 12, Method B has better performance than Method A in reducing fairness-bias with respect to gender for large privacy budget, while Method A performs better with small privacy budget. The reason is when privacy budget is large the bias is more sensitive in cost. While privacy budget is small, the flipping probability value dominates the gender fairness-bias. Hence, Method A is more preferred than Method B for use when privacy budget is small. Baseline 1 with no noise added to the blocked bins experiences the lowest fairness performance all the time. Method B has better performance than Method A in reducing fairness-bias introduced by added differential private noise with age group as sensitive group. As shown in Fig. 10, fairness in Baseline 1 is relatively high with threshold classifier. The introduced DP blocking method increases fairness loss as shown in Baseline 2. Both Method A and Method B help in reducing the fairness loss in this case. It is remarkable that Method B has equivalent or better performance among four scenarios for both threshold classifier and logistic linear regression classifier.
Intuitively, adding privacy preserving noise to the bins of records seems to reduce the precision of the linkage performance. And this can be shown in Fig. 6, Fig. 10 and Fig. 12. In Fig. 6, when privacy budget is small and for threshold classifier, -measure for Baseline 2, Method A and Method B are worse than Baseline 1 no noise scenario. When privacy budget is small, adding privacy preserving noise might reduce the -measure as shown in Fig. 6 and in Fig. 10. While in Fig. 12, with and for threshold classifier, -measure for Baseline 1 performs best in all four scenarios. However, adding privacy preserving noise doesn’t always decrease the -measure performance. Our results show that the -measure of scenarios with feature-level DP blocking methods are better than Baseline 1 with no noise in Fig. 6 on NCVR-Non-mod dataset with logistic linear regression classifier, and in Fig. 8 on NCVR-mod dataset with threshold classifier and logistic linear regression classifier, and in Fig. 10 with threshold classifier with large privacy budget, and also in Fig. 12 on ABS dataset with logistic linear regression classifier.
It is shown that with Fairness-constrained feature-level DP blocking method and small privacy budget and with threshold classifier, both fairness and -measure are significantly improved compared to Baseline 1 and Baseline 2 in some high fairness-bias cases. For logistic regression classifier on NCVR-no mod dataset, NCVR-mod dataset and ABS dataset, Baseline 1 experiences lowest fairness and worst -measure performance among all scenarios in high fairness-bias cases. Our methods perform better in terms of F* measure than both baselines with highly biased gender group.
Method A and Method B improve Baseline 1 and Baseline 2 on fairness while requiring the same overall privacy budget. For Method A, the privacy budgets between different protected features are the same, so the privacy budgets for all protected features remain the same as Baseline 2. By applying the (, Fairness)- Constrained Differential Privacy, the distance between False positive rates for male and female is reduced by adjusting the value of flipping probabilities, while the pairing cost for each protected feature remains the same as Baseline 2. For Method B, the overall privacy budget remains the same, while the privacy budgets for different protected features are different from each other. In method B, (,Cost)-Constrained fairness-aware Differential Privacy is used. As shown in Fig. 13, the overall privacy budgets remain the same as Baseline 2, while the cost for protected group men/male reduces and the cost for protected group women/female increases.
Differential private grouping or blocking has been used in several works in the Privacy-Preserving Record Linkage (PPRL) literature to efficiently link (encoded) records from different parties while providing resilience against frequency inference attacks on the bins/blocks of encoded records. However, these methods use the standard differential privacy (DP) notion that does not consider other constraints, such as fairness of linkage and computational cost of comparing record pairs for linkage, when adding differential privacy noise to the bins.
In this work, we propose new DP notions that are constrained not only on privacy guarantees, but also on fairness-bias in data and computational cost of linkage and apply our new DP notions to PPRL framework based on Bloom filter encoding and DP. We theoretically validate our new notions. Our experimental results show that the new PPRL algorithm following the two new DP notions constrained on fairness and cost provide better results in terms of fairness and cost for the same privacy guarantees.
While our initial results are promising, the cost-constrained fairness-aware DP method does not perform well compared to the fairness-constrained DP method in terms fairness results. We would like to analyze further the cost-constraint method and the trade-off between cost and fairness. In the future, we would like to explore fairness and cost-constrained DP for other learning tasks, including privacy-preserving active learning and privacy-preserving clustering.Another line of future work is experimenting the proposed notions with multiple protected features (e.g. gender and race) and different machine learning linkage models.
This research was funded by Macquarie University CyberSecurity Hub and strategic research funds from Macquarie University. Author Dinusha Vatsalan was affiliated with CSIRO Data61 at initial stages of the writing of this manuscript.