In today’s digital age, Internet has given us the potential to collect and access various kinds of information easily. Individuals are willing to give away personal information for online convenience (Craig and Ludloff, 2011). This vast amount of data about individuals is being constantly stored in various databases. When analysed efficiently, data can translate into meaningful information about the individual’s behaviour. This leads to personal data being the new currency of the digital age and is sought after by every industry and government. Despite the massive benefits that data analysis can bring, improper handling of sensitive data usually leads to more information being revealed than intended. One attempt to preserve user privacy was to release anonymised or aggregated data. However, it is proven that records can be de-anonymised when combined with other data sources. This happens when attackers can accurately match the anonymised database to another non-anonymised database (Narayanan and Shmatikov, 2008).
The introduction of differential privacy has brought us closer to achieving the goal of preserving personal privacy while still revealing meaningful information about datasets. The brief idea behind differential privacy is to incorporate some noise to the result of an output such that it does not change significantly with or without the addition of a single input in the dataset.
()-Differential Privacy (Dwork et al., 2006) is defined as:
Definition 0 ().
A randomized algorithm satisfies ()-Differential Privacy if, for any two neighboring datasets differing by one record ( and ) and all subsets S of the output range, it generates a randomized output such that:
, where the probability space is over the coin flips of the randomized algorithm
, where the probability space is over the coin flips of the randomized algorithm.
Traditionally, every query processed with differential privacy will generate a privacy cost. This privacy cost accumulates even if it is the same query. It brings about an issue where the privacy cost may exceed the privacy budget, leading to a greater percentage of privacy leakage (Cuppens et al., 2019; Jia et al., 2019).
The rise of blockchain technology gives rise to a possible solution to the above problem of privacy budget exhaustion. By using blockchain’s key advantages of decentralisation, tamper-proofing and traceability, it provides a distributed, trusted platform of peer-to-peer network to store information regarding the queries processed with differential privacy (Sultan et al., 2018). These transaction data can then be retrieved later to be processed for possible reuse of previously generated noisy answer. This reusing of old noisy answer will be highlighted in the subsequent web DApp demonstration. With the proposed blockchain-based privacy management system, the total privacy cost incurred will be significantly reduced, catering for datasets with frequent queries such as medical record datasets.
2. Related Works
Zyskind et al. (Zyskind et al., 2015) tackle the issue about privacy when using third-party mobile platform by suggesting the combination of blockchain with off-blockchain storage to create a personal data management platform for increased privacy. This allows users to have ownership and control over their data without requiring to trust any third-party. Kosba et al. (Kosba et al., 2016) proposes a framework for building privacy-preserving smart contract. The proposed framework, Hawk, allows any programmer to easily write a program that implements a cryptographic protocol between blockchain and the user. This cryptographic protocol includes using authenticated data structures and zero-knowledge proofs for added security.
The above studies among others focused on identity privacy between the blockchain and the users, while trusting the anonymity of the blockchain. However, there is a lack of literature which focuses not only on the privacy of blockchain, but also the privacy protection for the database itself. Research on reusing noisy answers for privacy protection has also been done previously.
Xiao et al. (Xiao et al., 2011)
proposed a differentially private algorithm that correlates Laplace noise added to different query results for improved data utility. However, in this paper, Gaussian noise is used over Laplace noise for an easier privacy analysis of multiple query types. The sum of independent Laplace random variables does not follow a Laplace distribution. On the other hand, the sum of independent Gaussian random variables still follows the Gaussian distribution. As such, using Gaussian noise will allow the algorithm to handle queries of different types.
3. Demo Overview
To illustrate the proposed blockchain-based privacy management system, a decentralised web application is created. This demo simulates how the system implements differential privacy algorithm while tracking and reducing the privacy cost incurred. Fig. 1 shows the overall user interface of the blockchain-based demo.
3.1. Ethereum account information
Ethereum account information of the user is displayed on a bar fixed at the top of the page once the user granted permission to the demo. The demo gets the account information (wallet address and balance inside it) from MetaMask extension of the browser. When connecting with MetaMask, a pop-up that is managed by MetaMask will appear asking users for their permission to access the account. Account information will only be displayed after permission is granted.
3.2. Defining differential privacy parameters
In the demo, users will be able to specify parameters used for differential privacy algorithm ( and ) using the input bar shown in Fig. 3. value is used to determine how strict the level of privacy is. The smaller the value, the better the privacy preservation. defines the level of relaxation of the -differential privacy notion.
3.3. Queries selection
Users can select the query type that they wish to inquire about by pressing on the buttons created. In the backend, each button is linked to a sensitivity level that is pre-calculated and assigned to the button according to its query type. Differential privacy algorithm scales the noise generated with the sensitivity of the query function. This sensitivity level is the maximum distance between the true query results for any two neighboring datasets that differ by one record. The calculation of the sensitivity level follows:
where is the sensitivity level of the query Q, and D and are neighboring datasets that differs by one record.
3.4. Reuse of previous Gaussian noise
The algorithm is designed to run on the comparison of standard deviation and follows the general workflow in Fig.5.
Based on the privacy parameters from the user, a value will be calculated. This is the standard deviation of the noise required for this query. To answer query with a sensitivity of , a zero-mean Gaussian noise with standard deviation is added to the true query result. Since Gaussian noise can be calculated from the standard deviation, the proposed system stores the standard deviation and uses it as a basis of comparison for reusing noise. The system retrieves all previous transactions from the blockchain and perform comparison. It first checks the blockchain for any existing record with the same query type and standard deviation used. If an existing record is found in the blockchain, the algorithm will return the same result as the output of differential privacy. If no existing record is found, it will compare the standard deviation of a new query with previous queries made. The algorithm reuses Gaussian noise by injecting noise to the previous noise to generate a new noise that fulfils the privacy requirements. As such, full reuse will be possible if the new standard deviation is larger than the minimum standard deviation of all previous queries from the same type. The algorithm will then calculate the new noise that needs to be added to the previous results and derive the new result. If the new standard deviation is smaller than the minimum standard deviation previously generated, it will not be able to fully reuse any previous noise. The algorithm can only reuse a fraction of a previous noise and computes additional noise to be added. The query will then be forwarded to the server to add the computed noise to the partially reused noisy response.
3.5. Output display
This demo displays the output of any query performed in a card format that is shown in Fig. 6. These cards will pop up once the output is available. The header of the card contains a query ID that is generated for the query submitted. In the body, it contains the query type that was requested, the noisy response generated by the application, standard deviation () calculated, blockchain price, privacy cost, and the remaining privacy budget. The privacy cost displayed is privacy cost.
The proposed system is developed by making use of Bootstrap, Ethereum, MetaMask, Web3.js, Truffle Suite, Provable, MongoDB and Heroku. Fig. 7 shows the architectural diagram of the proposed system.
The DApp consists of the frontend client browser and the backend that runs on a decentralised platform, Ethereum and MongoDB database at the hosted server. Truffle Suite and Provable is used for the development of the smart contract. For the frontend client browser, index.html and app.css files define the webpage displays for the user. It also makes use of Bootstrap for responsive interactions with the user. The frontend also contains a app.js file that maps items from index.html, interacts with Web3.js, processes calculations and parses it to the blockchain.
In this paper, a simple DApp demo is developed to illustrate the use of blockchain to reuse previously generated noisy responses from differential privacy algorithm. In the future, this demo can be improved with the addition of graphical elements to better show the effects of the system such as the increase in number of queries that user can submit without exceeding the privacy budget. The demo can also be further improved by analyzing and quantifying the extent of privacy preservation with the reuse of noise.
=0mu plus 1mu
- Privacy and big data: the players, regulators, and stakeholders. ”O’Reilly Media, Inc.”. Cited by: §1.
- Optimal distribution of privacy budget in differential privacy. In Risks and Security of Internet and Systems: 13th International Conference, CRiSIS 2018, Arcachon, France, October 16-18, 2018, Revised Selected Papers, Vol. 11391, pp. 222. Cited by: §1.
- Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Cited by: §1.
- Database query system with budget option for differential privacy against repeated attacks. In International Conference on Security and Privacy in New Computing Environments, pp. 46–57. Cited by: §1.
- Hawk: the blockchain model of cryptography and privacy-preserving smart contracts. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 839–858. Cited by: §2.
- Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 111–125. Cited by: §1.
- Conceptualizing blockchains: characteristics & applications. arXiv preprint arXiv:1806.03693. Cited by: §1.
- IReduct: differential privacy with reduced relative errors. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 229–240. Cited by: §2.
- Decentralizing privacy: using blockchain to protect personal data. In 2015 IEEE Security and Privacy Workshops, pp. 180–184. Cited by: §2.