1 Introduction
According to the report “Data Never Sleeps 6.0” published recently by Domo Inc., an estimated 1.7 MB of data will be created every second for each person on earth by 2020. The owners of this staggering amount of data sometimes provide it readily to others, but often hold back despite the value that data trading could provide. Both privacy concerns and the desire to monetize data at a fair market value are barriers, as both could be compromised if data are revealed before terms have been negotiated. A method to assess the value of a data trade without first revealing the data would help make data trading a more efficient transaction, whether the aim is to trade at a fair market price, apply some type of differential privacy, or both.
Finding business value in ‘distributed’ data:
When data on different aspects of a system are captured by different stakeholders, trading the data can provide a more complete perspective of the system. For instance, in an InternetofThings (IoT) ecosystem, IoT devices owned by different parties (manufacturers, service providers, consumers, etc.) often collect data that reveal only a partial understanding of behaviors and events. Creating a marketplace for trading the data would enable a party to get a more complete understanding when required, without spending extra time and money deploying additional IoT devices to collect data that another party already has. As long as stakeholders can establish a fair price for the data, inefficient duplication of efforts can be avoided, benefiting both parties of a transaction. However, identifying trade partners and tagging a cash value to the data can be a tricky challenge, particularly because the value depends on the quality and content of the data held by both partners.
Maximizing data utility while protecting individual privacy:
When considering how to share sensitive datasets, potential collaborators may seek to analyze how different statistical privacy options affect the utility of data. The party applying statistical privacy to their data before sharing may like to work with a potential collaborator to experiment with different choices of statistical privacy methods and parameters, in order to deliver desensitized data of the highest possible utility. Applications include both businesstobusiness transactions and businesstogovernment transactions.
Data trading scenarios:
An owner of a dataset may want to release only subsets of their data to control proliferation, but they need a way to determine utility of subsets in order to choose the right one for each potential collaborator. An owner may also want to limit the number of times data are shared, either to mitigate security and privacy risks or to maintain a desired monetary price for access to the data. Choosing customers that have the highest utility for the data will help maximize monetary return, as those customers will in principle pay a higher price. An owner may want to sell access to data at a full valuebased price, but rational purchasers may insist on a discounted price to compensate for any risk associated with uncertain utility. Thus, answering the following question is important:
How can one securely measure utility of data and the impact of applying statistical privacy enhancement techniques, without access to the actual data?
1.1 This Work
In this work, we try to answer the above question for a specific potential acquirer’s task, where the parties freely share data dictionaries. Specifically, we provide a protocol with which a potential provider and a potential acquirer can determine the value of the data with respect to the latter’s task at hand, without the latter learning anything more about the data, other than its specification in the data dictionary. The specific subcase we consider is the provider having a binary feature vector and the acquirer having a binary class vector. The acquirer would like to learn if the provider’s feature vector can improve the correctness of the acquirer’s classification. Thus, the utility we consider is whether the data shared by the provider is expected to improve the classification of the acquirer’s existing dataset. To quantify utility, we use the
statistic studied by Yang and Pederson (1997) for the related problem of feature selection. We employ Pallier homomorphic encryption for the required privacypreserving computations.1.2 Roadmap
The protocols in this paper assume parties share primary keys for their data, in order for data elements to be aligned. In future work, we will integrate private set intersection protocols, such as the Practical Private Set Intersection Protocols published by De Cristofaro and Tsudik [de2010practical], in order to relax this assumption. We also plan to study extensions of the work to more sophisticated feature selection, based on combining multiple columns in the provider’s dataset to generate more complex feature candidates.
2 Background
In this work, we consider a structured dataset, and we are interested in classification based on all the features available. Specifically, we consider two parties, Carol and Felix. Carol has a dataset consisting of certain feature columns and a class vector generated from her available features. Felix possesses an additional feature column that might be useful for Carol in improving the classification of her dataset.
Notations. Let be the class label vector with Carol, and
be the feature vector with Felix. We assume both the class labels and the features are binary attributes, leaving generalization to multinomial classifiers for a future paper. That is, for all
, and . Let denote the class variable of the th record in Carol’s dataset. Let be the feature value, in Felix’s feature vector, corresponding to the th record in Carol’s dataset.2.1 Feature Selection
Feature selection is the process of removing noninformative features and selecting a subset of features that are useful to build a good predictor [guyon2003introduction]
. The criteria for feature selection vary among applications. For example, Pearson correlation coefficients are often used to detect dependencies in linear regressions, and mutual information and
statistics are commonly used to rank discrete or nominal features [guyon2003introduction, yang1997comparative].In this paper, we focus on determining utility of binary features. We choose statistics as a measure of utility, due to its wide applicability and its amenability towards cryptographic tools. More specifically, unlike mutual information which involves logarithmic computations, the calculation of statistics only involves additions and multiplications.
For the class label vector and the corresponding feature vector , is defined to be the number of rows with and . is defined to be the number of rows with and . is defined to be the number of rows with and . is defined to be the number of rows with and . Table 1
shows the twoway contingency table for
and . The statistic of and is defined [yang1997comparative] to be:[width=.2]  0  1 
0  
1 
is used to test the independence of and . Table 2 shows the confidence of rejecting the independence hypothesis under different values. For example, when is larger than , the independence hypothesis can be rejected with more than 99.9% confidence, indicating that the feature vector is very likely to be correlated with the class label vector .
(, )  Confidence 

10.83  99.9% 
7.88  99.5% 
6.63  99% 
3.84  95% 
2.71  90% 
2.2 Cryptographic Tools
2.2.1 PKE scheme and CPA security.
We recall the standard definitions of publickey encryption (PKE) schemes and chosen plaintext attack (CPA) security, which are used in this paper.
PKE schemes. A scheme with message space consists of three probabilisticallypolynomialtime (PPT) algorithms . Key generation algorithm outputs a public key and a secret key . Encryption algorithm takes and a message , and outputs a ciphertext . Decryption algorithm takes and a ciphertext , and outputs a message . For correctness, we require that for all , all , and all .
Negligible Function. A function is negligible if for every possible integer , there exists an integer such that for all , . We denote negligible functions as .
The CPA Experiment. We now describe the chosenplaintext attack (CPA) game with an adversary against a PKE scheme .
CPA Security [katz2014introduction]. A PKE scheme has indistinguishable encryptions under a chosenplaintext attack, or is CPAsecure, if for all probabilistic polynomialtime adversaries there is a negligible function such that
where the experiment is defined in Algorithm 1
, and the probability is taken over the randomness of
and of the experiment.2.2.2 Paillier Encryption.
We use Paillier encryption to maintain privacy in our twoparty feature selection algorithm, and employ the additive homomorphic property of Paillier encryption to calculate the statistics that quantify feature utility. We recall the Paillier encryption scheme in Figure 1 [katz2014introduction].
Note that while we use Paillier homomorphic encryption, the proposed protocols can accomodate any semantically secure additively homomorphic encryption scheme.
Paillier Encryption Scheme Let be a polynomialtime algorithm that, on input , outputs where and and are bit primes (except or is not prime with probability negligible in ). Define the following encryption scheme:

: on input run () to obtain . The public key is , and the private key is , where .

: on input of a public key and a message , choose a uniformly random and output the ciphertext

: on input of a private key and a ciphertext , compute
Paillier encryption supports additive and scalar multiplication homomorphism. We briefly recall the definitions of additive homomorphism and scalar multiplication homomorphism [katz2014introduction].
Additive Homomorphism. A PKE scheme = (, , ) is said to be additively homomorphic, if there exists a binary operation , such that the following holds for all , and for all ,
Scalar Multiplication Homomorphism. A PKE scheme = (, , ) is said to be scalar multiplication homomorphic, if there exists a binary operation , such that the following holds for all , and for all ,
3 Proof of Privacy
We first present the highlevel argument for how our protocols will protect each party’s data. We have one of the parties (Carol) choose the encryption key, and encrypt her data using this key before sending it to the other party (Felix). Thus, Carol’s privacy will be guaranteed by the semantic security assumption of the encryption scheme. Meanwhile, Felix will also encrypt his data using Carol’s key, but he will blind all of the outputs he sends to Carol with randomness of his choosing, ensuring that Carol can learn nothing about his data. We now make these notions precise by first providing a formal definition of privacy protection in the honestbutcurious adversary model, and a formal proof of privacy for the protocol that attempts to protect privacy in the above described manner.
Definition 1 (Honestbutcurious security of twoparty protocol)
We begin with the following notation:

Let and be probabilistic polynomialtime functionalities and let be a twoparty protocol for computing . Let the parties be , with inputs , respectively.

The of the party during an execution of on and security parameter is denoted by and equals , where (’s value depending on the value of ), equals the contents of the party ’s internal random tape, and represents the th message that it received.

The output of the party during an execution of on and security parameter is denoted by and can be computed from its own view of the execution.
Let be a functionality. We say that securely computes in the presence of semihonest adversaries if there exist probabilistic polynomialtime algorithms and such that
(1)  
(2) 
such that , and .
4 Protocol
In this section, we describe a fourround protocol for statistic calculation under a twoparty setting. For convenience, we continue to refer to the parties as Carol, who has the class vector , and Felix, who has the feature vector . Carol’s objective is to learn and Felix’s objective is to not reveal any further information about while Carol computes the utility of Felix’s data for her classifier. In this section, Felix uses multiplicative binding to keep the detailed mathematics a little simpler, but an alternative protocol that uses additive blinding is provided in Section 6 for situations where the security of multiplicative blinding is a concern.
As before, is the number of rows with and . is the number of rows with and . is the number of rows with and . is the number of rows with and .
Round 1.
Carol performs the following operations:

Generate a Paillier key pair .

Encrypt all class labels with : .

Compute . Note that Carol can obtain this value by computing , since and , based on the contingency table.

Encrypt with : .

Send the following values to Felix:
Round 2.
Felix performs the following operations:

Compute . Note that Felix can obtain this value by computing = ,
since .

Sample , and compute .

Send the following value to Carol:
Round 3.
Carol performs the following operations:

Decrypt using .

Compute and , and encrypt them.

Send the following values to Felix:
Round 4.
Felix performs the following operations:

Cancel by computing
and

Compute an encryption of by computing:
where and are computed as
and
We see below that the above computation gives . Since , can be decomposed as follows:

Send the following value to Carol:
Local computation.
Carol decrypts to obtain .
Remark 1
We note that only Carol receives the value . Depending on the application, if Felix also needs to know the value of , Carol can simply then send it to Felix after running the protocol.
Comments
There are no comments yet.