Secure Two-Party Feature Selection

by   Vanishree Rao, et al.

In this work, we study how to securely evaluate the value of trading data without requiring a trusted third party. We focus on the important machine learning task of classification. This leads us to propose a provably secure four-round protocol that computes the value of the data to be traded without revealing the data to the potential acquirer. The theoretical results demonstrate a number of important properties of the proposed protocol. In particular, we prove the security of the proposed protocol in the honest-but-curious adversary model.



There are no comments yet.


page 1

page 2

page 3

page 4


Gimme That Model!: A Trusted ML Model Trading Protocol

We propose a HE-based protocol for trading ML models and describe possib...

Secure Multi-party Quantum Computation with a Dishonest Majority

The cryptographic task of secure multi-party (classical) computation has...

A partisan districting protocol with provably nonpartisan outcomes

We design and analyze a protocol for dividing a state into districts, wh...

Securely Trading Unverifiable Information without Trust

In future, information may become one of the most important assets in ec...

Privacy-Preserving Multiparty Protocol for Feature Selection Problem

In this paper, we propose a secure multiparty protocol for the feature s...

Faster Privacy-Preserving Computation of Edit Distance with Moves

We consider an efficient two-party protocol for securely computing the s...

On the Composability of Statistically Secure Random Oblivious Transfer

We show that stand-alone statistically secure random oblivious transfer ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the report “Data Never Sleeps 6.0” published recently by Domo Inc., an estimated 1.7 MB of data will be created every second for each person on earth by 2020. The owners of this staggering amount of data sometimes provide it readily to others, but often hold back despite the value that data trading could provide. Both privacy concerns and the desire to monetize data at a fair market value are barriers, as both could be compromised if data are revealed before terms have been negotiated. A method to assess the value of a data trade without first revealing the data would help make data trading a more efficient transaction, whether the aim is to trade at a fair market price, apply some type of differential privacy, or both.

Finding business value in ‘distributed’ data:

When data on different aspects of a system are captured by different stakeholders, trading the data can provide a more complete perspective of the system. For instance, in an Internet-of-Things (IoT) ecosystem, IoT devices owned by different parties (manufacturers, service providers, consumers, etc.) often collect data that reveal only a partial understanding of behaviors and events. Creating a marketplace for trading the data would enable a party to get a more complete understanding when required, without spending extra time and money deploying additional IoT devices to collect data that another party already has. As long as stakeholders can establish a fair price for the data, inefficient duplication of efforts can be avoided, benefiting both parties of a transaction. However, identifying trade partners and tagging a cash value to the data can be a tricky challenge, particularly because the value depends on the quality and content of the data held by both partners.

Maximizing data utility while protecting individual privacy:

When considering how to share sensitive datasets, potential collaborators may seek to analyze how different statistical privacy options affect the utility of data. The party applying statistical privacy to their data before sharing may like to work with a potential collaborator to experiment with different choices of statistical privacy methods and parameters, in order to deliver desensitized data of the highest possible utility. Applications include both business-to-business transactions and business-to-government transactions.

Data trading scenarios:

An owner of a dataset may want to release only subsets of their data to control proliferation, but they need a way to determine utility of subsets in order to choose the right one for each potential collaborator. An owner may also want to limit the number of times data are shared, either to mitigate security and privacy risks or to maintain a desired monetary price for access to the data. Choosing customers that have the highest utility for the data will help maximize monetary return, as those customers will in principle pay a higher price. An owner may want to sell access to data at a full value-based price, but rational purchasers may insist on a discounted price to compensate for any risk associated with uncertain utility. Thus, answering the following question is important:

How can one securely measure utility of data and the impact of applying statistical privacy enhancement techniques, without access to the actual data?

1.1 This Work

In this work, we try to answer the above question for a specific potential acquirer’s task, where the parties freely share data dictionaries. Specifically, we provide a protocol with which a potential provider and a potential acquirer can determine the value of the data with respect to the latter’s task at hand, without the latter learning anything more about the data, other than its specification in the data dictionary. The specific sub-case we consider is the provider having a binary feature vector and the acquirer having a binary class vector. The acquirer would like to learn if the provider’s feature vector can improve the correctness of the acquirer’s classification. Thus, the utility we consider is whether the data shared by the provider is expected to improve the classification of the acquirer’s existing dataset. To quantify utility, we use the

-statistic studied by Yang and Pederson (1997) for the related problem of feature selection. We employ Pallier homomorphic encryption for the required privacy-preserving computations.

1.2 Roadmap

The protocols in this paper assume parties share primary keys for their data, in order for data elements to be aligned. In future work, we will integrate private set intersection protocols, such as the Practical Private Set Intersection Protocols published by De Cristofaro and Tsudik [de2010practical], in order to relax this assumption. We also plan to study extensions of the work to more sophisticated feature selection, based on combining multiple columns in the provider’s dataset to generate more complex feature candidates.

2 Background

In this work, we consider a structured dataset, and we are interested in classification based on all the features available. Specifically, we consider two parties, Carol and Felix. Carol has a dataset consisting of certain feature columns and a class vector generated from her available features. Felix possesses an additional feature column that might be useful for Carol in improving the classification of her dataset.

Notations.  Let be the class label vector with Carol, and

be the feature vector with Felix. We assume both the class labels and the features are binary attributes, leaving generalization to multinomial classifiers for a future paper. That is, for all

, and . Let denote the class variable of the -th record in Carol’s dataset. Let be the feature value, in Felix’s feature vector, corresponding to the -th record in Carol’s dataset.

2.1 Feature Selection

Feature selection is the process of removing non-informative features and selecting a subset of features that are useful to build a good predictor [guyon2003introduction]

. The criteria for feature selection vary among applications. For example, Pearson correlation coefficients are often used to detect dependencies in linear regressions, and mutual information and

statistics are commonly used to rank discrete or nominal features [guyon2003introduction, yang1997comparative].

In this paper, we focus on determining utility of binary features. We choose statistics as a measure of utility, due to its wide applicability and its amenability towards cryptographic tools. More specifically, unlike mutual information which involves logarithmic computations, the calculation of statistics only involves additions and multiplications.

For the class label vector and the corresponding feature vector , is defined to be the number of rows with and . is defined to be the number of rows with and . is defined to be the number of rows with and . is defined to be the number of rows with and . Table 1

shows the two-way contingency table for

and . The statistic of and is defined [yang1997comparative] to be:

[width=.2] 0 1
Table 1: Two-Way Contingency Table of and

is used to test the independence of and . Table 2 shows the confidence of rejecting the independence hypothesis under different values. For example, when is larger than , the independence hypothesis can be rejected with more than 99.9% confidence, indicating that the feature vector is very likely to be correlated with the class label vector .

(, ) Confidence
10.83 99.9%
7.88 99.5%
6.63 99%
3.84 95%
2.71 90%
Table 2: Confidence of Rejecting the Hypothesis of Independence under Different Values

2.2 Cryptographic Tools

2.2.1 PKE scheme and CPA security.

We recall the standard definitions of public-key encryption (PKE) schemes and chosen plaintext attack (CPA) security, which are used in this paper.

PKE schemes.  A scheme with message space consists of three probabilistically-polynomial-time (PPT) algorithms . Key generation algorithm outputs a public key and a secret key . Encryption algorithm takes and a message , and outputs a ciphertext . Decryption algorithm takes and a ciphertext , and outputs a message . For correctness, we require that for all , all , and all .

Negligible Function.  A function is negligible if for every possible integer , there exists an integer such that for all , . We denote negligible functions as .

The CPA Experiment.  We now describe the chosen-plaintext attack (CPA) game with an adversary against a PKE scheme .

0:  Security parameter
2:  The adversary is given , , and oracle access to . outputs a pair of messages of the same length
3:  A uniform bit is chosen, and is given to
4:   continues to have access to , and outputs a bit
4:   if , and otherwise
Algorithm 1 The Experiment

CPA Security [katz2014introduction].  A PKE scheme has indistinguishable encryptions under a chosen-plaintext attack, or is CPA-secure, if for all probabilistic polynomial-time adversaries there is a negligible function such that

where the experiment is defined in Algorithm 1

, and the probability is taken over the randomness of

and of the experiment.

2.2.2 Paillier Encryption.

We use Paillier encryption to maintain privacy in our two-party feature selection algorithm, and employ the additive homomorphic property of Paillier encryption to calculate the statistics that quantify feature utility. We recall the Paillier encryption scheme in Figure 1 [katz2014introduction].

Note that while we use Paillier homomorphic encryption, the proposed protocols can accomodate any semantically secure additively homomorphic encryption scheme.

Paillier Encryption Scheme Let be a polynomial-time algorithm that, on input , outputs where and and are -bit primes (except or is not prime with probability negligible in ). Define the following encryption scheme:

  • : on input run () to obtain . The public key is , and the private key is , where .

  • : on input of a public key and a message , choose a uniformly random and output the ciphertext

  • : on input of a private key and a ciphertext , compute

Figure 1: Paillier Encryption Scheme.

Paillier encryption supports additive and scalar multiplication homomorphism. We briefly recall the definitions of additive homomorphism and scalar multiplication homomorphism [katz2014introduction].

Additive Homomorphism.  A PKE scheme = (, , ) is said to be additively homomorphic, if there exists a binary operation , such that the following holds for all , and for all ,

Scalar Multiplication Homomorphism.  A PKE scheme = (, , ) is said to be scalar multiplication homomorphic, if there exists a binary operation , such that the following holds for all , and for all ,

3 Proof of Privacy

We first present the high-level argument for how our protocols will protect each party’s data. We have one of the parties (Carol) choose the encryption key, and encrypt her data using this key before sending it to the other party (Felix). Thus, Carol’s privacy will be guaranteed by the semantic security assumption of the encryption scheme. Meanwhile, Felix will also encrypt his data using Carol’s key, but he will blind all of the outputs he sends to Carol with randomness of his choosing, ensuring that Carol can learn nothing about his data. We now make these notions precise by first providing a formal definition of privacy protection in the honest-but-curious adversary model, and a formal proof of privacy for the protocol that attempts to protect privacy in the above described manner.

Definition 1 (Honest-but-curious security of two-party protocol)

We begin with the following notation:

  • Let and be probabilistic polynomial-time functionalities and let be a two-party protocol for computing . Let the parties be , with inputs , respectively.

  • The of the party during an execution of on and security parameter is denoted by and equals , where (’s value depending on the value of ), equals the contents of the party ’s internal random tape, and represents the -th message that it received.

  • The output of the party during an execution of on and security parameter is denoted by and can be computed from its own view of the execution.

Let be a functionality. We say that securely computes in the presence of semi-honest adversaries if there exist probabilistic polynomial-time algorithms and such that


such that , and .

4 Protocol

In this section, we describe a four-round protocol for statistic calculation under a two-party setting. For convenience, we continue to refer to the parties as Carol, who has the class vector , and Felix, who has the feature vector . Carol’s objective is to learn and Felix’s objective is to not reveal any further information about while Carol computes the utility of Felix’s data for her classifier. In this section, Felix uses multiplicative binding to keep the detailed mathematics a little simpler, but an alternative protocol that uses additive blinding is provided in Section 6 for situations where the security of multiplicative blinding is a concern.

As before, is the number of rows with and . is the number of rows with and . is the number of rows with and . is the number of rows with and .

Round 1.
Carol performs the following operations:

  1. Generate a Paillier key pair .

  2. Encrypt all class labels with : .

  3. Compute . Note that Carol can obtain this value by computing , since and , based on the contingency table.

  4. Encrypt with : .

  5. Send the following values to Felix:

Round 2.
Felix performs the following operations:

  1. Compute . Note that Felix can obtain this value by computing = ,

    since .

  2. Sample , and compute .

  3. Send the following value to Carol:

Round 3.
Carol performs the following operations:

  1. Decrypt using .

  2. Compute and , and encrypt them.

  3. Send the following values to Felix:

Round 4.
Felix performs the following operations:

  1. Cancel by computing


  2. Compute an encryption of by computing:

    where and are computed as


    We see below that the above computation gives . Since , can be decomposed as follows:

  3. Send the following value to Carol:

Local computation.
Carol decrypts to obtain .

Remark 1

We note that only Carol receives the value . Depending on the application, if Felix also needs to know the value of , Carol can simply then send it to Felix after running the protocol.