In many scenarios, a server holds a collection of sets and clients wish to determine whether or how many server sets match their own set, while keeping clients and server sets private. We call these Private Set Matching (PSM) problems. We identify the privacy and efficiency requirements of PSM problems by analyzing three real-world use cases: determining whether a pharmaceutical database contains compounds that are chemically similar to the client’s (Stumpfe and Bajorath, 2011; Laufkötter, Oliver and Miyao, Tomoyuki and Bajorath, Jürgen, 2019), determining whether an investigative journalist holds relevant documents (Edalatnejad et al., 2020) (or how many), and matching a user’s profile to items or other users in mobile apps (The Fork, online, ; Strava Route Explorer, online, ; Tinder, online, ).
We find that PSM problems have three common characteristics: (1) Clients want to compare their one set with all sets at the server. (2) Clients do not need per-server set results, only an aggregated output (e.g., whether any server set matches). (3) Clients and server want privacy: the server should learn nothing about the clients’ sets, and the clients should only learn the aggregated output. PSM problems differ, however, in their definition of when sets match.
Typically, set matching is defined as a function of the intersection of two sets. Hence, private set matching problems can be solved (Edalatnejad et al., 2020; Shimizu et al., 2015; Zhao and Chow, 2018) using private set intersection (PSI) protocols – which let clients privately compute the intersection between their set and one server set (Cristofaro et al., 2012; Cristofaro and Tsudik, 2009; Cristofaro et al., 2010; Pinkas et al., 2019b, 2014, 2018b, 2018a, b; Kissner and Song, 2005) – and then computing matches locally. This approach, however, is inefficient: it requires one PSI interaction per server set. Additionally, it reduces privacy of the server’s collection by leaking information about server sets beyond the aggregated result. Such leakage can have negative consequences, such as clients learning secret chemical properties of compounds or learning the content of journalists’ sensitive document collections.
While PSM problems differ in their details, they all follow the same structure: first compute per-set matches then aggregate them. Performing this computation in plain, on either client or server, comes at the cost of privacy. We construct a framework for solving PSM problems that leverages computation in the encrypted domain, see Fig. 1. Our framework starts with the server applying a configurable matching criteria to compute per-set binary answers “is this server set of interest to the client” using an encrypted client’s set. Matching criteria are application-dependent; our framework implements cardinality threshold, containment, and set similarity measures. Next, the server aggregates these per server-set results. Aggregation methods are application-dependent too; our framework implements variants such as determining whether at least one set matches or counting the number of matching sets. Finally, the client decrypts the aggregated result.
Our work makes the following contributions:
We introduce Private Set Matching (PSM) problems. We derive their requirements from three real-world problems. We show that existing solutions cannot efficiently satisfy privacy needs.
We propose a modular PSM framework that separates basic PSI functionality, from flexible matching criteria, and many-set aggregation. Our framework simplifies designing private solutions.
We design single-set protocols where the client learns a one-bit output – whether one server set is of interest to the client – and many-set protocols where the client learns a collection-wide output that aggregates individual PSM responses.
Our PSM protocols use oblivious polynomial evaluation. The communication cost scales linearly with the size of the client’s set and is independent of the number of server sets and their total size. Our protocols do not require preprocessing and outperform existing PSI-based protocols under the same privacy guarantees.
We demonstrate our framework’s capability by solving chemical compounds and document matching scenarios. Empirical evaluation shows the privacy gain of PSM over PSI comes at a very reasonable communication and computation cost.
2. Private Set Matching
In this section we define the Private Set Matching (PSM) problem. We derive its basic requirements from three real-world matching problems. We show how the PSM setting differs from the traditional PSI scenario and thus why existing PSI solutions cannot satisfy the privacy requirements of PSM problems.
2.1. Case studies
We study three cases that can benefit from PSM.
Chemical research and development is a multi-billion dollar industry. When studying a new chemical compound, knowing the properties of similar compounds can speed up research. In an effort to monetize research, companies sell datasets describing thousands to millions of compounds they researched.
Chemical R&D teams are willing to pay high prices for these datasets but only if they include compounds similar to their research target. Determining the suitability of datasets is tricky. To protect their businesses, buyers want to hide the compound they are currently investigating (Shimizu et al., 2015), and sellers do not want to reveal information about the compounds in their dataset.
Chemical similarity of compounds is determined by computing molecular fingerprints of compounds, and then comparing these fingerprints (Stumpfe and Bajorath, 2011; Laufkötter, Oliver and Miyao, Tomoyuki and Bajorath, Jürgen, 2019; Xue et al., 2001; Willett et al., 1998; Cereto-Massagué et al., 2015; Muegge and Mukherjee, 2016)
. Fingerprints are based on the substructure of compounds and are represented as a fixed-size bit vectors – these vectors are between a few hundred to a few thousand bits. Measures such as Tversky(Tversky, 1977) and Jaccard (Jaccard, 1912) determine the similarity of these fingerprints and thus of compounds.
Revealing pair-wise intersection cardinalities or even similarity scores between the fingerprints of a target compound and seller’s compounds results in unacceptable leakages. A buyer can reconstruct an -bit molecular fingerprint by learning similarity values between and known compounds (Shimizu et al., 2015). To prevent leakage about the seller’s dataset, the buyer should learn only the number of similar compounds in the seller’s dataset, or, better, only learn whether at least one similar compound exists.
Peer-to-Peer Document Search
Privacy-preserving peer-to-peer search engines help users and organizations with strict privacy requirements to collaborate safely. We take the example of investigative journalists who, while unwilling to make their investigations or documents publicly available, want to find collaboration opportunities within their network.
To identify those opportunities, a journalist performs a search to learn whether another journalist has documents of interest (Edalatnejad et al., 2020). The querying journalist crafts queries made of keywords relevant for their investigation, and these query keywords are compared to each document in the document owners’ collections. A document is deemed relevant if its keywords either contain all query keywords, or a sufficient number of them. Each document owner has a collection of a thousand documents (on average), and each document is represented by around a hundred keywords.
The sensitivity of the documents’ content and of the journalists’ investigations demand that both the content of the documents and queries remain private (Edalatnejad et al., 2020). It suffices for the querying journalist to learn one bit of information – that at least one or a threshold number of documents in the owner’s collection is relevant – to determine whether they should contact the owner.
Matching in mobile apps
A common feature in mobile apps is enabling users to find records of interest in the app servers’ databases, e.g., restaurants (The Fork, online, ), new routes for running (Strava Route Explorer, online, ), or suitable dating partners (Tinder, online, ; OkCupid, online, ). Users are typically interested in records that have at least a number of matching characteristics in common with their search criteria or even are a perfect match. Also, for full functionality, users need to be able to retrieve these records.
The user-provided criteria – typically range choices entered via radio buttons or drop-down menus – are compared to the attributes of records. An app database can have millions of records and records can have dozens to hundreds of attributes.
Both search criteria and records are sensitive. Knowing search criteria enables profiling of user interests. These are particularly sensitive for dating applications. Thus, search queries should be kept private. The secrecy of the records in the database is not only at the core of the business value of these apps but also required by law in cases where records contain personal data (e.g., dating apps).
2.2. PSM requirements
In this section, we extract requirements that PSM protocols should fulfill – in addition to basic PSI properties such as client privacy – based on the commonalities between the use cases. We also identify properties that, while not required, make the deployment of PSM protocols easier in practice.
RQ.1: Flexible set matching
PSM protocols need to be able to determine matches between sets without revealing other information such as intersections or cardinalities to the client.
In the use cases, matches can be modeled as a function of the intersection between a client and a server set. A chemical compound is interesting when the Tversky or Jaccard similarity with the query compound exceeds a threshold. A document is interesting when some or all query keywords are present in the document’s keyword set. And a record is interesting when a threshold of query attributes are present in a record.
Clients do not need to know the intersection or its cardinality. Therefore, PSM protocols must not leaking these intermediate values to the client and instead compute a one-bit matching status.
RQ.2: Aggregate many-set responses
PSM protocols need to have the capability to provide an aggregated response for a collection of sets without leaking information about individual sets.
Our three use cases highlight that in many applications, a client (buyer, journalist, user) may want to compare their input (compounds under investigation, keywords of interest, search criteria) with a collection of sets (compounds in a database, documents in a collection, records in a database). Thus, PSM protocols must consider collections instead of the two-set setting.
Moreover, in our use cases clients are satisfied with a response that summarizes the utility of the collection as a whole. For example, a buyer is interested in a chemical dataset if it contains at least one similar compound and a querier journalist may contact a document owner if the owner has a number of relevant documents. Therefore, to protect the server’s privacy, PSM protocols should only reveal aggregated per-collection results. This is particularly important to limit leakage when clients can make several queries.
RQ.3: Efficient clients
PSM protocols must demand little communication and computation from clients.
Clients may be constrained in their resources. This can be in terms of computation, e.g., in mobile apps where battery has to be preserved; or in terms of bandwidth, e.g., in mobile apps where users may not have unlimited data plans, or in the journalist scenario where journalists can be in locations with poor Internet access. Therefore, PSM protocols should not incur a large client-side cost.
Fulfilling the many-set requirement (RQ.2) notably helps towards efficiency. Clients do not need to repeat computation and send multiple queries when applying the same criteria to all server sets (Edalatnejad et al., 2020). Moreover, sending aggregated per-collection responses reduces the transfer cost of the response from to .
2.2.2. Desirable properties
DP.1: Handling unbalance
It is desirable that PSM protocols are efficient in extremely unbalanced scenarios, with large total server set size and small client set size.
In the PSM setting, the server holds a collection of sets. As a result, the total number of server elements is large. On the other hand, the client only submits one set per query. Search queries commonly have a limited size. This leads to a large imbalance between the server’s and client’s input sizes. For example, in the document search scenario, search queries usually have less than 10 keywords, while the server sets together contain hundreds of millions of keywords. To gain efficiency, PSM protocols should not push the server load onto the client.
DP.2: Handling small domains
It is desirable that PSM protocols are efficient when the input domain of the sets is small.
In some PSM scenarios, the input is drawn from a small domain. For example, in a dating profile, the gender (male, female, other) and age (a number between 18 to 120) attributes can be represented using only a few hundred set elements. Taking advantage of small input domain should allow PSM protocols to gain in efficiency and eventually scale to a larger number of sets.
2.3. Related Work
We introduce the private set intersection (PSI) protocols which are most relevant to our work. We split protocols into two categories: traditional protocols focusing on single-set intersection or cardinality, and protocols that go beyond this scenario. We assess these works with respect to two critical aspects of PSM problems: privacy, and efficiency on clients. The many-set scenario impacts our assessment of client efficiency (RQ.3) as it leads to an extreme imbalance, regardless of supporting desired property DP.1 or DP.2. Thus, we assess client efficiency assuming that the server input size may be 6 orders of magnitude larger than the client’s. We summarize existing schemes and their suitability for the PSM scenario in Table 1.
2.3.1. Single-set PSI: intersection and cardinality
Most PSI protocols focus on scenarios where each party holds exactly one set. The client learns information about the intersection of these sets while (i) not learning anything about the server’s non-intersecting elements, and (ii) not leaking any information about their own set to the server. PSI protocols in the literature focus on providing two possible outputs: the intersection (e.g., finding common network intrusions (Nagaraja et al., 2010), or discovering contacts (Demmler et al., 2018)); and the cardinality of the intersection (e.g., privately counting the number of common friends between two social media users (Nagy et al., 2013), or performing genomic tests (Baldi et al., 2011)). Works in this area opt for a variety of trade-offs between the computational capability and the amount of bandwidth required to run the protocol (Chen et al., 2017; Kiss et al., 2017; Pinkas et al., 2020; Kales et al., 2019), showing that PSI can scale to large datasets (Pinkas et al., 2019b; Kamara et al., 2013), and support light clients (Kiss et al., 2017).
We classify PSI protocols in four classes by their underlying cryptographic primitive and discuss their suitability for PSM.
The fastest class of PSI protocols builds on oblivious transfers (OTs). Early OT-based protocols can compute the intersection of million-item sets in a few seconds (Pinkas et al., 2014). Later protocols use hashing to map the sets to small buckets and then apply efficient oblivious PRFs based on OTs to the items in these buckets to enable plaintext comparisons (Pinkas et al., 2015, 2018b; Kolesnikov et al., 2016; Pinkas et al., 2019a). Typical OT-based approaches reveal these comparison results (and thus intersections or cardinalities), to the client; therefore, they do not satisfy our single-set privacy requirement (RQ.1) nor do they allow aggregation (RQ.2). The communication cost is linear in the size of the client and server set, so OT approaches do not satisfy our requirement for client efficiency (RQ.3). We discuss OT-based approaches that support SMC extensions in Section 2.3.2.
Circuit-based PSI protocols use secure two-party computation, e.g., Yao’s garbled circuits (Yao, 1986), to compute set intersections. A first class constructs a full circuit to evaluate the set intersection (Huang et al., 2012; Kales et al., 2019; Pinkas et al., 2018a, 2009; Kiss et al., 2017). Circuit-based approaches can be extended to support flexible set matching or many-set aggregation straightforwardly. However, circuits are at least linear in the size of the server input. This limit is fundamental and circuits cannot satisfy our client efficiency requirement.
There are protocols that use circuits as a building block in larger protocols (Kales et al., 2019; Kiss et al., 2017). For example, obliviously evaluating a PRF such as AES or LowMC (Albrecht et al., 2015). These approaches do not support extending the circuit to do post-processing, so we do not categorize them as part of circuit-based protocols.
OPRFs using asymmetric cryptography
Some protocols construct Oblivious Pseudo-random Functions (OPRFs) from asymmetric primitives such as Diffie-Hellman (Rosulek and Trieu, 2021; Kiss et al., 2017), RSA (Cristofaro and Tsudik, 2009; Cristofaro et al., 2010; Kiss et al., 2017) or discrete logarithms (Cristofaro et al., 2012). The client obliviously evaluates the PRF on its elements with the server, and the server sends the PRF evaluation of its elements to the client. The client then locally compares.
OPRF-based approaches focus on comparison, similar to OT, and cannot compute flexible set matches without leaking intermediate information about sets. The communication cost is linear in the size of client and server sets. But preprocessing can make the online transfer cost independent of the server size.
Oblivious Polynomial Evaluation
Another approach is to use (partial) homomorphic encryption (Rivest et al., 1978) to determine set intersection using oblivious polynomial evaluation (OPE) (Freedman et al., 2004; Hazay, 2015; Dachman-Soled et al., 2009; Chen et al., 2017). The client encrypts their elements and sends them to the server. The server constructs a polynomial with its set elements as roots, evaluates the polynomial on encrypted client elements, randomizes the result, and sends them back to the client. The client decrypts the results, a 0 indicates a matching element. We use a similar approach in our HE schemes. Existing OPE-based approaches do not support flexible set matching or aggregation, but achieve computation and communication costs independent of the server set size on the client.
2.3.2. Custom PSI protocols
Several custom solutions provide more complex privacy-preserving functionalities than intersection or cardinality. These include computing the sum of associated data (Ion et al., 2017), evaluating a threshold on intersection size (Zhao and Chow, 2018; Ghosh and Simkin, 2019; Zhao and Chow, 2017), or computing Tversky similarity (Shimizu et al., 2015). These approaches improve privacy by supporting flexible set matching but they were not designed with a many-sets scenario and privacy-preserving aggregation in mind. These schemes are optimized for a specific setting, and adapting them requires rethinking their design. None of these approaches achieve cost independent of the server input size.
There have been efforts in making the hashing- and OT-based PSI protocols extendable with generic SMC (Huang et al., 2012; Ciampi and Orlandi, 2018; Pinkas et al., 2018a, 2019b; Chandran et al., 2022; Rindal and Schoppmann, 2021). These works support arbitrary privacy extensions, but focus on scenarios with equal client and server sizes and have cost linear in the server size. The linear communication cost to the server input size seems to be a fundamental limit of SMC approaches.
2.3.3. Other relevant works
We study two groups of papers with similar techniques/scenarios to our work outside our PSI categories. Two papers consider a scenario with many server sets (Edalatnejad et al., 2020; Shimizu et al., 2015). These two works are closest to our final system; we discuss and compare them to our system in Section 10. Other papers focus on small input domain and represent sets as bit-strings (Bay et al., 2021; Ruan et al., 2019; Shimizu et al., 2015; Huang et al., 2012). As our focus is solving PSM problems, we leave the comparison of our small domain PSI variants to existing work to Appendix F.
|OT (Pinkas et al., 2014, 2015, 2018b; Kolesnikov et al., 2016; Pinkas et al., 2019a)|
|Circuit (Huang et al., 2012; Kales et al., 2019; Pinkas et al., 2018a; Rindal and Schoppmann, 2021; Pinkas et al., 2009)||✓||✓|
|Asym. OPRF (Cristofaro and Tsudik, 2009; Cristofaro et al., 2010; Kiss et al., 2017; Cristofaro et al., 2012; Kiss et al., 2017; Rosulek and Trieu, 2021)||✓|
|OPE (Freedman et al., 2004; Hazay, 2015; Dachman-Soled et al., 2009; Chen et al., 2017)||✓|
|Flexible func. (Zhao and Chow, 2018; Ghosh and Simkin, 2019; Ion et al., 2017; Zhao and Chow, 2017; Shimizu et al., 2015)||✓|
|Extendable (Huang et al., 2012; Ciampi and Orlandi, 2018; Pinkas et al., 2018a, 2019b; Chandran et al., 2022; Rindal and Schoppmann, 2021)||✓||✓|
3. A Framework for PSM Schemes
Our analysis in Section 2.3 shows that, with the exception of a few special purpose solutions, previous work cannot be used to solve PSM problems without either losing privacy or efficiency. Instead of designing custom solutions for specific problems (as in Section 2.3.3), we design a modular framework that enables the design of PSM solutions with minimal effort while providing both great privacy and performance close to ad-hoc solutions. We organize our protocols in the following layers, as shown in Fig. 2:
- Base Layer:
protocols operate directly on single client’s and server’s sets. These protocols compute single-set PSI functionalities such as intersection or cardinality. This layer includes algorithms optimized for small and large input domains.
- PSM Layer:
protocols use base layer protocols to compute a binary answer determining whether the server set is of interest to the client. Computation in this layer is the same regardless of the domain size.
- Many-Set Layer:
protocols combine the output of PSM protocols (run on each of the server’s sets) and aggregate these responses into one collection-wide response.
We combine these layers to solve PSM protocols and satisfy our requirements, see also Fig. 1. The PSM layer protocols (using the base PSI layer) operate on the encrypted client’s set to compute binary per-set matches (RQ.1). The many-set layer tackles collections of sets and aggregates individual set responses to compute per-collection responses (RQ.2).
This construction ensures client efficiency (RQ.3). All layers are evaluated at the server to deal with the large unbalance between client and server input sizes (DP.1). The special small-domain base layer efficiently handles small input domains (DP.2). The many-set layer enables (1) sending only one single-set client query for searching a collection of sets and (2) aggregates server set responses to achieve constant size server responses (RQ.3).
Our framework simplifies building new privacy-preserving PSM solutions into two tasks: requirement analysis to understand the problem, and configuring the framework to provide the required privacy. Our modular approach supports extensions. If an application has a new matching or aggregation criteria, designers only need to add one functionality to the appropriate layer while benefiting from the remaining layers. In Section 10, we show how to use our framework to build solutions for performing chemical similarity and peer-to-peer document search.
Additional design decisions
Our framework provides security and privacy in a nuanced threat model. As we show in Section 8, the framework ensures client-set privacy against malicious servers, and server-set privacy against malicious clients. However, our framework only provides correctness in the semi-honest server setting. This is a deliberate design decision. Even if we would use cryptographic tools such as zero-knowledge proofs to ensure correct server computation, such computation would only be correct with respect to the server’s freely chosen inputs (e.g., the server could just input random sets). We therefore do not aim to provide system-wide correctness. We discuss the rationale for our decision and how malicious correctness impacts our requirements and privacy in Appendix E.
We chose not to use preprocessing in our framework. In the use cases we considered, clients perform only a small number of queries. Thus the pre-processing cost cannot be amortized over many queries and is therefore of limited benefit.
We choose to build protocols with just one round of communication. In this way, our framework can support systems where we cannot assume that both parties are online during the whole process. For example, journalists in a peer-to-peer search engine are often offline (Edalatnejad et al., 2020). Asynchrony leads to high roundtrip times and makes multiple round protocols unsuitable.
Porting to Somewhat Homomorphic Encryption
Our framework is designed with FHE in mind and assumes unbounded multiplication depth. For practical purposes, we port the majority, but not all, of our protocols to support execution with somewhat homomorphic encryption and optimize operations in Section 9.
4. Technical Background
We introduce our notation and define the syntax of the fully homomorphic encryption scheme we use.
Let be a security parameter. We write to denote that is drawn uniformly at random from the set . Let be a positive integer, then denotes the set of integers , and represent the elements of that are co-prime with . We write to denote the set , and use to present the list . We drop the subscript when the list length is clear from the context. We write to denote the encryption of .
4.1. Homomorphic Encryption
Homomorphic encryption (HE) schemes enable arithmetic operations on encrypted values without decryption. We use a homomorphic encryption scheme that operates over the ring with prime , such as BFV (Fan and Vercauteren, 2012).
HE is defined by the following procedures:
. Generates a HE parameter with the plaintext domain .
. Takes the parameter params and generates a fresh pair of keys .
. Takes the public key and a message and returns the ciphertext .
. Takes the secret key and a ciphertext and returns the decrypted message .
The correctness property of homomorphic encryption ensures that .
HE schemes support homomorphic addition (denoted by ) and subtraction (denoted by ) of ciphertexts: and . HE schemes also support multiplication (denoted by ) of ciphertexts: .
Besides operating on two ciphertexts, it is possible to perform addition and multiplication with plaintext scalars. In many schemes, such scalar-ciphertext operations are more efficient than first encrypting the scalar and then performing a standard ciphertext-ciphertext operation. We abuse notation and write to represent .
4.2. Core functions
We realize that the complex functionality of PSM protocols can be reduced to a sequence of calling zero detection and inclusion test procedures. These two functions allow us to describe our protocol at a higher abstraction level. Moreover, any improvement to these basic functions automatically enhances our framework.
The function computes whether the ciphertext is an encryption of zero. The binary output is defined as if otherwise .
We rely on the ring structure of to build HE.IsZero. Based on Fermat’s Little Theorem, we know that for all prime numbers , any non-zero variable to the power is congruent to one modulo . We can perform this exponentiation with multiplications. See Algorithm 1 for the implementation.
The high multiplicative depth of HE.IsZero currently makes it impractical for use with most somewhat homomorphic encryption schemes. We hope that with the advances in HE schemes and research on HE comparison and equality test, efficient instantiations of this function become available to unlock our framework’s full capabilities. When evaluating our framework in Section 10 we use ad-hoc techniques to prevent the need for this function.
The function checks if is included in the set of cardinality . We consider two variants. In the first, is a set of ciphertexts , in the second, is a set of plaintexts . In both cases, the output equals 0 if and only if a exists such that , otherwise will be a uniformly random element in .
The function HE.IsIn relies on oblivious polynomial evaluation (OPE) (Freedman et al., 2004; Hazay, 2015), see Algorithm 2. In both variants, we create an (implicit) polynomial with roots , and evaluate . If is in the set, there exists a variable where , thus result is zero, and so is . Otherwise, result is the product of non-zero factors modulo . Since is prime, the product of non-zero values is non-zero. The random value ensures that in this case the output is a uniformly random element in . The multiplicative depth of HE.IsIn scales with the size of the set . We use the second form, where is a set of plaintexts, to compute HE.IsIn at a lower multiplicative depth when are known, see Section 9.3.
5. Base Layer
In this section, we present the base layer of our framework. It provides three basic single-set private set intersection operations. The first two compute intersection and its cardinality. The third is a generalization of the cardinality where the server assigns weights to its elements, and the output is the sum of weights corresponding to the intersection (PSI-SUM). These protocols are set up in a way that the server can continue computation on the output of the base layer (see Fig. 2), but they can be used in isolation too.
5.1. Protocols with Large Input Domain
The three large-domain protocols follow the same structure, see Fig. 3. The client holds set and the server holds set . The client generates a HE key pair and sends the public key to the server ahead of the protocol. Clients encrypt each of the elements in their set and send them to the server as a query. The server runs a protocol-specific processing function process to obtain the result . The protocol is either used as the first layer and passes into the second layer, or is a stand alone protocol and returns to the client. The client runs the protocol-specific function reveal to compute the final result. Algorithm 3 instantiates process and reveal for the three protocols.
The PSI protocol computes . The server uses the inclusion test HE.IsIn (see Section 4.1) to compute an inclusion status for each client element (see PSI-process). An element is in the intersection if and only if the corresponding inclusion status is zero (recall that a zero indicates that the corresponding client element is in the server set). When used as a stand-alone protocol, the server returns the list of encrypted inclusion values , which the client then decrypts (see PSI-reveal).
The PSI cardinality protocol computes . There exist two variants: the standard PSI-CA variant in which the client learns the cardinality (Cristofaro et al., 2012; Debnath and Dutta, 2015), and the ePSI-CA variant in which the server learns an encrypted cardinality (Zhao and Chow, 2018). We focus on the latter to enable further computation on the intersection cardinality in next layers.
Our ePSI-CA protocol (see ePSI-CA-process) first computes the inclusion statuses using PSI-process and then uses HE.IsZero to compute – in the ciphertext domain – the cardinality, i.e., the number of elements that are zero. When used as a stand-alone protocol, the server returns to the client which decrypts it to obtain the answer (see ePSI-CA-reveal).
When the cardinality protocol is used as a stand alone protocol without next layers, it is possible to mimic earlier work (Cristofaro et al., 2012) and construct a cardinality protocol from the above-mentioned naive PSI protocol by shuffling server responses before returning them.
In the PSI-SUM protocol, the server assigns a weight to each of its elements and computes the sum of weights of common elements, i.e., . To this end, PSI-SUM-process takes the weights as extra input. The server computes a binary inclusion status for each server element . The server then proceeds similarly to ePSI-CA to compute the encrypted weighted sum. When used as a stand-alone protocol, the server returns to the client which decrypts it to obtain the answer (see PSI-SUM-reveal).
OPE-based PSI schemes are overlooked in the literature due to their quadratic computation cost. We show that the extreme imbalance in PSM problems enables our scheme to provide a competitive computation cost while achieving communication cost independent of the server input size.
5.2. Small Domain Extension
When the domain of the client and server sets is small, sets can be efficiently represented and manipulated as bit-vectors (Bay et al., 2021; Ruan et al., 2019; Shimizu et al., 2015). To this end, the client and server first agree on a fixed ordering of elements in the domain . Figure 4 shows the structure of the small domain protocols. This time, the client’s query consists of a vector of encrypted inclusion statuses iff and 0 otherwise for all elements in the domain.
Assume for now that clients honestly compute the values . Algorithm 4 shows how to instantiate the small domain process functions. The function PSI-SD-process creates a bit vector for the intersection. To do so, the server multiplies the indicator with another binary indicator determining whether the element is present in the server set . The final status will be an encryption of 1 if is present in both sets and 0 otherwise.
The remaining two functions ePSI-CA-SD-process and PSI-SUM-SD-process compute the sum and weighted-sum, respectively, of the inclusion status for all possible domain values .
Malicious clients can deviate from the protocol and submit non-binary ciphertexts . In doing so, they can learn more than the permitted cardinality/sum. Because the query is encrypted, the server cannot detect this misbehavior. The standard approach of using zero-knowledge proofs to ensure honest behavior is expensive. Therefore, we use an HE approach to protect against malicious clients. The server builds a randomizer term (blue instructions in Fig. 4) that is random in if the client misbehaves. By adding it to the result , a misbehaving client learns nothing about the real result. (With abuse of notation, the server adds a vector of fresh randomizers for PSI-SD-process.) In a many-set scenario, the server can amortize the cost of computing the randomizers for all sets : The server first computes as before, then picks a fresh randomness and sets .
The term is zero when all s are binary, and we argue is close to uniformly random otherwise. The term evaluates to 0 if and to a uniformly random element in otherwise. Therefore, the distribution of will be close to uniformly random in as long as at least one non-binary exists in the client’s query. See Appendix C for the exact distribution.
While the idea of representing sets as bit-vectors is not new (Bay et al., 2021; Ruan et al., 2019; Shimizu et al., 2015), existing work dismisses FHE protocols as too costly and focuses on additively homomorphic solutions. Using the inherent parallelism of encryption schemes such as BFV (Fan and Vercauteren, 2012), that we discuss in Section 9, enables us to achieve lower computation and communication costs, especially in the many-set scenario. We evaluate and compare the cost of PSI-CA-SD in Appendix F. We show that it has comparable performance to existing optimized small-domain scheme (Ruan et al., 2019), while providing better privacy by protecting server sets against malicious clients.
6. Private Set Matching Layer
In this section, we present the PSM layer of our framework. It uses the output of the base-layer protocols to determine whether the server set is of interest given the client query . The output of the matching layer is the matching status . Similar to the inclusion test HE.IsIn in Section 4.1, the matching status is zero for sets of interest and a random value in otherwise.
The PSM layer provides three interest metrics. Full PSM (F-PSM), which determines if the client’s query set is fully contained in the server set; threshold PSM (Th-PSM), which determines if the size of intersection exceeds a threshold; and Tversky PSM (Tv-PSM), which determines if the Tversky similarity between the client set and the server exceeds a threshold. Each of these PSM operations rely on the large or small domain base layer protocols, see Algorithm 5.
To instantiate a PSM protocol, the client and the server proceed as in Fig. 3 for large domains and Fig. 4 for small domains, but plug in the appropriate PSM process and reveal methods from Algorithm 5. Since the outputs of the PSM process functions are defined identically, all 3 PSM operations share the same PSM-reveal method.
The F-PSM variant determines whether the client elements are all in the server’s set, i.e., . The server first computes the inclusion statuses by calling PSI-process (see F-PSM-process). Recall is zero when . Therefore, when , the sum of all is zero. When an element is not in the server set, its inclusion status is uniformly random in , and therefore the sum will be random as well.
The F-PSM protocol has a small false-positive probability when more than oneexists such that . Adding multiple random PSI responses can, incorrectly, lead to a zero sum . In Appendix C, we bound the probability of a false-positive to . Moreover, we bound the difference between the distribution of the sum to and uniformly random over at all points when more than one is missing. The false-positive probability is zero when only one is missing.
Note that, the F-PSM protocol computes containment and not equality. Thus, hashing and comparing the client set with server’s set does not work as the server needs to hash every combination of server set elements incurring an exponential cost.
The Th-PSM variant determines if the sets and have at least elements in common, e.g., . The server’s computation (see Th-PSM-process) first uses ePSI-CA-process to compute the encrypted cardinality and then evaluates the inequality .
Directly computing this one-sided inequality over encrypted values is costly. However, , where we use that the size of the client’s set is bounded by the query size (an exact bound on is not needed). The server evaluates the inequality by performing the inclusion test using HE.IsIn.
The Tv-PSM variant determines if the Tversky similarity of sets and exceeds a threshold . More formally, the protocol computes where
is the Tversky similarity with parameters and . Computing the Tversky similarity in this form is difficult as it requires floating-point operations. We assume and are rational, and known to both the client and the server. We follow the approach of Shimizu et al. (Shimizu et al., 2015) and transform Tversky’s similarity to
for appropriate integer values of . The server either knows (large domain where ) or can compute it (small domain where ). The server also knows and can compute using ePSI-CA-process. Evaluating the inequality requires two steps (see Tv-PSI-process):
Step 1. Transform coefficients , and to equivalent integer coefficients . We describe this in detail in Appendix B.
Step 2. Evaluate the Tversky similarity inequality (1). We convert this inequality to a two-sided equation. We know and , thus
Therefore, two sets and satisfy iff
The server evaluates it by performing an inclusion test.
7. Many-Set Layer
The many-set layer aggregates the outputs of the PSM layer, thereby answering questions about the collection of server sets. We provide four aggregation methods: naive aggregation, which simply returns the PSM layer outputs; existential search (X-MS), which returns whether at least one server set matched; cardinality search (CA-MS) which returns the number of matching server sets; and retrieval (Ret-MS) which returns the index of the th matching server set.
Upon receiving the query , the server runs the desired PSM protocol psm-process (e.g., one of the methods in Algorithm 5) on each of its sets to compute the matching output . The response is zero if the set is interesting for the client and random otherwise. The server next runs an aggregation function ms-process, taking as input the PSM responses to compute the final result . Algorithm 6 shows how to instantiate ms-process and ms-reveal for each of the four variants. When using naive aggregation (NA-MS), the client runs NA-ms-reveal. In all other cases, the client runs Agg-ms-reveal to compute the result.
|Auxiliary psm-process data|
Naive Aggregation (NA-MS)
The naive aggregation variant runs the PSM protocol for each of the server sets and returns the results to the client (see NA-ms-process). Using the many-set protocol reduces the client’s computation and bandwidth costs as they compute and send the query only once.
Existential Search (X-MS)
The existential search variant determines if at least one of the server sets is of interest to the client. Recall that interesting sets produce zero PSM responses , so a collection will have an interesting set if and only if the product of PSM responses is zero (see X-ms-process). This process does not introduce false positives as responses are elements of a prime field ; their product will never be zero without having a zero response.
Cardinality Search (CA-MS)
The cardinality search variant counts the number of interesting server sets in the server’s collection . The CA-MS protocol (see CA-ms-process) follows the same process as the ePSI-CA protocol and uses HE.IsZero to turn the PSM responses into binary values and computes their sum.
Similar to the single set PSI-CA, we can use shuffling to convert the naive aggregation into cardinality search with minimal computational overhead. This gain comes at the cost of increased communication as the protocol needs to send shuffled set responses to the client instead of a single encrypted cardinality.
The many-set retrieval variant returns associated data of the th matching server set . Clients use this variant when they are not concerned about whether a matching set exists, but rather about information related to this matching set – such as an index () for retrieving records. A good example is the matching scenario where apps want to retrieve a lot of data about the matching records. Apps would first run the Ret-MS protocol to retrieve the index of the matching record and then follow with a PIR request to retrieve the matching set’s associated data.
The Ret-MS protocol takes an input parameter denoting that the client wants to retrieve the associated data of the th matching set. The server builds an encrypted index of interesting sets in three steps (see Ret-ms-process): (1) The server uses HE.IsZero to compute indicating if a set is interesting. (2) The server computes a counter to track how many interesting sets exist in the first sets. (3) The server combines and to compute an index where is 1 if is the th interesting set, and zero otherwise. Adding weighted values produces the result.
8. Security and Privacy
We define correctness, client and server privacy and prove our protocols achieve privacy against malicious adversaries. The same definition applies to both PSM and PSI protocols.
Definition 0 (Correctness).
A PSM protocol is correct if the computed result matches the definition in Table 5.
Theorem 2 ().
Our PSM protocols are correct against semi-honest servers.
The correctness follows our mathematical definition in a straightforward manner, as we assume a semi-honest threat for correctness. We explain the rationale behind our threat model and how semi-honest correctness may impact malicious privacy in Appendix E.
Definition 0 (Client Privacy).
A PSM protocol is client private if the server cannot learn any information about the client’s set beyond the maximum size of the client’s set.
Definition 0 (Server Privacy).
A PSM protocol is server private if the client cannot learn any information about the server elements beyond the number of server sets, the maximum server set size, and the explicit output of the protocol.
Theorem 5 ().
Our PSM protocols provide client and server privacy against malicious adversaries, as long as the HE scheme is IND-CPA and circuit private.
We provide below a sketch of the proof, and refer the reader to Appendix D for the full proof.
Our protocols achieve client privacy because the only client action is to send their encrypted set elements to the server. If the server can extract information from these ciphertexts then it can break IND-CPA.
To show server-privacy, we split our protocols into two types. Stand-alone protocols (PSI, PSI-SUM, PSI-SD), that operate directly on the client and server’s inputs; and composite protocols that post-process the output of stand-alone protocols, but do not depend on further client or server inputs.
The stand-alone protocols produce the correct output regardless of any malicious client input – in the case of PSI-SD, the malicious check randomizes the output of misbehaving clients. Circuit privacy ensures that encrypted responses reveal nothing beyond the decrypted outputs. So the stand-alone protocols are server private.
The composite protocols only process the output of the stand-alone protocols. They do not rely on further client or server input. Thus, as the composite protocols compute the desired functionality (by correctness) and the underlying stand-alone protocol is server private, so is their composition. ∎
9. From Theory To Practice
Naively implementing schemes using homomorphic encryption leads to slow protocols. We analyze the asymptotic cost of our schemes, and explain the optimizations we implement to make our schemes feasible in practice.
9.1. Asymptotic Cost
Table 2 summarizes the asymptotic cost of the modules in our framework. The communication cost of our protocols is linear in the client set size, but independent of the number of server sets and their size, except for the many-set NA-MS protocol that must return a matching status for each server set and is thus linear in the number of server sets. This is the minimum achievable cost.
Let be the size of the client’s set and be the total number of server elements. The computation cost of our protocols is , a higher asymptotic cost than alternative approaches (Cristofaro et al., 2012; Pinkas et al., 2020; Kiss et al., 2017). The PSI literature often dismisses it as “quadratic” and too expensive, while approaches (Cristofaro and Tsudik, 2009; Cristofaro et al., 2010; Ateniese et al., 2011; Pinkas et al., 2019b; Ciampi and Orlandi, 2018) with “linear” cost , where determines the protocol’s false positive rate, are considered acceptable. However, we focus on unbalanced () and extremely unbalanced () scenarios. In these scenarios, “quadratic” protocols may, in practice, outperform “linear” protocols.
Fully homomorphic encryption schemes rely on bootstrapping, which is prohibitively expensive. Thus, we use the somewhat homomorphic BFV cryptosystem (Fan and Vercauteren, 2012) with a fixed multiplicative depth.
We implement our protocols using the Go language. Our code is 1,620 lines long and relies on the Lattigo library (Lattigo v2.1.1, online, 2020) to handle BFV operations. Our protocol requires the HE scheme to provide circuit privacy which is supported by Lattigo (Mouchet et al., 2020; de Castro et al., 2021)
. We will open source our code upon publication of the paper.
We run all experiments on a machine equipped with an Intel i7-9700 @ 3.00 GHz processor and 16 GiB of RAM. All reported numbers are single-core computation costs. As the costly operations are inherently parallel, we believe our scheme scales linearly with the number of cores.
Let be the degree of the RLWE polynomial, be the plaintext modulus, and be the ciphertext modulus. The polynomial degree and ciphertext modulus determine the multiplicative depth of the scheme. We follow the Homomorphic Encryption Security Standard guidelines (Albrecht et al., 2018) to choose s that provide 128 bits of security. The plaintext modulus determines the input domain. Depending on the required multiplicative depth, we use one of the parameters described in Table 3.
Table 4 shows a microbenchmark of basic operations and the size of keys and ciphertexts. We use relinearization keys to support multiplication, and rotation keys to support some of our optimizations (see next section). Generating and communicating keys is expensive. Therefore, we assume that clients generate these keys once at setup and use them for all subsequent protocols.
|Plaintext mult. (ms)||0.97||4.13||18.3|
|Rotation by 1 (ms)||2.16||10.8||57|
|Public key (KB)||512||2048||7680|
|Relinearization key (MB)||3||12||60|
|Rotation key (MB)||22||96||510|
We explain how we optimize our implementation and limit the multiplicative depth of (some of) our algorithms to improve efficiency.
We use BFV in combination with the number-theoretic transformation (NTT) (Smart and Vercauteren, 2014) so that a BFV ciphertext encodes a vector of elements; BFV additions and multiplications act as element-wise vector operations. This batching enables single instruction multiple data (SIMD) operations on encrypted scalars.
Performing operations between scalars in the same ciphertext (such as computing the sum or product of elements) requires modifying their position through rotations. Applying rotation requires rotation keys. It is possible to reduce the size of the rotation keys in exchange for more costly rotation operations. The sizes reported in Table 4 provide a balanced computation/communication trade-off.
The use of batching renders computing HE.IsZero infeasible. The exponentiation with in HE.IsZero consumes a multiplicative depth of , so must be small. To use batching, however, the plaintext modulus must be prime and large enough that . As batching does not support small s that are required for HE.IsZero to work, we prioritize efficiency and focus our evaluation on variants that do not require zero detection. The parameters in Table 3 support batching.
The client’s query is small with respect to the capacity of batched ciphertexts, i.e., . We use two forms of replications to improve parallelization: powers and duplicates. When the domain is large, the client encodes powers of each element (e.g., ) into the query ciphertext to reduce the multiplicative depth of HE.IsIn, see the second variant in Algorithm 2. This optimization is not needed for small domains, since they do not use HE.IsIn in the base layer. Additionally, the client can encode duplicates of the full query (including powers, when they are in use).
Replication is straightforward when the client is semi-honest, but impacts security when the client is malicious. We enforce correct replication as follows: The server computes a malicious check ciphertext such that will be zero when (1) all duplicates are equal and (2) for all consecutive power elements and , the equality holds. Otherwise, is a uniformly random plaintext. The server destroys the result of misbehaving clients by adding the check to the final output of the protocol. The server needs to compute this check only once in a many-set scenario. We implemented this check and included its cost in all figures.
When the server returns more than one value, e.g., in the case of naive aggregation, the batched answers may have ended up in different ciphertexts. The server repacks the answers to always ensure the smallest number of ciphertexts needs to be transferred back to the client.
10. Many-Set PSM in Practice
To demonstrate our framework’s capability, we solve the chemical similarity and document search problems. We discuss matching in mobile apps in Appendix G.
10.1. Chemical Similarity
Recall from Section 2.1 that chemical similarity of compounds is determined by first computing a molecular fingerprint, and then comparing these fingerprints (Stumpfe and Bajorath, 2011; Laufkötter, Oliver and Miyao, Tomoyuki and Bajorath, Jürgen, 2019; Xue et al., 2001; Willett et al., 1998; Cereto-Massagué et al., 2015; Muegge and Mukherjee, 2016). As fingerprints are short, we represent the molecular fingerprint vectors as small-domain sets. We use the Tversky similarity PSM algorithm Tv-PSM to compute whether the query compound is similar enough to each compound in the seller’s database. We instantiate Tv-PSM with the optimized small-domain version of ePSI-CA.
We follow a popular configuration (Shimizu et al., 2015) where compounds are represented by 166-bit MACCS fingerprints (Durant, Joseph L and Leland, Burton A and Henry, Douglas R and Nourse, James G, 2002) and Tversky parameters are set to . Processing these raw parameters as part of the Tv-PSM protocol (see Algorithm 5) leads to evaluating the inequality
We evaluate two aggregation policies. To determine whether at least one compound in the database matches, we apply X-MS aggregation. To count the number of matching compounds, we apply CA-MS aggregation. In their naive implementation, these variants have high multiplicative depth, so we modify them to enable efficient deployment without relying on bootstrapping.
Existential Search. The X-MS protocol applied to a collection of compounds requires a multiplicative depth of to compute . This depth is too high for our parameters. Instead, we relax the output requirements and reduce the output as much as possible: For a fixed depth , we return encrypted scalars to the client. This relaxation reduces the privacy of X-MS – it is less than full X-MS, but better than CA-MS – at the gain of efficiency. If this reduced privacy is unacceptable, the client just needs to choose a larger HE parameter.
Cardinality Search. We use the shuffling variant of the CA-MS protocol due to its lower multiplicative depth. When the server shuffles the PSM responses, as indexes are ephemeral, the client only learns the number of interesting compounds. Shuffling encrypted SIMD values is hard, so the server shuffles the compounds (server sets) as plaintext before processing the query to the same effect.
Our modifications increase the transfer cost from a single aggregated result to linear to the number of server sets. Although, the X-MS protocol requires sending less scaler values by a factor of .
We evaluate our similarity search on the ChEMBL database (Mendez et al., 2019; Gaulton, Anna and Bellis, Louisa J and Bento, A Patricia and Chambers, Jon and Davies, Mark and Hersey, Anne and Light, Yvonne and McGlinchey, Shaun and Michalovich, David and Al-Lazikani, Bissan and others, 2012). With more than 2 million compounds, it is one of the largest public chemical databases. The ChEMBL database contains compounds in the SMILES format. We use the RDKit library (RDKit, online, ) to compute MACCS fingerprints from this format.
Figure 6 shows the computation and communication cost of running our protocols with the parameter
. We ran experiments with at most 256k compounds 5 times and larger experiments 3 times. We report the average cost. The standard errors of the mean are very small, so error bars are not visible.
For the X-MS protocol, we aggregate up to PSM results into one scalar . As each ciphertext holds 32k scalars, sending up to scalar values requires 1 ciphertext. The communication cost of CA-MS requires 1 ciphertext per 32k compounds. The cost of X-MS multiplications is insignificant compared to the cost of computing similarity. Hence, the computation cost of both protocols is very close.
The client computation is less than a second and the transfer cost of searching 2 million compounds is 12MB for X-MS and 378MB for CA-MS. So these protocols can be run on a thin client. The server, however, should have high computational power as searching 2 million compounds requires 3.5 hours of single-core computation. The closest system to ours is Shimizu et al. (Shimizu et al., 2015). They report that their custom protocol requires 167 seconds of server computation, 172 seconds of client computation, and 265MB of data to compute the number of matching compounds in a 1.3 million compound dataset, yielding a higher ratio of bandwidth consumed per-compound, and an increased computational load on the client compared to our protocols. Moreover, they use the curve secp192k1 which provides less than 100-bit of security, while we offer full 128-bit security. Their server privacy depends on a probabilistic differential privacy approach. Our CA-MS protocol provides cryptographic privacy guarantees. Finally, the X-MS protocol provides better privacy than CA-MS, and thus Shimizu et al. (Shimizu et al., 2015), despite not fully aggregating.
10.2. Peer-to-Peer Document Search
To implement peer-to-peer document search, we represent queries and documents by the set of keywords they contain. A single document, represented by the keyword set , is of interest to the querier if it contains all query keywords , i.e., . This functionality can be implemented with the full matching (F-PSM) variant.
The client and the server must agree on how keywords are represented in . We use hash functions to do the conversion. As the domain for searchable keywords is too large for the bit-vector representation of small domain variants, we construct F-PSM with PSI with a large domain. There are two sources of false-positive keyword matches in this approach.
False Positive of Mapping Words. The parameter (recall ) determines the input domain. Since is small, multiple words could be mapped to the same element. This can lead to F-PSM claiming there is a match, even though one of the colliding keywords is absent in the server’s set. Since impacts the multiplicative depth of our HE schemes, choosing a large value is impractical.
Instead of directly increasing the size of to reduce the false positive rate, the client hashes the query keywords with hash functions and encodes them into scalar values, which reduces the false-positive rate to . When running PSI-process, the server knows the corresponding hash function for each scalar value and hashes them accordingly. Afterward, the server runs the F-PSM protocol on all PSI outputs together; F-PSM outputs one if all hashed keywords are present in the document and zero otherwise.
Using multiple hashes to reduce false positives does not impact privacy, as it is straightforward to simulate a query with hashes, with F-PSM queries. This modification increases the computation cost and the number of scalar values in the query by fold. However, there is no concrete change in the communication cost as the client can still pack scalars into one ciphertext.
False Positive of F-PSM. The F-PSM protocol itself has a false positive rate of (see Section 6) caused by internal randomness. An easy way to reduce this FP rate is to run instances of F-PSI with different randomness on each document and reveal the responses. This repetition reduces the FP rate of a single PSM response to , while increasing the computation cost by a factor .
Aggregation. We explore two aggregation policies: existential aggregation (X-MS) to determine whether at least one document in the collection matches; and cardinality (CA-MS) to determine how many documents match. Since the multiplicative depth of F-PSM is very low, we can fully reduce the X-MS output to one encrypted scalar. For the CA-MS variant, we still use the shuffling variant for lower multiplicative depth.
We use the homomorphic parameters for CA-MS and