1 Introduction
The ability to elicit information through automated scanning of personal texts has significant economic and societal value. Machine learning (ML) models for classification of text such as emails and SMS messages can be used to infer whether the author is depressed
[42], suicidal [38], a terrorist threat [1], or whether the email is a spam message [2, 44]. Other valuable applications of text message classification include user profiling for tailored advertising [30], detection of hate speech [6], and detection of cyberbullying [46]. Some of the above are integrated in parental control applications^{1}^{1}1https://www.bark.us/, https://kidbridge.com/, https://www.webwatcher.com/ that monitor text messages on the phones of children and alert their parents when content related to drug use, sexting, suicide etc. is detected. Regardless of the clear benefits, giving applications access to one’s personal text messages and emails can easily lead to (un)intentional privacy violations.In this paper, we propose the first privacypreserving (PP) solution for text classification that is provably secure. To the best of our knowledge, there are no existing Differential Privacy (DP) or Secure Multiparty Computation (SMC) based solutions for PP feature extraction and classification of unstructured texts; the only existing method is based on Homomorphic Encryption (HE) and takes 19 minutes to classify a tweet
[14] while leaking information about the text being classified. In our SMC based solution, there are two parties, nicknamed Alice and Bob (see Fig. 1). Bob has a trained ML model that can automatically classify texts. Our secure text classification protocol allows to classify a personal text written by Alice with Bob’s ML model in such a way that Bob does not learn anything about Alice’s text and Alice does not learn anything about Bob’s model. Our solution relies on PP protocols for feature extraction from text and PP machine learning model scoring, which we propose in this paper.We perform endtoend experiments with an application for PP detection of hate speech against women and immigrants in text messages. In this use case, Bob has a trained logistic regression (LR) or AdaBoost model that flags hateful texts based on the occurrence of particular words. LR models on word grams have been observed to perform comparably to more complex CNN and LSTM model architectures for hate speech detection [33]
. Using our protocols, Bob can label Alice’s texts as hateful or not without learning which words occur in Alice’s texts, and Alice does not learn which words are in Bob’s hate speech lexicon, nor how these words are used in the classification process. Moreover, classification is done in seconds, which is two orders of magnitude better than the existing HE solution despite the fact we use over 20 times more features and do not leak any information about Alice’s text to the model owner (Bob). The solution based on HE leaks which words in the text are present in Bob’s lexicon
[14].We build our protocols using a privacypreserving machine learning (PPML) framework based on SMC developed by us. All the building blocks are available and can be composed within themselves or with new protocols added to the framework.^{2}^{2}2A link to the code repository is omitted to respect the doubleblind review process. It will be added in the final version of the paper. While most of our building blocks already exist in the literature, the main contribution of this work consists of the careful choice of the ML techniques, feature engineering and algorithmic and implementation optimizations to enable practical PP text classification. Additionally, we provide security definitions and proofs for our proposed protocols.
2 Preliminaries
We consider honestbutcurious adversaries, as is common in SMC based PPML (see e.g. [18, 20]). An honestbutcurious adversary follows the instructions of the protocol, but tries to gather additional information. Secure protocols prevent the latter.
We perform SMC using additively secret shares to do computations modulo an integer . A value is secret shared over between parties Alice and Bob by picking uniformly at random subject to the constraint that , and then revealing to Alice and to Bob. We denote this secret sharing by , which can be thought of as a shorthand for . Secretsharing based SMC works by first having the parties split their respective inputs in secret shares and send some of these shares to each other. Naturally, these inputs have to be mapped appropriately to . Next, Alice and Bob represent the function they want to compute securely as a circuit consisting of addition and multiplication gates. Alice and Bob will perform secure additions and multiplications, gate by gate, over the shares until the desired outcome is obtained. The final result can be recovered by combining the final shares, and disclosed as intended, i.e. to one of the parties or to both. It is also possible to keep the final result distributed over shares.
In SMC based text classification, as illustrated in Fig. 1, Alice’s input is a personal text and Bob’s input is an ML model for text classification. The function that they want to compute securely is , i.e. the class label of when classified by . To this end, Alice splits the text in secret shares while Bob splits the ML model in secret shares. Both parties engage in a protocol in which they send some of the input shares to each other, do local computations on the shares, and repeat this process in an iterative fashion over shares of intermediate results (Step 1). At the end of the joint computations, Alice sends her share of the computed class label to Bob (Step 2), who combines it with his share to learn the classification result (Step 3). As mentioned above, the protocol for Step 1 involves representing the function as a circuit of addition and multiplication gates.
Given two secret sharings and , Alice and Bob can locally compute in a straightforward way a secret sharing corresponding to or by simply adding/subtracting their local shares of and modulo . Given a constant , they can also easily locally compute a secret sharing corresponding to or : in the former case Alice and Bob just multiply their local shares of by ; in the latter case Alice adds to her share of while Bob keeps his original share. These local operations will be denoted by , , and , respectively. To allow for efficient secure multiplication of values via operations on their secret shares (denoted by ), we use a trusted initializer that predistributes correlated randomness to the parties participating in the protocol before the start of Step 1 in Fig. 1.^{3}^{3}3This technique for secure multiplication was originally proposed by Beaver [7] and is regularly used to enable very efficient solutions both in the context of PPML [19, 16, 31, 18] as well as in other applications, e.g., [43, 26, 25, 36, 45, 17]. The initializer is not involved in any other part of the execution and does not learn any data from the parties. This can be straightforwardly extended to efficiently perform secure multiplication of secret shared matrices. The protocol for secure multiplication of secret shared matrices is denoted by and for the special case of innerproduct computation by . Details about the (matrix) multiplication protocol can be found in [18]. We note that if a trusted initializer is not available or desired, Alice and Bob can engage in precomputations to securely emulate the role of the trusted initializer, at the cost of introducing computational assumptions in the protocol [18].
3 Secure text classification
Our general protocol for PP text classification relies on several building blocks that are used together to accomplish Step 1 in Fig. 1: a secure equality test, a secure comparison test, private feature extraction, secure protocols for converting between secret sharing modulo 2 and modulo , and private classification protocols. Several of these building blocks have been proposed in the past. However, to the best of our knowledge, this is the very first time they are combined in order to achieve efficient text classification with provable security.
We assume that Alice has a personal text message, and that Bob has a LR or AdaBoost classifier that is trained on unigrams and bigrams as features. Alice constructs the set of unigrams and bigrams occurring in her message, and Bob constructs the set of unigrams and bigrams that occur as features in his ML model. We assume that all and are in the form of bit strings. To achieve this, Alice and Bob convert each unigram and bigram on their end to a number using SHA 224 [40], strictly for its ability to map the same inputs to the same outputs in a pseudorandom manner. Next Alice and Bob map each on their end to a number between and , i.e. a bit string of length , using a random function in the universal hash family proposed by Carter and Wegman [11].^{4}^{4}4The hash function is defined as where is a prime and and are random numbers less than . In our experiments, , , and . In the remainder we use the term “word” to refer to a unigram or bigram, and we refer to the set as Bob’s lexicon.
Below we outline the protocols for PP text classification. A correctness and security analysis of the protocols is provided as an appendix. In the description of the protocols in this paper, we assume that Bob needs to learn the result of the classification, i.e. the class label, at the end of the computations. It is important to note that the protocols described below can be straightforwardly adjusted to a scenario where Alice instead of Bob has to learn the class label, or even to a scenario where neither Alice nor Bob should learn what the class label is and instead it should be revealed to a third party or kept in a secret sharing form. All these scenarios might be relevant use cases of PP text classification, depending on the specific application at hand.
3.1 Cryptographic building blocks
Secure Equality Test:
At the start of the secure equality test protocol, Alice and Bob have secret shares of two bit strings and of length . corresponds to a word from Alice’s message and corresponds to a feature from Bob’s model. The bit strings and are secret shared over . Alice and Bob follow the protocol to determine whether . The protocol outputs a secret sharing of 1 if and of 0 otherwise.
Protocol :

[leftmargin=*,noitemsep,topsep=10pt]

For , Alice and Bob locally compute .

Alice and Bob use secure multiplication to compute a secret sharing of . If , then for all bit positions , hence ; otherwise some and therefore . The result is the secret sharing , which is the desired output of the protocol.
This protocol for equality test is folklore in the field of SMC. The multiplications can be organized in as binary tree with the result of the multiplication at the root of the tree. In this way, the presented protocol has rounds. While there are equality test protocols that have a constant number of rounds, the constant is prohibitively large for the parameters used in our implementation.
Secure Feature Vector Extraction:
At the start of the feature extraction protocol, Alice has a set and Bob has a set . is a set of bit strings that represent Alice’s text, and
is a set of bit strings that represent Bob’s lexicon. Bob would like to extract words from Alice’s text that appear in his lexicon. At the end of the protocol, Alice and Bob have secret shares of a binary feature vector
which represents what words in Bob’s lexicon appear in Alice’s text. The binary feature vector of length is defined as(1) 
Protocol :

[leftmargin=*,noitemsep,topsep=10pt]

Alice and Bob secret share each () and each () with each other.

For : // Computation of secret shares of as defined in Equation (1).
For :
Alice and Bob run the secure equality test protocol to compute secret shares(2) Alice and Bob locally compute the secret share .
The secure feature vector extraction can be seen as a private set intersection where the intersection is not revealed but shared [12, 29]. Our solution is tailored to be used within our PPML framework (it uses only binary operations, it is secret sharing based, and is based on predistributed binary multiplications). In principle, other protocols could be used here. The efficiency of our protocol can be improved by using hashing techniques [41]
at the cost of introducing a small probability of error. The improvements due to hashing are asymptotic and for the parameters used in our fastest running protocol these improvements were not noticeable. Thus, we restricted ourselves to the original protocol without hashing and without any probability of failure.
Secure Comparison Test:
In our privacypreserving AdaBoost classifier we will use a secure comparison protocol as a building block. At the start of the secure comparison test protocol, Alice and Bob have secret shares over of two bit strings and of length . They run the secure comparison protocol of Garay et al. [32] with secret sharings over and obtain a secret sharing of 1 if and of 0 otherwise.
Secure Conversion between and :
Some of our building blocks perform computations using secret shares over (secure equality test, comparison and feature extraction), while the secure inner product work over for . In order to be able to integrate these building blocks we need:

[leftmargin=*,noitemsep,topsep=10pt]

A secure bitdecomposition protocol for secure conversion from to : Alice and Bob have as input a secret sharing and without learning any information about they should obtain as output secret sharings , where is the binary representation of . We use the secure bitdecomposition protocol from De Cock et al. [18].

A protocol for secure conversion from to : Alice and Bob have as a input a secret sharing of a bit and need to obtain a secret sharing of the binary value over a larger field without learning any information about . To this end, we use protocol :

[leftmargin=*,noitemsep,topsep=10pt]

For the input , let denote Alice’s share and denote Bob’s share.

Alice creates a secret sharing by picking uniformly random shares that sum to and delivers Bob’s share to him, and Bob proceeds similarly to create .

Alice and Bob compute .

The output is computed as .

Secure Logistic Regression (LR) Classification:
At the start of the secure LR classification protocol, Bob has a trained LR model that requires a feature vector of length as its input, and produces a label as its output. Alice and Bob have secret shares of the feature vector which represents what words in Bob’s lexicon appear in Alice’s text. At the end of the protocol, Bob gets the result of the classification . We use an existing protocol for secure classification with LR models [18].^{5}^{5}5In our case the result of the classification is disclosed to Bob (the party that owns the model) instead of Alice (who has the original input to be classified) as in [18], however it is trivial to modify their protocol so that the final secret share is open towards Bob instead of Alice. Note also that in our case, the feature vector that is used for the classification is already secret shared between Alice and Bob, while in their protocol Alice holds the feature vector, which is then secret shared in the first step of the protocol. This modification is also trivial and does not affect the security of the protocol.
Secure AdaBoost Classification:
The setting is the same as above, but the model is an AdaBoost ensemble of decision stumps instead of a LR model. While efficient solutions for secure classification with tree ensembles were previously known [31]
, we can take advantage of specific facts about our use case to obtain a more efficient solution. In more detail, in our use case: (1) all the decision trees have depth 1 (i.e., decision stumps); (2) each feature
is binary and therefore when it is used in a decision node, the left and right children correspond exactly to and ; (3) the output class is binary; (4) the feature values were extracted in a PP way and are secret shared so that no party alone knows their values. We can use the above facts in order to perform the AdaBoost classification by computing two inner products and then comparing their values.Protocol :

[leftmargin=*,noitemsep,topsep=10pt]

Alice and Bob hold secret sharings of each of the binary features . Bob holds the trained AdaBoost model which consists of two weighted probability vectors and . For the th decision stump: is the weighted probability (i.e., a probability multiplied by the weight of the th decision stump) that the model assigns to the output class being 0 if , and is defined similarly for the output class 1 (see Fig. 2).

Bob secret shares the elements of and , and Alice and Bob locally compute secret sharings of the vector .

Using the secure inner product protocol , Alice and Bob compute secret sharings of the inner product between and , and of the inner product between and . and are the aggregated votes for class label 0 and 1 respectively.

Alice and Bob use to compute bitwise secret sharings of and over .

Alice and Bob use to compare and , getting as output a secret sharing of the output class , which is then open towards Bob.
To the best of our knowledge, this is the most efficient provably secure protocol for binary classification over binary input features with an ensemble of decisions stumps.
3.2 Privacypreserving classification of personal text messages
We now present our novel protocols for PP text classification. They result from combining the cryptographic building blocks we introduced previously. The PP protocol for classifying the text using a logistic regression model works as follows:
Protocol :

[leftmargin=*,noitemsep,topsep=10pt]

Alice and Bob execute the secure feature extraction protocol with input sets and in order to obtain the secret shares of the feature vector .

They run the protocol to obtain shares over .

Alice and Bob run the secure logistic regression classification protocol in order to get the result of the classification. The LR model is given as input to by Bob, and the secret shared feature vector by both of them. Bob gets the result of the classification .
The privacypreserving protocol for classifying the text using AdaBoost works as follows:
Protocol :

[leftmargin=*,noitemsep,topsep=10pt]

Alice and Bob execute the secure feature extraction protocol with input sets and in order to obtain the secret shares of the feature vector .

They run the protocol to obtain shares over .

Alice and Bob run the secure AdaBoost classification protocol to obtain the result of the classification. The secret shared feature vector is given as input to by both of them, and the two weighted probability vectors and that constitute the model are specified by Bob. Bob gets the output class .
Detailed proofs of security are presented in the appendix.
4 Experimental results
We evaluate the proposed protocols in a use case for the detection of hate speech in short text messages, using data from [6]. The corpus consists of 10,000 tweets, 60% of which are annotated as hate speech against women or immigrants. We convert all characters to lowercase, and turn each tweet into a set of word unigrams and bigrams. There are 29,853 distinct unigrams and 93,629 distinct bigrams in the dataset, making for a total of 123,482 features.
Accuracy results for a variety of models trained to classify a tweet as hate speech vs. nonhate speech are presented in Table 1. The models are evaluated using 5fold crossvalidation over the entire corpus of 10,000 tweets. The top rows in Table 1 correspond to tree ensemble models consisting of 50, 200, and 500 decision stumps respectively; the root of each stump corresponds to a feature. The bottom rows contain results for an LR model trained on 50, 200, and 500 features (preselected based on information gain), and an LR model trained on all features. We ran experiments for feature sets consisting of unigrams and bigrams, as well as for feature sets consisting of unigrams only, observing that the inclusion of bigrams leads to a small improvement in accuracy. Note that designing a model to obtain the highest possible accuracy is not the focus of this paper. Instead, our goal is to demonstrate that PP text classification based on SMC is feasible in practice.
Unigrams  Unigrams+Bigrams  
Acc  Time (in sec)  Acc  Time (in sec)  
Extr  Class  Tot  Extr  Class  Tot  
Ada; 50 trees; depth 1  71.6%  0.8  6.4  7.2  73.3%  1.5  6.6  8.1 
Ada; 200 trees; depth 1  73.0%  2.8  6.4  9.2  74.2%  9.4  6.6  16.0 
Ada; 500 trees; depth 1  73.9%  6.6  6.7  13.3  74.4%  21.6  6.7  28.3 
Logistic regression (50 feat.)  72.4%  0.8  3.7  4.5  73.8%  1.5  3.8  5.3 
Logistic regression (200 feat.)  73.3%  2.8  3.7  6.5  73.7%  9.4  3.8  13.2 
Logistic regression (500 feat.)  73.4%  6.6  3.8  10.4  74.2%  21.6  4.1  25.7 
Logistic regression (all feat.)  73.1%  318.0  6.1  324.1  73.8%  5,371.9  24.9  5,396.8 
We implemented the protocols from Section 3 in Java and ran experiments on AWS c5.9xlarge machines with 36 vCPUs, 72.0 GiB Memory.^{6}^{6}6A link to the code repository is omitted to respect the doubleblind review process. It will be added in the final version of the paper. Each of the parties ran on separate machines (connected with a Gigabit Ethernet network), which means that the results in Table 1 cover communication time in addition to computation time. Each runtime experiment was repeated 3 times and average results are reported. In Table 1 we report the time (in sec) needed for converting a tweet into a feature vector (Extr), for classification of the feature vector (Class), and for the overall process (Tot).
4.1 Analysis
The best running times were obtained using unigrams, 50 features and logistic regression (4.5 s) with an accuracy of 72.4%. The highest accuracy (74.4%) was obtained by using unigram and bigrams, 500 features and AdaBoost with a running time equal to 28.3s. From these results, it is clear that feature engineering plays a major role in optimizing privacypreserving machine learning solutions based on SMC. We managed to reduce the running time from 5,396.8s (logistic regression, unigram and bigrams, all 123,482 features being used) to 5.3s (logistic regression, unigrams and bigrams, 50 features) without any loss in accuracy and to 4.5s (logistic regression, unigrams only, 50 features) with a small loss.
4.2 Optimizing the computational and communication complexities
The feature extraction protocol requires secure equality tests of bit strings. The equality test relies on secure multiplication, which is the more expensive operation. To reduce the number of required equality tests, Alice and Bob can each first map their bit strings to buckets and respectively, so that bit strings from each need to only be compared with bit strings from . Each bit string and is hashed and the first bits of the hash output are used to define the bucket number corresponding to that bit string, using a total of
buckets. In order not to leak how many elements are mapped to each bucket (which can leak some information about the probability distribution of the elements, as the hash function is known by everyone), each bucket has a fixed number of elements (
for Bob’s buckets and for Alice’s buckets) and the empty spots in the buckets are filled up with dummy elements. The feature extraction protocol now requires equality tests, which can be substantially smaller than . When using bucketization, the feature vector of length from (1) is expanded to a feature vector of length , containing the original features as well as the dummy features that Bob created to fill up his buckets. These dummy features do not have any effect on the accuracy of the classification because Bob’s model does not take them into account: the trees with dummy features in an AdaBoost model have 0 weight for both class labels, and the dummy features’ coefficients in an LR model are always 0.The size of the buckets has to be chosen sufficiently large to avoid overflow. The choice depends directly on the number of buckets (which is kept constant for Alice and Bob) and the number of elements to be placed in the buckets, i.e. elements on Bob’s side and elements on Alice’s side. While for hash functions coming from a 2universal family of hash functions the computation of these probabilities is relatively straightforward, the same is not true for more complicated hash functions [41]. In that case, numerical simulations are needed in order to bound the required probability.
The effect of using buckets is more significant for large values of and . In our case, after performing feature engineering for reducing the number of elements in each set, in the best case, we end up with inputs for which there is no significant difference between the original protocol (without buckets) and the protocol that uses buckets. If the performance of these two cases is comparable, one is better off using the version without buckets, since there will be no probability of information being leaked due to bucket overflow.
Another way we could possibly improve the communication and computation complexities of the protocol is by reducing the number of bits used to represent each feature albeit at the cost of increasing the probability of collisions (different features being mapped into the same bit strings). We used 13 bits for representing unigrams and 17 bits for representing unigrams and bigrams. We did not observe any collisions.
5 Related work
The interest in privacypreserving machine learning (PPML) has grown substantially over the last decade. The bestknown results in PPML are based on differential privacy (DP), a technique that relies on adding noise to answers, to prevent an adversary from learning information about any particular individual in the dataset from revealed aggregate statistics [28]. While DP in an ML setting aims at protecting the privacy of individuals in the training dataset, our focus is on protecting the privacy of new user data that is classified with proprietary ML models. To this end, we use Secure Multiparty Computation (SMC) [15], a technique in cryptography that has successfully been applied to various ML tasks with structured data (see e.g. [13, 18, 20] and references therein).
To the best of our knowledge there are no existing DP or SMC based solutions for PP feature extraction and classification of unstructured texts. Defenses against authorship attribution attacks that fulfill DP in text classification have been proposed [48]. These methods rely on distortion of term frequency vectors and result in loss of accuracy. In this paper we address a different challenge: we assume that Bob knows Alice, so no authorship obfuscation is needed. Instead, we want to process Alice’s text with Bob’s classifier, without Bob learning what Alice wrote, and without accuracy loss. To the best of our knowledge, Costantino et al. [14] were the first to propose PP feature extraction from text. In their solution, which is based on homomorphic encryption (HE), Bob learns which of his lexicon’s words are present in Alice’s tweets, and classification of a single tweet with a model with less than 20 features takes 19 minutes. Our solution does not leak any information about Alice’s words to Bob, and classification is done in seconds, even for a model with 500 features.
Below we present existing work that is related to some of the building blocks we use in our PP text classification protocol (see Section 3.1).
Private equality tests have been proposed in the literature based on several different flavors [3]. They can be based on Yao Gates, Homomorphic Encryption, and generic SMC [47]. In our case, we have chosen a simple protocol that depends solely on additions and multiplications over a binary field. While different (and possibly more efficient) comparison protocols could be used instead, they would either require additional computational assumptions or present a marginal improvement in performance for the parameters used here.
Our private feature extraction can be seen as a particular case of private set intersection (PSI). PSI is the problem of securely computing the intersection of two sets without leaking any information except (possibly) the result, such as identifying the intersection of the set of words in a user’s text message with the hate speech lexicon used by the classifier. Several paradigms have been proposed to realize PSI functionality, including a Naive hashing solution, Serveraided PSI, and PSI based on oblivious transfer extension. For a complete survey, we refer to Pinkas et al. [41]. In our protocol for PP text classification, we implement private feature extraction by a straightforward application of our equality test protocol. While more efficient protocols could be obtained by using sophisticated hashing techniques, we have decided to stick with our direct solution since it has no probability of failure and works well for the input sizes needed in our problem. For larger input sizes, a more sophisticated protocol would be a better choice [41].
We use two protocols for the secure classification of feature vectors: an existing protocol for secure classification with LR models [18]; and a novel secure AdaBoost classification protocol. The logistic regression protocol uses solely additions and multiplications over a finite field. The secure AdaBoost classification protocol is an novel optimized protocol that uses solely decision trees of depth one, binary features and a binary output. All these characteristics were used in order to speed up the resulting protocol. The final secure AdaBoost classification protocol uses only two secure inner products and one secure comparison.
6 Conclusion
In this paper we have presented the first provably secure method for privacypreserving (PP) classification of unstructured text, building on SMC protocols for secure equality testing of strings and secure feature extraction from text. We have provided an analysis of the correctness and security of all protocols. An implementation of the protocols in Java, run on AWS machines, allowed us to classify text messages securely within seconds. It is important to note that this run time (1) includes both secure feature extraction and secure classification of the extracted feature vector; (2) includes both computation and communication costs, as the parties involved in the protocol were run on separate machines; (3) is two orders of magnitude better than the only other existing solution, which is based on HE. Our results show that in order to make PP text classification practical, one needs to pay close attention not only to the underlying cryptographic protocols but also to the underlying ML algorithms. ML algorithms that would be a clear choice when used in the clear might not be useful at all when transferred to the SMC domain. One has to optimize these ML algorithms having in mind their use within SMC protocols. Our results provide the first evidence that provably secure PP text classification is feasible in practice.
References
 [1] Peter Ray Allison. Tracking terrorists online might invade your privacy. BBC, http://www.bbc.com/future/story/20170808trackingterroristsonlinemightinvadeyourprivacy, 2017.
 [2] Tiago A. Almeida, José María G. Hidalgo, and Akebo Yamakami. Contributions to the study of SMS spam filtering: new collection and results. In Proc. of the 11th ACM Symposium on Document Engineering, pages 259–262, 2011.
 [3] Nuttapong Attrapadung, Goichiro Hanaoka, Shinsaku Kiyomoto, Tomoaki Mimoto, and Jacob CN Schuldt. A taxonomy of secure twoparty comparison protocols and efficient constructions. In 15th Annual Conference on Privacy, Security and Trust (PST), 2017.
 [4] Boaz Barak, Ran Canetti, Jesper Buus Nielsen, and Rafael Pass. Universally composable protocols with relaxed setup assumptions. In FOCS 2004, pages 186–195, 2004.
 [5] Paulo S. L. M. Barreto, Bernardo David, Rafael Dowsley, Kirill Morozov, and Anderson C. A. Nascimento. A framework for efficient adaptively secure composable oblivious transfer in the ROM. Cryptology ePrint Archive, Report 2017/993, 2017. http://eprint.iacr.org/2017/993.
 [6] Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Rangel, Paolo Rosso, and Manuela Sanguinetti. Semeval2019 Task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proc. of the 13th International Workshop on Semantic Evaluation (SemEval2019). ACL, 2019.
 [7] Donald Beaver. Commoditybased cryptography (extended abstract). In STOC 1997, pages 446–455, 1997.
 [8] Ran Canetti. Universally composable security: A new paradigm for cryptographic protocols. In FOCS 2001, pages 136–145, 2001.
 [9] Ran Canetti and Marc Fischlin. Universally composable commitments. In Crypto 2001, pages 19–40, 2001.
 [10] Ran Canetti, Yehuda Lindell, Rafail Ostrovsky, and Amit Sahai. Universally composable twoparty and multiparty secure computation. In STOC 2002, pages 494–503, 2002.
 [11] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143–154, 1979.
 [12] Michele Ciampi and Claudio Orlandi. Combining private setintersection with secure twoparty computation. In SCN 2018, pages 464–482, 2018.
 [13] Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations Newsletter, 4(2):28–34, 2002.
 [14] Gianpiero Costantino, Antonio La Marra, Fabio Martinelli, Andrea Saracino, and Mina Sheikhalishahi. Privacypreserving text mining as a service. In 2017 IEEE Symposium on Computers and Communications (ISCC), pages 890–897, 2017.
 [15] Ronald Cramer, Ivan Damgård, and Jesper Buus Nielsen. Secure Multiparty Computation and Secret Sharing. Cambridge University Press, 2015.
 [16] Bernardo David, Rafael Dowsley, Raj Katti, and Anderson CA Nascimento. Efficient unconditionally secure comparison and privacy preserving machine learning classification protocols. In International Conference on Provable Security, pages 354–367. Springer, 2015.
 [17] Bernardo David, Rafael Dowsley, Jeroen van de Graaf, Davidson Marques, Anderson C. A. Nascimento, and Adriana C. B. Pinto. Unconditionally secure, universally composable privacy preserving linear algebra. IEEE Transactions on Information Forensics and Security, 11(1):59–73, 2016.

[18]
Martine De Cock, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento,
WingSea Poon, and Stacey Truex.
Efficient and private scoring of decision trees, support vector machines and logistic regression models based on precomputation.
IEEE Transactions on Dependable and Secure Computing, 16(2):217–230, 2019. 
[19]
Martine De Cock, Rafael Dowsley, Anderson C. A. Nascimento, and Stacey C.
Newman.
Fast, privacy preserving linear regression over distributed datasets based on predistributed data.
In8th ACM Workshop on Artificial Intelligence and Security (AISec)
, pages 3–14, 2015.  [20] Sebastiaan de Hoogh, Berry Schoenmakers, Ping Chen, and Harm op den Akker. Practical secure decision tree learning in a teletreatment application. In International Conference on Financial Cryptography and Data Security, pages 179–194. Springer, 2014.
 [21] Nico Döttling, Daniel Kraschewski, and Jörn MüllerQuade. Unconditional and composable security using a single stateful tamperproof hardware token. In TCC 2011, pages 164–181, 2011.
 [22] Rafael Dowsley. Cryptography Based on Correlated Data: Foundations and Practice. PhD thesis, Karlsruhe Institute of Technology, Germany, 2016.
 [23] Rafael Dowsley, Jörn MüllerQuade, and Anderson C. A. Nascimento. On the possibility of universally composable commitments based on noisy channels. In SBSEG 2008, pages 103–114, Gramado, Brazil, September 1–5, 2008.
 [24] Rafael Dowsley, Jörn MüllerQuade, and Tobias Nilges. Weakening the isolation assumption of tamperproof hardware tokens. In ICITS 2015, pages 197–213, 2015.
 [25] Rafael Dowsley, Jörn MüllerQuade, Akira Otsuka, Goichiro Hanaoka, Hideki Imai, and Anderson C. A. Nascimento. Universally composable and statistically secure verifiable secret sharing scheme based on predistributed data. IEICE Transactions, 94A(2):725–734, 2011.
 [26] Rafael Dowsley, Jeroen Van De Graaf, Davidson Marques, and Anderson CA Nascimento. A twoparty protocol with trusted initializer for computing the inner product. In International Workshop on Information Security Applications, pages 337–350. Springer, 2010.
 [27] Rafael Dowsley, Jeroen van de Graaf, Jörn MüllerQuade, and Anderson C. A. Nascimento. On the composability of statistically secure bit commitments. Journal of Internet Technology, 14(3):509–516, 2013.
 [28] Cynthia Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1–19. Springer, 2008.
 [29] Brett Hemenway Falk, Daniel Noble, and Rafail Ostrovsky. Private set intersection with linear communication from general assumptions. Cryptology ePrint Archive, Report 2018/238, 2018. https://eprint.iacr.org/2018/238.
 [30] Golnoosh Farnadi, Geetha Sitaraman, Shanu Sushmita, Fabio Celli, Michal Kosinski, David Stillwell, Sergio Davalos, MarieFrancine Moens, and Martine De Cock. Computational personality recognition in social media. User Modeling and UserAdapted Interaction, 26(23):109–142, 2016.
 [31] Kyle Fritchman, Keerthanaa Saminathan, Rafael Dowsley, Tyler Hughes, Martine De Cock, Anderson Nascimento, and Ankur Teredesai. Privacypreserving scoring of tree ensembles: A novel framework for AI in healthcare. In Proc. of 2018 IEEE International Conference on Big Data, pages 2412–2421, 2018.
 [32] Juan A. Garay, Berry Schoenmakers, and José Villegas. Practical and secure solutions for integer comparison. In PKC 2007, pages 330–342, 2007.
 [33] Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N. Asokan. All you need is “love”: Evading hatespeech detection. In Proc. of the 11th ACM Workshop on Artificial Intelligence and Security (AISec), 2018.
 [34] Dennis Hofheinz and Jörn MüllerQuade. Universally composable commitments using random oracles. In TCC 2004, pages 58–76, 2004.
 [35] Dennis Hofheinz, Jörn MüllerQuade, and Dominique Unruh. Universally composable zeroknowledge arguments and commitments from signature cards. In MoraviaCrypt 2005, 2005.
 [36] Yuval Ishai, Eyal Kushilevitz, Sigurd Meldgaard, Claudio Orlandi, and Anat PaskinCherniavsky. On the power of correlated randomness in secure computation. In Theory of Cryptography, pages 600–620. Springer, 2013.
 [37] Jonathan Katz. Universally composable multiparty computation using tamperproof hardware. In Eurocrypt 2007, pages 115–128, 2007.
 [38] Bridianne O’Dea, Stephen Wan, Philip J. Batterham, Alison L. Calear, Cecile Paris, and Helen Christensen. Detecting suicidality on Twitter. Internet Interventions, 2(2):183–188, 2015.
 [39] Chris Peikert, Vinod Vaikuntanathan, and Brent Waters. A framework for efficient and composable oblivious transfer. In Crypto 2008, pages 554–571, 2008.
 [40] Wouter Penard and Tim van Werkhoven. On the secure hash algorithm family. In Cryptography in Context, pages 1–18. 2008.
 [41] Benny Pinkas, Thomas Schneider, and Michael Zohner. Scalable private set intersection based on OT extension. ACM Transactions on Privacy and Security (TOPS), 21(2):7, 2018.
 [42] Andrew G. Reece, Andrew J. Reagan, Katharina L.M. Lix, Peter Sheridan Dodds, Christopher M. Danforth, and Ellen J. Langer. Forecasting the onset and course of mental illness with Twitter data. Scientific Reports, 7(1):13006, 2017.
 [43] Ronald L. Rivest. Unconditionally secure commitment and oblivious transfer schemes using private channels and a trusted initializer. Preprint available at http://people.csail.mit.edu/rivest/Rivest commitment.pdf, 1999.
 [44] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A Bayesian approach to filtering junk email. In Learning for Text Categorization: Papers from the 1998 Workshop, volume 62, pages 98–105, 1998.
 [45] Rafael Tonicelli, Anderson C. A. Nascimento, Rafael Dowsley, Jörn MüllerQuade, Hideki Imai, Goichiro Hanaoka, and Akira Otsuka. Informationtheoretically secure oblivious polynomial evaluation in the commoditybased model. International Journal of Information Security, 14(1):73–84, 2015.
 [46] Cynthia Van Hee, Gilles Jacobs, Chris Emmery, Bart Desmet, Els Lefever, Ben Verhoeven, Guy De Pauw, Walter Daelemans, and Véronique Hoste. Automatic detection of cyberbullying in social media text. PloS one, 13(10):e0203794, 2018.
 [47] Thijs Veugen, Frank Blom, Sebastiaan JA de Hoogh, and Zekeriya Erkin. Secure comparison protocols in the semihonest model. IEEE Journal of Selected Topics in Signal Processing, 9(7):1217–1228, 2015.
 [48] Benjamin Weggenmann and Florian Kerschbaum. SynTF: Synthetic and differentially private term frequency vectors for privacypreserving text mining. In 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 305–314, 2018.
Appendix A Correctness and Security Analysis of Protocols
a.1 Security Model
The gold standard model for proving the security of cryptographic protocols nowadays is the Universal Composability (UC) framework [8] and it is the security model that we use in this work. Protocols that are proven UCsecure enjoy strong securities guarantees and can be arbitrary composed without compromising the security. In short, it is the most adequate model to use when the protocols need to be executed in complex environments such as the Internet, and it additionally allows a modular design of bigger protocols. In this work protocols with two parties, Alice and Bob, are considered and in the following we present an overview of the UC framework for this setting. We refer interested readers to the book of Cramer et al. [15] for more details and the most general definitions.
Apart from the protocol participants, Alice and Bob, there are also an adversary , an ideal world adversary (also known as the simulator) and an environment
(which captures everything that happens outside of the instance of the protocol that is being analyzed, and therefore is the one giving the inputs and getting the outputs from the protocol). All these entities are assumed to be interactive Turing machines. The network is assumed to be under adversarial control and therefore
is the one that delivers the messages between Alice and Bob. In addition to controlling the network scheduling, can also corrupt Alice or Bob, in which case he gains the total control over the corrupted party and learn its complete state. For defining the security of the protocol, an ideal functionality is defined, which captures the idealized version of what the protocol is supposed to achieve and communicates directly with Alice and Bob to receive the inputs and delivering the outputs of the protocol (in the ideal world, that is all that Alice and Bob do). Then to prove the security of the protocol , we show that for every possible adversary there exists a simulator such that no environment can distinguish between a real world execution with Alice, Bob and the adversary running the protocol and the ideal world execution with the ideal functionality , the simulator and the dummy version of Alice and Bob that just forward the inputs and outputs between and . Formally:Definition A.1 ([8])
A protocol UCrealizes an ideal functionality if, for every possible adversary , there exists a simulator such that, for every possible environment , the view of the environment in the real world execution with , Alice and Bob executing the protocol (with security parameter ) is computationally indistinguishable from the view of in the ideal world execution with the functionality , the simulator and the dummy Alice and Bob, where the probability distribution is taken over the randomness used by all entities.
Adversarial Model: We consider honestbutcurious adversaries. Honestbutcurious adversaries follow the protocol instructions correctly, but try to learn additional information. We only consider static adversaries, for which the set of corrupted parties is chosen before the start of the protocol execution and does not change. A version of the UC theorem for the case of honestbutcurious adversaries is given in Theorem 4.20 of Cramer et al. [15].
Setup Assumption: It is a wellknown fact that secure twoparty computation (and also secure multiparty computation) can only achieve UCsecurity using a setup assumption [9, 10]. Multiple setup assumptions were used previously to achieve UCsecurity for secure computation protocols, including: the availability of a common reference string [9, 10, 39], the availability of a publickey infrastructure [4], the random oracle model [34, 5], the existence of noisy channels between the parties [23, 27], and the availability of signature cards [35] or tamperproof hardware [37, 21, 24]. In this work the commoditybased model [7] is used as the setup assumption. In this model there exists a trusted initializer that predistributed correlated randomness to Alice and Bob during a setup phase. This setup phase is run before the protocol execution (and in fact can be performed even before Alice and Bob get to know their inputs), and the trusted initializer does not participate in any other point of the protocol. The commoditybased model was used in many previous works, e.g., [43, 26, 25, 36, 45, 19, 16, 17, 31, 18]. The trusted initializer is modeled by the ideal functionality described in Figure 3.
Functionality is parametrized by an algorithm . Upon initialization run and deliver to Alice and to Bob. 
Simplifications: The simulation strategy in our proofs is in fact very simple: all the computations are performed using secret sharings and all the protocol messages look uniformly random from the point of view of the receiver, with the single exception of the openings of the secret sharings. Nevertheless, the messages that open a secret sharing can be straightforwardly simulated using the outputs of the respective functionalities. In the ideal world, the simulator has the leverage of being the one responsible for simulating all the ideal functionalities other than the one whose security is being analyzed (including the trusted initializer functionality ), and he can easily use this fact to perform a perfect simulation. For this reason the real and ideal world are indistinguishable for any environment and achieve perfect security.
The messages of the ideal functionalities are formally public delayed outputs, i.e., first the simulator is asked whether it allows the message to be delivered (this is due to the fact that in the real world the adversary controls the scheduling of the network), and the message is only delivered when agrees. And formally, every instance has a session identification. We omit those information from descriptions for the sake of readability.
Security of the Building Blocks: The protocol for secure distributed matrix multiplication UCrealizes the distributed matrix multiplication functionality described in Figure 4 [22, 18]. The protocol for secure comparison UCrealizes the functionality described in Figure 5 [32, 18]. The protocol for secure bitdecomposition UCrealizes the functionality described in Figure 6 [18]. The LR classification protocol UCrealizes the functionality described in Figure 7 [18].
Functionality is executed with Alice and Bob is parametrized by the size of the ring and the dimensions and of the matrices. 
Input: Upon receiving a message from Alice/Bob with her/his shares of and , verify if the share of is in and the share of is in . If it is not, abort. Otherwise, record the shares, ignore any subsequent message from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both Alice and Bob, reconstruct and from the shares, compute and create a secret sharing . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Functionality is parametrized by the bitlength of the values being compared. 
Input: Upon receiving a message from Alice/Bob with her/his shares of and for all , record the shares, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, reconstruct and from the bitwise shares. If , then create and distribute to Alice and Bob the secret sharing ; otherwise the secret sharing . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Functionality is parametrized by the bitlength of the value being converted from an additive secret sharing in to additive bitwise secret sharings in such that . 
Input: Upon receiving a message from Alice or Bob with her/his share of , record the share, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of both shares, reconstruct , compute its bitwise representation , and for distribute new secret sharings of the bit . Before the output deliver, the corrupt party fix its shares of the outputs to any constant values. The shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraints. 
Functionality 
computes the classification according to a logistic regression model with the threshold value set to 0.5. The input feature vector is secret shared between Alice and Bob. 
Input: Upon receiving the weight vector , the intercept value and his shares of the elements of from Bob, or her shares of the elements of from Alice, store the information, ignore any subsequent message from that party, and inform the other party about the receipt. 
Output: Upon getting the inputs from both parties, reconstruct the feature vector , compute the value and output it to Bob as the class prediction. 
The correctness of the equality test protocol follows from the fact that in the case that , then all ’s will be equal to 1 and therefore will also be 1. If , then for at least one value , we have that , and therefore . For the simulation, executes an internal copy of interacting with an instance of in which the uncorrupted parties use dummy inputs. Note that all the messages that receives look uniformly random to him. Since the share multiplication protocol is substituted by using the UC composition theorem, and is the one responsible for simulating in the ideal world, can leverage this fact in order to extract the share that any corrupted party have of the value , let the extracted value of the corrupted party be denoted by . The simulator then pick random values such that and submit these values to as being the shares of the corrupted party for and (note that the result of only depends on the values of ). is also able to fix the output share of the corrupted party in so that it matches the one in the instance of . This is a perfect simulation strategy, no environment can distinguish the ideal and real worlds and therefore UCrealizes .
The correctness of the secure feature extraction protocol follows directly from the fact that each is equal to if, and only if, , and therefore is equal to if, and only if, is equal to some element of . In the ideal world, the simulator runs internally a copy of and an execution of with dummy inputs for the uncorrupted parties. All the messages from the uncorrupted parties look uniformly random from ’s point of view, and therefore the simulation is perfect. uses the leverage of being responsible for simulating ( is substituted by using the UC composition theorem) in order to extract the inputs of any corrupted party and forward it to . No environment can distinguish the ideal world from the real one, and thus UCrealizes .
In the case of the conversion protocol the correctness of the protocol execution follows straightforwardly: since , then is such that for all possible values . As for the security, the simulator runs internally a copy of the adversary and simulates to him an execution of the protocol using dummy inputs for the uncorrupted parties. As all the messages from the uncorrupted parties look uniformly random from the adversary point of view, and so the simulation is perfect. The simulator can use the fact that it is the one simulating the multiplication functionality (the secret sharing multiplication is substituted by using the UC composition theorem) in order to extract the share of any corrupted party and fix the input to/output from appropriately, so that no environment can distinguish the real and ideal worlds. Hence UCrealizes .
The AdaBoost classification protocol is trivially correct for the case of binary features and output class, and decision stumps. In the simulation, runs an internal copy of interacting with a simulated instance of that uses dummy inputs for the uncorrupted parties. is substituted by using the UC composition theorem. uses the leverage of simulating in order to extract the shares of the feature vector belonging to a corrupted party, as well as the weighted probability vectors and if Bob is corrupted. can then give these extracted inputs to . No environment can distinguish the real and ideal worlds since the simulation is perfect, and thus UCrealizes .
Functionality is parametrized by the bitlength of the values being compared. 
Input: Upon receiving a message from Alice/Bob with her/his shares of and for all , record the shares, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, reconstruct and from the bitwise shares. If , then create and distribute to Alice and Bob the secret sharing ; otherwise the secret sharing . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Functionality is parametrized by the sizes of Alice’s set and of Bob’s set, and the bitlength of the elements. 
Input: Upon receiving a message from Alice with her set or from Bob with his set , record the set, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, define the binary feature vector of length by setting each element to if , and to otherwise. Then create and distribute to Alice and Bob the secret sharings . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Functionality is parametrized by the size of the field . 
Input: Upon receiving a message from Alice/Bob with her/his share of , record the share, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon receipt of the inputs from both parties, reconstruct , then create and distribute to Alice and Bob the secret sharing . Before the deliver of the output shares, a corrupt party fix its share of the output to any constant value. In both cases the shares of the uncorrupted parties are then created by picking uniformly random values subject to the correctness constraint. 
Functionality 
computes the classification according to AdaBoost with multiple decision stumps. All the features are binary and the output class is also binary. The input feature vector is secret shared between Alice and Bob. The model specified by Bob can be expressed in a simplified way by two weighted probability vectors and . For the th decision stump: is the weighted probability (i.e., a probability multiplied by the weight of the th decision stump) that the model assigns to the output class being 0 if , and is defined similarly for the output class 1. 
Input: Upon receiving the vectors and and his shares of the elements of the feature vector from Bob, or her shares of the elements of from Alice, store the information, ignore any subsequent message from that party, and inform the other party about the receipt. 
Output: Upon getting the inputs from both parties, reconstruct the feature vector and let . If , output the class prediction 1 to Bob; otherwise output 0. 
Security of the PrivacyPreserving Text Classification Solutions:
The protocol simply executes sequentially the protocols , and . Given that these protocols UCrealize , and , respectively, they can be substituted by the functionalities using the UC composition theorem. Note that the sequential composition of those functionalities trivially perform the same computation as , and no information other than the output of the classification is revealed (all the intermediate values are kept as secret sharings). In the ideal world simulates an internal copy of the adversary running and using dummy inputs for the uncorrupted parties. The simulator can easily extract all the information (from the corrupted parties) that it needs to provide to by using the leverage of being responsible for simulating , and in the ideal world. Therefore no environment can distinguish the real world from the ideal world, and UCrealizes .
Similarly, the protocol just runs sequentially the protocols , and , that can be substituted by , and using the UC composition theorem. The result of the computation is trivially the same as in , and no additional information is revealed. runs internally a copy of interacting with a simulated instance of (using dummy inputs for the uncorrupted parties) and can easily extract from the corrupted parties all the information that it must provide to by using the leverage of being responsible for simulating , and in the ideal world. No environment can distinguish the real and ideal worlds, and therefore UCrealizes .
Functionality 
computes the privacypreserving text classification according to a logistic regression model with the threshold value set to 0.5. It is parametrized by the sizes of Alice’s set and of Bob’s set, and the bitlength of the elements. 
Input: Upon receiving a message from Alice with her set or from Bob with his set , the weight vector and the intercept value , record the values, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon getting the inputs from both parties, define the feature vector of length as follows: if ; and otherwise. Compute the value and output it to Bob as the class prediction. 
Functionality 
computes the privacypreserving text classification according to AdaBoost with multiple decision stumps. It is parametrized by the sizes of Alice’s set and of Bob’s set, and the bitlength of the elements. All the features are binary and the output class is also binary. The model specified by Bob can be expressed in a simplified way by two weighted probability vectors and . For the th decision stump: is the weighted probability (i.e., a probability multiplied by the weight of the th decision stump) that the model assigns to the output class being 0 if the feature , and is defined similarly for the output class 1. 
Input: Upon receiving a message from Alice with her set or from Bob with his set , and , record the values, ignore any subsequent messages from that party and inform the other party about the receipt. 
Output: Upon getting the inputs from both parties, define the feature vector of length as follows: if ; and otherwise. Let . If , output the class prediction 1 to Bob; otherwise output 0. 
Comments
There are no comments yet.