Private Text Classification

by   Leif W. Hanlen, et al.

Confidential text corpora exist in many forms, but do not allow arbitrary sharing. We explore how to use such private corpora using privacy preserving text analytics. We construct typical text processing applications using appropriate privacy preservation techniques (including homomorphic encryption, Rademacher operators and secure computation). We set out the preliminary materials from Rademacher operators for binary classifiers, and then construct basic text processing approaches to match those binary classifiers.



There are no comments yet.


page 1

page 2

page 3

page 4


Fast Privacy-Preserving Text Classification based on Secure Multiparty Computation

We propose a privacy-preserving Naive Bayes classifier and apply it to t...

Privacy-Preserving Visual Learning Using Doubly Permuted Homomorphic Encryption

We propose a privacy-preserving framework for learning visual classifier...

PrivFT: Private and Fast Text Classification with Homomorphic Encryption

Privacy and security have increasingly become a concern for computing se...

Interpretable Privacy Preservation of Text Representations Using Vector Steganography

Contextual word representations generated by language models (LMs) learn...

Towards Robust and Privacy-preserving Text Representations

Written text often provides sufficient clues to identify the author, the...

Investigating the Working of Text Classifiers

Text classification is one of the most widely studied task in natural la...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Private text data — with confidential content — is difficult to “open”. Privacy requirements in text data are difficult to guarantee due to the inter-dependencies of text, and grammar. Although Natural Language Processing (NLP) nominally operates on numerically encoded text, NLP exploits the structure of text and not merely a sequence of integer codes (Hirschberg & Manning, 2015).

Research work in the space of Information Retrieval (IR) has acknowledged the need to preserve privacy (Si et al., 2014; Oard, 2015). Mechanisms to accommodate sharing of text corpora, have essentially reduced to licensing requirements that allow full access under limited conditions of (re-)use, since sharing (raw) text data essentially allows a human to read (and reproduce) the data (Dankar & Emam, 2013; Thomson, 2004; Ji et al., 2014; McDonald & Kelly, 2012).

1.1 Machine Learning from Private Text Data

The (typical) components of the text document that might be subject to privacy concerns (names of persons, places, drug-names, disease-names) are likely to be the components most interesting for text processing — and most damaging to algorithm performance if altered. In this work, we consider a different approach: we apply encryption techniques to text which allow learning without viewing the raw data — thereby applying machine learning without the need to share (or read) text data to learn from private text corpora and classify private text.

This work does not avoid the need for ethical approval and research permission: encryption and privacy preserving techniques cannot overcome ethical, legislative, or contractual requirements on what may (or may not) be done with data. We are interested in allowing groups to ethically interact with data, where raw data sharing would not be desired (or possible).

1.1.1 Text in Health Is a Special Case

Although open sharing is generally accepted as a good principal in health (Verhulst et al., 2014; Dunn et al., 2012; Estrin & Sim, 2010; Veitch & Barbour, 2010); privacy concerns may overwhelm implied scientific benefit (Thomson, 2004; Vogel, 2011; Sayogo & Pardo, 2012). The need to address privacy while supporting research-use (as well as non-research use) of health data has been observed (McKeon et al., 2013). To support confidentiality, the British Medical Journal [Table 1 (Hrynaszkiewicz et al., 2010) recommends not publishing verbatim responses or transcriptions of clinical discussions — which is exactly the sort of data that text mining systems require (Suominen et al., 2014, 2015). The work of (Jin, 2007) suggests some approaches for privacy-preserving health analytics, and reviews several privacy-preserving techniques, although most are numeric focused.

The risk of compromising privacy by being able to conceal the identifiers remains regardless of recent advances in automated de-identification algorithms for health text. Algorithms for automated de-identification of health text have have been evaluated to reach the F1 correctness percentage from 81 to 99 in English, French, Japanese, and Swedish (Dalianis & Velupillai, 2010; Morita et al., 2013; Chazard et al., 2013; Grouin & Neveol, 2014; Kayaalp et al., 2014; Meystre et al., 2014).111

F1 is a performance measure that takes values between 0 and 1 — the larger the value, the better the performance. It is defined as the harmonic mean of precision and recall, that is,

where precision refers to the proportion of correctly identified words for de-identification to all de-identified words and recall refers to the proportion of correctly identified words for de-identification to all words that should have been de-identified. However, approximately 90 per cent of the residual identifiers left behind by either these algorithms or human coders can be concealed by applying additional computation methods (Carrell et al., 2013).

This capability to conceal the identifiers gets even more alarming after record linkage of different shared corpora. For example, in the USA, Washington is one of 33 states that share or sell anonymized patient records. For US$50, anyone can purchase a patient-level health corpus that contains all hospitalisations that occurred in this state in 2011, without patient names or addresses, but with full patient demographics, diagnoses, procedures, attending physician, hospital, a summary of charges, and how the bill was paid. By linking these de-identified health records with public news papers from the same year from Washington State, leads 43 per cent of the time to concealing the patient’s name and sometimes even her/his street address (Sweeney, 2015),

As expected, patients are concerned about the potential of health data sharing and linkage to result in data misuse and compromised privacy (Simon et al., 2009). However, they are also enthusiastic about their capacity to improve the quality and safety of health care through giving their informed consent to sharing some or all of their own health records for purposes of (medical) science in general or some specific research project (Shaw et al., 2016).

Our approach addresses precisely these problems in finding and “hiding” sensitive text. It allow machine learning algorithms to use all encrypted data, but not all raw text.

2 Background

We assume all participants are Honest But Curious (HBC) (Paverd et al., 2014). We limit the need for “trusted” intermediaries (Dwork & Roth, 2014), and where such intermediaries are used, we restrict (by aggregation and secure computing) the information they may receive.

2.1 Text Processing

We consider the typical linear so-called “1-best” pipeline as outlined in (Johnson, 2014)

. This could be extended to parallel, iterative, or network approaches. We shall ignore feature engineering and simply presume a (large) number of categorical variables. A particular example of text processing pipeline output is outlined in Figure 1 of 

(Hirschberg & Manning, 2015).

In training, the text processing pipeline also uses labelled text segments — which may be document-, sentence- or word- labels (e.g. for sentiment analysis) or some combination of text fragments (such as used by the

Browser Rapid Annotation Tool (BRAT) for collaborative text annotation (Stenetorp et al., 2012)). In each case, we may represent the features as numeric labels — where the “tags” are converted into a dictionary — and a series of numeric values. We shall be interested in binary values — such as might result from a “1-hot” encoding. These become our observations, and also labels.

A differential-privacy approach is not suitable for this data type: adding “noise” to the encoded text will either render the “new” text meaningless, or be overcome — by treating the noise as spelling or grammatic errors. We use the approach of (Shannon, 1949) — to improve independent secret systems by concatenating them. In this case, we will firstly encrypt the numeric features and labels (using a Paillier homomorphic encryption system (Paillier, 1999; Damgård & Jurik, 2001)

). This makes direct interpretation difficult. Second, we attempt to address the data dependencies by applying irreversible aggregation to the numeric data so as to hide many of the implied dependencies between observations. Finally, we wrap the learning process in a secure learning approach, to further reduce the capacity of an inquisitive user discovering the underlying labels. This reflects the well known fact that data dependences must be accounted for in training, validation, and testing of machine learning methods in order to produce reliable performance estimates 

(Suominen et al., 2008; Pahikkala et al., 2012).

Figure 1: Text processing pipeline with encryption, pre-processing is encapsulated in the dashed arrow.

2.2 Partial Homomorphic Encryption

The Paillier encyption scheme (Paillier, 1999) (and later generalisations (Damgård & Jurik, 2001)) is a public-private key encryption scheme. We alter the notation of (Franz, 2011) (note, this is different to (Djatmiko et al., 2014)) where an integer is encrypted as and the decrypt operation is . In other words,


The operations and in Definition (1) are public key encryptions: users can encrypt data (and perform computations) using a common public key, however, only the user with the corresponding private key can extract the clear data.

The main operations for Paillier homomorphic encryption are the operators and . They are defined for two integers where is a constant for the particular encryption and parameters and are two large primes as follows:




where is an un-encrypted real-valued scalar.

Equations (2) and (3) gain us an ability to sum encrypted values in the encrypted domain (and consequently decrypt the result) and multiply encrypted values with un-encrypted scalars. Note that the result (3) does not apply when is encrypted. For more advanced operations (such as multiplying encrypted values) we use secure computation.

2.3 Secure Computations

We use results from (Franz, 2011; From & Jakobsen, 2006) to provide several protocols for secure computation among two parties and . Work from (Clifton et al., 2002) provides mechanism for multiple parties (i.e., more than two). We shall assume that operates on encrypted data, and has the private key (and can decrypt data). Neither party should be able to discern the numerical values. These protocols comprise the following three steps:

  1. an obfuscation step by the public key holder ,

  2. a transfer step to the private key holder , who decrypts and then performs the calculation on clear data222As the result is obfuscated, learns nothing from this operation, even though it is performed on clear data. and returns an encrypted result to , and

  3. then removes the original obfuscation.

Other work extends (Franz, 2011) to linear algebra for homomorphic analysis.

We now recall the work of (Nock et al., 2015) and (Patrini et al., 2016) to present relevant parts the aggregation techniques. This presents learners on specially aggregated data sets where the data set could be in a single location.

2.3.1 Single (Complete) Data Set

We will first consider the data set as a single (coherent) source. That is, all data is held by a single organisation.

Definition 1 ((Numeric) Supervised Learning Space).

Given a set of examples , where are observations, is the domain, and are binary labels. We are concerned with a (binary) linear classifier for fixed . The label of an observation is given by

The irreversible aggregation is based on Rademacher observations (rados) as defined below:

Definition 2 (Rado).

cf.Definition 1 (Nock et al., 2015)

Let . Then given a set of , and for any with . The Rademacher observation with signature is


2.3.2 Multiple Data Sets

This case is described in Figure 1 (Patrini et al., 2016). We do not assume that entities are linked: different text corpora are held by different parties, and no entity resolution is performed.

Definition 3 (BB-Rado).

cf. Definition 1 (Patrini et al., 2016) Consider . Let concatenate zeros to such that . For any , labels and , the -basic block rado for is


2.4 Encrypted Rados

The encryption process occurs after securely processing the text documents at the private location of . Using her/his private key, then encrypts the features, and these are then aggregated. The aggregation occurs blind to , and may be performed by an honest-but-curious intermediary . The rados are generated privately at . Once generated, the rados can be used by other honest-but-curious external parties .

Figure 2 outlines the encryption steps, using secure mathematical operations, and denote the two parties as where is the private key holder and is an intermediary. can “see” encrypted features , and encrypted labels

, and knows the choices of rado vectors (i.e.,

knows values of ). It would be possible to operate with also encrypted.

We re-write Equation (4) below with the encrypted values made explicit. Corresponding secure mathematical operations are also shown. We use the notation to denote a series of homomorphic addition operations ie. . We will use as an abuse of notation, to denote “has the meaning of” rather than equality, as follows:


The resulting “Equation” (6) shows the formation of the (encrypted) rado. The additions and (unencrypted scalar) multiplications must all be translated to the appropriate homomorphic addition and multiplication operations.

The output is an encoded rado, based on any numerical field, that we will refer to as . We outline the procedure to build the rado in Protocol 1: 1 Encrypted Radomacher.

  at peer
Protocol 1 Encrypted Radomacher

2.4.1 Multi-party Case

The case for multiple parties requires the use of the function. This function appends zeros onto a vector, and thus (in the encrypted domain) may be represented as appending an encrypted scalar (zero) to the encrypted vector. As above, Equation (5) can be re-written in the encrypted domain.

Figure 2: Feature encryption pipeline, showing encryption links and (dashed) knowledge at with private key, and intermediary .

2.5 Learning, Using Encrypted Rados

2.5.1 Unencrypted Single-party Case

Recall the learner for rados (in the unencrypted case) is given by (Nock et al., 2015). We will use the equivalent (exponential) learner for rados as follows:

Lemma 1 (Rado Learning).

(cf. Lemma 2 (Nock et al., 2015) ) For any and , and a set of rados , minimizing the loss


is equivalent to minimising the standard logistic loss .

The supervised learning (optimisation) is written as

Problem 1 (Minimise Exponential Loss).

The optimal classifier is given by solving




and is a regularising term (a.k.a. regulariser).

2.5.2 Secure Single-party Case

The exponential in Equation (9) can be computed securely using the protocol outlined in (Yu et al., 2011). The logarithm can be performed using Algorithm 1 (Djatmiko et al., 2016). We perform a gradient descent to solve Problem 1.

Recall Problem 1. Note that the gradient of , with respect to , is


We note that .

Using our abuse of notation we have


2.5.3 Unencrypted, Multi-party Case

The proof of Theorem 3 (Patrini et al., 2016) shows that the mean square loss can be used, over — that is, on the limited sample sets — by using a modified mean loss as given in Definition 4 as follows:

Definition 4 (BB-Rado loss).

cf. Definition 2 .(Patrini et al., 2016) and Theorem 3 (Patrini et al., 2016) The -loss for the classifier is


where expectation

and variance

are computed with respect to the uniform sampling of in . If the matrix is positive definite, it can be defined as a weighted diagonal matrix


where accounts for (lack of) confidence in certain columns of .

“[M]inimizing the Ridge regularized square loss over examples is equivalent to minimizing a regularized version of the M-loss, over the complete set of all rados.” (Patrini et al., 2016)

The optimal classifier is given by the simple closed-form expression Theorem 6 (Patrini et al., 2016). Namely,


where is stacked (column-wise) rados and is the number of columns of . The procedure for building the rados and solving for are given in (Patrini et al., 2016).

To solve (14), (Hall et al., 2011 (revised 2013) recommends an iterative – Shur (Guo & Higham, 2006)) approach. Faster approaches (with fewer multiplications) are achieved by higher order algorithms. An outline and review are given in (Rajagopalan, 1996; Soleymani, 2012). The inverse may be found using secure multiplication linear algebra.

2.5.4 De-risking the Coordinator

The notation of (Patrini et al., 2016) suggests a central coordinator with access to all vectors,: we avoid this by returning to Definition 4. Let


and then


The sums in Equations (15) and (16) are over the appropriate rados. However, these rados may be calculated by their peers, so the sums may be broken into per-peer summations, where we consider disjoint sets such that .

Definition 5 (BB-rado per peer).

Consider peers with , where each peer has distinct rados drawn from , and the rados are distinct in . For each peer , we have a expectation and variance defined as




Each peer can calculate and independently.

Although the mean may be calculated centrally, it is preferable to use secure multi-party addition to achieve the same result. This reduces the scope for the coordinator to access (encrypted, aggregated) vectors, and (instead) only access noisy aggregates of the data.

3 Putting the Bricks together

The algorithm incorporates the secure multi-party summation work of (Clifton et al., 2002), to prevent the coordinator from obtaining the rados directly. This adds a third layer of obfuscation to the data (encryption, aggregation, blind addition), which means that at the coordinator (who can decrypt the data) the data remains protected by the aggregation to rados and the blind addition of the vectors.

0:  peers , coordinator
0:  encrypted classifier at
0:  encrypted local classifier at peer
0:  binary feature vector at { the features available at each peer }  
at coordinator :
  generate Paillier public key & secret key as a pair
  send public key to all peers  
at each peer independently:  
{run local text labelling on document set}  
{ document set; dictionary; data and labels }
   binary vector of observations
  send to
  encrypt data and labels
  build rado’s from encrypted data using public key
   {elementwise homomorphic addition}  
at coordinator :
   SM.Add() {Sec.Add using }
  send mean value to all peers  
at each peer independently:
   {elementwise homomorphic addition}  
at coordinator :
   SM.Add() {Sec.add using }
   S.Inv() {Sec.inversion}
   S.MatProd(,) {Sec.mult}
  for  to  do
     {local classifier for peer}
     send to peer
  end for
Protocol 2 Classifier for Secure Text with Central Coordinator
Figure 3: Communication architecture for multiple peers, common coordinator. The coordinator sends a common dictionary and public key to all peers. Each peer has different data components, with some common elements (cyan).Each peer has encrypted and aggregated its local data (blue server icons). The blue servers correspond to the “intermediary” in Figure 2. The encryption key is generated by the coordinator. Dashed arrows denote information transfers between participants, whilst solid arrows denote local transformations (at the respective participant).

In Figure 3 we have outlined the encrypted pipeline, that combines Figure 1 with the inverse proposed in (Patrini et al., 2016), using Protocol 2: 2 Classifier for Secure Text with Central Coordinator.

At each peer , we now wish to classify a particular observation vector . Nominally, we would calculate


However, each peer only has a subset of features . We note that the label is determined only by the sign of a scalar, and hence, we can break the inner product into an inner product of local features and remote features as follows:


The local component of Equation (21) may be calculated at peer . If we denote the local classifier result as , then we may write


The summation in Equation (25) is the sum of all (local) calculated classifier results on the sub-components of the vector . The result of Equation (25) shows that the remote classification results may be treated as offsets for the local result — that is, the remote inner products act as corrections to the local result. However, this requires that every peer share linking information about the observation . To avoid this, we replace the summation in Equation (25) with an equivalent rado as follows:


In the homomorphic encrypted case, the local inner product can be calculated by keeping the encrypted classifier vector in its encrypted form, and treating the elements of as unencrypted scalars. Finally, the summation may be achieved using multi-party secure addition, as outlined in (Clifton et al., 2002).

0:  coordinator with public-secret key pair
0:  common extra rado at
0:  binary feature vector at { the features available at each peer }
0:  encrypted local classifier at each peer
0:   label from classification  
at peer :
   S.local.innerProd() {Local.scalarproduct}  
at each other peer : {The set of peers may be chosen by }
   S.innerProd() {Sec.innerproduct}  
at peer :
   SM.Add() {LABEL:A:secure-add using scalar }
  send to  
at :
  send to peer  
at :
Protocol 3 Local Classify for Secure Text
Grad descent algorithm peers rados misclassification run time (s)
LogisticRegression (baseline) 1 plain plain 0.049 0.13s
radoBoost 1 plain plain 0.12 1.4
radoLearn using rados from radoBoost 1 plain plain 0.20 0.069
radoLearn 4 plain plain 0.089 0.12
radoLearn 4 encrypted plain 0.10 75
radoLearn 4 encrypted encrypted 0.085 87
Table 1: Results using numerical regression, and trivial text analytics. Encryption does not impact the accuracy of the results, but does dramatically reduce computation speed.

3.1 Usage Scenario for Multi-party, Private Text

In this scenario we outline the key procedure for learning from distributed private text corpora, and then classifying a locally private corpora. We shall use names to illuminate actors. Alice and Bob each have private text collections. Alice would like to classify her text by using a combination of patterns learnt from her own data and from Bob’s. Cate333Cate plays no role in the learning, but is needed as a coordinator. provides coordination for Alice and Bob. Together, Alice, Bob and Cate follow Protocol 2 to establish a learned feature set. As Alice and Bob may have different feature sets, Cate separates Alice’s appropriate feature vector, and sets the remaining features to zero. Cate then coordinates Alice and Bob through Protocol 3.

4 Preliminary results

Using a simple data set from UCI Ionosphere data set – to provide a significant number of numeric features, agnostic of text input – we compare basic analytics using various privacy constraints. For comparison, (Zhou & Jiang, 2004)

has reported a misclassification rate of 0.109, 0.112 & 0.096 using various neural network approaches. In our case, we have trialled standard linear regression, against multiple peers with rados, and

all calculated in the encrypted domain. The results are shown in Table 1

   S.MatProd() {Sec.mult}
   S.Inv() {Sec.inversion }
  return First row of
Protocol 4 Secure Rado Solver

5 Conclusion

We have outlined a protocol to provide secure text analytics, by combining standard features with numeric computation and secure linear algebra with obfuscated addition. Our result may also be used for numeric, un-trusted coordinators. Whilst not guaranteeing security, the protocol addresses common issues with sharing text data – namely visibility of identifiable information.


  • Carrell et al. (2013) Carrell, D, Malin, B, Aberdeen, J, Bayer, S, Clark, C, Wellner, B, and Hirschman, L. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. Journal of the American Medical Informatics Association, 20(2):342–348, 2013.
  • Chazard et al. (2013) Chazard, E, Mouret, C, Ficheur, G, Schaffar, A, Beuscart, J B, and Beuscart, R. Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records. International Journal of Medical Informatics, 83(4):303–312, 2013.
  • Clifton et al. (2002) Clifton, Chris, Kantarcioglu, Murat, Vaidya, Jaideep, Lin, Xiaodong, and Zhu, Michael Y. Tools for privacy preserving distributed data mining. SIGKDD Explorations, 4(2), 2002.
  • Dalianis & Velupillai (2010) Dalianis, H and Velupillai, S. De-identifying Swedish clinical text — refinement of a gold standard and experiments with Conditional Random Fields. Journal of Biomedical Semantics, 1(1):6, 2010.
  • Damgård & Jurik (2001) Damgård, Ivan and Jurik, Mats. A generalisation, a simplification and some applications of Paillier’s probabilistic public-key system. In PKC ’01 Proceedings of the 4th International Workshop on Practice and Theory in Public Key Cryptography: Public Key Cryptography, pp. 119–136, 2001.
  • Dankar & Emam (2013) Dankar, Fida K. and Emam, Khaled El. Practicing differential privacy in health care: A review. Transactions on Data Privacy, 5:53–67, 2013.
  • Djatmiko et al. (2014) Djatmiko, Mentari, Friedman, Arik, Boreli, Roksana, Lawrence, Felix, Thorne, Brian, and Hardy, Stephen. Secure evaluation protocol for personalized medicine. In Workshop on Genome Privacy, 2014.
  • Djatmiko et al. (2016) Djatmiko, Mentari, Hardy, Stephen, Henecka, Wilko, Ott, Max, Smith, Guillaume, and Thorne, Brian. (confidential) N1Analytics: Distributed scoring with data confidentiality. whitepaper, January 2016.
  • Dunn et al. (2012) Dunn, Adam G., Day, Richard O., Mandl, Kenneth D., and Coiera, Enrico. Learning from hackers: Open-source clinical trials. Science, Translational Medicine, 4(132):132cm5, May 2012.
  • Dwork & Roth (2014) Dwork, Cynthia and Roth, Aaron. The algorithmic foundations of differential privacy. In Foundations and Trends in Theoretical Computer Science, volume 9, pp. 211–407. now, 2014.
  • Estrin & Sim (2010) Estrin, Deboarah and Sim, Ida. Open mhealth architecture: An engine for health care innovation. Science, 330(6005):759–760, November 2010.
  • Franz (2011) Franz, Martin. Secure Computations on Non-Integer Values. PhD thesis, Technische Universität Darmstadt, 2011.
  • From & Jakobsen (2006) From, Strange L. and Jakobsen, Thomas. Secure multi-party computation on integers. Master’s thesis, University of Aarhus, 2006.
  • Grouin & Neveol (2014) Grouin, C and Neveol, A. De-identification of clinical notes in French: towards a protocol for reference corpus development. Journal of Biomedical Informatics, 50:151–161, 2014.
  • Guo & Higham (2006) Guo, Chun-Hua and Higham, Nicholas J. A Schur-Newton method for the matrix ’th root and its inverse. Technical report, Manchester Institute for Mathematical Sciences, October 2006.
  • Hall et al. (2011 (revised 2013) Hall, Rob, Fienberg, Stephen E., and Nardi, Yuval. Secure multiple linear regression based on homomorphic encryption. Journal of Official Statistics, 27(4):669–691, 2011 (revised 2013).
  • Hirschberg & Manning (2015) Hirschberg, Julia and Manning, Christopher D. Advances in natural language processing. Science, 349(6245):261–266, July 2015.
  • Hrynaszkiewicz et al. (2010) Hrynaszkiewicz, Iain, Norton, Melissa L, Vickers, Andrew J, and Altman, Douglas G. Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers. British Medical Journal, 340, January 2010.
  • Ji et al. (2014) Ji, Zhanglong, Jiang, Xiaoqian, Wang, Shuang, Xiong, Li, and Ohno-Machado, Lucila.

    Differentially private distributed logistic regression using private and public data.

    BMC Medical Genomics, 7:S14, 2014.
  • Jin (2007) Jin, Huidong (Warren). Practical issues on privacy-preserving health data mining. In Working Notes of PAKDD’07 Industrial Track, 2007.
  • Johnson (2014) Johnson, Mark. Beyond the 1-best pipeline. Presentation slides, at NICTA-NLP workshop 2014, September 2014.
  • Kayaalp et al. (2014) Kayaalp, M, Browne, A C, Callaghan, F M, Dodd, Z A, Divita, G, Ozturk, S, and McDonald, C J. The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them. Journal of the American Medical Informatics Association, 21(3):423–431, 2014.
  • McDonald & Kelly (2012) McDonald, Diane and Kelly, Ursula. The value and benefit of text mining to UK further and higher education. Digital infrastructure. available online, 2012.
  • McKeon et al. (2013) McKeon, Simon, Alexander, Elizabeth, Brodaty, Henry, Ferris, Bill, Frazer, Ian, and Little, Melissa. Strategic review of health and medical research – better health through research. Technical report, Australian Government, Department of Health and Ageing, April 2013.
  • Meystre et al. (2014) Meystre, S M, Ferr ndez, , Friedlin, F J, South, B R, Shen, S, and Samore, M H. Text de-identification for privacy protection: A study of its impact on clinical text information content. Journal of Biomedical Informatics, 50(Supplement C):142–150, 2014. Special Issue on Informatics Methods in Medical Privacy.
  • Morita et al. (2013) Morita, M, Kano, Y, Ohkuma, T, Miyabe, M, and Aramaki, E. Overview of the NTCIR-10 MedNLP task. In Proceedings of the 10th NTCIR Conference, Tokyo, Japan, 2013. NTCIR.
  • Nock et al. (2015) Nock, Richard, Patrini, Giorgio, and Friedman, Arik. Rademacher observations, private data, and boosting. J Machine Learning Research, 37, 2015.
  • Oard (2015) Oard, Douglas. Keynote presentation: Beyond information retrieval: When and how not to find things. In CLEF, 2015.
  • Pahikkala et al. (2012) Pahikkala, T., Suominen, Hanna Jasmine, and Boberg, J. Efficient cross-validation for kernelized least-squares regression with sparse basis expansions. Machine Learning, 87(3):381–407, 2012.
  • Paillier (1999) Paillier, Pascal. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT, pp. 223–238. Springer-Verlag, 1999.
  • Patrini et al. (2016) Patrini, Giorgio, Nock, Richard, Hardy, Stephen, and Caetano, Tiberio. Fast learning from distributed datasets without entity matching. Technical report, Data61, March 2016.
  • Paverd et al. (2014) Paverd, Andrew, Martin, Andrew, and Brown, Ian. Modelling and automatically analysing privacy properties for honest-but-curious adversaries. Tech. Report, 2014.
  • Rajagopalan (1996) Rajagopalan, Jayasiree. An Iterative Algorithm for Inversion of Matrices. PhD thesis, Concordia University, Montreal, Canada, September 1996.
  • Sayogo & Pardo (2012) Sayogo, Djoko Sigit and Pardo, Theresa A. Exploring the motive for data publication in open data initiative: Linking intention to action. In System Science (HICSS), pp. 2623–2632, Hawaii, USA, January 2012.
  • Shannon (1949) Shannon, Claude E. Communication theory of secrecy systems. The Bell System Technical Journal, 28(4):656–715, May 1949.
  • Shaw et al. (2016) Shaw, D M, Gross, J V, and Erren, T C. Data donation after death: A proposal to prevent the waste of medical research data. EMBO Reports, 17(1):14–17, 2016.
  • Si et al. (2014) Si, Luo, Yang, Grace Hui, Zhang, Sicong, and Cen, Lei (eds.). Proceeding of the 1st International Workshop on Privacy-Preserving IR: When Information Retrieval Meets Privacy and Security, 2014.
  • Simon et al. (2009) Simon, R S, Evans, S J, Benjamin, A, Delano, D, and Bates, W D. Patients’ attitudes toward electronic health information exchange: Qualitative study. Journal of Medical Internet Research, 11, 2009.
  • Soleymani (2012) Soleymani, Fazlollah. A rapid numerical algorithm to compute matrix inversion. International Journal of Mathematics and Mathematical Sciences, 2012, 2012.
  • Stenetorp et al. (2012) Stenetorp, Pontus, Pyysalo, Sampo, Topić, Goran, Ohta, Tomoko, Ananiadou, Sophia, and Tsujii, Jun’ichi. brat: a web-based tool for nlp-assisted text annotation. In Proceedings of the Demonstrations Session at EACL, 2012.
  • Suominen et al. (2014) Suominen, Hanna, Johnson, Maree, Zhou, Liyuan, Sanchez, Paula, Sirel, Raul, Basilakis, Jim, Hanlen, Leif, Estival, Dominique, Dawson, Linda, and Kelly, Barbara. Capturing patient information at nursing shift changes: methodological evaluation of speech recognition and information extraction. J. Am. Med. Info. Assoc., October 2014.
  • Suominen et al. (2008) Suominen, Hanna J., Pahikkala, T., and Salakoski, Tapio. Critical points in assessing learning performance via cross-validation. In Proceedings of the 2nd International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR 2008), pp. 9–22, 2008.
  • Suominen et al. (2015) Suominen, Hanna J., Zhou, Liyuan, Hanlen, Leif W., and Ferraro, Gabriela. Benchmarking clinical speech recognition and information extraction: New data, methods, and evaluations. JMIR Med Inform, 2(3), April 2015.
  • Sweeney (2015) Sweeney, L. Only you, your doctor, and many others may know. Technology Science, 2015092903, 2015.
  • Thomson (2004) Thomson, Colin. The regulation of health information privacy in Australia. Technical report, National Health and Medical Research Council, 2004.
  • Veitch & Barbour (2010) Veitch, Emma and Barbour, Virginia. Innovations for global health equity: beyond open access towards open data. MEDICC Rev, 12(3):48, Jul 2010.
  • Verhulst et al. (2014) Verhulst, Stefaan, Noveck, Beth Simone, Caplan, Robyn, Brown, Kristy, and Paz, Claudia. The open data era in health and social care: A blueprint for the National Health Service (NHS England) to develop a research and learning programme for the open data era in health and social care. online, May 2014.
  • Vogel (2011) Vogel, Lauren. The secret’s in: open data is a foreign concept in Canada. Canadian Medical Association Journal, 183(7):E375–6, Apr 2011. doi: 10.1503/cmaj.109-3837.
  • Yu et al. (2011) Yu, Ching-Hua, Chow, Sherman S. M., Chung, Kai-Min, and Liu, Feng-Hao. Efficient secure two-party exponentiation. In CT-RSA, pp. 17–32, San Francisco, CA, February 2011.
  • Zhou & Jiang (2004) Zhou, Zhi-Hua and Jiang, Yuan. NeC4.5: Neural ensemble based C4.5. IEEE Transactions on Knowledge and Data Engineering, June 2004.