Single-Component Privacy Guarantees in Helper Data Systems and Sparse Coding

07/15/2019 ∙ by Behrooz Razeghi, et al. ∙ 0

We investigate the privacy of two approaches to (biometric) template protection: Helper Data Systems and Sparse Ternary Coding with Ambiguization. In particular, we focus on a privacy property that is often overlooked, namely how much leakage exists about one specific binary property of one component of the feature vector. This property is e.g. the sign or an indicator that a threshold is exceeded. We provide evidence that both approaches are able to protect such sensitive binary variables, and discuss how system parameters need to be set.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Privacy-preserving storage of biometric enrollment data

Biometric data such as fingerprints and irises cannot be treated as a secret. After all, we leave latent fingerprints on many objects that we touch, and high-resolution photos of faces reveal a lot about our irises. Nonetheless, person authentication based on biometrics is still possible, provided that the verifier performs good liveness detection. In spite of the not-really-secret nature of biometric data there are very good reasons to treat them as confidential. Storing biometric databases in unprotected form would lead to various privacy issues. In this paper we focus on one particular privacy problem: some biometric data reveal medical conditions.

The protection of this kind of data must be as good as the protection of passwords. The attacker model in the case of password storage states that the adversary in an insider, i.e., somebody who has access to cryptographic keys. Furthermore, the standard use case considered in most of the literature dictates that the biometric prover does not have to type long keys or to present a smartcard. This combination of attacker model and use case implies that simply encrypting the confidential data is not an option. The typical solution for passwords is to apply a one-way function and to store the hash of each password. However, this solution does not work for noisy data such as biometrics; one bit flip in the input of the hash function causes 50% bit flips in the output.

Several techniques have been developed for securely storing noisy credentials, also known as template protection, in the above given context: (i) Helper Data Systems (HDS), also known as fuzzy commitment, secure sketch, fuzzy extractor [1, 2, 3, 4]; (ii) Locality Sensitive Hash (LSH) functions [5, 6]; (iii) homomorphic encryption [7, 8]; and most recently (iv) Sparse Coding with Ambiguization (SCA) [9, 10, 11, 12].

I-B Comparison of Template Protection Techniques

The LSH approach is fast but does not give clear privacy guarantees. Homomorphic encryption has excellent privacy, but is computationally expensive. In this paper we will not consider the LSH and homomorphic crypto approach.

The HDS approach is the oldest and is well studied. Nevertheless, the narrow privacy question of protecting one specific aspect of the biometric, which is relevant for the above mentioned medical condition, has not been studied in detail.

The aim of this paper is to compare the privacy properties of the HDS and the SCA approach, in particular the ‘medical condition’ aspect. Here it is important to note that previous work on SCA has focused only on the inability of an adversary to reconstruct the full biometric from the enrollment data; that is not the property we will be looking at in the current paper. Mostly, in the literature, the protection of a vector is considered. However, often it is the projection of onto some fixed direction that is relevant, . The range in which lies can be privacy-sensitive, e.g. the sign of or whether is far away from average.

I-C Contributions

We concentrate on one component of a to-be-protected random vector , in particular a binary property , which is either the sign or an ‘extremeness’ indicator that checks if exceeds some threshold. We investigate how much information leaks about through the enrollment data.

  • [leftmargin=*]

  • In quantizing HDSs high leakage can occur if a bad parameter choice is made. The best choice is to take an even number of quantization intervals, and to subdivide them into two helper data intervals; then there is zero leakage about the sign and the ‘extremeness’.

  • The Code Offset Method causes negligible leakage.

  • In the SCA mechanism, leakage about the ‘extremeness’ bit can be made negligibly small by setting the ambiguization noise level larger than the Hamming weight of the sparse ternary representation. This noise has little impact on the performance of the authentication system, since it gets removed in case of a genuine user’s verification measurement.

Ii Preliminaries

Ii-a Notation and Terminology

Vectors and matrices are denoted by boldface lower-case () and upper-case (). When there is no distinction between a scalar (), vector (), or matrix (), we write . The notation denotes the -th column of . An enrollment measurement will be written as a vector , and the verification measurement as 

. When a distinction needs to be made between a random variable (RV) and its numerical value, the RV is written in capitals, and the value in lowercase. Expectation over

is denoted as . The Shannon entropy of a discrete RV is denoted as and is defined as . The conditional entropy of given is written as and is defined as . The mutual information between and is . For the binary entropy function is defined as . We use the notation . The superscript stands for the transpose and for pseudo-inverse. The logarithm ‘log’ has base . Bitwise XOR is denoted as . The Heaviside step function is written as .

We will work with the following authentication setting. The Verifier owns an enrolment database of public data for a set of users . In the verification phase he is presented with a vector and a user label ; his task is to decide if is consistent with .

Fig. 1: Data flow in: (a) generic Helper Data System and (b) general Sparse Coding with Ambiguization mechanism.

Ii-B Zero Leakage Helper Data Systems

A HDS in its most general form is shown in Fig. 0(a). The Gen procedure takes as input a measurement . Gen outputs a secret and public Helper Data , where , and can be scalar, vector, or matrix, in general. The helper data is stored in a memory that can be accessed by the adversary. In the reproduction phase, a fresh measurement is obtained. Typically is close to but not identical. The Rec procedure takes and as input. It outputs

, an estimate of

. If is sufficiently close to , then . The ‘Zero Leakage’ (ZL) property is defined as , i.e., the helper data reveals nothing about the secret. Obviously, has to leak about , since is a function of . When the HDS is referred to as a Secure Sketch. When given

has a uniform distribution, the HDS is called a Fuzzy Extractor. If

is a continuum variable, the first step in the signal processing is discretisation. For this purpose, a special ZLHDS has been designed [13, 14, 15] which reduces quantisation errors. The distribution of needs to be known. After discretisation the Code Offset Method can be applied (Section II-C).

The discretising ZLHDS is shown in Fig. 2. Consider a source , with . Let be the cumulative distribution. The -axis is divided into quantisation intervals corresponding to the extracted secret . The distribution of is not necessarily uniform. Let be the left boundary of the interval . Let

denote the probabilities of the

-values. Then . Each -interval is equiprobably divided into sub-intervals (depicted as grayscales in Fig. 2); the helper data is defined as the index of the sub-interval in which the enrollment lies. The and are computed from as follows:


In the limit , the helper data can be seen as a quantile within the -interval. It holds that .

Fig. 2: Example of the Zero Leakage discretising HDS with four quantisation intervals . The discrete helper data is indicated as grayscales.

Ii-C The Code Offset Method [16, 1]

Consider . Consider a linear binary error-correcting code with syndrome function and syndrome decoding function , where is the message length. The Code Offset Method in its simplest form can be used as a Secure Sketch for any distribution of . The helper data and the reconstruction are defined as follows:


Ii-D Sparse Ternary Coding (STC) [17, 10]

The encoder is a mapping , where may be smaller, equal to, or larger than . The is a sparse encoding of ; the number of nonzero entries is , which is called the sparsity level. The encoder first applies a projection matrix and then element-wise thresholding , where is a parameter, (see Fig. 2(a)).


The threshold is tuned to get the desired sparsity level . The decoder produces an estimator for as .

Ii-E Sparse Binary Coding (SBC)

Here the thresholding function is (Fig. 2(b)). Given a (raw) feature vector SBC generates a binary vector .

Ii-F Binary Coding (BC)

This is the component-wise sign operation (Fig. 2(c)) excluding zeros. Given a (raw) feature vector the BC simply generates a binary vector .

Ii-G Sparse Coding with Amibiguization (SCA) [9, 10]

Given a sparse ternary vector , the ambiguization mechanism turns randomly chosen zero components of into a (random) . The resulting ternary vector is stored as enrolment data (see Fig. 4), together with the matrix . The randomly added nonzero components make it prohibitively difficult to reconstruct from , while still allowing a verifier to check if a verification measurement is consistent with .

Fig. 3: (a) Ternary thresholding; (b) Binary thresholding; (c) Binarisation.

Iii Problem formulation

We formally capture the ‘medical condition’ issue as follows. We model the existence of the privacy-sensitive medical condition as a binary function of one of the components of the enrollment measurement . That is, we say that is the quantity that should not leak, for some .

We will work with two choices for the function that seem to make sense in our context: (a) (see Fig. 2(b)), and (b) (see Fig. 2(c)). We will assume that the index is not known to the legitimate parties at the time of enrollment. Otherwise, there is a trivial solution.

We will consider only distributions of that are symmetric around , i.e., even functions . Furthermore we work in the ‘perfect enrollment’ model, which states that there is no measurement noise at enrollment time. In this way we are erring on the side of caution, overestimating the leakage.

Iv Results for Helper Data Systems

We zoom in on the relevant component . We introduce shorthand notation and . We write

for the cumulative distribution function of 


Fig. 4: Enrolment phase of the SCA scheme.

Iv-a Leakage from the Quantising HDS

First we look at the sign variable .

Theorem 1.

Let the distribution of be an even function. Then and


When is even, it holds that for any . When is odd and is even, it holds for any that . When is odd and is odd, the above situation holds for , but for the one special value of in the middle we have . ∎

Next we look at the threshold indicator . The entropy of is given by . For symmetric this reduces to .

Theorem 2 below gives an expression for the entropy of given the helper data, in the regime where the threshold lies in the outermost -region (the rightmost grayscale band in Fig. 2), i.e., the medical condition is rare.

Theorem 2.

Let be a symmetric pdf. Let . Then


For it is certain that . This gives . Due to symmetry this equals . Since is a binary RV we have . Finally we use . ∎

Remark: For we see that , i.e., there is no leakage.

It is interesting to note that in the HDS with even J and m=2, the helper data leaks absolutely nothing about V and Z.

One may wonder if is still worth considering. It should be noted that the noise resilience of the HDS improves when the number of subdivisions is increased. Therefore we cannot exclude that setting can be a good design choice. Theorem 3 below gives a leakage result for the limiting case , i.e. the continuum helper data.

Theorem 3.

Let be an even function. Let .


We compute . Since is uniform on this evaluates to .

Case . For it is certain that . For all other we have .

Case . For we have (left tail and right tail). For all other the probability is . ∎

The relative leakage is plotted in Fig. 5. The dependence on looks strange, but three special points can be understood. (i) For the continuum helper data completely reveals ; (ii) For the helper data contains no information about crossing the threshold. That information is contained in , and the ZLHDS has been designed not to leak anything about ; (iii) At the intermediate value a special symmetry occurs between the left region and the right region . Due to this symmetry the conditional distribution looks the same for every value of .

Furthermore the leakage is a decreasing function of , since at large the two tail regions and have less influence.

Fig. 5: Normalised leakage as a function of . From top to bottom .

Iv-B Leakage from the Code Offset Method

Even if the quantising HDS does not reveal , the helper data from the Code Offset Method may still do. Typically the sign gets turned into a bit value somewhere in the binary string that serves as input to the COM.

We use notation as in Section II-C. Consider uniform , and helper data . At fixed , there are strings that are consistent with , and all of them have equal probability. The marginal distribution for one component of is uniform; hence there is no leakage about .

The above reasoning no longer applies when is non-uniform. Then the strings compatible with are not uniform. However, the number of strings summed over in the computation of the marginal is exponentially large; for most ‘normal-looking’ -distributions the marginal of one component will still be close to uniform.

Finally we briefly depart from the perfect enrollment setting and investigate the effect of measurement noise at enrollment time on the privacy properties of the COM. Suppose there exists a ‘true’ biometric , which is shielded from our view by enrollment noise . The enrollment measurement yields . Then the relevant privacy question is how much is leaking about (parts of) , as opposed to .

Proposition 1.

Let the enrollment noise be bitwise iid Bernoulli noise with bit error rate . Let be the row weight of the error correcting code. Let . Then


. We use . In a good code the redundancy is just slightly larger than the entropy of the syndrome . We estimate the entropy of by setting it close to the Gallager bound [18], , where is the bit error probability of binary symmetric channels concatenated together, . ∎

For reasonable values of the row weight, we see that the leakage is very small, approximately ; and this is leakage about the whole vector .

V Results for Sparse Coding with Ambiguisation

Fig. 6: The schematic block diagram of a physical verification system.

V-a Method

We consider users. Each user has a measurement vector ; it consists of

components which are modeled as zero-mean unit-variance Gaussian variables. The enrolment of the vectors

is as described in Section II-G. The public data for user is .

Fig. 7: Error probabilities for different sparsity ratios and measurement noise ; setting , .

We consider a verification vector , which is allegedly from user . It is either a noisy version of the enrolled or completely unrelated to it (but drawn from the same distribution). The former case is referred to as Hypothesis , the latter as . We write , with Gaussian noise . The verification procedure works as follows. The verifier computes . If this inner product exceeds a threshold, then he decides on , otherwise . The expression is essentially an STC ‘enrollment’ of without ambiguization noise. Taking the inner product with is meant to remove the ambiguization noise from (we call this ‘purification’), and it results in a similarity score.

The goal of the SCA mechanism is to prevent recovery of and the corresponding from the enrolment data while enabling a verifier to check if is consistent with enrolment data . We can characterize the performance of SCA in terms of:

(i) preservation of mutual information between and in the authorized case, i.e. , whilst in the unauthorized case . The same holds for a function of , i.e., and .

(ii) Reconstruction error: We investigate the error probabilities


Let , . Ideally it should hold that and . The latter expression is for random independent of , with the same distribution.

V-B Performance results

Fig. 8: Normalized Information Leakage as a function of ambiguization ratio . . (a) ; (b) equals the PCA transform of the matrix .

We consider a database of random vectors (individuals) with dimensionality , which are generated from the distribution . We then generate the noisy version of with two different noise variances and . We consider square matrix , i.e., .

We look at the error probability averaged over the users and the components, .

Fig. 7 shows the averaged error probability as a function of the ambiguization ratio , for fixed , and different sparsity ratios and measurement noise variances . Several things are worth noting.

  • For the un-enrolled case (random ), the error probability in guessing the privacy bit increases as a function of the ambiguization ratio. This is as expected.

  • In the genuine user case the situation is more complex; the ambiguization noise interferes with the measurement noise.

  • There is a clear gap between the genuine user case and the un-enrolled case. The low111 Note that the False Negative probability for the overall user matching is much lower than the single-component reconstruction error. (and in some plots nearly constant) error rate for genuine users demonstrates that the ‘purification’ correctly removes the ambiguization noise.

Furthermore we compute the leakage . Fig. 7(a) shows what happens when is set to the trivial value . The leakage decreases from 100% to zero with increasing ambiguization. The curve seems to consist of three ambiguization ratio regimes, with piecewise linear behaviour: , and

. At the moment we are not able to explain this behaviour.

Fig. 7(b) shows what happens when a less trivial matrix is used, namely the PCA transform matrix of the matrix . This same is used for all users. We again observe piecewise linear behaviour with the same three intervals. However, the middle piece is no longer constant but increasing. More importantly, the leakage is reduced by orders of magnitude. Fig. 7(b) also shows the leakage from Sparse Binary Coding with Ambiguization; it is slightly smaller than for the ternary case.

V-C Non-square projection matrix

We briefly discuss the case , i.e. the number of random projections is larger than the dimension of . The adversary is confronted with an ambiguized ternary vector , and from it has to guess the by guessing which locations in contain the ambiguization noise. When is larger than , the adversary may be able to distinguish between wrong guesses and the correct guess, as follows. For a wrong guess it will typically hold that is far away from , while on the other hand it holds that . From the correct an estimator for is then obtained as . Hence, information-theoretically speaking, there is no privacy protection. However, the amount of effort in going through all the possible guesses scales as , which is huge. The security is computational, not information-theoretic.

Vi Discussion

For quantizing HDSs we have established that there is a clearly identifiable optimal choice for protecting the and bits: taking the number of quantization intervals to be even, and setting . However, for noise tolerance it is advantageous to set as large as possible. The -leakage result for (Fig. 5) has some caveats. It is nice that a minimum exists at and , but unfortunately the operational meaning of is not really well defined. A small shift of has little impact on the concept “this variable is abnormally far from zero”, but has a large effect in Fig. 5. It is left as a topic for future work to study this further.

There are no such subtleties for the Code Offset Method. We think we can safely conclude that the COM has only negligible leakage.

For the SCA approach we have established that there is a clear gap between how much you know about if you do and do not have access to a matching verification measurement . (Not having such access means trying to reconstruct from the public data.) This is visible as a gap (Fig. 7) in the error probability for reconstructing , and as low mutual information in Fig. 7(b). Determining the leakage about sign() is left for future work. Other topics for future work are further experimentation with different choices of the projection matrix and understanding the piecewise linear shape of the leakage curve.


  • [1] A. Juels and M. Wattenberg, “A fuzzy commitment scheme,” in ACM Conference on Computer and Communications Security (CCS) 1999, 1999, pp. 28–36.
  • [2] J.-P. Linnartz and P. Tuyls, “New shielding functions to enhance privacy and prevent misuse of biometric templates,” in Audio- and Video-Based Biometric Person Authentication.   Springer, 2003.
  • [3] Y. Dodis, M. Reyzin, and A. Smith, “Fuzzy Extractors: How to generate strong keys from biometrics and other noisy data,” in Eurocrypt 2004, ser. LNCS, vol. 3027.   Springer-Verlag, 2004, pp. 523–540.
  • [4] Y. Dodis, R. Ostrovsky, L. Reyzin, and A. Smith, “Fuzzy Extractors: how to generate strong keys from biometrics and other noisy data,” SIAM J. Comput., vol. 38, no. 1, pp. 97–139, 2008.
  • [5]

    P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    .   ACM, 1998, pp. 604–613.
  • [6] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry.   ACM, 2004, pp. 253–262.
  • [7] R. L. Lagendijk, Z. Erkin, and M. Barni, “Encrypted signal processing for privacy protection: Conveying the utility of homomorphic encryption and multiparty computation,” IEEE Signal Processing Magazine, vol. 30, no. 1, pp. 82–105, 2012.
  • [8] C. Aguilar-Melchor, S. Fau, C. Fontaine, G. Gogniat, and R. Sirdey, “Recent advances in homomorphic encryption: A possible future for signal processing in the encrypted domain,” IEEE Signal Processing Magazine, vol. 30, no. 2, pp. 108–117, 2013.
  • [9] B. Razeghi, S. Voloshynovskiy, D. Kostadinov, and O. Taran, “Privacy preserving identification using sparse approximation with ambiguization,” in IEEE International Workshop on Information Forensics and Security (WIFS), Rennes, France, December 2017, pp. 1–6.
  • [10] B. Razeghi and S. Voloshynovskiy, “Privacy-preserving outsourced media search using secure sparse ternary codes,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, April 2018, pp. 1–5.
  • [11] B. Razeghi, S. Voloshynovskiy, S. Ferdowsi, and D. Kostadinov, “Privacy-preserving identification via layered sparse code design: Distributed servers and multiple access authorization,” in 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, September 2018.
  • [12] S. Rezaeifar, B. Razeghi, O. Taran, T. Holotyak, and S. Voloshynovskiy, “Reconstruction of privacy-sensitive data from protected templates,” in IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, September 2019.
  • [13] E. Verbitskiy, P. Tuyls, C. Obi, B. Schoenmakers, and B. Škorić, “Key extraction from general nondiscrete signals,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 2, pp. 269–279, 2010.
  • [14] J. de Groot, B. Škorić, N. de Vreede, and J. Linnartz, “Quantization in Zero Leakage Helper Data Schemes,” EURASIP Journal on Advances in Signal Processing, 2016, 2016:54.
  • [15] T. Stanko, F. Andini, and B. Škorić, “Optimized quantization in Zero Leakage Helper Data Systems,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 8, pp. 1957–1966, 2017.
  • [16] C. Bennett, G. Brassard, C. Crépeau, and M. Skubiszewska, “Practical quantum oblivious transfer,” in CRYPTO, 1991, pp. 351–366.
  • [17] S. Ferdowsi, S. Voloshynovskiy, D. Kostadinov, and T. Holotyak, “Sparse ternary codes for similarity search have higher coding gain than dense binary codes,” in IEEE Int. Symp. on Inf. Theory (ISIT), 2017.
  • [18] R. Gallager, Low Density Parity Check Codes.   MIT Press, 1963.