Privacy-Preserving Identification via Layered Sparse Code Design: Distributed Servers and Multiple Access Authorization

by   Behrooz Razeghi, et al.

We propose a new computationally efficient privacy-preserving identification framework based on layered sparse coding. The key idea of the proposed framework is a sparsifying transform learning with ambiguization, which consists of a trained linear map, a component-wise nonlinearity and a privacy amplification. We introduce a practical identification framework, which consists of two phases: public and private identification. The public untrusted server provides the fast search service based on the sparse privacy protected codebook stored at its side. The private trusted server or the local client application performs the refined accurate similarity search using the results of the public search and the layered sparse codebooks stored at its side. The private search is performed in the decoded domain and also the accuracy of private search is chosen based on the authorization level of the client. The efficiency of the proposed method is in computational complexity of encoding, decoding, "encryption" (ambiguization) and "decryption" (purification) as well as storage complexity of the codebooks.



There are no comments yet.


page 1

page 2

page 3

page 4


Privacy Preserving Identification Using Sparse Approximation with Ambiguization

In this paper, we consider a privacy preserving encoding framework for i...

Non-Interactive Private Decision Tree Evaluation

Decision trees are a powerful prediction model with many applications in...

Privacy-Preserving Near Neighbor Search via Sparse Coding with Ambiguation

In this paper, we propose a framework for privacy-preserving approximate...

Differentially-Private "Draw and Discard" Machine Learning

In this work, we propose a novel framework for privacy-preserving client...

Privacy-Preserving Identification of Target Patients from Outsourced Patient Data

With the increasing affordability and availability of patient data, hosp...

Increasing Adversarial Uncertainty to Scale Private Similarity Testing

Social media and other platforms rely on automated detection of abusive ...

pvCNN: Privacy-Preserving and Verifiable Convolutional Neural Network Testing

This paper proposes a new approach for privacy-preserving and verifiable...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Privacy-preserving identification is of great importance for the growing amount of applications that require fast and accurate identification. Third parties are assumed to perform the expected services but are curious about the nature of the data content of the queries. These applications include but are not limited to the internet-of-things (IoT), biometrics, clinical reports, etc.

In this work, we propose a new distributed framework of privacy-preserving identification based on successive refinement. The successive refinement of information was first studied for the classic source coding problem [1]. The performance of this problem is formulated by a rate-distortion theory. The objective is to achieve the rate-distortion bound at each stage. In [2] the authors proposed the Sparse Ternary Coding (STC) scheme for fast search in large scale identification problems. The theoretical results of the STC scheme are studied in [3]. Inspired by the successive refinement of information problem, the authors proposed a multi-layer network which successively generates sparse ternary codes, which closely achieve the Shannon lower bound of the distortion-rate function.

I-a Our Contribution

In this paper, we propose a new framework of multi-stage identification using successive refinement with sparse ternary codes at each layer of the privacy-preserving identification. The proposed privacy-preserving mechanism is based on the ambiguization, i.e., addition of noise to non-zero sparse data representation in the transform domain. We demonstrate that the security of this scheme does not rely on the secrecy of transform. Accordingly, we develop a distributed search framework (Fig. 1) with a granted granular access to the results of the search based on the level of authorization expressed in the knowledge of codebook and vote refinement. We demonstrate that the identification based on compressed STC representation could be a good first stage for the fast public identification, while the authorized private users can benefit from the refined results enjoying the accurate upgrades in the reconstructed real space with a low computational complexity. Up to our best knowledge, the proposed scheme is among the first that is based on the successive refinement with the sparse ternary coding bridging the gap to the theoretical performance limits.

Fig. 1: General block diagram of the proposed framework.

I-B Notation

Matrices and vectors are denoted by boldface upper-case (

) and lower-case () letters, respectively. We consider the same notation for a random vector and its realization. The difference should be clear from the context. denotes the -th entry of vector . For a matrix , denotes the -th column of . The superscript stands for the pseudo-inverse and stands for the transpose. We use the notation for the set .

I-C Outline of the Paper

The remainder of the paper is organized as follows. In Sec. II, the problem formulation is introduced. Then, in Sec. III we present our framework. We provide the performance analysis in Sec. IV. Finally, conclusions are drawn in Sec. V.

Ii Problem Formulation

Suppose that an owner has a collection of raw vectors in the database , where each raw vector from a set is a random vector with distribution

and bounded variance

. In general, the input data might be raw or based on extracted features such as those from (aggregated) local descriptors [4, 5, 6]

, or the top layer of a neural network

[7] or the latent space of auto-encoders [8]. The user has a query which is a noisy version of , i.e., , where we assume is a Gaussian noise vector with distribution . The user is interested in some information about the subset of the -NN (or -ANN) of . The owner subcontracts the similarity search to an entity called the server. The clients and data owner attempt at protecting their data from (public) server side analysis, which is assumed to be honest but curious.

Fig. 2: Successive sparse codebooks generation scheme.

Iii Proposed Framework

Iii-a Framework Overview

Our framework is composed of the following five steps:

Iii-A1 Preparation at Owner Side

The owner generates one public sparse codebook plus private sparse codebooks from the media data that he owns (Fig. 2). The public codebook is sent to the public storage server (e.g., Google Site) and the private sparse codebooks are sent to the private server storage (e.g., “Friend” Sites). The public sparse codebook is generated using the learned sparsifying transform followed by an element-wise nonlinearity and a privacy amplification. The private codebooks are generated by the successive refinement encoder that will be explained in the text below.

Iii-A2 Indexing at Server Sides

The public and private servers index the received sparse codes.

Iii-A3 Querying at Client Side

The client generates a sparse code from his query data using the same transformation scheme used for public search (Fig. 3). Then, the client sends the sparse code of his query data to the public server and his original domain query to the private server.

Iii-A4 Initial Search at Public Server Side

The server runs a similarity search to identify the sparse codes that are most similar to the query (Fig. 3). The public list, which consists of indices of the most similar codes, is sent to the private server.

Iii-A5 Multi-layer List Refinement at Private Server

The private server looks at his first layer codebook and decodes (reconstructs) the sparse codes that are within the public list. Then he runs a similarity search using the received query and the decoded sparse codes, i.e., the similarity is computed in the original domain. This similarity search results in the first private list, which is accessible to the authorized users at level 1. Next, the private server uses his second layer codebook and decodes the sparse codes with indices within the initial private list. The second private list is hereby computed using similarity search between the received query and superposition of the decoded sparse codes of this layer and the previous layer. This list is accessible to the authorized users at level 2. Analogously, the private lists are refined successively by running the similarity search between the query and the superposition of the decoded sparse codes of each layer and all previous layers (Fig. 4).

Fig. 3: Public identification scheme.

Iii-B Layered Sparse Coding

Iii-B1 Principal Element

The core of our coding paradigm is as follows:

Encoder: This is defined by a mapping . Given the (raw) feature vector the encoder generates the sparse code with dimensionality and rate , therefore .

Indeed, our encoder is based on the sparsifying transform learning model [9] followed by a non-linearity thresholding function to constraint the alphabet of codes. This model suggests that a feature vector is approximately sparsifiable using a transform , that is , where is sparse, i.e., , and is the representation error of the feature vector or residual in the transform domain. The sparse coding problem for this model is a direct constraint projection problem. This sparse approximation is as follows:

The above direct problem has a closed-form solution for any of the two important regularizers or . Analogous to [10], we consider the -“norm” as our sparsity-inducing penalty. In this case, the solution is obtained exactly by hard-thresholding the projection and keeping the entries of the largest magnitudes while setting the remaining low magnitude entries to zero. For this purpose, we define an intermediate vector and denote by the -th largest magnitude amongst the set . Then the closed-form solution is achieved by applying a hard-thresholding operator to , which is defined as . Now we impose extra constraint on the alphabet of our codes by applying the ternary hash mapping to as:


where . The bit rate of this code can be formulated as . We denote by the encoder in general, therefore, the codeword with block-length and rate is denoted by .

In general, we have a joint learning problem that can be formulated as:


where and are regularization parameters, and are the constraints on the linear mapper and sparse (but not ternarized) code matrix , respectively. The algorithm for the above problem alternates between solving for (sparse coding step) and (transform update step), whilst the other variables are kept fixed. Finally, the ternarized sparse codebook is obtained as , which consists of sparse codewords .

Decoder: This is a mapping . Base on generated at the encoder, the decoder produces reconstruction . That is, our decoding is simply a pseudo-inverse operation.

Fig. 4: Private multiple access identification scheme.

Iii-B2 Overall Scheme

In [11]

, the authors studied the reconstruction performance of STC and scaled STC based on distortion-rate function. It is shown that for relatively small rates the ternarized sparsified codes almost achieve the Shannon distortion-rate function for i.i.d. Gaussian distributed data. In

[12], the authors extended the concept of STC to multi-layer STC, a codebook-free scheme that successively refines the reconstruction of the residuals of previous layers. Based on the results in [11] and [12] we formulate our layered sparse coding that provides a multiple-access privacy-preserving identification scheme. To this end, first we generate our first coebook with block-length and rate as: . Next, we do reconstruction as: . This provides the residual . Now, we encode the residual of the first layer to generate the second codebook with block-length and rate as: . The reconstructed data as well as the residual of the second layer obtained as and , respectively. In the same way, the layered sparse coding scheme, initialized with , can be formulated as:


Note that

forms a Markov chain. The algorithm successively refines the original database

over (asymptotically large) stages, such that .

Iii-C Privacy Amplification Scheme

The core idea of our privacy amplification scheme is to increase the general entropy of our sparse codes via adding some randomness to it. To this end, let be two subspaces such that , where is the space of -dimensional sparse codes. So, every vector has at least one expression as . If we have , then every vector has the unique expression and we write . Also, is called the direct sum of and . Now, let be the space of non-zero components of and be the space of zero components of . The idea of our ambiguization scheme is to set ambiguization noise such that . Furthermore, since is an inner product space and , is orthogonal direct sum of and . It is clear that , and . For more details about the performance of ambiguization scheme we refer the reader to [10].

Iii-C1 Owner’s Privacy Amplification

Based on our definition, the data owner simply adds random samples with alphabet to the zero-components of his sparse codebook and sends the ambiguized sparse codebook to the public server. We denote by the sparsity level of ambiguization noise at the public server. Note that . Furthermore, the owner may send only a fraction of his sparse codes to the public server. In [11], the authors analyzed this scheme with more details. In general, the public ambiguized sparse codebook generated as with the block-length and the rate , where is an ambiguization function, which consists of randomness addition as well as codeword subspace selection.

Iii-C2 Client’s Privacy Amplification

In order to prevent reconstruction of exact information about the client’s interests at the public server side, the client ambiguizes his sparse code by adding random samples with alphabet to the zero-components of his query. We denote by the public query.

Fig. 5:

The relation between probability of correct identification and a) sparsity ratio, b) encoding rate.

Iii-D Algorithm

Iii-D1 Preparation at Owner Side

The owner generates offline the sparse codebook with the trained linear map followed by the element-wise nonlinearity thresholding operator , i.e., . Then, the owner performs the privacy amplification on codebook to generate the public sparse codebook with block-length and rate . This ambiguized codebook is outsourced to the public server storage. Next, the owner generates successively the sparse codebooks from the first sparse codebook . Therefore, the database is encoded by total rate . The sparse codebooks are outsourced to the private server storage. The block diagram of codebooks generation is illustrated in Fig. 2.

Iii-D2 Indexing at Server Sides

The public and private servers index the received sparse codes. It can be indexed as in [2].

Iii-D3 Querying at Client Side

The client generates the sparse codeword from its query , using the shared public trained linear map followed by the element-wise nonlinearity operator , therefore . Then, the client ambiguizes his code by adding random samples with alphabet to his code. The ambiguized public query is send to the public server. The client also sends his original domain query to the private server. Each client has a pre-defined authorization level at the private sever.

Iii-D4 Initial Search at Public Server Side

The public server seeks all ANNs in the radius from the query in order to produce an initial public list of possible candidates as , where is a similarity measure in space . Next, the public server sends the initial list to the private server.

One can use different similarity measures . However, due to many interesting properties, we consider a new similarity and dissimilarity measures based on the support intersection of the sparse codewords [3, 13]. To this end, we decompose sparse codes into positive part and negative part as and , where and correspond to positive components and and correspond to negative components. The similarity score between and is defined as:


and the dissimilarity score between and is defined as:


where is the Hadamard product. For more details about the the theoretical aspects of the considered similarity measure, we refer the reader to [13].

The public list is composed of the indices whose similarity score is higher than a threshold and dissimilarity score is below a threshold. Another option is to define a normalized similarity as . Therefore, the public list is composed of the indices of the largest ’s. Finally, the public server sends back the public list to the private server. The public server can either fix the threshold or the number of similar elements.

Iii-D5 Multiple Access List Refinement at Private Server

The private server receives the public list , then it considers first layer codebook , which is the clean and full length version of the public codebook . Next, the private server reconstructs the codewords with indices reported on the public list as . It then computes the distance measure between private query and reconstructed sparse codewords in the original signal domain. This will produced the first private list . Next, he reconstructs the codewords of the second layer codebook . The second private list is obtained as . In the same approach, at the -th layer the private list is given as .

Fig. 6: Comparison between the probability of correct identification at the public server and private server.

Iv Performance Analysis

In this section we analyze the performance of our method in the terms of probability of correct identification as well as privacy leakage. To this end we consider a database of random vectors with dimensionality , which are generated from the distribution . We then generate the noisy version of with three different signal-to-noise-ratios (SNRs) , and , where . We consider square sparsifying transform, i.e., . Moreover, the sparsity level of the public sparse codewords as well as the public query code are considered the same.

In Fig. 5, we depict the probability of correctly identifying the true query in the public list as the function of sparsity ratio and encoding rate . The red, blue and black solid lines show the performance of our method in the case that we impose no privacy amplification for the stored public database and the client’s query. Next, we ambiguized our sparse public codebook by adding random samples in the co-support of the public codewords. Finally, we complete our scenario by considering the privacy protection of query as well as owner’s database., i.e., we ambiguized our codes by adding samples in the co-support of the public query codewords.

In Fig. 6, we compare the probability of correct identification at the public and private servers. We set the privacy amplification parameters of public codebook and client’s query as and , respectively. Then, we perform fast public search in the transform domain and send back the public list to the private server. The results demonstrate high performance just by one layer similarity search in the original domain.

Fig. 7: The relation between normalized similarity and: a) sparsity ratio, b) encoding rate.

Based on the defined similarity measure in (4) and (5), in Fig. 7, we illustrate the relation between normalized similarity and sparsity ratio and encoding rate . As it is shown, at the sparsity ratios (rates) close to zero the similarity measure grows much faster than dissimilarity measure such that we have the maximal normalized similarity for relatively small sparsity ratios (rates). However, after a certain level the dissimilarity measure grows faster than the similarity measure.

In [10] and [11], we defined the privacy measures in the terms of ‘reconstruction leakage’ and ‘clustering leakage’. Based on the results in [10] and [11], the curious public server cannot perform clustering the stored public database. Moreover, the un-authorized clients cannot infer the structure of database. In order to address the reconstruction leakage of the proposed privacy-preserving identification scheme, consider the mutual information between the random sequence , the reconstructed random sequence at the public server and the reconstructed random sequences at the private server. Using the data-processing inequality and considering the markovity of random sequences, we have:

In Fig. 8, we illustrate the distortion-rate behavior at the public and private servers, which interprets the ‘reconstruction’ leakage in these scenarios. Fig. 7(a) depicts the reconstruction leakage for three different ambiguization levels at the public sparse codebook and compares them with the Shannon lower bound. In Fig. 7(b), we illustrate the performance of reconstruction at the private server and compare it with Shannon lower bound. This plot also depicts the accuracy of private lists for different authorization levels. Note that the illustrated results are obtained without considering any optimal rate allocation to our codebooks. By utilizing the optimal rate allocation and also multi-level quantization we can closely achieve the Shannon lower bound. That is beyond the scope of this paper.

V Conclusion

We have proposed a novel distributed privacy-preserving identification framework based on layered sparse codes with the ambiguization and granular access to the results of identification. The initial fast search is performed on the public server and the refined searches are performed on the distributed private server(s). The accuracy of the private search is based on the authorization level of the clients. The results show the performance of proposed scheme in the terms of probability of correct identification as well as the privacy leak measures.

Fig. 8: Distortion-rate behavior at the a) public server, b) private server.


  • [1] W. H. Equitz and T. M. Cover, “Successive refinement of information,” IEEE Transactions on Information Theory, vol. 37, pp. 269–275, 1991.
  • [2] S. Ferdowsi, S. Voloshynovskiy, D. Kostadinov, and T. Holotyak, “Fast content identification in high-dimensional feature spaces using sparse ternary codes,” in IEEE Int. Work. on Inf. Forensics and Security (WIFS), 2016, pp. 1–6.
  • [3] ——, “Sparse ternary codes for similarity search have higher coding gain than dense binary codes,” in IEEE Int. Symp. on Inf. Theory (ISIT), 2017.
  • [4] H. Jégou, M. Douze, and C. Schmid, “On the burstiness of visual elements,” in IEEE Conf. on Comp. Vision and Pattern Recog. (CVPR), 2009, pp. 1169–1176.
  • [5] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” in IEEE Conf. on Comp. Vision and Pattern Recog. (CVPR), 2007, pp. 1–8.
  • [6] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in IEEE Conf. on Comp. Vision and Pattern Recog. (CVPR), 2010, pp. 3304–3311.
  • [7]

    A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in

    Europ. Conf. on Comp. Vision, 2014, pp. 584–599.
  • [8] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations (ICLR), 2014.
  • [9] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,” IEEE Trans. on Signal Processing, vol. 61, no. 5, pp. 1072–1086, 2013.
  • [10] B. Razeghi, S. Voloshynovskiy, D. Kostadinov, and O. Taran, “Privacy preserving identification using sparse approximation with ambiguization,” in IEEE International Workshop on Information Forensics and Security (WIFS), Rennes, France, December 2017, pp. 1–6.
  • [11] B. Razeghi and S. Voloshynovskiy, “Privacy-preserving outsourced media search using secure sparse ternary codes,” in IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), Calgary, Canada, April 2018, pp. 1992–1996.
  • [12] S. Ferdowsi, S. Voloshynovskiy, and D. Kostadinov, “A multi-layer network based on sparse ternary codes for universal vector compression,” ArXiv e-prints, Oct 2017.
  • [13]

    D. Kostadinov and S. Voloshynovskiy, “Learning non-linear transform with discriminative and minimum information loss priors,” 2018. [Online]. Available: