Prive-HD: Privacy-Preserved Hyperdimensional Computing

05/14/2020 ∙ by Behnam Khaleghi, et al. ∙ University of California, San Diego 0

The privacy of data is a major challenge in machine learning as a trained model may expose sensitive information of the enclosed dataset. Besides, the limited computation capability and capacity of edge devices have made cloud-hosted inference inevitable. Sending private information to remote servers makes the privacy of inference also vulnerable because of susceptible communication channels or even untrustworthy hosts. In this paper, we target privacy-preserving training and inference of brain-inspired Hyperdimensional (HD) computing, a new learning algorithm that is gaining traction due to its light-weight computation and robustness particularly appealing for edge devices with tight constraints. Indeed, despite its promising attributes, HD computing has virtually no privacy due to its reversible computation. We present an accuracy-privacy trade-off method through meticulous quantization and pruning of hypervectors, the building blocks of HD, to realize a differentially private model as well as to obfuscate the information sent for cloud-hosted inference. Finally, we show how the proposed techniques can be also leveraged for efficient hardware implementation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The efficacy of machine learning solutions in performing various tasks has made them ubiquitous in different application domains. The performance of these models is proportional to the size of the training dataset. Thus, machine learning models utilize copious proprietary and/or crowdsourced data, e.g., medical images. In this sense, different privacy concerns arise. The first issue is with model exposure [abadi2016deep]

. Obscurity is not considered a guaranteed approach for privacy, especially parameters of a model (e.g., weights in the context of neural networks) that might be leaked through inspection. Therefore, in the presence of an adversary with full knowledge of the trained model parameters, the model should not reveal the information of constituting records.

Second, the increasing complexity of machine learning models, on the one hand, and the limited computation and capacity of edge devices, especially in the IoT domain with extreme constraints, on the other hand, have made offloading computation to the cloud indispensable [teerapittayanon2017distributed, li2018learning]. An immediate drawback of cloud-based inference is compromising client data privacy. The communication channel is not only susceptible to attacks, but an untrusted cloud itself may also expose the data to third-party agencies or exploit it for its benefits. Therefore, transferring the least amount of information while achieving maximal accuracy is of utmost importance. A traditional approach to deal with such privacy concerns is employing secure multi-party computation that leverages homomorphic encryption whereby the device encrypts the data, and the host performs computation on the ciphertext [gilad2016cryptonets]. These techniques, however, impose a prohibitive computation cost on edge devices.

Previous work on machine learning, particularly deep neural networks, have come up with generally two approaches to preserve the privacy of training (model) or inference. For privacy-preserving training, the well-known concept of differential privacy is incorporated in the training [dwork2006calibrating, mcsherry2009differentially]. Differential privacy, often known as the standard notation of guaranteed privacy, aims to apply a carefully chosen noise distribution to make the response of a query (in our concept, the model being trained on a dataset) over a database randomized enough so the singular records remain indistinguishable whilst the query result is fairly accurate. Perturbation of partially processed information, e.g., the output of the convolution layer in neural networks, before offloading to a remote server is another trend of privacy-preserving studies [wang2018not, osia2020hybrid, mireshghallah2020shredder] that target the inference privacy. Essentially, it degrades the mutual information of the conveyed data. This approach degrades the prediction accuracy and requires (re)-training the neural network to compensate the injected noise [wang2018not] or analogously learning the parameters of a noise that can be tolerated by the network [mireshghallah2020shredder, mireshghallah2020principled], which are not always feasible, e.g., when the model is inaccessible.

In this paper, for the first time, we scrutinize Hyperdimensional (HD) computing from a privacy perspective. HD is a novel efficient learning paradigm that imitates the brain functionality in cognitive tasks, in the sense that the human brain computes with patterns of neural activity rather than scalar values [kanerva2009hyperdimensional, schmuck2019hardware, mitrokhin2019learning, neubertintroduction]. These patterns and underlying computations can be realized by points and light-weight operations in a hyperdimensional space, i.e., by hypervectors of 10,000 dimensions. Similar to other statistical mechanisms, the privacy of HD might be preserved by noise injection, where formally the granted privacy budget is directly proportional to the amount of the introduced noise and indirectly to the sensitivity of mechanism. Nonetheless, as a query hypervector (HD’s raw output) has thousands of -bits dimensions, the sensitivity of the HD model can be extremely large, which requires a tremendous amount of noise to guarantee differential privacy, which significantly reduces accuracy. Similarly, the magnitude of each output dimension is large (each up to ), so is the intensity of the required noise to disguise the transferring information for inference. Therefore, we require more prudent approaches to augment HD with differentially private training as well as blurring the information of offloaded inference.

Our main contributions are as follows. We show the privacy breach of HD and introduce different techniques including well-devised hypervector (query and/or class) quantization and dimension pruning to reduce the sensitivity, and consequently, the required noise to achieve a differentially private HD model. We also target inference privacy by showing how quantizing the query hypervector, during inference, can achieve good prediction accuracy as well as multifaceted power efficiency while significantly degrading the Peak Signal-to-Noise Ratio (PSNR) of reconstructed inputs (i.e., diminishing useful transferred information). Finally, we propose an approximate hardware implementation that benefits from the aforementioned innovations for further performance and power efficiency.

Ii Preliminary

Ii-a Hyperdimensional Computing

Encoding

is the first and major operation involved in both training and inference of HD. Assume that an input vector (an image, voice, etc.) comprises

dimensions (elements or features). Thus, each input can be represented as (1). ‘’s are elements of the input, where each feature takes value among to . In a black and white image, there are only two feature levels (), and , and .

(1)

Varied HD encoding techniques with different accuracy-performance trade-off have been proposed [kanerva2009hyperdimensional, imani2019framework]. Equation (2) shows analogous encodings that yield accuracies similar to or better than the state of the art [imani2019framework].

(2a)
(2b)

’s are randomly chosen hence orthogonal bipolar base hypervectors of dimension to retain the spatial or temporal location of features in an input. That is, and , where

denotes the cosine similarity:

. Evidently, there are fixed base/location hypervectors for an input (one per feature). The only difference of the encodings in (2a) and (2b) is that in (2a) the scalar value of each input feature (mapped/quantized to nearest in ) is directly multiplied in the corresponding base hypervector . However, in (2b), there is a level hypervector of the same length () associated with different feature values. Thus, for feature of the input, instead of multiplying by location vector , the associated hypervector performs a dot-product with

. As both vectors are binary, the dot-product reduces to dimension-wise XNOR operations. To maintain the

closeness in features (to demonstrate closeness in original feature values), and are entirely orthogonal, and each is obtained by flipping randomly chosen bits of .

Training of HD is simple. After generating each encoding hypervector of inputs belonging to class/label , the class hypervector can be obtained by bundling (adding) all s. Assuming there are inputs having label :

(3)

Inference of HD has a two-step procedure. The first step encodes the input (similar to encoding during training) to produce a query hypervector . Thereafter, the similarity () of and all class hypervectors are obtained to find out the class with highest similarity:

(4)

Note that is a repeating factor when comparing with all classes, so can be discarded. The factor is also constant for a classes, so only needs to be calculated once.

Retraining can boost the accuracy of the HD model by discarding the mispredicted queries from corresponding mispredicted classes, and adding them to the right class. Retraining examines if the model correctly returns the label for an encoded query . If the model mispredicts it as label , the model updates as follows.

(5)

Ii-B Differential Privacy

Differential privacy targets the indistinguishability of a mechanism (or algorithm), meaning whether observing the output of an algorithm, i.e., computations’ result, may disclose the computed data. Consider the classical example of a sum query over a database with s being the first to rows, and , i.e., the value of each record is either or . Although the function does not reveal the value of an arbitrary record , it can be readily obtained by two requests as . Speaking formally, a randomized algorithm is -indistinguishable or -differentially private if for any inputs and that differ in one entry (a.k.a adjacent inputs) and any output of , the following holds:

(6)

This definition guarantees that observing instead of

scales up the probability of any event by no more than

. Evidently, smaller values of non-negative provide stronger guaranteed privacy. Dwork et al. have shown that -differential privacy can be ensured by adding a Laplace noise of scale Lap() to the output of algorithm [dwork2006calibrating]. , defined as norm in Equation (7), denotes the sensitivity of the algorithm which represents the amount of change in a mechanism’s output by changing one of its arguments, e.g., inclusion/exclusion of an input in training.

(7)

Dwork et al. have also introduced a more amiable -approximate -indistinguishable privacy guarantee, which allows the -privacy to be broken by a probability of [dwork2006our].

(8)

is Gaussian noise with mean zero and standard deviation of

. Both and are dimensions, i.e., output class hypervectors of dimensions. Here, is norm, which relaxes the amount of additive noise. meets -privacy if [abadi2016deep]. Achieving small for a given needs larger , which by (8) translates to larger noise.

Iii Proposed Method: Prive-HD

Iii-a Privacy Breach of HD

Fig. 1: Encoding presented in Equation (2a).

In contrast to the deep neural networks that comprise non-linear operations that somewhat cover up the details of raw input, HD operations are fairly reversible, leaving it zero privacy. That is, the input can be reconstructed from the encoded hypervector. Consider the encoding of Equation (2a), which is also illustrated by Fig. 1. Multiplying each side of the equation to hypevector , for each dimension gives:

(9)

, so . Summing all dimensions together yields:

(10)

As the base hypervectors are orthogonal and especially is large, in the right side of Equation (10). It means that every feature can be retrieved back by . Note that without lack of generality we assumed , i.e., features are not normalized or quantized. Indeed, we are retrieving the features (‘’s), that might or might not be the exact raw elements. Also, although we showed the reversibility of the encoding in (2a), it can easily be adjusted to the other HD encodings. Fig. 2 shows the reconstructed inputs of MNIST samples by using Equation (10) to achieve each of pixels, one by one.

That being said, the encoded hypervector sent for cloud-hosted inference can be inspected to reconstruct the original input. This reversibility also breaches the privacy of the HD model. Consider that, according to the definition of differential privacy, two datasets and differ by one input. If we subtract all class hypervectors of the models trained over and , the result (difference) will exactly be the encoded vector of the missing input (remember from Equation (3) that class hypervectors are simply created by adding encoded hypervectors of associated inputs). The encoded hypervector, hence, can be decoded back to obtain the missing input.

Fig. 2: Original and retrieved handwritten digits.

Iii-B Differentially Private HD Training

Let and be models trained with encoding of Equation (2a) over datasets that differ in a single datum (input) present in but not in . The outputs (i.e., class hypervectors) of and thus differ in inclusion of a single dimension encoded vector that misses from a particular class of . The other class hypervectors will be the same. Each bipolar hypervector (see Equation (2) or Fig. 1) constituting the encoding

is random and identically distributed, hence according to the central limit theorem

is approximately normally distributed with

and , i.e., the number of vectors building . In norm, however, the absolute value of the encoded matters. Since has normal distribution, mean of the corresponding folded (absolute) distribution is:

(11)

The sensitivity will therefore be . For the

sensitivity we indeed deal with a squared Gaussian (chi-squared) distribution with freedom degree of one, thus:

(12)

Note that the mean of the chi-squared distribution (

) is equal to the variance (

) of the original distribution of . Both Equation (11) and (12) imply a large noise to guarantee privacy. For instance, for a modest 200-features input () the sensitivity is while a proportional noise will annihilate the model accuracy. In the following, we articulate the proposed techniques to shrink the variance of the required noise. In the rest of the paper, we only target Gaussian noise, i.e., privacy, since in our case it needs a weaker noise.

Iii-B1 Model Pruning

An immediate observation from Equation (12) is to reduce the number of hypervectors dimension, to mollify the sensitivity, hence, the required noise. Not all the dimensions of a class hypervector have the same impact on prediction. Remember, from Equation (4

), that prediction is realized by a normalized dot-product between the encoded query and class hypervectors. Intuitively, we may prune out the close-to-zero class elements as their element-wise multiplication with query elements leads to less-effectual results. Notice that this concept (i.e., discarding a major portion of the weights without significant accuracy loss) does not readily hold for deep neural networks as the impact of those small weights might be amplified by large activations of previous layers. In HD, however, information is uniformly distributed over the dimensions of the query hypervector, so overlooking some of the query’s information (the dimensions corresponding to discarded

less-effectual dimensions of class hypervectors) should not cause unbearable accuracy loss.

Fig. 3: Impact of increasing (left) and reducing (right) effectual dimensions.

We demonstrate the model pruning as an example in Fig. 3 (that belongs to a speech recognition dataset). In Fig. 3(a), after training the model, we remove all dimensions of a certain class hypervector. Then we increasingly add (return) its dimensions starting from the less-effectual dimensions. That is, we first restore the dimensions with (absolute) values close to zero. Then we perform a similarity check (i.e., prediction of a certain query hypervector via normalized dot-product) to figure out what portion of the original dot-product value is retrieved. As it can be seen in the same figure, the first 6,000 close-to-zero dimensions only retrieve 20% of the information required for a fully confident prediction. This is because of the uniform distribution of information in the encoded query hypervector: the pruned dimensions do not correspond to vital information of queries. Fig. 3(b) further clarifies our observation. Pruning the less-effectual dimensions slightly reduces the prediction information of both class A (correct class, with an initial total of 1.0) and class B (incorrect class). As more effectual dimensions of the classes are pruned, the slope of information loss plunges. It is worthy of note that in this example the ranks of classes A and B have been retained.

Fig. 4: Retraining to recover accuracy loss.

We augment the model pruning by retraining explained in Equation (5) to partially recover the information of the pruned dimensions in the remaining ones. For this, we first nullify of the close-to-zero dimensions of the trained model, which perpetually remain zero. Therefore, during the encoding of query hypervectors, we do not anymore need to obtain the corresponding indexes of queries (note that operations are dimension-wise), which translates to reduced sensitivity. Thereafter, we repeatedly iterate over the training dataset and apply Equation (5) to update the classes involved in mispredictions. Fig. 4 shows 1-2 iteration(s) is sufficient to achieve the maximum accuracy (the last iteration simple shows the maximum of previous ones). In lower dimension, decreasing the number of levels ( in Equation (1), denoted by L in the legend), achieves slightly higher accuracy as hypervectors lose the capacity to embrace fine-grained details.

Iii-B2 Encoding Quantization

Previous work on HD computing have introduced the concept of model quantization for compression and energy efficiency, where both encoding and class hypervectors are quantized at the cost of significant accuracy loss [salamat2019f5]. We, however, only target quantizing the encoding hypervectors since the sensitivity is merely determined by the norm of encoding. Equation (13) shows the 1-bit quantization of encoding in (2a). The original scalar-vector product, as well as the accumulation, is performed in full-precision, and only the final hypervector is quantized. The resultant class hypervectors will also be non-binary (albeit with reduced dimension values).

(13)
Fig. 5: Accuracy-sensitivity trade-off of encoding quantization.

Fig. 5 shows the impact of quantizing the encoded hypervectors on the accuracy and the sensitivity of the same speech recognition dataset trained with such encoding. In 10,000 dimensions, the bipolar (i.e., or sign) quantization achieves 93.1% accuracy while it is 88.1% in previous work [salamat2019f5]. This improvement comes from the fact that we do not quantize the class hypervectors. We then leveraged the aforementioned pruning approach to simultaneously employ quantization and pruning, as demonstrated in Fig. 5(a). In , the 2-bit quantization () achieves 90.3% accuracy, which is only 3% below the full-precision full-dimension baseline. It should note be noted that the small oscillations in specific dimensions, e.g., lower accuracy in 5,000 dimensions compared to 4,000 dimensions in bipolar quantization, are due to randomness of the initial hypervectors and non-orthogonality that show up in smaller space.

Fig. 5(b) shows the sensitivities of the corresponding models. After quantizing, the number of features, (see Equation (12)), does not matter anymore. The sensitivity of a quantized model can be formulated as follows.

(14)

shows the probability of (e.g., ) in the quantized encoded hypervector, so is the total occurrence of quantized encoded hypervector. The rest is simply the definition of norm. As hypervectors are randomly generated and i.i.d, the distribution of is uniform. That is, in the bipolar quantization, roughly of encoded dimensions are (or ). We therefore also exploited a biased quantization to give more weight for in the ternary quantization, dubbed as ‘ternary (biased)’ in Fig. 5(b). Essentially the biased quantization assigns a quantization threshold to conform to , while . This reduces the sensitivity by a factor of . Combining quantizatoin and pruning, we could shrink the sensitivity to , which originally was for the speech recognition with 617-features inputs. In Section IV we will examine the impact of adding such noise on the model accuracy for varied privacy budgets.

Iii-C Inference Privacy

Building upon the multi-layer structure of ML, IoT devices mostly rely on performing primary (e.g., feature extraction) computations on the edge (or edge server) and offload the decision-making final layers to the cloud

[teerapittayanon2017distributed, li2018learning]. To tackle the privacy challenges of offloaded inference, previous work on DNN-based inference generally inject noise on the offloaded computation. This necessitates either to retrain the model to tolerate the injected noise distribution [wang2018not], or analogously, learn the parameters of a noise that maximally perturbs the information with preferably small impact on the accuracy [mireshghallah2020shredder, mireshghallah2020principled].

In Section III-A we demonstrated how the original feature vector can be reconstructed from the encoding hypervectors. Inspired by the encoding quantization technique explained in the previous section, we introduce a turnkey technique to obfuscate the conveyed information without manipulating or even accessing the model. Indeed, we observed that quantizing down to 1-bit (bipolar) even in the presence of model pruning could yield acceptable accuracy. As shown in Fig. 5(a), 1-bit quantization only incurred 0.25% accuracy loss. These models, however, were trained by accumulating quantized encoding hypervectors. Intuitively, we expect that performing inference with quantized query hypervectors but on full-precision classes (class hypervectors generated by non-quantized encoding hypervectors) should give the same or better accuracy as quantizing is nothing but degrading the information. In other words, in the previous case, we deal with checking the similarity of a degraded query with classes built up also from degraded information, but now we check the similarity of a degraded query with information-rich classes.

Fig. 6: Impact of inference quantization and dimension masking on PSNR and accuracy.

Therefore, instead of sending the raw data, we propose to perform the light-weight encoding part on the edge and quantize the encoded vector before offloading to the remote host. We call it inference quantization do distinguish between encoding quantization, as inference quantization targets a full-precision model. In addition, we also nullify a specific portion of encoded dimensions, i.e., mask out them to zero, to further obfuscate the information. Remember that our technique does not need to modify or access to the trained model.

Fig. 6 shows the impact of inference 1-bit quantization on the speech recognition model. When only the offloaded information (i.e., query hypervector with 10,000 dimensions) is quantized, the prediction accuracy is 92.8%, which is merely 0.5% lower than the full-precision baseline. By masking out 5,000 dimensions, the accuracy is still above 91%, while the reconstructed image becomes blurry. While the reconstructed image (from a typical encoded hypervector) has a PSNR of 23.6 dB, in our technique, it shrinks to 13.1.

Iii-D Hardware Optimization

Fig. 7: Principal blocks of FPGA implementation.
Fig. 8: Investigating the optimal , dimensions and impact of data size in the benchmark models.

The simple bit-level operations involved in the proposed techniques and dimension-wise parallelism of the computation makes FPGA a highly efficient platform to accelerate privacy-aware HD computing [imani2019sparsehd, salamat2019f5]. We devise efficient implementations to further improve the performance and power. We adopt the encoding of Equation (2b) as it provides better optimization opportunity.

For the 1-bit bipolar quantization, a basic approach is adding up all bits of the same dimension, followed by a final sign/threshold operation. This is equivalent to a majority operation between ‘’s and ‘1’s. Note that we can represent 1 by 0, and 1 by 1 in hardware, as it does not change the logic behind. We shrink this majority by approximating it as partial majorities. As shown by Fig. 7(a), we use 6-input look-up tables (LUT-6) to obtain the majority of every six bits (out of bits), which are binary elements making a certain dimension. In the case an LUT has equal number of 0 and 1 inputs, it breaks the tie randomly (predetermined) We can repeat this till stages but that would degrade accuracy. Thus, we use majority LUTs only in the first stage, so the next stages are typical adder-tree [imani2019sparsehd]. This approach is not exact, however, in practice it imposes accuracy loss due to inherent error tolerance of HD, especially we use majority LUTs only in the first stage, so the next stages are typical adder-tree [imani2019sparsehd]. Total number of LUT-6s will be:

(15)

which is less than required in the exact adder-tree implementation.

For the ternary quantization, we first note that each dimension can be , so requires two bits. The minimum (maximum) of adding three dimensions is therefore 3 (3), which requires three bits, while typical addition of three 2-bit values requires four bits. Thus, as shown in Fig. 7(b), we can pass numbers (dimensions) , and to three LUT-6 to produce the 3-bit output. Instead of using an exact adder-tree to sum up the resultant three-bits, we use saturated adder-tree where the intermediate adders maintain a bit-width of three through truncating the least-significant bit of output. In a similar fashion to Equation (15), we can show that this technique uses LUT-6, saving 33.3% compared to in the case of using exact adder-tree to sum up ternary values.

Iv Results

Iv-a Differentially Private Training

We evaluate the privacy metrics of the proposed techniques by training three models on different categories: the same speech recognition dataset (ISOLET) [Isolet] we used within the paper, the MNIST handwritten digits dataset, and Caltech web faces dataset (FACE) [griffin2007caltech]. The goal of training evaluation is to find out the minimum with affordable impact on accuracy. Similar to [abadi2016deep], we set the parameter of the privacy to (which is reasonable especially the size of our datasets are smaller than ). Accordingly, for a particular , we can obtain the factor of the required Gaussian noise (see Equation (8)) from [abadi2016deep]. We iterate over different values of to find the minimum while the prediction accuracy remains acceptable.

Fig. 8(a)–(c) shows the obtained for each training model and corresponding accuracy. For instance, for the FACE model (Fig. 8(b)), (labeled by eps1) gives an accuracy within 1.4% of the non-private full-precision model. Shown by the same figure, slightly reducing to causes significant accuracy loss. This figure also reveals where the minimum is obtained. For each , using the proposed pruning and ternary quantization, we reduce the dimension to decrease the sensitivity. At each dimension, we inject a Gaussian noise with standard deviation of with obtainable from , which is 4.75 for a demanded . of different quantization schemes and dimensions is already discussed and shown by Fig. 5. When the model has large number of dimensions, its primary accuracy is better, but on the other hand has higher sensitivity (). Thus, there is a trade-off between dimension reduction to decrease sensitivity (hence, noise) and inherent accuracy degradation associated with dimension reduction itself. For FACE model, we see that optimal number of dimension to yield the minimum is 7,000. It should be noted that although there is no prior work on HD privacy (and few works on DNN training privacy) for a head-to-head comparison, we could obtain a single digit for the MNIST dataset with 1% accuracy loss (with 5,000 ternary dimensions), which is comparable to the differentially private DNN training over the MNIST in [abadi2016deep] that achieved the same with

accuracy loss. In addition, differentially private DNN training requires very large number of training epochs where the per-epoch training time also increases (e.g., by

in [abadi2016deep]) while we readily apply the noise after building up all class hypervectors. We also do not retrain the noisy model as it violates the concept of differential privacy.

Fig. 8(d) shows the impact of training data size on the accuracy of the FACE differentially private model. Obviously, increasing the number of training inputs enhances the model accuracy. This due to the fact that, because of quantization of encoded hypervectors, the class vectors made by their bundling have smaller values. Thus, the magnitude of induced noise becomes comparable to the class values. As more data is trained, the variance of class dimensions also increases, which can better bury the same amount of noise. This can be considered a vital insight in privacy-preserved HD training.

Iv-B Privacy-Aware Inference

Fig. 9: Impact of inference quantization (left) and dimension masking on accuracy and MSE.

Here we show a similar result of Fig. 6 on HD models trained on different datasets. Fig. 9(a) shows the impact of bipolar quantization of encoding hypervectors on the prediction accuracy. As discussed in Section III-C, here we merely quantize the encoded hypervectors (to be offloaded to cloud for inference) while the class hypervectors remain intact. Without pruning the dimensions, the accuracy of ISOLET, FACE, and MNIST degrades by 0.85% on average, while the mean squared error of the reconstructed input increases by , compared to the data reconstructed (decoded) from conventional encoding. Since the dataset of ISOLET and FACE are extracted features (rather than raw data), we cannot visualize them, but from Fig. 9(b) we can observe that ISOLET gives a similar MSE error to MNIST (for which the visualized data can be seen in Fig. 6) while the FACE dataset leads to even higher errors.

In conjunction with quantizing the offloaded inference, as discussed before, we can also prune some of the encoded dimensions to further obfuscate the information. We can see that in the ISOLET and FACE models, discarding up to 6,000 dimensions leads to a minor accuracy degradation while the increase of their information loss (i.e., increased MSE) is considerable. In the case of MNIST, however, accuracy loss is abrupt and does not allow for large pruning. However, even pruning 1,000 of its dimensions (together with quantization) reduces the PSNR to 15, meaning that reconstruction of our encoding is highly lossy.

Iv-C FPGA Implementation

We implemented the HD inference using the proposed encoding with the optimization detailed in Section III-D. We implemented a pipelined architecture with building blocks shown in Fig. 7(a) as in the inference we only used binary (bipolar) quantization. We used a hand-crafted design in Verilog HDL with Xilinx primitives to enable efficient implementation of the cascaded LUT chains. Except the proposed approximate adders, the rest of implementation follows an architecture similar to [imani2019sparsehd]. Table I compares the results of Prive-HD on Xilinx Kintex-7 FPGA KC705 Evaluation Kit, versus software implementation on Raspberry Pi 3 embedded processor and NVIDIA GeForce GTX 1080 Ti GPU. Throughout denotes number of inputs processed per second, and energy indicates energy (in Joule) of processing a single input. All benchmarks have have the same number of dimensions in different platforms. For FPGA, we assumed that all data resides in the off-chip DRAM, otherwise the latency will be affected but throughout remains intact as off-chip latency is eliminated in the computation pipeline. Thanks to the massive bit-level parallelism of FPGA with relatively low power consumption (

7W obtained via Xilinx Power Estimator, compared to

W of Raspberry Pi obtained by Hioki 3334 power meter, and W of GPU obtained through NVIDIA system management interface), the average inference throughput of Prive-HD is , and of Raspberry Pi and GPU, respectively. Prive-HD improves the energy by , and compared to Raspberry Pi and GPU, respectively.

Raspberry Pi GPU Prive-HD (FPGA)
Throughput Energy Throughput Energy Throughput Energy
ISOLET
FACE
MNIST
TABLE I: Comparing the Prive-HD on FPGA versus Raspberry Pi and GPU

V Conclusion

In this paper, we disclosed the privacy breach of hyperdimensional computing and presented a privacy-preserving training scheme by quantizing the encoded hypervectors involved in training, as well as reducing their dimensionality, which together enable employing differential privacy by relieving the required amount of noise. We also showed that we can leverage the same quantization approach in conjunction with nullifying particular elements of encoded hypervectors to obfuscate the information transferred for untrustworthy cloud (or link) inference. We also proposed hardware optimization for efficient implementation of the quantization schemes by essentially using approximate cascaded majority operations. Our training technique could address the discussed challenges of HD privacy and achieved single-digit privacy metric. Our proposed inference, which can be readily employed in a trained HD model, could reduce the PSNR of an image dataset to below 15 dB with affordable impact on accuracy. Eventually, we implemented the proposed encoding on an FPGA platform which achieved speed-up and energy efficiency over an optimized GPU implementation.

Acknowledgements

This work was supported in part by CRISP, one of six centers in JUMP, an SRC program sponsored by DARPA, in part by SRC-Global Research Collaboration grant, and also NSF grants #1527034, #1730158, #1826967, and #1911095.

References