I Introduction
With the rise of data science and machine learning, as well as the Internet of Things (IoT), the amount of produced data on a daily basis has increased to a level we are barely able to handle
[gubbi2013internet]. As the amount of data that needs to be processed is often significantly larger than smallscale and batterypowered devices can handle, so many of these devices are forced to connect to the internet in order to process the data in the cloud. Deep Neural Networks (DNN) are used for complex classification tasks, such as image classification [deng2009imagenet], sentimental analysis (text analysis) or even entertainment [meltem]. However, the complexity of DNN algorithms makes them impractical for realworld applications, such as classification tasks on batterypowered devices. Engineers often face a tradeoff, between energyefficiency and the achieved classification accuracy. Therefore, we need to create lightweight classifiers, which can perform inference on smallscale operating devices.Braininspired Hyperdimensional (HD) computing [kanerva2009hyperdimensional] has been proposed as a lightweight learning algorithm and methodology. The principles governing HD computing are based on the fact that the brain computes with patterns of neural activities which are not directly associated with numbers [kanerva2009hyperdimensional]. Machine learning algorithms based on Braininspired HD computing imitate cognition by exploiting statistical properties of very highdimensional vector spaces. The first step in HD computing is to map each data point into a highdimensional space (e.g.,
dimensions). During training, HD computing linearly combines the encoded hypervectors to create a hypervector representing each class. During the inference, classification is done by calculating the cosine similarity between the encoded query hypervector and all class hypervectors. The algorithm then predicts the class with the highest similarity score. In case of multiple classes with high similarity, the algorithm is likewise suited to express the confidence in the correctness of a prediction.
Many publications on Braininspired HD computing argue that for most practical applications, HD computing has to be trained and tested using floating point, or at least integer values [morris2019comphd, imani2019sparsehd]. Binarized HD computing models provided low classification accuracies. Often too low for practical applications. A recently published algorithm, called QuantHD [QuantHD], revealed the existence of a method to significantly improving the classification accuracies of binarized and ternarized models. Nevertheless, there still exists a large gap between the classification accuracy of nonbinarized and binarized HD computing classifiers. Also, such methods increase the required training time and are unstable as they tend to get stuck in local minima during training. In this paper, we propose a new method which can both reduce this classification accuracy gap by between a third and a half whilst, simultaneously, improving energy efficiency during training by 60%
on average. It also makes the training more stable by introducing randomness. We call this technique QubitHD, as it is based on the principle of information being stored in a quantum bit (Qubit) before its measurement. The floating point values represent the quantum state, while the binarized values represent the quantum state after a measurement had been performed.
The main contributions of the paper are the following:

We decreased the gap in classification accuracy between binarized and nonbinarized stateoftheart HD computingbased ML algorithms by 38.8% on average.

We decrease the convergence time in the range of 3050%. Introducing randomness in the algorithms prevents it from getting stuck in small, local minima, and incites the algorithm to quickly move towards the optimal value. The reason why the authors of [QuantHD] had problems with slow convergence was precisely this: lack of randomness.

We stop the algorithm from getting stuck in local minima during training, by introducing randomness in the convergence process.

QubitHD performs similarity check by calculating the Hamming distance between the hypervectors instead of calculating the costly cosine similarity.

We implemented the algorithm on GPU and FPGA, which accelerates training and inference. We also evaluated several classification problems, including human activity, face and text recognition. When looking at energy efficiency and speed, the FPGA implementation of QubitHD provides on average a 56 and 8 energyefficiency improvement and speedup during training, as compared to stateoftheart HD computing algorithms [ISLPED]. For comparison purposes, the authors of [QuantHD] only achieve 34.1 and 4.1
energy efficiency improvement and speedup during the training against the same stateoftheart HD computing algorithms. When comparing QubitHD with multilayer perceptron (MLP) and binarized neural network (BNN) classifiers, we observe that QubitHD can provide
56 and 52 faster computing in training and testing respectively, while providing similar classification accuracies (see Table III).
Ii Hyperdimensional Computing
The applications of braininspired HD computing in machine learning are divers. In this publication, we only focus on supervised classification tasks, but a recent publication indicated that HD computingbased ML algorithms can be applied to clustering and semisupervised learning as well [SemiHD]. The basis of QubitHD is described in Figure 2. The core difference to QuantHD is the binarization step that is discussed in Section III. The nonbinarized algorithm with retraining consists of the following steps:
Iia Encoding
The training dataset is preprocessed by converting all datapoints into the very highdimensional vectors (hypervectors). We used hypervectors of length in this paper, as it is the standard baseline for all HD computingbased machine learning algorithms. Like explained in [QuantHD], the original data is assumed to have features: . The objective is encoding each feature that corresponds to each datapoint into the hypervector of dimension ( in this paper). Each feature vector ”memorizes” the value and position of the relevant feature. In order to take into account the position of each feature, we use a set of randomly generated base hypervectors , where is the total number of features in each data point (
). Since the basehypervectors are uniformly generated at random (with equal probability for
and ), they are all mutually orthogonal [kanerva2000random]. The cosine product between hypervectors ranges between . The expected cosine product of independent and randomly generated basehypervectors is, whereas the variance is
for (random walk) for . Thereby, the hypervectors are almost orthogonal. This is true only when the number of randomly generated basehypervectors is significantly smaller than the dimension of the entire vectors space . For comparison of the binarized hypervectors, we use the Hamming distance. Therefore:Here is the Hamming distance similarity between the two binarized basehypervectors.
We also distinguish the actual value of each feature with different hypervectors. One way of doing so is the following. We find the minimum and maximum value of the feature across the entire dataset, and generate two distinct (and random) base hypervectors representing those two values. Every feature which has a value inbetween the minimum and maximum, will be associated with a hypervector proportionate to the two and basehypervectors. Feature values that are close to the minimum will be highly correlated with the basehypervector corresponding to the minimum value (, ). Feature values just in the middle between the minimum and the maximum will be 50% correlated with both basehypervectors (, ). The equivalent principle applies to feature values close to the maximum value (, ).
Finally, we add all of the results for all the features:
(1) 
where denotes the XOR operation.
IiB Initial training
The first training round is performed by summing up all hypervectors pertaining to the same class. That is, we abstract all hypervectors with the same labels. This method is called oneshot learning and is, at the moment, the most widespread way of using HD computing in machine learning
[ISLPED, Mitra_HD_1, Mitra_HD_2, rahimi2016hyperdimensional, ISSCC]. We now have one matrix of size (), where is the number of existing classes and is the length of the hypervectors.IiC Retraining
The classification accuracy of the current model during the inference is low [QuantHD]. For this reason, we have to do retraining. As displayed in Figure 3, retraining is done in the following way. We go through the entire dataset of encoded datapoints and test them to ascertain if they are correctly classified by the current model . For every misclassified datapoint, we have to make additional improvements to the model. Let us assume that the correct label of a datapoint is , but it was incorrectly classified as . We now add the erroneously classified hypervector to its corresponding row . (to make them more similar). We also subtract the incorrectly classified hypervector from the row corresponding to the inaccurately predicted class (to make them more distinct). In order to decrease the convergence of time and noise, it is common practice to introduce a learning rate in this step as illustrated in Figure 3a [imani2019adapthd]. This process is repeated several times.
IiD Inference
During the inference, we predict the class to which the datapoint belongs. This datapoint is encoded as described in IIA, and then compared to all the class hypervectors. The algorithm then predicts the class with the largest cosine similarity.
IiE Binarization:
So far, we described in the algorithm that the trained model has nonbinarized elements with floating point values. Many existing HD computing methods [rahimi2017high, imani2017exploring, rahimi2017hyperdimensional2] binarize the class hypervectors to eliminate costly cosine operations used for the associative search (). Binary hypervectors do not provide sufficiently high classification accuracies on many (if not most) realworld applications. The usual way of binarizing class hypervectors is making all positive values equal to and negative values equal to . This method suffers from significant loss of information about the trained model. To the best of our knowledge, [QuantHD] was the first publication demonstrating a method of achieving high classification accuracy while using a binarized (or quantized) HD model. Instead of just ”blindly” binarizing the class hypervectors after every retraining iteration, QuantHD trains the model in a way that is optimized for binarized hypervectors. That is, during every single retraining iteration, they create an additional binarized model. Doing so requires no additional computational power, as the binary representation of numbers in usual computer architectures reserves the first bit for the sign ( stands for positive, for negative). The QuantHD algorithm retrains on the predictions of the binarized model, while updating the nonbinarized model as described in subsection IIC and Figure 3. The binarized model achieves, after several iterations, very high classification accuracies. They are significantly higher than they would be without binaryoptimized retraining.
Iii Stochastic binarization
Motivation: In the field of quantum information theory, a quantum bit (or qubit) is the fundamental unit of a quantum information. A qubit is the quantum equivalent of the classical binary bit. It is a twolevel quantummechanical system, solely entailing two possible states. An example of this is the spin of a particle in a magnetic field, in which the two states can be taken as spin up and spin down. In a classical system, the bit would have to be either in one state or the other (either logical or ). A quantum system can be in any superposition of those two states. When measuring the component of the spin of a particle, which can have any value in the range , the quantum state is going to collapse in either the or the state, with a probability directly related to before the measurement.
This leaves us with a very interesting property: there exists a ”true” zcomponent of the state, which is equal to the projection of the quantum state on the zaxis. This gave us the following idea: why don’t we use the quantum measurement technique for binarizing the HD model? Doing so would allow us to have a binarized model whose expected value would be equal to the nonbinarized model. In other words . The QuantHD algorithm uses the following (very trivial) binarization function:
(2) 
We propose using the following method instead:
(3) 
where
is the cutoff value defined as a fixed fraction of the standard deviation
of the data. It is discussed in greater detail in subsection IVB. The advantage of doing so is the fact that the expected value of for is proportional to x:Iv Proposed QubitHD algorithm
Motivation
The QuantHD algorithm still leaves us with a significant gap between the maximum classification accuracy of the floatingpoint model and the binarized one. Also, the QuantHD retraining method described in [QuantHD] tends to get ”stuck” in local minima. Further, their algorithm almost doubles the average convergence rate, which hereafter increases the energy consumption during training.
To summarize, here are the main problems with the QuantHD algorithm, which the QubitHD algorithm can either solve or improve:

There still exists a significant gap between the binarized model and nonbinarized model accuracy

The algorithm can sometimes get stuck in local minima, which makes it unrealiable.

The convergence time of QuantHD algorithm is almost twice as slow as compared to the other stateoftheart HD computing algorithms with retraining
Iva Overview of the QubitHD algorithm
In this section, we will present the QubitHD algorithm. It enables efficient binarization of the HD model with particularly minor impact on the classification accuracy. The algorithm is based on QuantHD and consists of four main steps:

Binarized retraining The retraining steps compensate for the classification accuracy loss during the previous step. Binarization through Equation (3) is performed on the model after every single retraining iteration. This ensures a fast convergence towards a consistently high classification accuracy of the ML model.
IvB Framework of the QubitHD algorithm
In this section, we present the QubitHD algorithm. It enables efficient binarization of the HD model with minor impact on the classification accuracy. The algorithm is based on QuantHD and consists of four main steps:
2) Initial training: QubitHD trains the class hypervectors by summing all the encoded datapoints corresponding to the same class as seen in Figure 3a
It is evident from Figure 3a that every accumulated hypervector represents one class. As explained in [QuantHD], in an application with classes, the initially trained HD model contains nonbinarized hypervectors , where ().
3) Stochastic binarization: This part is the main change with respect to the QuantHD algorithm. A given class hypervector is created by summing together random hypervectors of the type . Every element in a class hypervector
(of class i) is the product of a ”random walk”. In other words, its distribution follows a binomial distribution with a probability mass function (pmf):
(4) 
where (as we have equal probabilities for and ), n is the number of randomly summed hypervectors for class i, and k is a possible value can take. Note that .
The expected value , while the standard deviation is
Assuming that the number of hypervectors corresponding to every class in the dataset is large enough, the normal distribution is a good approximation for modeling the binomial distribution:
In previous publications [QuantHD], the way of binarizing a model was described by Equation 2. We instead propose using Equation 3 shown in Figure 4. Implementing this change requires almost no additional resources (the random flips have to be performed only once per retraining round), but leads to significant improvements in terms of accuracy, reliability, speed and energy efficiency. The accuracy improvement is due to the fact that the expected value of this stochastically binarized model is equal to the nonbinarized model. The reliability and speed improvement are due to the fact that the model quickly ”jumps” out of local minima, as opposed to getting stuck for several iterations. The energy consumption during training depends on the number of retraining iterations, which are significantly reduced.
Just as we have demonstrated, the encoded data we are processing is (approximately) normally distributed. In order to be able to use Equation 3 for the binarization process, we have to define a ”cutoff” value . That is, everything above will become , and everything below will become . Only values between and will be better approximated through Equation 3. In most cases, has to be smaller than the standard deviation of the data distribution . If we would use , our model would become almost completely random, as most of the values are contained in the interval (68% to be precise).
The reason why the binarization works better than the lies in the fact that, taking into account the expected value of the binarized model, it is equal to actual values in the nonbinarized model, with the exception of values below and above . Empirically, we also noticed that the randomness of prevents the algorithm from getting stuck in local minima during training, which reduces the convergence time by 50% on average.
What if we use the encoding method that allows the hypervectors of the encoded datapoints to be floatingpoint values, as opposed to
values only? In this case, (due to randomness) it is to be assumed that we are sampling from an approximately uniform distribution. This assumption is only reasonable, if the encoding scheme is also based on random base hypervectors
. The resulting probability distribution won’t be a discrete binomial distribution, but rather a slightly modified version of the continuous
IrwinHall distribution [hall1927distribution]given by the probability density function (pdf):
(5) 
where n is the number of uniform intervals summed together. Just as with the binomial distribution, the IrwinHall distribution also converges towards a normal distribution for large , with mean and standard deviation . It is possible to show that for a large
, this distribution converges towards the normal distribution. This was to be expected because of the central limit theorem. Therefore, even in the case of floatingpoint numbers in the encoding, we can use the same QubitHD technique.
V Possible FPGA Implementation
It is known that HD computingbased machine learning algorithms can be implemented in a wide range of different hardware platforms, such as CPUs, GPUs and FPGAs. As most of the training and all of the inference rely on bitwise operations, it was proposed in [QuantHD] that FPGAs would be a suitable candidate for the efficient hardware acceleration. The same hardware can be used for implementing both, the QuantHD and the QubitHD algorithm. This is also one of the major advantages of QubitHD, as it doesn’t require costly hardware upgrades from previous models.
Vi Evaluation
Via Experimental Setup
The training and inference of the algorithm were implemented and verified using Verilog and the code was synthesized on the Kintex7 FPGA KC705 Evaluation Kit. The Vivado XPower tool
has been used to estimate the device power. Additionally, for testing purposes, all parts of the QubitHD algorithm have been implemented on CPU. We also implemented the algorithm on an embedded device (
Rasberry Pi 3) with an ARM Cortex A54 CPU. For the purposes of making an accurate and fair comparison, we use the following FPGAimplemented algorithms as baselines:
The QuantHD algorithm from [QuantHD], on which QubitHD is based

Other stateoftheart HD computingbased machine learning algorithms [ISLPED, imani2019adapthd, morris2019comphd]

A multi level perceptron (MLP) [sharma2016high] (see Table III)

A binary neural network (BNN) [umuroglu2017finn] (see Table III)
With a view to show the advantage of the QubitHD and the previous [QuantHD] algorithm, we used the datasets summarized in Table I
. The datasets range from small datasets like UCIHAR and ISOLET (frequently used in IoT devices, for which QubitHD is specially created), to the larger datasets like face recognition.
Data Size  Train Size  Test Size  Description  

ISOLET  617  26  19MB  6,238  1,559  Voice Recognition [Isolet] 
UCIHAR  561  12  10MB  6,213  1,554  Activity recognition(Mobile)[anguita2012human] 
MNIST  784  10  220MB  60,000  10,000  Handwritten digits [lecun2010mnist] 
FACE  608  2  1.3GB  522,441  2,494  Face recognition[kim2017orchard] 
EXTRA  225  4  140MB  146,869  16,343  Phone position recognition[vaizman2017recognizing] 
ViB Accuracy
The evaluation of the baseline HD model provides highclassification accuracy when using nonbinarized hypervectors for classification. The problem, however, is that retraining and inference with a nonbinary class hypervector is very costly, as it requires calculating cosine similarities^{1}^{1}1The cosine similarity of vectors and is calculated as , where is the dot product and is the absolute value of the vector.. That is, for bit integers or floating point numbers basic operations need to be performed through every step. These are costly and impractical on smallscale and batterypowered devices.
Similarly, during inference, the associative search between query and trained model requires the calculation of the costly cosine similarities. To address this issue, many HD computingbased machine learning algorithms binarize their models [ISLPED]. That way, the cosine similarity is replaced by a simple Hamming distance similarity check. The key problem with this approach is that it leads to a significant decrease in classification accuracy, as shown in Table II.
The authors of [QuantHD] already showed the existence of a partial solution to this problem, which involves simultaneously retraining the nonbinarized model, while updating the binarized model. We already listed the problems with this model in subsection IV. What especially motivated us to create the stable and more reliable QubitHD algorithm, is the fact that the QuantHD algorithm’s retraining is unstable and unreliable. After extensively testing the QubitHD algorithm, we conclude that it on average closes the gap of classification accuracy by 38.8% as compared to the baseline HD computingbased machine learning algorithms in [ISLPED] using the QuantHD framework (See Table II).
Additionally, we observe that the accuracies of the QubitHD algorithm, using a binary model, are 1.2% and 60% higher than the classification accuracies of the baseline HD computingbased algorithm using nonquantized and binary respectively.
ViC Training Efficiency
The training efficiency of HDbased algorithms is characterised by an initial training of the model and subsequent retraining.

Algorithms in this type all consume the same energy during the generation of the initial training model

The significant cost is the retraining: compared to the nonbinarized model, QuantHD uses less operations when calculating the hypervector similarities (step 4 in Figure 3).

No complex cosine similarity has to be computed as calculating the Hamming distance is sufficient to determine whether there was a correct classification or not

The improvement of QubitHD lies in the faster convergence to a high classification accuracy, which is 3050% faster than in QuantHD and also decreases the energy consumption after the initial training proportionally.

The QubitHD modification has dual benefit. It makes it possible to save energy and time during training, whilst achieving the same or better classification accuracies during testing depending on whether the goal is rapid convergence or high classification accuracy.
ViD Inference Efficiency
Compared to QuantHD, there is no gain or loss in the time execution or energy efficiency, since the models behave identically once they are trained. So we report the same 44 energy efficiency improvement and 5 speedup as compared to the nonbinarized HD algorithm.
ViE QubitHD comparison with MLP and BNN
QubitHD is a classifier intended to run on low powered devices, specifically with the goal of the low energy consumption and fast and efficient execution in mind. As such we set out to compare QubitHD, not only to Quant HD, but also other nonHD lightweight classifiers. In our analysis we compared QubitHD accuracy and efficiency with the stateoftheart lightweight classifiers, including MultiLayer Perceptron (MLP) and Binarized Neural Network (BNN). For MLP and BNN, we aimed at using the same metric as employed in [umuroglu2017finn] with the small modification in input and output layers in order to run different applications. The results of this, presented in Table III, indicate that QubitHD, while having similar classification accuracies to very lightweight classifier BNNs and MLPs, drastically reduces CPU usage during training and execution time during the inference. In particular, compared to MLPs QubitHD uses 12 less CPU during training and is 84 faster during the inference on FPGAs. In comparison with BNNs, QubitHD uses a factor of 101 less CPU time during training and is still 20 faster during the inference.
Vii Conclusion
Machine learning algorithms, based on the Braininspired Hyperdimensional (HD) computing, imitate cognition by exploiting statistical properties of very highdimensional vector spaces. They are a promising solution for energyefficient classification tasks. A weakness of existing HD computingbased ML algorithms is the fact that they have to be binarized for achieving very high energyefficiency. At the same time, binarized models reach lower classification accuracies. In order to solve the problem of the tradeoff between the energyefficiency and classification accuracy, we propose the QubitHD algorithm. With QubitHD, it is possible to use binarized HD computingbased ML algorithms, which provide virtually the same classification accuracies as their nonbinarized counterparts. The algorithm is inspired by stochastic quantum state measurement techniques. The improvement of QubitHD is a duality and is reflected in the quicker convergence and the higher and more stable classification accuracy achieved, as compared to QuantHD.
Our main contributions are:

The FPGA implementation of QubitHD provides on average a 65% improvement in terms of energyefficiency, and a 95% improvement in terms of training time, as compared to the most recent stateoftheart HD computingbased machine learning algorithm QuantHD [QuantHD]

When compared with stateoftheart lowcost classifiers like Binarized Neural Networks (BNN) and MultiLayer Perceptrons (MLP), QubitHD offers a similar classification accuracy whilst reducing training time by and allows for faster inference when testing.

QubitHD decreases the classification accuracy gap between stateoftheart binarized and nonbinarized HD models by almost half.

QubitHD converges on average faster during training, thus significantly decreasing the energy consumption.
References
Appendix A Appendix A: Quantum bit (Qubit) measurements
Appendix A: Quantum bit (Qubit) measurements:
In the field of quantum information theory, a quantum bit (or qubit) is the fundamental unit of a quantum information. A qubit is the quantum equivalent of the classical binary bit. It is a twolevel quantummechanical system, solely entailing two possible states (usually corresponding to two distinct energy levels). An example of this is the spin of a particle in a magnetic field, in which the two states can be taken as spin up and spin down. Another example could be an atom which can be in the ground state (with low energy) and in the excited state (with a higher energy level). In a classical system, the bit would have to be either in one state or the other (either logical or ). A quantum system can be in an superposition of those two states. Let us define two possible states of the system as and . An arbitrary quantum state can be in a superposition of these two states: , where . For the sake of completeness, let us also define the counterpart of , which is . is called the ketnotation, while is called the branotation. As a quantum state is directly related to the probability of finding the system in a certain state after measurement, every quantum state has to be normalized. That is, .
It might seem as if there were four degrees of freedom in
, since there are two complex numbers with two degrees of freedom. However, one degree of freedom is removed by the normalization constraint , and the other degree of freedom is removed by the fact that the arbitrary overall phase of the quantum state doesn’t matter, as it has no influence on any physical observables in the onequbit case. Hence, we can represent the state of the quantum bit as a point on a sphere of radius . This is the so called Bloch sphere, displayed in the Figure (7).For simplicity, we are going to introduce new coordinates for representing the quantum state:
(6) 
Here and are the polar coordinates.
While an ordinary (classical) bit can store only one ”piece of information”, a quantum state is much more powerful, as it can be in any state with and . The problem with quantum states is the fact that we cannot measure or observe which state they are in. Whenever we attempt to observe which state a qubit is in, it is going to collapse into a (classical) bit state. In the Bloch representation, whenever we attempt to measure the zcomponent of the qubit, it will either collapse into the state, or into the state. The probability of it collapsing into state is . Analogy, the probability of it collapsing into state is .
This leaves us with a very interesting property: there exists a ”true” zcomponent of the state, which is equal to the projection of the quantum state on the zaxis: . This gave us the following idea: why don’t we use the quantum measurement technique for binarizing the HD model? Doing so would allow us to have a binarized model whose expected value would be equal to the nonbinarized model. In other words . This is the reason why we use Eq. 3 for binarization, as opposed to Eq. 2.
It is similar to the property we observe when measuring a quantum state in the direction. The probability of seeing one of the two possible measurement outcomes ( and ) is proportional to the projection on the zaxis of the quantum state, which we cannot measure directly.
Comments
There are no comments yet.