SHEARer: Highly-Efficient Hyperdimensional Computing by Software-Hardware Enabled Multifold Approximation

07/20/2020 ∙ by Behnam Khaleghi, et al. ∙ University of California, San Diego 0

Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, accompanied by so-called bundling procedure that simply adds up the hypervectors to realize encoding hypervector. Although the operations of HD are highly parallelizable, the massive number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithm-hardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to inherent error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. In contrast to previous works that generate the encoding hypervectors in full precision and then ex-post quantizing, we compute the encoding hypervectors in an approximate manner that saves a significant amount of resources yet affords high accuracy. We also propose a novel FPGA implementation that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904x (15.7x) and energy savings of up to 56,044x (301x) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Networked sensors with native computing power – otherwise known as the “internet of things” (IoT) – are a rapidly growing source of data. Applications based on IoT devices typically use machine learning (ML) algorithms to generate useful insights from data. While modern machine learning techniques – in particular deep neural networks (DNNs) – can produce state-of-the-art results, they often entail substantial memory and compute requirements which may exceed the resources available on light-weight edge devices. Thus, there is a pressing need to develop novel machine learning techniques which provide accuracy and flexibility while meeting the tight resource constraints imposed by edge-sensing devices.

Hyperdimensional computing – HD for short – is an emerging paradigm for machine learning based on evidence from the neuroscience community that the brain “computes” on high-dimensional, distributed, representations of data (kanerva2009hyperdimensional; masse2009olfactory; turner2008olfactory; wilson2013early; olshausen2004sparse)

. In HD, the primitive units of computation are high-dimensional vectors of length

sampled randomly from the uniform distribution over the binary cube

. Typical values of are in the range 5-10,000. Because of their high-dimensionality, any randomly chosen pair of points will be approximately orthogonal (that is, their inner product will be approximately zero). A useful consequence of this is that sets can be encoded simply by summing (or “bundling”) together their constituent vectors. For any collection of vectors their element-wise sum is, in expectation, closer to and than any other randomly chosen vector in the space.

Given HD representations of data, this provides a simple classification scheme: we simply take the data points corresponding to a particular class and superimpose them into a single representation for the set. Then, given a new piece of data for which the correct class label is unknown, we compute the similarity with the hypervectors representing each class and return the label corresponding to the most similar one. More formally, suppose we are given a set of labeled data where corresponds to an observation in low-dimensional space and

is a categorical variable indicating the class to which a particular

belongs. In general, HD classification proceeds by generating a set of “class hypervectors” which represent the training data corresponding to each class. Then, given a piece of data for which we do not know the correct label – the “query” – we simply compute the similarity between the query and each class hypervector and return the label corresponding to the most similar. This process is illustrated in Figure 1.

Suppose we wish to generate the class hypervector corresponding to some class . The prototype can be generated simply by superimposing (also called “bundling” in the literature) the HD-encoded representation of the training data corresponding to that particular class (plate1995holographic; kanerva2009hyperdimensional):

(1)

where is some encoding function which maps a low-dimensional signal to a binary HD representation. Then, given some piece of “query” data for which we do not know the correct label we simple return the predicted label as:

(2)

where is an appropriate similarity metric. Common choices for include the inner-product/cosine distance – appropriate for integer or real valued encoding schemes – and the hamming distance – appropriate for binary HD representations. This phase is commonly referred to in literature as “associative search”. Despite the simplicity of this “learning” scheme, HD computing has been successfully applied to a number of practical problems in the literature ranging from optimizing the performance of web-browsers (wan2012web), to DNA sequence alignment (kim2020genie; imani2018hdna), bio-signal processing (rahimi2018efficient; fatemeh2020embc), robotics (mitrokhin2019learning; neubert2019introduction), and privacy preserving federated learning (imani2019framework; behnam2020hd).

The primary appeal of HD computing lies in its amenability to implementation in modern hardware accelerators. Because the HD representations (e.g. ) are simply long Boolean vectors, they can be processed extremely efficiently in highly parallel platforms like GPUs, FPGAs and PIM architectures. The principal challenge of HD computing – and the focus of this paper – lies in designing good encoding schemes which (1) represent the data in a format suitable for learning and (2) are efficient to implement in hardware. In general, the encoding phase is the most expensive stage in the HD learning pipeline – in some cases taking up to longer than training or prediction (imani2019bric). Existing encoding methods require generating hypervectors in full integer-precision and then ex-post quantizing to . While this accelerates the associative search phase, it does not address encoding which is the primary source of inefficiency.

In this work, we propose novel techniques to compute the encodings in an approximate manner that saves a substantial amount of resources with an insignificant impact on accuracy. Of independent interest is our novel FPGA implementation that achieves striking performance through massive parallelism with low power consumption. Approximate encodings entail models to be trained in a similar approximate fashion. Thus we also develop a software emulation to enable users to train desired HD models. Our software framework enables users to explore the tradeoff between the degree of approximation, accuracy, and resource utilization (hence power consumption) by generating a pre-compiled library that correlates approximation schemes and FPGA resource utilization and power consumption. We show our procedure leads to performance improvement of 104,904 (15.7) and energy savings of up to 56,044 (301) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti).

Figure 1. Encoding and training in HD.

2. Background and Motivation

2.1. HD Encoding Algorithms

The literature has proposed a number of encoding methods for the multitude of data types which arise in practical learning settings. We here focus on a method from (plate1995holographic; kanerva2009hyperdimensional; rahimi2016robust) which we refer to as “ID-vector” based encoding. This encoding method is widely used (see for instance: (imani2017voicehd; rahimi2016robust; imani2019sparsehd; rahimi2018efficient)) and works well on both discrete and continuous data. We focus the discussion on continuous data as discrete data is a simple extension.

Suppose we wish to encode some set of vectors where is supported on some compact subset of . To begin, we first quantize the domain of each feature into a set of discrete values and assign each a codeword . To preserve the ordinal relationship between the quantizer bins (the ), we wish the similarity between the codewords to be inversely proportional to distance between the corresponding quantization bins; e.g. . To enforce this property we generate the codeword corresponding to the minimal quantizer bin by sampling randomly from . The codeword for the second bin is generated by flipping random coordinates in . The codeword for the third bin is generated analogously from and so on. Thus, the codewords for the minimal and maximal bins are orthogonal and decays as increases. This scheme is appropriate for quantizers with linearly spaced bins – however, it can be extended to variable bin-width quantizers.

To complete the description of encoding, let be a function which returns the appropriate codeword for a component . Then encoding proceeds as follows:

(3)

Where is a “position hypervector” which encodes the index of the feature value (e.g. ) and is a “binding” operation which is typically taken to be XOR.

2.2. Motivation

While the basic operations of HD are simple, they are numerous due to its high-dimensional nature. Prior work has proposed varied algorithmic and hardware innovations to tackle the computational challenges of HD. Acceleration in hardware has typically focused on FPGAs (salamat2019f5; imani2019quanthd; schmuck2019hardware) or ASIC-ish accelerators (imani2019binary; datta2019programmable). FPGA-based implementations provide a high degree of parallelism and bit-level granularity of operations that significantly improves the performance and effective utilization of resources. Furthermore, FPGAs are advantageous over more specialized ASICs as they allow for easy customization of model parameters such as lengths of hypervectors () and input-vectors () along with the number of quantization levels. This flexibility is important as learning applications are heterogeneous in practice. Accordingly, we here focus on an FPGA based implementation but emphasize our techniques are generic and can be integrated with ASIC- (imani2019binary) and processor-based (datta2019programmable) implementations.

Figure 2. (a) Adder-tree and (b) counter-based implementation of popcount. +⃝ denotes add operation.

As noted in the preceding section, the element-wise sum is a critical operation in the encoding pipeline. Thus, popcount operations play a critical role in determining the efficiency of HD computing. Figure 2(a) shows a popular tree-based implementation of popcount that adds binary bits (note that we can replace ‘’s by 0 in the hardware). Each six-input look-up table (LUT-6) of conventional FPGAs consists of two LUT-5. Hence, we can implement the first stage of the tree using of three-port one-bit adders. Each subsequent stage comprises two-port -bit adders where increases by one at each stage, while the number of adders per stage decreases by a factor of . A -bit adder requires LUT-6. Thus, the number of LUT-6 for a -input popcount can be formulated as Equation (4).

(4)
Figure 3. Our proposed approximate encoding techniques. MAJ and +⃝ denote majority and addition, respectively.

HD operations can be parallelized at the granularity of a single coordinate in each hypervector: all dimensions of the encoding hypervector and associative search can be computed in parallel. Nonetheless, Equation (4) reveals that the popcount module for a popular benchmark dataset (isolet) with 617 features per input requires 820 LUTs. This limits a mid-size low-power FPGA with 50K LUTs (xilinx7) to generate only 60 encoding dimension per cycle (out of ).

To save resources, (schmuck2019hardware) and (imani2019binary) suggest using counters to implement the popcount for each dimension of encoding, as shown in Figure 2 (b). Although this seems more compact, in practice, it is less efficient than an adder-tree implementation: the counter-based implementation needs “” LUTs per dimension, with a per-dimension latency of cycles, while adder-trees require LUTs per dimension with a per-dimension throughput of one cycle, so for a given amount of resources, the conventional adder-tree is more performance-efficient.

Work in (salamat2019f5) and (imani2019quanthd)

quantize the dimensions of encoding and class hypervectors which eliminates DSP modules (or large number of cascaded LUTs) that are conventionally used for the associative search stage, since, through quantization, inner product for cosine similarity will be replaced by popcount operations in case of binary quantization, or lower-bit multiplications. The resulting improvement is minor because the quantization is applied

after full-bit encoding. Furthermore, the multipliers of the associative search stage have input widths of (from encoding dimensions) and (from class dimensions), so each one needs LUTs. Pessimistically assuming bit-widths up to , an extreme binary quantization can eliminate 256 LUTs required for multiplication. However, the savings are again modest at best in practice: on the benchmark dataset mentioned previously, only . Therefore, in this paper, we target the popcount portion that contributes to the more significant part of resources. Indeed, ex-post quantizing of encoding hypervectors can be orthogonal to our technique for further improvement.

3. Proposed Method: SHEAR

3.1. Approximate Encoding

In the previous section, we explained prior work that applies quantization after obtaining the encoding hypervector in full bit-width. As noted there, while this approach is simple it only accelerates the associative search phase and does not improve encoding - which is often the principal bottleneck. Because the HD representation of data entails substantial redundancy and information is uniformly distributed over a large number of bits, it is robust to bit-level errors: flipping 10% of hypervectors’ bits shows virtually zero accuracy drop, while 30% bit-error impairs the accuracy by a mere 4% (imani2017exploring). We leverage such resilience to improve the resource utilization through approximate encoding, as shown in Figure 3

. In the following, we discuss each technique in greater detail and estimate its resource usage.

(1) Local majority. From Equation (4) we can observe that the number of resources (in terms of LUT-6) of the exact adder-tree to see that the complexity encoding each dimension linearly depends on the number of data features, . We, therefore, aim to reduce the number of inputs to the primary adder-tree by sub-sampling using the majority function so as to shrink the tree inputs while (approximately) extracting the information contained in the input. Note that, here, ‘inputs’ are the binary dimensions of the level hypervectors (see Figure 1

2
and Figure 2). As shown in Figure 3(a), each LUT-6 is configured to return the majority of its six input bits. When three out of six inputs are 0/1, we break the tie by designating all LUTs that perform majority functions of a specific encoding dimension to deterministically output 0 or 1. We specify this randomly for every dimension (i.e., an entire adder-tree) but it remains fixed for a model during the training and inference. We choose groups of six bits as a single LUT-6 can vote for up to six inputs. Using smaller majority groups diminishes the resource saving, especially taking the majorities adds extra LUTs. Moreover, following the Shannon decomposition, implementing a ‘’-input LUT requires two -input LUTs (and a two-input multiplexer). Thus, the number of LUTs for majority groups larger than six inputs grows exponentially.

There are MAJ LUTs in the first stage of Figure 3(a), hence the number of inputs for the subsequent adder-tree reduces to . From Equation (4) we also know that a -input adder-tree requires LUT-6. Thus, the design of Figure 3(a) consumes:

(5)

This uses less LUT resources than an exact adder-tree.

In (imani2019quanthd), the authors report an average accuracy loss of 1.6% by post-hoc quantizing the encodings to binary. Thus, one might think of repeating the majority functions in the subsequent stages to obtain final one-bit encoding dimensions. Using local majority functions is efficient, but degrades the encoding quality as majority is not associative. In particular, the MAJ LUTs add another layer of approximation by breaking ties. Thus, a so-called MAJ-tree causes considerable accuracy loss. Therefore, in our cascaded-MAJ design in Figure 3(b), we limit the MAJ stages to the first two stages. Our cascaded-MAJ utilizes:

(6)

which saves resources compared to exact encoding. We emphasize that a cascaded all-MAJ popcount needs LUTs, which saves 85.0% of LUTs. So the two-stage MAJ implementation with 82.6% resource saving is nearly optimal because the first two stages of the exact tree were consuming the most resources.

(2) Input overfeeding. In Figure 2(a) we can observe that each LUT-5 pair of the first stage computes = . Since only three (out of five) inputs of them are used, these LUTs left underutilized. With one more input, the output range will be [03], which requires three bits (outputs) to represent, so we cannot add more than three bits using two LUT-5s. However, instead of using the LUT-5s to carry out regular addition, we can supply a pair of LUT-5s with five inputs to perform quantized/truncated addition. For actual outputs (sum of five bits) of 0 or 1, the LUT-5 pair would produce 00 (zero); for 2 or 3 they produce 01 (one), and for 4 or 5 they produce 10 (two). That is one LUT-5 computes the actual carry out of the five bits, and the other computes MSB of the sum. To ensure that the synthesis tool infers a single LUT-6 for each pair, we can directly instantiate LUT primitives. As a LUT-6 comprises a LUT-5 pair (with shared inputs), the number of resources of Figure 3(c) is:

(7)

The first stage encompasses LUT-6s, and each subsequent stage contains -bit adders while their count decreases by at each stage. Total number of LUTs is reduced by (the same ratio of over-use of inputs). The saving is smaller than the local majority approach but we expect higher accuracy due to intuitively more moderate imposed approximation.

(3) Truncated nodes. Out of LUTs used in an exact adder-tree, (75%) are used in the intermediate adder units. More precisely, following ratio (see Equation (4)), stages 1–4 of the adder contribute to 25%, 25%, 18.75%, and 12.5% of the total resources, respectively. Note that, although the number of adder units halves at each stage, the area of each one increases linearly. We avoid a blowup of adder sizes by truncating the least significant bit (LSB) of each adder. As demonstrated in Figure 3(d), the LSB of the second stage (which is supposed to have three-bit output) is discarded. Thus, instead of using two LUT-6s to compute , we can use two LUT-5s (equivalent to one LUT-6) to obtain , where one LUT-5 computes and the other produces using four inputs , , , and . Truncating the output of the second stage consequently decreases the output bit-width of the third stage by one bit as its inputs became two bits. Thus, we can apply the LSB truncating to the third stage to implement it using two LUT-5s, as well. We can apply the same procedure in all the consecutive nodes and implement them by only two LUT-5s. The output of the first stage is already two bits so we do not modify its original implementation.

We apply truncating to first stages particularly from the left side of Equation 4 we can perceive the first five stages that contribute to 90% of the adder-tree resources. Otherwise, the decay in accuracy becomes too severe. Equation (8) characterizes the resource usage of the adder-tree in which the first stages are implemented using 2-bit adders shown in Figure 3(d) (including the stage one, which uses the primary exact mode).

(8)

We can see that for – i.e. when none of intermediate stages are truncated – the equation returns which is equal to resources of an exact adder-tree. Setting to 2, 3, and 4 achieves 25%, 37.5%, and 43.75% resource saving, respectively.

3.2. Shear Architecture

Recall from Figure 1, that the HD encoding procedure needs to convert all input features to equivalent level hypervectors, bind them with the associated ID hypervector, and bundle together (e.g. sum) the resulting hypervectors to generate the final encoding. FPGAs, however, contain limited logic resources as well as on-chip SRAM-based memory blocks (a.k.a BRAMs) to provide high performance with affordable power. Previous work, therefore, break down this step into multiple cycles whereby at each cycle they process dimensions (salamat2019f5; imani2019sparsehd; imani2019fach). When processing dimensions to , those architectures fetch the same dimensions of all level hypervectors. Each of adder-trees are augmented with -to-1 multiplexers in all of their input ports, where the adder-tree’s multiplexers are connected to

dimension of the fetched level hypervectors, and the (quantized) value of associated feature selects the right level dimension to pass. The advantage of such architectures is that only

bits need to be fetched at each cycle. However, it requires multiplexers. For a modest , which translates to 16-input multiplexers occupying four LUTs, the total number of LUTs used for multiplexers will be , the (exact) adder-trees occupy (in Equation (4) we showed that a input exact adder-tree uses LUTs). This means that the augmented multiplexers occupy LUTs of the adder area.In our approximate encoding, this ratio would be even larger as we trim the exact adder. Thus, multiplexer-based implementation overshadows the gain of approximating the adders as we need to preserve the copious multiplexers.

Figure 4. SHEAR datapath abstract.

To address this issue, we propose a novel FPGA implementation that relies on on-chip memories rather than adding extra resources. Figure 4 illustrates an overview of the SHEAR FPGA architecture. At each cycle, we partially process (out of ) input features, where . Our implementation is BRAM-oriented, so each (quantized) feature translates to the address from which the corresponding level hypervector can be read. This entails a dedicated memory block group for each of features currently being processed. The number of BRAMs in a group is equal to as there are different level hypervectors of length bits, for a memory capacity of bits. Therefore, the number of features that can be partially processed in a cycle is limited to . The coefficient 2 is because the BRAMs have two ports from which we can independently read (that is why in Figure 4 two pixels share the same BRAM group). The address translator – “level to address” in Figure 4) – activates only the right BRAM and row of the group, so the other BRAMs do not dissipate dynamic power. Depending on its configuration, each memory block can deliver up to bits, as indicated in the figure. Certainly, we could double the by duplicating the size of memory groups to process more dimensions per cycle, but then – the number of features that can be processed – halves.

Each of fetched level hypervector bit is XORed with the corresponding bit of the ID (position) hypervector. As detailed in Section 2.1, each feature index is associated with an ID hypervector, which is a randomly chosen (but fixed) hypervector of length . We thus require additional BRAM blocks to store ID hypervectors. This further limits the number of features that can be processed in a cycle due to BRAM shortage. To resolve this, we only store a single ID hypervector (seed ID) and generate the other ones by rotating the seed ID, i.e., ID of index can be obtained by rotating the ID of index 1 (seed ID) by . This does not affect the HD accuracy as the resulting ID hypervectors are still iid and approximately orthogonal. For the first feature, we need to read bits, while for the subsequent features we need one more bit as each ID has common bits with its predecessor. Therefore we need a data-width of for ID memory, meaning that we need memory blocks of the seed ID hypervector. Thus, although the seed ID fits in a single BRAM, the required data-width demands more memory blocks. However, this is still significantly smaller than the case of storing all different IDs in BRAM blocks, which either releases BRAMs for processing the features, or power gates the unused BRAMs. Moreover, using seed ID BRAM also saves dynamic power as bits are read (compared to of storing different IDs). It is also noteworthy that at each cycle the first bits read from the ID memory are passed to the first feature of the features currently being processed (i.e., feature 1, + 1, 2 + 1, ). Similarly, bits 2 to of the fetched ID are passed to the second feature, and so on. Thus, the output of ID BRAMs to processing logic needs a fixed routing.

After XORing the fetched level hypervectors with the ID hypervectors, each of the approximate adder-trees add up binary bits, so the input size of all adders is . Since the result is only the sum of the first features, SHEAR utilizes a buffer to store these partial sums. In the next cycle, the procedure repeats for the next group of features, i.e., features + 1 to . Therefore, SHEAR produces encoding dimensions in cycles, hence the entire encoding hypervector is generated in cycles.

To make these tangible, in the Xilinx FPGAs we use for experiments, is 64 and . We also noticed that 16 level hypervectors gives the same accuracy of having more, so we set . We also select the hypervector lengths to be a multiple of 512. Taking the previously mentioned language recognition benchmark (isolet) as an example, we observed that provides acceptable accuracy (see Section 4 for more details). For this benchmark we thus need group size of BRAMs, where each group can cover two input features. The FPGA we use has a total 445 BRAMs, which can make at most groups, capable of processing 444 features per cycle. Therefore, we divide 617 input features of the benchmark into two repeating cycles using 310 BRAMs (155 BRAM groups) to process the first 310 features in the first cycle, and the rest 307 cycles in the second cycle, generating encoding dimensions per 2 cycles. All 64 adder-trees have a 1-bit input sizes of 310. The entire encoding takes . Note that reading from on-chip BRAMs has just one cycle latency and the off-chip memory latency is buried in the computation pipeline.

3.3. Software Layer

Because of approximation, the output of encoding and hence the class hypervectors are different than training with exact encoding. Therefore we also need to train the model using the same approximate encoding(s), as the associative search only looks for the similarity (rather than exactness) of an approximately encoded hypervector with trained class hypervectors – which are made up by bundling a manifold of encoding hypervectors. Our FPGA implementation is tailored for inference, so we carry out the training step on CPU. We developed an efficient SIMD vectorized Python implementation to emulate the exact and the proposed encoding techniques in software. The emulation of the proposed techniques is straightforward. For instance, for the local majority approximation (Figure 3(a)), instead of adding up all hypervectors, we divide them to groups of six hypervectors, add up all six hypervectors of each group, and compare if each resultant dimension is larger than 3. We also break the ties in software by generating a constant vector dictating how the ties of each dimension should be served. This acts as the MAJ LUTs of the first stage. Thereafter, we simply add up all these temporary hypervectors to realize the subsequent exact adders. This guarantees to match the software output with approximate hardware’s, while we also achieve a fast implementation by avoiding unnecessary imitation of hardware implementation.

In addition to that is the dataset’s attribute, , , epochs

(number of training epochs) are the other variables of our software implementation.

is the learning rate of HD. As explained in Section 1, HD bundles all encoding hypervectors belonging to the same-label data to create the initial class hypervectors. In the subsequent epochs iterations, HD updates the class hypervectors by observing if the model correctly predicts the training data. If the model mispredicts an encoded query of label as class , HD updates as shown by Equation (9). If learning rate is not provided, SHEAR finds the best through bisectioning for a certain number of iterations.

(9)

We supply the software implementation of SHEAR with the number of BRAM and LUT resources of the target FPGA to estimate the architectural parameters according to Section 3.2 as well as using the resource utilization formulated in Section 3.1

. We have also implemented the exact and approximate adder-trees of different input sizes and interpolated their measured power consumption – which is linear w.r.t. the adder size – for different average activities of the adders’ primary inputs. Therefore, we calculate the average signal activity observed by the adders according to the values of temporary-generated binding hypervectors (level

XOR ID). We similarly estimate the toggle rate of BRAMs according to consecutive bits read from BRAMs. As alluded earlier, we do not replicate the hardware implementation in software; we just need to determine each fetched level hypervector belongs to which BRAM group (based on the index of feature), so we can keep track of toggle rates. Using the signal information with an offline look-up table created for activity-power, along with the instantiated resource information calculated as mentioned, during training, SHEAR estimates the power consumption of an application targeted for a specific device.

4. Experimental Results

(1) General Setup. We have implemented the SHEAR architecture using Vivado High-Level Synthesis Design Suite on Xilinx Kintex-7 FPGA KC705 Evaluation Kit which embraces a XC7K325T device with 203,800 LUT-6 and 445 36 Kb BRAM memory blocks that we use in bit configuration. By pipelining the adder-tree stages we could achieve a clock frequency of 200 MHz. We compare the performance and energy results with the high-end NVIDIA GeForce GTX 1080 Ti GPU, and Raspberry Pi 3 embedded processor. We optimize the CUDA implementation by packing the hypervectors within 32-bit integers, so a single logical XOR operation can bind 32 dimensions. We use speech (isolet), activity (ucihar), and hand-written digit (lecun1998mnist) recognition as well as a face detection dataset (griffin2007caltech) as our benchmarks. Table 1 summarizes the length of hypervectors and associated accuracy of each dataset in the baseline exact mode. For a fair comparison, we first obtained the accuracies using , then decreased it until the accuracies remain within 0.5% of the original values. This avoids over-saturated hypervectors and accuracy drop due to approximation manifests better.

Parameter  Benchmark speech activity face digit
Input features () 617 561 608 784
Hypervector length () 2,560 3,072 6,144 2,048
Baseline accuracy 93.18% 93.91% 95.47% 89.07%
Table 1. Baseline implementation results.
exact MAJ MAJ-2 over-feed truncate
Synthesis 638 183 116 383 340
Equation 675 195 116 405 343
Error 5.8% 6.6% 0.0% 5.7% 0.9%
Table 2. LUT count for a 512-input adder-tree.

(2) Resource Utilization. To validate the efficiency of the proposed approximation techniques, in addition to holistic high-level performance and energy comparisons, we examine them by synthesizing a 512-input adder-tree. Table 2 represents the LUT utilization of the adder implemented in exact and approximate modes. MAJ, MAJ-2, over-feed and truncate refer to the designs of Figure 3(a)-(d). It can be seen that our equations in Section 3.1 have a modest average error of 3.8%. Especially, it over-estimates the LUT count of both exact and approximate adders, so the resource saving estimations remain similar to our predicted values. For instance, synthesis results indicate MAJ (MAJ-2) saves 71.3% (81.8%) LUTs, which is very close to the predicted 71.1% (82.8%).

exact MAJ MAJ-2 over-feed trunc-3 trunc-4
speech 93.2% 0.7% 2.3% 0.8% 0.9% 1.9%
activity 93.9% 0.8% 1.2% 1.3% 1.1% 1.0%
face 95.5% 1.8% 3.3% 1.7% 1.6% 1.9%
digit 89.1% 0.8% 0.3% 1.7% 0.1% 0.1%
average 1.0% 1.8% 1.4% 0.9% 1.2%
LUT saving 0 71.1% 82.8% 40.0% 37.5% 43.8%
Table 3. Relative accuracies SHEAR approximate encodings.

(3) Accuracy. Table 3 summarizes the accuracies of the proposed encodings relative to the exact encoding. LUT saving, which is dataset-independent, is represented again for the comparison purpose. “trunc-3” and “trunc-4” stand for truncated encoding (Figure 3(d)) where, respectively, three and four intermediate stages are truncated. Overall, MAJ encoding (one-stage local majority shown in Figure 3(a)) achieves an acceptable accuracy with significant resource saving, though it is not always the highest-accurate one. For instance, in the face detection benchmark, the over-feed and 3-stage truncated encodings offer slightly better accuracy. More interestingly, in the digit recognition dataset, trunc-4 shows a negligible accuracy drop while trunc-3 even improves the accuracy by 0.1%. This can stem from the fact that emulating the hardware approximation in SHEAR’s software layer takes a long time for the digit dataset, so we limited the software to try five different learning rate () and repeat the entire training for five times (with

) so the result might be slightly skewed. For the other datasets we conducted the training for 25 times each with 50

epochs

to average out the variance of results.

Figure 5. Throughput of SHEAR versus Raspberry Pi 3 and Nvidia GTX 1080 Ti. Y-axis is logarithmic scale.

(4) Performance. Figure 5 compares the throughput of SHEAR FPGA implementation with Raspberry Pi and Nvidia GPU. SHEAR implementation is BRAM-bound, so all the exact and approximate implementations yield the same performance. In Section 3.2 we elaborated that the speech dataset requires two cycles per dimensions. We can similarly show that activity and digit datasets also need two cycles per 64 dimensions, while digit requires three cycles as its level hypervectors are larger () and occupy more BRAMs. In the worst scenario, SHEAR improves the throughput by 58,333 and 6.7 compared to Raspberry Pi and GPU implementation. On average SHEAR provides a throughput of 104,904 and 15.7 as compared to Raspberry Pi and GPU, respectively. The substantial improvements arise from that SHEAR adds up (e.g., 25,000) numbers per cycle while also performs the binding (XOR operations) on the fly. However, Raspberry Pi executes sequentially and also its cache cannot fit all the class hypervectors with non-binary dimensions. Note that we assume that dataset is available in the off-chip memory (DRAM) of the FPGA. Otherwise, although per-sample latency would be affected, throughput remains the same as the off-chip memory latency is buried in the computation cycles.

Figure 6. Energy (Joule) consumption of SHEAR, Raspberry Pi and GPU for 10 million inference. Y-axis is logarithmic.

(5) Energy Consumption. Figure 6 compares the energy consumption of the exact and approximate SHEAR implementations with Raspberry Pi and GPU. We have scaled the energy to 10 million inferences for the sake of illustration (Y-axis is logarithmic). We used Hioki 3334 power meter and NVIDIA system management interface to measure the power consumption of Raspberry Pi and GPU, respectively. We used Xilinx Power Estimator (XPE) to estimate the FPGA power consumption. The average power of Raspberry Pi for all datasets hovers around 3.10 Watt, while this is 120 Watt for the GPU. In FPGA implementation, powers showed more variation as the number of active LUTs and BRAMs differ between applications. E.g., The face dataset with two-stage majority encoding (MAJ-2) consumes 3.11 Watt, while the digit recognition dataset in the exact mode consumes 10.80 Watt. The smaller power consumption of face is mainly because of smaller off-chip data transfer as face has the largest hypervector length and takes 288 cycles to process an entire input, while for digit it takes 64 cycles. On average, SHEAR’s exact encoding decreases the energy consumption of by 45,988 and 247 (average of all datasets) as compared to Raspberry Pi and GPU implementations. MAJ-2 encoding of SHEAR consumes the minimum energy, which throttles the energy consumption by 56,044 and 301 compared to Raspberry Pi and GPU, respectively. Note that power improvement of the approximate encodings is not proportional to their resource (LUT) utilization as BRAM power remains the same for all encodings.

5. Conclusion

In this paper, we leveraged the intrinsic error resiliency of HD computing to develop different approximate encodings with varied accuracy and resource utilization attributes. With a modest accuracy drop, our approximate encoding reduces the LUT utilization by 71.1%. By effectively utilizing the on-chip BRAMs of FPGA, we also proposed a highly efficient implementation that outperforms an optimized GPU implementation over 15

, and surpasses Raspberry Pi by over five orders of magnitude. Our FPGA implementation also consumes a moderate power: a minimum of 3.11 Watt for a face detection dataset using approximate encoding, and a maximum of 10.8 Watt on a digit recognition dataset when using exact encoding. Eventually, our implementation reduces the energy consumption by 247

(45,988) compared to GPU and Raspberry Pi in exact encoding, which further improves by a factor of using approximate encoding.

Acknowledgements

This work was supported in part by CRISP, one of six centers in JUMP, an SRC program sponsored by DARPA, in part by SRC Global Research Collaboration (GRC) grant, DARPA HyDDENN grant, and NSF grants #1911095 and #2003279.

References