In-memory hyperdimensional computing

06/04/2019 ∙ by Geethan Karunaratne, et al. ∙ ibm ETH Zurich 0

Hyperdimensional computing (HDC) is an emerging computing framework that takes inspiration from attributes of neuronal circuits such as hyperdimensionality, fully distributed holographic representation, and (pseudo)randomness. When employed for machine learning tasks such as learning and classification, HDC involves manipulation and comparison of large patterns within memory. Moreover, a key attribute of HDC is its robustness to the imperfections associated with the computational substrates on which it is implemented. It is therefore particularly amenable to emerging non-von Neumann paradigms such as in-memory computing, where the physical attributes of nanoscale memristive devices are exploited to perform computation in place. Here, we present a complete in-memory HDC system that achieves a near-optimum trade-off between design complexity and classification accuracy based on three prototypical HDC related learning tasks, namely, language classification, news classification, and hand gesture recognition from electromyography signals. Comparable accuracies to software implementations are demonstrated, experimentally, using 760,000 phase-change memory devices performing analog in-memory computing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

When designing biological computing systems, nature decided to trade accuracy for efficiency. Hence, one viable solution for continuous reduction in energy-per-operation is to adopt computational approaches that are inherently robust to uncertainty. Hyperdimensional computing (HDC) is recognized as one such framework based on the observation that key aspects of human memory, perception, and cognition can be explained by the mathematical properties of hyperdimensional spaces, and that a powerful system of computing can be built on the rich algebra of hypervectors Kanerva2009 . Groups, rings, and fields over hypervectors become the underlying computing structures with permutations, mappings, and inverses as primitive computing operations. Hypervectors are defined as -dimensional, where

, (pseudo)random vectors with independent and identically distributed (i.i.d.) components 

Kanerva98SDM . When the dimensionality is in the thousands, there exist a large number of quasiorthogonal hypervectors. This allows HDC to combine such hypervectors into new hypervectors using well-defined vector space operations, defined such that the resulting hypervector is unique, and with the same dimension.

In contrast to other brain-inspired approaches such as deep learning

Y2015lecunNature

in which learning is computationally much more demanding than subsequent classification, learning in HDC is fast and computationally balanced with respect to classification by reusing the same operators. The training algorithm in HDC works in one or few shots, i.e., object categories are learned from one or few examples, and in a single pass over training data as opposed to many iterations. HDC also produces transparent representation due to its well-defined set of arithmetic operations with distributivity and invertibility. It requires far fewer operations to achieve the same functionality than other approaches such as support vector machines, k-nearest neighbors algorithm, multi-layer perceptrons, and convolutional neural networks 

Y2017rahimiTCAS ; Y2019rahimiProcIEEE .

HDC is intrinsically robust to failures, defects, variations, and noise of computing fabrics on which it is implemented Y2016rahimiISLPED ; Y2016liIEDM ; Y2018wuISSCC . Symbols represented with hypervectors begin with i.i.d. components, and are combined by nearly i.i.d.-preserving operations. This implies that failure in a component of a hypervector is not “contagious.” At the same time, failures in a subset of components are compensated for by the holographic nature of the data representation, i.e., the error-free components can still provide a useful representation that is similar enough to the original hypervector.

The manipulation of large patterns stored in memory and the inherent robustness make HDC particularly well suited for emerging computing paradigms such as in-memory computing or computational memory based on emerging nanoscale resistive memory or memristive devices  Y2013yangNatureNano ; Y2017sebastianNatComm ; Y2018zidanNatureElectronics ; Y2018ielminiNatureElectronics . In one such work, 3D vertical resistive random access memory (ReRAM) device was used to perform individual operations for HDC Y2016liIEDM ; Y2017liVLSI . In another work, a carbon nanotube field effect transistor-based logic layer was integrated to ReRAMs, improving efficiency further Y2018wuISSCC . However, these prototypes have been limited in multiple aspects: a small 32-bit datapath that demands heavy time-multiplexing of hypervectors Y2018wuISSCC ; they can store only a fraction of HDC models due to availability of only 256 ReRAM cells Y2016liIEDM ; and they do not allow any reprogrammability as they are restricted to one application Y2016liIEDM , or a binary classification task Y2018wuISSCC .

In this paper, we propose a complete in-memory HDC system and present large-scale mixed hardware/software experimental demonstrations of in-memory HDC with up to 49 -dimensional hypervectors encoded in phase-change memory (PCM) devices performing analog in-memory computing. Our experiments achieve comparable accuracies to the software baselines. Furthermore, a system-level study demonstrates end-to-end reductions in energy with in-memory HDC compared to a dedicated digital CMOS implementation.

Ii The concept of in-memory HDC

Figure 1: The concept of in-memory HDC A schematic illustration of the concept of in-memory HDC shows the essential steps associated with HDC (left) and how they are realized using in-memory computing (right). An item memory (IM) stores , -dimensional basis hypervectors that correspond to the symbols associated with a classification problem. During learning, based on a labelled training dataset, an encoder performs dimensionality preserving mathematical manipulations on the basis hypervectors to produce , -dimensional prototype hypervectors that are stored in an associative memory (AM). During classification, the same encoder generates a query hypervector based on a test example. Subsequently, an associative memory search is performed between the query hypervector and the elements of AM to determine the class to which the test example belongs. In in-memory HDC, both the IM and AM are mapped onto crossbar arrays of memristive devices. The mathematical operations associated with encoding and associative memory search are performed in-place by exploiting in-memory read logic and dot product operations, respectively. A dimensionality of is used.

When HDC is used for learning and classification, first, a set of i.i.d., hence quasiorthogonal hypervectors, referred to as basis hypervectors, are selected to represent each symbol associated with a dataset. For example, if the task is to classify an unknown text into the corresponding language, the symbols could be the letters of the alphabet. The basis hypervectors stay fixed throughout the computation. Assuming that there are

symbols, , the set of the , -dimensional basis hypervectors is referred to as the item memory (IM) (see Fig. 1). Basis hypervectors serve as the basis from which further representations are made by applying a well-defined set of component-wise operations: addition of binary hypervectors is defined as the component-wise majority, multiplication () is defined as the component-wise exclusive-OR (or equivalently as the component-wise exclusive-NOR), and finally permutation (

) is defined as a pseudo-random shuffling of the coordinates. Applied on dense binary hypervectors where each component has equal probability of being zero or one 

BSC96 , all these operations produce a d-bit hypervector resulting in a closed system.

Subsequently, during the learning phase, the basis hypervectors in the IM are combined with the component-wise operations inside an encoder to compute for instance an -gram hypervector representing an object of interest; and to add -gram hypervectors from the same category of objects to produce a prototype hypervector representing the entire class of category during learning. In the language example, the encoder would receive input text associated with a known language and would generate a prototype hypervector corresponding to that language. In this case determines the smallest number of symbols (letters in the example) that are combined while performing an -gram encoding operation. The overall encoding operation results in , -dimensional prototype hypervectors (referred to as associative memory (AM)) assuming there are classes. When the encoder receives consecutive symbols, , it produces an -gram hypervector, , given by:

(1)

where corresponds to the associated basis hypervector for symbol, . The operator denotes the exclusive-NOR, and denotes a pseudo-random permutation operation, e.g., a circular shift by 1 bit. The encoder then bundles several such

-gram hypervectors from the training data using component-wise addition followed by a binarization (majority function) to produce a prototype hypervector for the given class.

When inference or classification is performed, a query hypervector (e.g. from a text of unknown language) is generated identical to the way the prototype hypervectors are generated. Subsequently, the query hypervector is compared with the prototype hypervectors inside the AM to make the appropriate classification. Equation 2 defines how a query hypervector is compared against each of the prototype hypervector out of classes to find the predicted class with maximum similarity. This AM search operation can for example be performed by calculating the inverse Hamming distance.

(2)

One key observation is that the two main operations presented above, namely, the encoding and AM search, are about manipulating and comparing large patterns within the memory itself. Both IM and AM (after learning) represent permanent hypervectors stored in the memory. As a lookup operation, different input symbols activate the corresponding stored patterns in the IM that are then combined inside or around memory with simple local operations to produce another pattern for comparison in AM. These component-wise arithmetic operations on patterns allow a high degree of parallelism as each hypervector element needs to communicate with only a local component or its immediate neighbors. This highly memory-centric aspect of HDC is the key motivation for the in-memory computing implementation proposed in this work.

The essential idea of in-memory HDC is to store the elements of both the IM and the AM as the conductance values of nanoscale memristive devices Y2011chuaAPA ; Y2015wongNatureNano organized in crossbar arrays (see Fig. 1). The IM of rows and columns is stored in the first crossbar, where each basis hypervector is stored on a single row. To perform operations between the basis hypervectors for the -gram encoding, an in-memory read logic primitive is employed. Unlike the vast majority of reported in-memory logic operations Y2010borghettiNature ; Y2014kvatinskyTCAS , the proposed in-memory read logic is non-stateful and this obviates the need for very high write endurance for the memristive devices. Additional peripheral circuitry is used to implement the remaining permutations and component-wise additions needed in the encoder. The AM of rows and

columns is implemented in the second crossbar, where each prototype hypervector is stored on a single row. During supervised learning, each prototype hypervector output from the first crossbar gets programmed into a certain row of the AM based on the provided label. During inference, the query hypervector output from the first crossbar is input as voltages on the wordline driver, to perform the associative memory search using an in-memory dot product primitive.

This design ideally fits the memory-centric architecture of HDC, because it allows to perform the main computations on the IM and AM within the memory units themselves with a high degree of parallelism. Furthermore, the IM and AM are only programmed once while training on a specific dataset. and the two types of in-memory computations that are employed, involve just read operations. Therefore, non-volatile memristive devices are very well suited for implementing the IM and AM, and only binary conductance states are required. In this work, we used PCM technology Y2010wongProcIEEE ; Y2016burrJETCAS , which operates by switching a phase-change material between amorphous (high resistivity) and crystalline (low resistivity) phases to implement binary data storage (see Methods). PCM has also been successfully employed in novel computing paradigms such as neuromorphic computing Y2011kuzumNanoLetters ; Y2016tumaNatNano ; Y2018boybatNatComm ; Y2018sebastianJAP and computational memory Y2013wrightAFM ; Y2017sebastianNatComm ; Y2018legalloNatureElectronics ; Y2018ielminiNatureElectronics , which makes it a good candidate for realizing the in-memory HDC system.

In the remaining part of the paper, we will elaborate the detailed designs of the associative memory, the encoder, and finally propose a complete in-memory HDC system that achieves a near-optimum trade-off between design complexity and output accuracy. The system performance will be validated through experiments using a prototype PCM chip fabricated in 90 nm CMOS technology (see Methods), and a complete system-level design implemented using 65 nm CMOS technology will be presented.

Iii The associative memory search module

Classification involves an AM search between the prototype hypervectors and the query hypervector using a suitable similarity metric, such as the inverse Hamming distance () computed from Equation (2). Using associativity of addition operations, the expression in Equation (2) can be decomposed into the addition of two dot product terms as shown in Equation III

where denotes the logical complement of . Since the operations associated with HDC ensure that both the query and prototype hypervectors have an almost equal number of zeros and ones, the dot product () can also serve as a viable similarity metric.

Figure 2: Associative memory search. a Schematic illustration of the AM search architecture to compute the similarity metric. Two PCM crossbar arrays of rows and columns are employed. b Schematic illustration of the coarse grained randomization strategy employed to counter the variations associated with the crystalline PCM state. c Results of the classification task show that the experimental on-chip accuracy results compare favorably with the 10-partition simulation results and software baseline for both similarity metrics on the three datasets.

To compute the similarity metric, two PCM crossbar arrays of rows and columns are used as shown in Fig. 2a. The prototype hypervectors, , are programmed into one of the crossbar arrays as conductance states. Binary ‘1’ elements are programmed as crystalline states and binary ‘0’ elements are programmed as amorphous states. The complementary hypervectors are programmed in a similar manner into the second crossbar array. The query hypervector and its complement are applied as voltage values along the wordlines of the respective crossbars. In accordance with the Kirchoff’s current law, the total current on the bitline will be equal to the dot-product between query hypervector and prototype hypervector. The results of this in-memory dot-product operations from the two arrays are added in a pairwise manner using a digital adder circuitry in the periphery and are subsequently input to a winner-take-all (WTA) circuit which outputs a ‘1’ only on the bitline corresponding to the class of maximum similarity value. When similarity metric is considered, only the crossbar encoding is used and the array of adders in the periphery is eliminated, resulting in reduced hardware complexity.

Experiments were performed using a prototype PCM chip to evaluate the effectiveness of the proposed implementation on three common HDC benchmarks: language classification, news classification, and hand gesture recognition from electromyography (EMG) signals (see Methods). In those experiments, the prototype hypervectors (and their complements) are learned beforehand in software, and are then programmed into the PCM devices on the chip. Inference is then performed with a software encoder and using Equation (III) for the associative memory search, in which all multiplication operations are performed in the analog domain (by exploiting the Ohm’s law) on-chip and the remaining operations are implemented in software (see Methods).

However, it was found that, when a naive mapping of the prototype hypervectors to the array is used, the chip-level variability associated with the crystalline state detrimentally affects the AM search operation. To address this issue, we employed a coarse grained randomization strategy where the idea is to segment the prototype hypervector and to place the resulting segments spatially distributed across the crossbar array (see Fig. 2b). This helps all the components of prototype hypervectors to uniformly mitigate long range variations. The proposed strategy involves dividing the crossbar array into f equal sized partitions (,,…,) and storing a segment of each of the prototype hypervectors (,,…,) per partition. Here f is called the ‘partition factor’ and it controls the granularity associated with the randomization. To match the segments of prototype hypervectors, the query vector is also split into equal sized subvectors ,,…, which are input sequentially to the wordline drivers of the crossbar.

A statistical model that captures the spatio-temporal conductivity variations was used to evaluate the effectiveness of the coarse-grained randomized partitioning method. Simulations were carried out for different partition factors 1, 2 and 10 for the two similarity metrics dotp and as shown in Figure 2c. These results indicate that the classification accuracy increases with the number of partitions. For instance, for language classification, the accuracy improves from 82.5% to 96% with dopt by randomizing with a partition factor of 10 instead of 1. The experimental on-chip accuracy (performed with a partition factor of 10) is close to the 10-partition simulation result and the software baseline for both similarity metrics on all three datasets. When the two similarity metrics are compared, provides slightly better accuracy for the same partition size, at the expense of almost doubled area and energy consumption. Therefore, for low-power applications, a good trade-off is the use of dotp similarity metric with a partition factor of 10.

Iv The n-gram encoding module

Figure 3: In-memory -gram encoding based on 2-minterm. a The basis hypervectors and their complements are mapped onto two crossbar arrays. Through a sequence of in-memory logical operations the approximated -gram as in Equation (5) is generated. b Classification results on the language (using ) and news (using ) datasets show the performance of the 2-minterm approximation compared with the all-minterm approach.

In this section, we will focus on the design of the -gram encoding module. As described in Section II, one of the key operations associated with the encoder is the calculation of the -gram hypervector given by Equation (1). In order to find in-memory hardware friendly operations, Equation (1) is re-written as the component-wise summation of minterms given by Equation (4).

(4)

where the operator is given by

where , is the item hypervector index within an -gram and is used to index minterms.

The representation given by Equation (4) can be mapped into memristive crossbar arrays where bitwise AND () function is naturally supported through analog scalar multiplication. However the number of minterms () rises exponentially with the size of the -gram, making the hardware computations costly. Therefore, it is desirable to reduce the number of minterms and to use a fixed number of minterms independent of .

It can be shown that when is even, there exists a 2-minterm approximation to Equation 4 given by

(5)

We used this 2-minterm based approximation for in-memory HDC. A schematic illustration of the corresponding -gram encoding system is presented in Fig. (3a). The basis hypervectors are programmed on one of the crossbars and their complement vectors are programmed on the second. The component-wise logical AND operation between two hypervectors in Equation (5) is realized in-memory by applying one of the hypervectors as the gate control lines of the crossbar, while selecting the wordline of the second hypervector. The result of the AND function from the crossbar is passed through an array of sense amplifiers to convert the analog values to binary values. The binary result is then stored in the minterm buffer, whose output is fed back as the gate controls by a single component shift to the right (left in the complementary crossbar). This operation implements the permutation operation in -gram encoding as shown in Equation (5). By performing these operations times, it is possible to generate the -gram (the details are presented in the Methods section).

To test the effectiveness of the encoding scheme with in-memory computing, simulations were carried out using the PCM statistical model. The training was performed in software with the same encoding technique used thereafter for inference, and both the encoder and AM were implemented with modeled PCM crossbars for inference. The simulations were performed only on the language and news classification datasets, because for the EMG dataset the hypervectors used for the -gram encoding are generated by a spatial encoding process and cannot be mapped entirely into a fixed IM of reasonable size. From the results presented in Fig. 3b, it is clear that the all-minterm approach to encoding provides the best classification accuracy in most configurations of AM as expected. However, the 2-minterm based encoding method yields a stable and in some cases, particularly in language dataset, similar accuracy level to that of the all-minterm approach, while significantly reducing the hardware complexity. The 2-minterm approximation also appears to provide satisfactory results when

is odd according to the experiments conducted, even though the second minterm in Equation (

5) shows up in Equation (4) only when is even.

V The complete in-memory HDC system

Figure 4: The complete in-memory HDC system. a The schematic of architecture showing the 2-minterm encoder and associative memory search engine employing metric. b The classification accuracy results on the news and language datasets where both the encoding and associative memory search are performed in software, simulated using PCM model and are experimentally realized on the chip.

In this section the complete HDC system and the associated experimental results are presented. The proposed architecture comprises the 2-minterm encoder and dotp similarity metric with a partition factor of 10 as this provides the best trade off between classification accuracy and hardware complexity. As shown in Figure 4a, the proposed architecture has three PCM crossbar arrays—two having rows and columns, and one having rows and columns, with .

The system includes several peripheral circuits: an index buffer, a minterm buffer, and a bundler which reside inside the encoder, whereas the AM search module contains a sum buffer and a comparator circuit. The index buffer is located at the input of the IM to keep the indices of the symbols in the sequence and to feed them into the crossbar rows. The bundler accumulates the -gram hypervectors to produce a sum hypervector. Once threshold is applied on the sum hypervector, the result is a prototype hypervector at the time of training or a query hypervector at the time of inference. During inference, one segment of the query hypervector at the output buffer of the encoder is fed at a time to the AM through an array of multiplexers so that only the corresponding partition is activated in the AM. Depending on the partition that is selected, the relevant gates are activated through a controller sitting inside the AM search module. Finally the results in the sum buffer are sent through a WTA circuitry to find the maximum index which provides the prediction.

All-CMOS PCM crossbar based
Encoder AM search Total Encoder AM search Total
Energy
Average energy per query (nJ) 1470 1110 2580 382 9.44 391.8
Improvement 3.86x 117.5x 6.61x
Exclusive modules avg. energy per query (nJ) 1130 1100 2240 39.40 3.30 42.70
Improvement 28.74x 334.62x 52.39x
Area
Total area () 4.77 2.99 7.76 1.36 0.68 2.04
Improvement 3.52x 4.38x 3.81x
Exclusive modules area () 3.53 2.38 5.91 0.11 0.075 0.18
Improvement 32.25x 31.94x 32.13x
Table 1: Performance comparison between a dedicated all-CMOS implementation and in-memory HDC with PCM crossbars

We conducted experiments on the prototype PCM chip, where all components of both IM and AM data were stored on hardware PCM devices. In the full chip experiment, training was performed in software, and measurements for both IM crossbars and AM crossbars at each step of the HD computing algorithm were taken from the PCM prototype chip for inference (see Methods). Figure 4b summarizes the accuracy results with software, the PCM statistical model, and on-chip experiment, for the language and news classification benchmarks. Compared with the previous experiment where only AM was contained on-chip, the full chip experiment results show a similar accuracy level, indicating the minimal effect on accuracy when porting the IM into PCM devices with in-memory -gram encoding. Furthermore, the accuracy level reported in this experiment is close to the accuracy reported with the software for the same parametric configuration of the HD inference model.

Finally, to benchmark the performance of the system in terms of energy consumption, the digital submodules in the system-level architecture (marked with dotted boundaries in Figure 4) that fall outside the PCM crossbars arrays were synthesized using 65 nm CMOS technology. The synthesis results of these modules can be combined with the performance characteristics of PCM crossbar arrays to arrive at figures of merit such as energy, area and throughput of the full system (see Methods). Furthermore, PCM crossbar sections were implemented in CMOS distributed standard cell registers with associated multiplier adder tree logic and binding logic respectively for AM and IM to construct a complete CMOS HD processor with the intention of comparing against the figures of merits of the PCM crossbar based architecture proposed in this paper. A comparison of the performance between all-CMOS approach versus the PCM crossbar based approach is presented in Table 1. As seen in the table, a improvement in total energy efficiency and reduction in area is obtained with the introduction of the PCM crossbar modules. The encoder’s energy expense for processing a query reduces by a factor of with the PCM crossbar implementation whereas that of the AM search module reduces by a factor of . However, these efficiency factors are partially masked by the CMOS peripheral circuitry that is common to both implementations, specifically that in the encoder module which accounts for the majority of its energy consumption. When peripheral circuits are ignored and only the parts of the design that are exclusive to each approach are directly compared to each other, and energy savings and and area savings are obtained for the encoder and AM search module, respectively. It remains part of the future work to investigate methods in which peripheral modules are designed more energy efficiently so that the overall system efficiency can be improved further.

Vi Conclusion

In summary, we proposed the concept of in-memory HDC where nanoscale memristive devices organized in crossbar arrays are used to store the hypervectors associated with the IM and AM. The main computations of the HDC algorithm are performed in-memory with logical and dot product operations on the memristive devices. Due to the inherent robustness of HDC to errors, it was possible to approximate the mathematical operations associated with HDC to make it suitable for hardware implementation, and to use analog in-memory computing without significantly degrading the output accuracy. Hardware/software experiments using a prototype PCM chip delivered accuracies comparable to software baselines on language and news classification benchmarks with -dimensional hypervectors, making this work the largest experimental demonstration of HDC with memristive hardware to date. A comparative study performed against a system-level design implemented using 65 nm CMOS technology showed that the in-memory HDC approach could result in end-to-end savings in energy. By designing more energy-efficient peripheral circuits and with the potential of scaling PCM devices to nanoscale dimensions Y2011xiongScience , these gains could increase several fold. The in-memory HDC concept is also applicable to other types of memristive devices based on ionic drift Y2010waserNT and magnetoresistance Y2015kentNatureNanotechnology . Future work will be focused on taking in-memory HDC beyond learning and classification to perform advanced cognitive tasks alongside with data compression and retrieval on dense storage devices as well as building more power efficient peripheral hardware to harness the best of in-memory computing.

Acknowledgments

This work was supported in part by the European Research Council through the European Union’s Horizon 2020 Research and Innovation Program under Grant 682675 and in part by the European Union’s Horizon 2020 Research and Innovation Program through the project MNEMOSENE under Grant 780215. We would like to thank Evangelos Eleftheriou for managerial support.

Methods

PCM-based hardware platform

The experimental hardware platform is built around a prototype phase-change memory (PCM) chip that contains PCM cells that are based on doped-GeSbTe (d-GST) and are integrated into the prototype chip in 90 nm CMOS baseline technology. In addition to the PCM cells, the prototype chip integrates the circuitry for cell addressing, on-chip ADC for cell readout, and voltage- or current-mode cell programming. The experimental platform comprises the following main units:

  • a high-performance analog-front-end (AFE) board that contains the digital-to-analog converters (DACs) along with discrete electronics, such as power supplies, voltage, and current reference sources,

  • an FPGA board that implements the data acquisition and the digital logic to interface with the PCM device under test and with all the electronics of the AFE board, and

  • a second FPGA board with an embedded processor and Ethernet connection that implements the overall system control and data management as well as the interface with the host computer.

The prototype chip Y2010closeIEDM contains 3 million PCM cells, and the CMOS circuitry to address, program and readout any of these 3 million cells. In the PCM devices used for experimentation, two 240 nm-wide access transistors are used in parallel per PCM element (cell size is 50 F). The PCM array is organized as a matrix of 512 word lines (WL) and 2048 bit lines (BL). The PCM cells were integrated into the chip in 90 nm CMOS technology using the key-hole process  Y2007breitwischVLSI . The bottom electrode has a radius of  nm and a length of  nm. The phase change material is nm thick and extends to the top electrode, whose radius is nm. The selection of one PCM cell is done by serially addressing a WL and a BL. The addresses are decoded and they then drive the WL driver and the BL multiplexer. The single selected cell can be programmed by forcing a current through the BL with a voltage-controlled current source. It can also be read by an 8-bit on-chip ADC. For reading a PCM cell, the selected BL is biased to a constant voltage of 300 mV by a voltage regulator via a voltage generated via an off-chip DAC. The sensed current, , is integrated by a capacitor, and the resulting voltage is then digitized by the on-chip 8-bit cyclic ADC. The total time of one read is s. For programming a PCM cell, a voltage generated off-chip is converted on-chip into a programming current, . This current is then mirrored into the selected BL for the desired duration of the programming pulse. The pulse used to program the PCM to the amorphous state (RESET) is a box-type rectangular pulse with duration of 400 ns and amplitude of 450 A. The pulse used to program the PCM to the crystalline state (SET) is a ramp-down pulse with total duration of approximately 12 s. The access-device gate voltage (WL voltage) is kept high at 2.75 V during the programming pulses.

Datasets to evaluate in-memory HDC

We target three highly relevant learning and classification tasks to evaluate the proposed in-memory HDC architecture. The following three tasks are used to benchmark the performance of in-memory HDC in terms of classification accuracy.

  1. Language classification: In this task, HDC is applied to classify raw text composed of Latin characters into their respective language. The training texts are taken from the Wortschatz Corpora language_trainset where large numbers of sentences (about a million bytes of text) are available for 22 European languages. Another independent dataset, Europarl Parallel Corpus language_testset , with 1,000 sentences per language is used as the test dataset for the classification. The former database is used for training 22 prototype hypervectors for each of the languages while the latter is used to run inference on the trained HDC model. For the subsequent simulations and experiments with the language dataset we use dimensionality d and -gram size n.

    We use an item memory (IM) of 27 symbols, representing the 26 letters of the Latin alphabet plus whitespace character. Training is performed using the entire training dataset, containing a labeled text of words per language. For inference, a query is composed of a single sentence of the test dataset, hence in total 1,000 queries per language are used.

  2. News classification: The news dataset comprises a database of Reuters news articles, subjected to a light weight pre-processing step, covering 8 different news genres NewsDataset . The pre-processing step removes frequent “stop” words and words with less than 3 letters HD_news . The training set has 5400+ documents while the testing set contains 2100+ documents. For the subsequent simulations and experiments with news dataset we use dimensionality d and -gram size n. Similar to the language task, we use an IM of 27 symbols, representing the 26 letters of the Latin alphabet plus whitespace character. Training is performed using the entire training dataset, where all labeled documents pertaining to the same class are merged into a single text. This merged text contains words per class. For inference, a query is composed of a single document of the test dataset.

  3. Hand gesture recognition from Electromyography (EMG) signals:

    In this task, we focus on use of HDC in a smart prosthetic application, namely hand gesture recognition from a stream of EMG signals. A database emgdataset that provides EMG samples recorded from four channels covering the forearm muscles is used for this benchmark. Each channel data is quantized into 22 intensity levels of electric potential. The sampling frequency of the EMG signal is 500 Hz.

    A label is provided for each time sample. The label varies from 1 to 5 corresponding to five classes of performed gestures. This dataset is used to train an HDC model to detect hand gestures of a single subject. For training on EMG dataset, a spatial encoding scheme is first employed to fuse data from the four channels so the IM has four symbols, and it is paired with a continuous item memory to jointly map the intensity levels per channel. The spatial encoding creates one hypervector per time sample.

    Then, a temporal encoding step is performed, whereby consecutive spatially encoded hypervectors are combined into an -gram. For the subsequent simulations and experiments with EMG dataset we use dimensionality d and -gram size n. Training and inference are performed using the same EMG channel signals from the same subject, but on non-overlapping sections of recording. The recording used for training contains 1280 time samples after down-sampling by a factor of 175. For inference, 780 queries are generated from the rest of recording, where each query contains 5 time samples captured with the same down-sampling factor.

Dataset Item Memory (IM) Associative Memory (AM)
# Symbols Dimensionality Dimensionality # Classes
Language 27 10,000 10,000 22
News 27 10,000 10,000 8
EMG 4 10,000 10,000 5
Table 2: Dimensions of IM and AM for the different tasks

Table 2 provides details on the dimensions of the IM and AM for the different tasks. For EMG dataset, the hypervectors for the encoding operation are drawn by binding items from a pair of IM and continuous IM. In hardware implementation of in-memory HDC, the IM and AM may be distributed into multiple narrower crossbars in case electrical/physical limitations arise.

Coarse grained randomization

The programming methodology followed to achieve the coarse grained randomized partitioning in memristive crossbar for AM search is explained in the following steps. First, we split all prototype hypervectors (,,…,) into f subvectors of equal length where f is the partition factor. For example, subvectors from the prototype hypervector of the first class are denoted as: (,,…,). Then the crossbar array is divided into f equal sized partitions (,,…,). Each partition must contain rows and c columns. A random permutation e of numbers 1 to c is then selected. Next, the first subvector from each class (,,…,) is programmed into the first partition such that each subvector fits to a column in the crossbar partition. The order of programming of subvectors into the columns in the partition is determined by the previously selected random permutation e. The above steps must be repeated to program all the remaining partitions (,,…,).

The methodology followed in feeding query vectors during inference is detailed in the following steps. First, we split query hypervector Q into subvectors ,,…, of equal length. Then, we translate component values into voltage levels and apply onto the wordline drivers in the crossbar array. Bitlines corresponding to the partition are enabled. Depending on the belonging class, the partial dot products are then collected onto respective destination in sum buffer through the A/D converters at the end of partition of the array. The above procedure is repeated for each partition . Class-wise partial dot products are accumulated together in each iteration and updated in the sum buffer. After the f-th iteration, full dot product values are ready in the sum buffer. The results are then compared against each other using a WTA circuit to find the maximum value to assign its index as the predicted class.

Experiments on associative memory search

In order to obtain the prototype hypervectors used for AM search, training with HDC is first performed in software on the three datasets described in the section “Datasets to evaluate in-memory HDC”. For the language and news datasets, XOR-based encoding (see Section II) is used with -gram size of and , respectively. For the EMG dataset, an initial spatial encoding step creates one hypervector per time sample. Then, a temporal encoding step is performed, whereby consecutive spatially encoded hypervectors are combined into an -gram with XOR-based encoding and .

Once training is performed, the prototype hypervectors are programmed on the prototype PCM chip. In the experiment conducted with as the similarity metric, devices on the PCM prototype chip are allocated. Each device in the first half of the address range (from 1 to ) is programmed with a component of a prototype hypervector , where . Devices in the second half of the array are programmed with components of the complementary prototype hypervectors. The exact programming order is determined by the partition factor () employed in the coarse grained randomized partitioning scheme. For used in the experiment, devices from first address up to -th address are programmed with content of the first partition, i.e., the first segment of each of the prototype hypervector. The second set of addresses is programmed with content of the second partition and so on. As the hypervector components are binary, devices mapped to the logical 1 components and devices mapped to logical 0 components are programmed to the maximum (approximately 20 S) and minimum conductance (approximately 0 S) levels respectively. The devices are programmed in a single-shot (no iterative program-and-verify algorithm is used) with a single RESET/SET pulse for minimum/maximum conductance devices.

Once the programming phase is completed, the devices are read once per query hypervector. The query hypervectors are generated using the same software HD encoder used for training. The components in the query hypervector carrying a value 1 trigger a read (300 mV applied voltage) on the devices storing the corresponding components of prototype hypervectors. The current values digitized via ADC are then transferred to the host computer and class-wise summed up in software to obtain class-wise similarity values. The class with the highest similarity is assigned as the predicted class for the given query. For experiments with as the similarity metric, the devices attributed to complementary prototype hypervectors are not read when forming the class-wise aggregate.

Experiments on the complete in-memory HDC system

For the experiments concerning the complete in-memory HDC system, training with HDC is first performed in software on the language and news datasets. 2-minterm encoding (Equation (5)) is used with -gram size of and , respectively.

After training is performed, devices are allocated on the PCM chip for storing IM and the complementary IM in addition to devices allocated for AM. The IM and complementary IM hypervectors are programmed on PCM devices in a single-shot with RESET/SET pulses for logical 0/1 components. The prototype hypervectors of the AM are programmed as described in the section “Experiments on associative memory search”, with the exception that the complementary prototype hypervectors are not programmed since is used as the similarity metric.

During inference, the entire IM and complementary IM conductance values are read as a batch once per query. A fixed threshold is applied on the read current values such that IM and complementary IM data are stored as binary values after sense amplification to be reused for the 2-minterm query encoding process running in software. Once the encoding is completed using PCM-retrieved basis hypervectors, the associative memory search is carried out as specified in the section “Experiments on associative memory search” with as the similarity metric.

More details on the 2-minterm encoder

In order to generate a -gram hypervector in cycles, the crossbar is operated using the following procedure. During the first cycle, -gram encoding is initiated by asserting the ‘start’ signal while choosing the index of -th symbol . This enables all the gate lines in both crossbar arrays and the wordline corresponding to to be activated. The current released onto the bitlines passed through the sense amplifiers should ideally match the logic levels of in first array and in the second array. The two ’minterm buffers’ downstream of the sense amplifier arrays register the two hypervectors by the end of the first cycle. During subsequent -th () cycles, the gate lines are driven by the right shifted version of the incumbent values on the minterm buffers—effectively implementing permutation—while row decoders are fed with symbol ; the left shift is used for the second crossbar. This ensures that the output currents on the bitlines correspond to the component-wise logical AND between the permuted minterm buffer values and the next basis hypervector (complement for the second array). The expression for the value stored on the left-side minterm buffers at the end of j-th cycle is given by . The product of the complementary hypervectors is stored in the right-side minterm buffers. At the end of the n-th cycle, the two minterms are available in the minterm buffers. The elements in the minterm buffers are passed onto the OR gate array following the minterm buffers (shown in Figure 3), such that inputs to the array have matching indices from the two minterm vectors. At this point, the output of the OR gate array reflects the desired -gram hypervector from 2-minterm -gram encoding.

After -gram encoding, the generated -grams are accumulated and binarized. In the hardware implementation, this step is realized inside the bundler module shown in Figure 4. The threshold applied to binarize the sum hypervector elements is given by:

where is the length of the sequence, is the -gram size, and is the number of minterms used for the binding operation in the encoder.

Performance, energy estimation and comparison

In order to evaluate and benchmark energy efficiency of the proposed architecture, a cycle-accurate register transfer level (RTL) model of a complete CMOS design that has equivalent throughput to that of the proposed in-memory HDC system architecture is developed. A testbench infrastructure is then built to verify the correct behavior of the model. Once the behavior is verified, the RTL model is synthesized in UMC 65 nm technology node using Synopsys Design Compiler. Due to limitations in EDA tools used for synthesizing the CMOS-based HDC, dimensionality had to be limited to

. The post-synthesis netlist is then verified using the same stimulus vectors applied during behavioral simulation. During post-synthesis netlist simulation, the design is clocked at 440 MHz frequency to create a switching activity file in value change dump (VCD) format for inference of 100 language classification queries. Then, the energy estimation for the CMOS modules is performed by converting average power values reported by Synopsys Primetime which takes the netlist and the activity file from the previous steps as the inputs. A typical operating condition with voltage 1.2 V and temperature 25 C is set as the corner for the energy estimation of the CMOS system. Further energy and area results were obtained for

values 100, 500, 1000 in addition to 2000. Then the results were extrapolated to derive the energy and area estimates for dimensionality to have a fair comparison with in-memory HDC system.

The energy/area of the proposed in-memory HDC system architecture is obtained by adding the energy/area of the modules that are common with the full CMOS design described above, together with the energy of PCM crossbars and the analog/digital peripheral circuits exclusive to the in-memory HDC architecture. Parameters based on the prototype PCM chip in 90 nm technology used in the experiments are taken as the basis for the PCM-exclusive energy/area estimation. The parameters of the sense amplifiers which are not present in the PCM hardware platform but present in the proposed in-memory HD encoder are taken from the 65 nm current latched sense amplifier presented by Chandoke et al. sense_amp . Parameters used for PCM exclusive energy estimation are shown in Table 3.

Common parameters
Parameter Value
Read voltage 0.1 V
Current on conducting devices 1 A
Unit device area 0.2 m
Module-specific parameters
Parameter Encoder AM
Readout time 2.8 ns 100 ns
Active devices per query 145,000 66,000
Energy per sense amp read 9.8 fJ -
Energy per ADC read - 12 pJ
Table 3: Parameters for PCM-exclusive energy estimation

References

  • (1) Kanerva, P.

    Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors.

    Cognitive Computation 1, 139–159 (2009).
  • (2) Kanerva, P. Sparse Distributed Memory (The MIT Press, Cambridge, MA, USA, 1988).
  • (3) LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).
  • (4) Rahimi, A. et al. High-dimensional computing as a nanoscalable paradigm. IEEE Transactions on Circuits and Systems I: Regular Papers 64, 2508–2521 (2017).
  • (5) Rahimi, A., Kanerva, P., Benini, L. & Rabaey, J. M. Efficient biosignal processing using hyperdimensional computing: Network templates for combined learning and classification of exg signals. Proceedings of the IEEE 107, 123–143 (2019).
  • (6) Rahimi, A., Kanerva, P. & Rabaey, J. M. A robust and energy-efficient classifier using brain-inspired hyperdimensional computing. In Proceedings of the International Symposium on Low Power Electronics and Design, ISLPED ’16, 64–69 (ACM, New York, NY, USA, 2016).
  • (7) Li, H. et al. Hyperdimensional computing with 3D VRRAM in-memory kernels: Device-architecture co-design for energy-efficient, error-resilient language recognition. In IEEE International Electron Devices Meeting (IEDM) (2016).
  • (8) Wu, T. F. et al. Brain-inspired computing exploiting carbon nanotube FETs and resistive RAM: Hyperdimensional computing case study. In International Solid State Circuits Conference (ISSCC), 492–494 (2018).
  • (9) Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Nature nanotechnology 8, 13 (2013).
  • (10) Sebastian, A. et al. Temporal correlation detection using computational phase-change memory. Nature Communications 8, 1115 (2017).
  • (11) Zidan, M. A., Strachan, J. P. & Lu, W. D. The future of electronics based on memristive systems. Nature Electronics 1, 22 (2018).
  • (12) Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching devices. Nature Electronics 1, 333 (2018).
  • (13) Li, H., Wu, T. F., Mitra, S. & Wong, H. S. P. Device-architecture co-design for hyperdimensional computing with 3D vertical resistive switching random access memory (3D VRRAM). In International Symposium on VLSI Technology, Systems and Application (VLSI-TSA), 1–2 (2017).
  • (14) Kanerva, P. Binary spatter-coding of ordered -tuples. In (ed.) ICANN’96, Proceedings of the International Conference on Artificial Neural Networks, vol. 1112 of Lecture Notes in Computer Science, 869–873 (Springer, 1996).
  • (15) Chua, L. Resistance switching memories are memristors. Applied Physics A 102, 765–783 (2011).
  • (16) Wong, H.-S. P. & Salahuddin, S. Memory leads the way to better computing. Nature nanotechnology 10, 191 (2015).
  • (17) Borghetti, J. et al. memristive switches enable stateful logic operations via material implication. Nature 464, 873 (2010).
  • (18) Kvatinsky, S. et al. Magic memristor-aided logic. IEEE Transactions on Circuits and Systems II: Express Briefs 61, 895–899 (2014).
  • (19) Wong, H.-S. P. et al. Phase change memory. Proceedings of the IEEE 98, 2201–2227 (2010).
  • (20) Burr, G. W. et al. Recent progress in phase-change memory technology. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6, 146–162 (2016).
  • (21) Kuzum, D., Jeyasingh, R. G., Lee, B. & Wong, H.-S. P.

    Nanoelectronic programmable synapses based on phase change materials for brain-inspired computing.

    Nano letters 12, 2179–2186 (2011).
  • (22) Tuma, T., Pantazi, A., Le Gallo, M., Sebastian, A. & Eleftheriou, E. Stochastic phase-change neurons. Nature Nanotechnology 11, 693 (2016).
  • (23) Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. Nature communications 9, 2514 (2018).
  • (24) Sebastian, A. et al. Tutorial: Brain-inspired computing using phase-change memory devices. Journal of Applied Physics 124, 111101 (2018).
  • (25) Wright, C. D., Hosseini, P. & Diosdado, J. A. V. Beyond von-neumann computing with nanoscale phase-change memory devices. Advanced Functional Materials 23, 2248–2254 (2013).
  • (26) Le Gallo, M. et al. Mixed-precision in-memory computing. Nature Electronics 1, 246 (2018).
  • (27) Xiong, F., Liao, A. D., Estrada, D. & Pop, E. Low-power switching of phase-change materials with carbon nanotube electrodes. Science 332, 568–570 (2011).
  • (28) Waser, R. & Aono, M. Nanoionics-based resistive switching memories. In Nanoscience And Technology: A Collection of Reviews from Nature Journals, 158–165 (World Scientific, 2010).
  • (29) Kent, A. D. & Worledge, D. C. A new spin on magnetic memories. Nature nanotechnology 10, 187 (2015).
  • (30) Close, G. et al. Device, circuit and system-level analysis of noise in multi-bit phase-change memory. In IEEE International Electron Devices Meeting (IEDM), 29–5 (2010).
  • (31) Breitwisch, M. et al. Novel lithography-independent pore phase change memory. In IEEE Symposium on VLSI Technology, 100–101 (IEEE, 2007).
  • (32) Quasthoff, U., Richter, M. & Biemann, C. Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (European Language Resources Association (ELRA), 2006). URL http://www.aclweb.org/anthology/L06-1396.
  • (33) Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, 79–86. AAMT (AAMT, Phuket, Thailand, 2005). URL http://mt-archive.info/MTS-2005-Koehn.pdf.
  • (34) Mimaroglu, D. S. Some Text Datasets. https://www.cs.umb.edu/~smimarog/textmining/datasets/ (2018). [Online; accessed 9-March-2018].
  • (35) Najafabadi, F. R., Rahimi, A., Kanerva, P. & Rabaey, J. M. Hyperdimensional computing for text classification. Design, Automation, and Test in Europe Conference (DATE) (March 2016).
  • (36) Montagna, F., Rahimi, A., Benatti, S., Rossi, D. & Benini, L. PULP-HD: Accelerating brain-inspired high-dimensional computing on a parallel ultra-low power platform. In Proceedings of the 55th Annual Design Automation Conference, DAC ’18, 111:1–111:6 (ACM, New York, NY, USA, 2018).
  • (37) Chandoke, N., Chitkara, N. & Grover, A. Comparative analysis of sense amplifiers for sram in 65nm cmos technology. In 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 1–7 (2015).