1 Introduction
With the emergence of InternetofThings devices, speaker recognition at the edge is desirable as it can enable smart environments, cyberphysical security, and robotic control, etc. However, speaker recognition is now done mostly in the cloud due to the constrained resources and battery capacity of small devices that are unable to run complex models. Adapting to new speakers calls for lightweight and efficient algorithms suitable for ondevice and online learning.
Developments in deep neural networks have led to endtoend speaker recognition systems that achieve high accuracy on noisy and uncontrolled speech data
[nagrani2017voxceleb], [chung2018voxceleb2], [kim2021adaptive]. Although neural networks have the ability to deal with noisy realworld data, they are expensive to train due to iterative backpropagation using gradient descent. Moreover, the whole network may need to be retrained when adding new speakers. Nonneuralnetwork approaches include the traditional Gaussian Mixture Model–Support Vector Machine
[campbell2006support], and the more recent Joint Factor Analysis [kenny2005joint] or the ivector approaches [dehak2009support]. However, training these models usually requires running the Expectation Maximization (EM) algorithm iteratively, which can also be computationally intensive for large datasets.
We propose a speakerrecognition approach based on computing with highdimensional (HD) vectors, also called “Hyperdimensional” [kanerva2009hyperdimensional]. By mapping data to nearly orthogonal vectors in a highdimensional space and computing with simple yet powerful operations on vectors, the intrinsic structure of the data can be revealed in a manner that is effective for classification [wong2018negative, OsipovHyperSeed2021]. HD computing has provided an efficient way to analyze various types of data [kleyko2021survey] and to achieve fast, online and incremental learning in dealing with text [joshi2016language, AlonsoHyperEmbed2020], multimodal biosignals [rahimi2018efficient, moin2021wearable, zhou2021memory, ge2021seizure], classifying spoken letters [imani2017voicehd], and others [KleykoSurveyVSA2021Part2]. This work extends the application of HD computing to speaker recognition.
We start by describing the idea of computing with hypervectors, and the operations that are used to encode speech (Section 2). The proposed speech encoder aims to capture the pronunciation variations between speakers into a speaker profile hypervector (Section 3). The profile is computed in three steps. First, the formants in a time slice are encoded into a hypervector, to capture the variation of the signal over frequencies. Then hypervectors for consecutive time slices are encoded into an gram hypervector, to capture the variation of the signal over time. Finally the gram hypervectors of a speech sample are added together, to form a profile hypervector that summarizes the course of the power spectrum over time. Ways to improve this basic algorithm using GLVQ are also explored.
A metric is introduced to evaluate the efficiency of speaker identification systems in terms of the training energy per 1bit of information gain. This work has achieved a highly competitive energy efficiency due to its small number of active parameters during operation and oneshot learning.
2 Computing with Hypervectors
HD computing originates from Holographic Reduced Representations in the early 1990s [plate1994distributed, plate2003HRRbook]. For speech processing, it provides a formulaic way to encode the frequency and temporal structure of a spectrogram into a fixeddimensional vector.
“Hypervectors” refer to highdimensional ( 1,000) seed vectors and to vectors made from them with three operations: addition (), multiplication (), and permutation (). The seed vectors are chosen at random to represent basic entities—they are like atoms from which everything else is built. Here they represent differences of spectral power in adjacent frequency bins of a time slice. We use random bipolar vectors (of s and s) as seeds. Addition and multiplication happen coordinatewise, and permutations reorder (shuffle) hypervector coordinates. The similarity of vectors is measured with the cosine, which equals 0 when the vectors are orthogonal—in a highdimensional space nearly all pairs of vectors are approximately orthogonal. Computing with vectors is like traditional computing with numbers, except that addition and multiplication now operate on vectors, and no arithmetic operation corresponds to the permutation of coordinates.
Unlike most machinelearning methods that require iterative training, HD computing offers oneshot and online learning. Learning happens in a single pass over samples from known speakers. The pass produces a
dimensional profile hypervector—a class prototype—for each speaker. Profiles for test samples are made with the same algorithm and are identified with the most similar speaker profiles. Therefore, the model need not be retrained when speakers are added.3 Encoding Speech
Learning the statistics of a signal over time is particularly natural and efficient with hypervectors. The proposed encoder aims to capture the unique pronounciation variation between speakers, similar to the phone grambased modeling [kohler2001phonetic]. In this approach, a profile hypervector is designed to learn the unique course of the formants over time for each speaker.
In a spectrogram, typically up to 4 formants stand out at any moment of time. Taking a spectrum a slice at a time, for example the upper left plot in Figure
1, a formant can be identified by rising power to the left of it and falling to the right. Therefore, formants can be located by comparing the power in adjacent bins. A simple local binary pattern (LBP) encoding [burrello2019hyperdimensional] is used here to encode the locations of the formants in a time slice. From the first bin to the second bin, the LBP encoder looks at whether the power increases or remains the same, or decreases, and reports 1 or 0. In the VoxCeleb dataset, audio is sampled at 16 kHz, and so the first 40 bins of a power spectrum computed over a 5ms window cover frequencies from 0 to 8,000 Hz. The power in 40 bins gives rise to 39 differences between neighboring bins, resulting in a 39bit LBP.The hypervector for the spectrum at time summarizes the output of the LBP encoder. It is made from bipolar seed vectors that represent the 0s and the 1s of the LBP. There are a total of 78 seed vectors to choose from, corresponding to power going up or down at each bin. For example, or represents the power in the ()th bin greater or less than that in the th bin. Then according to the output of the LBP encoder, 39 seed vectors are selected and added together. Finally, the resulting sum vector is transformed to a bipolar vector of 1s and s by thresholding it at zero:
(1) 
where
(2) 
is applied coordinatewise. Only looking at whether the power is increasing or decreasing has the advantage of not being affected by the power characteristics and the loudness of speech. It is important to note that this encoding yields similar hypervectors for similarly located formants.
Hypervectors for individual spectra are combined into spectra over time by encoding them in grams and adding into a profile hypervector. Empirically, we found that trigrams and tetragrams work the best. Sampling of spectrum slices at 20ms intervals means that these grams represent 60–80 ms of speech, or approximately the length of a phoneme.
A trigram vector is made by permuting the vector for the first spectrum twice, permuting the vector for the second spectrum once, taking the third as is, and multiplying the three vectors coordinatewise, i.e.:
(3) 
where denotes permuting the vector once, and denotes permuting the vector twice. Permutation was implemented by random shuffling the indices of coordinates of the vector.
Finally, the profile hypervector is formed by summing the trigram vectors over time within an utterance, or across multiple utterances. In VoxCeleb dataset, utterances have been collected for each speaker in different contexts with different recording quality and background noise. As it will become clear in Section 4.5, two kinds of profile hypervectors were generated for each speaker: (1) for each context/video/subfolder and (2) one across all the speaker’s contexts. For speaker , his/her context profile hypervector for context is the sum of all trigrams from all utterances in that context, i.e. , and the final profile hypervector for speaker is
4 Experiments
4.1 Experimental Setup
We used the VoxCeleb dataset [nagrani2017voxceleb] to develop and test the algorithm. It consists of speech from 1,251 speakers collected from YouTube videos. Each speaker has a folder, divided into a number of subfolders that contain the audio files. The subfolders come from different videos and are referred to as different “contexts.” The speech files in the subfolders are called “utterances.” Following the same procedure as in [nagrani2017voxceleb], we reserved one subfolder/context for testing that had the fewest utterances but at least five. Each speaker is trained with 904 seconds of speech and tested with 64 seconds, on the average. Top1, Top5, and Top10 accuracies were calculated.
4.2 Input Features
Spectrograms are computed using a 5ms Hann window and a step size of 20 ms. A short window—5 ms vs, the commonly used 25 ms—simplifies the LBP encoding of the formants by smoothing over the multiples of the fundamental frequency and its overtones. Spectrum slices are sampled at 20ms intervals such that a trigram or a tetragram represents 60–80 ms of speech, approximately the length of a phoneme.
4.3 Encoder Design
Before training and testing on the entire dataset, a few parameters of the encoder need to be determined, such as the dimension of the hypervectors, which gram to use, and the number of frequency bins to encode. The dimension of the hypervectors was chosen to be 1,024. To determine which gram to use, unigram to pentagram were used in the encoder to train and test on 40 speakers’ data, which is sufficient to indicate the results for the entire dataset. The results suggested that tetragrams and trigrams are comparable and perform better than the others. Therefore, trigrams were chosen for the encoder. Similarly, different number of frequency bins (26, 32, and 40) were encoded on 40 speakers’ data, and results suggested that encoding the full number of bins (40) performs best. For the proposed approach, the training and testing on 40 speakers’ data take roughly 5 minutes on an Apple M1 processor, so one can quickly test design parameters.
4.3.1 Weighting
Voice activity detection is usually necessary so that desirable features are extracted only from speech segments. The algorithm so far treated a period of silence the same as speech. To counter the lack of information in silence or a weak signal, we use the total energy in the spectrum to weight its hypervector before including it in the gram. The total energy of a spectrum slice at time is . Thus the weighted trigram at time is
(4) 
where the exponent was determined empirically as described in Section 4.3. We found that weighting with the 0.3 power of energy works well with trigrams. Figures 1(a) and 1(b) show the correlation matrices of the first 20 speakers’ profile hypervectors before and after applying the weights. It can be seen that the correlations drops after applying the weighting.
4.3.2 Normalized Weights
Performance may be further improved by normalizing the weights to discount large variations in the power of speech segments across different contexts (due to lack of control over recording conditions in the VoxCeleb dataset). As shown in Figure 3, the average power per frequency bin over an utterance for the same speaker can vary over 15 dB. To avoid favoring contexts with louder voices or even noise, the weight for the hypervector of each spectrum slice is scaled by the ratio between the desired maximum bin power and a speaker’s maximum bin power averaged over the utterance where the hypervector at time is computed from:
(5) 
where is the targeted max bin power set for all utterances to be adjusted to and can be an arbitrary constant. We set it to the average of the largest bin power over the first 40 speakers. is the maximum bin power of the utterance being considered and is computed during training. This essentially makes all utterances’ maximum bin power equal, so the weights no longer favor contexts with higher spectral power. Applying this normalizing ratio leads to an increase of 4.3% and 7.3% in the Top1 and Top5 accuracies from the weighted case.
Method  Weighting applied to  Further processing  Top1  Top5  Top10 

Baseline  none  none  
Weighting  none  
Normalized Weighting  none  
Refinement of profile hypervectors  GLVQ 
Method  Top1  Top5  Active Parameters^{1}  Stored  Mutual  Training  Classification 
during training  Parameters  Info.  Method (Time)  Speed  
ivectors/SVM [nagrani2017voxceleb]  not reported  1M^{2}  4.04 bits  iterative EM+SGD^{3}  not reported  
ivectors/PLDA/SVM [nagrani2017voxceleb]  not reported  0.5M^{4}  5.29 bits  iterative EM+SGD  not reported  
CNN [nagrani2017voxceleb]  0.805  0.921  67M  67M  7.57 bits  iterative SGD  not reported 
ACNN [kim2021adaptive]  0.855  0.953  4.69M  4.69M  8.20 bits  iterative SGD  not reported 
This work  HD: 1.02k  HD: 1.28M  3.93 bits  HD: oneshot (128 min)  5.7 ms per  
GLVQ: 2.05k  GLVQ: 1.28M  GLVQ: SGD (92 sec)  1sec test sample 

Number of parameters updated for a single pass of one data sample.

Assuming two 400dimensional vectors [nagrani2017voxceleb] were stored for each speaker’s SVM.

Generally ivector extractor is trained with the EM algorithm, and SVM is trained with the SGD algorithm.

Assuming two 200dimensional vectors [nagrani2017voxceleb] were stored for each speaker’s SVM.

SVM: Support Vector Machine; PLDA: Probabilistic Linear Discriminant Analysis; CNN: Convolutional Neural Network; ACNN: Adaptive CNN; EM: Expectation Maximization; SGD: Stochastic Gradient Descent.
4.4 Refinement of profile hypervectors with
Learning Vector Quantization
The profile hypervector that sums a speaker’s context hypervectors corresponds to centroidbased classification, which is commonly used in speech and signal processing (e.g., [RasanenMultivariate2015, KleykoTradeoffs2018, GeClassificationReview2020]) due to its simplicity, although it does not guarantee the most accurate classification [RosatoHDDistributed2021, KarlgrenSemantics2021]. Of the various classifiers that can take hypervectors as input (e.g., [RachkovskijClassifiers2007, RahimiBiosignal2016, imani2017voicehd, KleykoDensityEncoding2020, DiaoGLVQHD2021]), we used the Generalized Learning Vector Quantization (GLVQ) [SatoGLVQ1995], since it is natural to use context hypervectors to initialize prototypes and then refine them iteratively [DiaoGLVQHD2021]. In each iteration, the classifier uses one misclassified context hypervector from one speaker to update the speaker’s profile vector (“prototype”) as well as the profile vector of the nearest (the most similar) speaker. In this manner, the classification accuracy improves after each iteration.
4.5 Results
Table 1 summarizes the identification accuracies on the entire dataset using the methods described above. The baseline is based on a very simple encoder, and any reasonable feature engineering keeps improving the score. Most importantly, training and testing on the entire dataset require only 1.02k active parameters and take only 135.6 minutes (training: 128 minutes; testing: 7.6 minutes) on a regular CPUbased Linux machine (Intel Xeon® CPU @ 2.40GHz) with a maximum usage of 5 cores during the program. The classification speed is 5.7 ms per 1second of speech.
The GLVQ classifier with a single prototype per speaker is used to obtain refined profile vectors for speakers from the context hypervectors. After every epoch (i.e., a pass over all training samples) the classification accuracies were evaluated. Figure
4 shows the results for Top1, Top5, and Top10 test accuracies. Epoch corresponds to the accuracy of the centroidbased classification. For example, in just two epochs Top1 accuracy increased from to , and the accuracy started to saturate after approximately epochs reaching , , and as Top1, Top5, and Top10 accuracies, respectively. Running GLVQ for 30 epochs took only 92 seconds on a regular CPUbased laptop.Table 2 compares this work to the stateoftheart nonneural network and neural network approaches. Although identification accuracy has been the only metric reported for most methods, the cost that comes with it is also an important factor that cannot be ignored. Therefore, we proposed a metric to evaluate the efficiency of a design: the energy to train a model per 1bit of information gain, i.e.:
(6) 
where is the training time, and is the information gain (in bits) from the speaker identification system. ^{1}^{1}1
can be estimated from the Top1 accuracy
as , where andare the input speaker id and output speaker id of the identification system, assuming every speaker is equally likely to appear at the input and has the same probability
to get identified at the output. If not identified, a speaker gets misclassified to any one of the rest 1250 speakers equally likely. Although neural networks achieve 4 more bits of information gain, the energy to train the network is much larger than 2 times, as they take more than 4500 active parameters to train iteratively over an unspecified number of epochs. For ivector approaches, they also require iterative training for ivector extraction and the perspeaker SVM training. Due to the limited information reported from other works, we are not able to quantify their efficiency. Considering relatively few active parameters used during training and the oneshot learning algorithm, we believe that our approach leads to highly energyefficient systems for speech processing.5 Discussion
In this work, we have studied the application of a new computing paradigm to the encoding of speech. With a simple encoding scheme and reasonable feature engineering, it has achieved highly competitive efficiency for its information gain. The results obtained so far are solely based on making use of one acoustic feature (formants) and their course over a short time. There are many more acoustic features yet to be considered, such as the pitch and cepstral coefficients. HD computing is especially suited for encoding a combination of features and producing a fixeddimensional representation for them [KarlgrenSemantics2021]. Therefore, its identification accuracy is expected to keep improving when combined with other acoustic features, with a modest increase in computing time and memory use. This work can help to originate a simpler, more energyefficient machine learning for speech processing.
6 Acknowledgements
PCH was supported by NSF ECCS2147640. PCH, DK, JMP, BAO, and PK were supported in part by the DARPA’s AIE (HyDDENN Project). DK, BAO, and PK were also supported in part by AFOSR FA95501910241. DK was supported by the EU’s MSCA Fellowship (839179).