DeepAI
Log In Sign Up

Computing with Hypervectors for Efficient Speaker Identification

We introduce a method to identify speakers by computing with high-dimensional random vectors. Its strengths are simplicity and speed. With only 1.02k active parameters and a 128-minute pass through the training data we achieve Top-1 and Top-5 scores of 31 in contrast to CNN models requiring several million parameters and orders of magnitude higher computational complexity for only a 2× gain in discriminative power as measured in mutual information. An additional 92 seconds of training with Generalized Learning Vector Quantization (GLVQ) raises the scores to 48 in 5.7 ms. All processing was done on standard CPU-based machines.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/22/2018

Weakly Supervised Training of Speaker Identification Models

We propose an approach for training speaker identification models in a w...
10/24/2018

The speaker-independent lipreading play-off; a survey of lipreading machines

Lipreading is a difficult gesture classification task. One problem in co...
09/04/2017

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

We present FLASH ( Fast LSH Algorithm for Similarity search accelerat...
02/05/2020

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

In a recent work, we presented a discriminative backend for speaker veri...
04/05/2019

Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition

In this work, we continue in our research on i-vector extractor for spea...

1 Introduction

With the emergence of Internet-of-Things devices, speaker recognition at the edge is desirable as it can enable smart environments, cyber-physical security, and robotic control, etc. However, speaker recognition is now done mostly in the cloud due to the constrained resources and battery capacity of small devices that are unable to run complex models. Adapting to new speakers calls for lightweight and efficient algorithms suitable for on-device and online learning.

Developments in deep neural networks have led to end-to-end speaker recognition systems that achieve high accuracy on noisy and uncontrolled speech data 

[nagrani2017voxceleb][chung2018voxceleb2][kim2021adaptive]

. Although neural networks have the ability to deal with noisy real-world data, they are expensive to train due to iterative back-propagation using gradient descent. Moreover, the whole network may need to be re-trained when adding new speakers. Non-neural-network approaches include the traditional Gaussian Mixture ModelSupport Vector Machine 

[campbell2006support], and the more recent Joint Factor Analysis [kenny2005joint] or the i-vector approaches [dehak2009support]

. However, training these models usually requires running the Expectation Maximization (EM) algorithm iteratively, which can also be computationally intensive for large datasets.

We propose a speaker-recognition approach based on computing with high-dimensional (HD) vectors, also called “Hyperdimensional” [kanerva2009hyperdimensional]. By mapping data to nearly orthogonal vectors in a high-dimensional space and computing with simple yet powerful operations on vectors, the intrinsic structure of the data can be revealed in a manner that is effective for classification [wong2018negative, OsipovHyperSeed2021]. HD computing has provided an efficient way to analyze various types of data [kleyko2021survey] and to achieve fast, online and incremental learning in dealing with text [joshi2016language, AlonsoHyperEmbed2020], multi-modal bio-signals [rahimi2018efficient, moin2021wearable, zhou2021memory, ge2021seizure], classifying spoken letters [imani2017voicehd], and others [KleykoSurveyVSA2021Part2]. This work extends the application of HD computing to speaker recognition.

We start by describing the idea of computing with hypervectors, and the operations that are used to encode speech (Section 2). The proposed speech encoder aims to capture the pronunciation variations between speakers into a speaker profile hypervector (Section 3). The profile is computed in three steps. First, the formants in a time slice are encoded into a hypervector, to capture the variation of the signal over frequencies. Then hypervectors for consecutive time slices are encoded into an -gram hypervector, to capture the variation of the signal over time. Finally the -gram hypervectors of a speech sample are added together, to form a profile hypervector that summarizes the course of the power spectrum over time. Ways to improve this basic algorithm using GLVQ are also explored.

A metric is introduced to evaluate the efficiency of speaker identification systems in terms of the training energy per 1-bit of information gain. This work has achieved a highly competitive energy efficiency due to its small number of active parameters during operation and one-shot learning.

2 Computing with Hypervectors

HD computing originates from Holographic Reduced Representations in the early 1990s [plate1994distributed, plate2003HRRbook]. For speech processing, it provides a formulaic way to encode the frequency and temporal structure of a spectrogram into a fixed-dimensional vector.

“Hypervectors” refer to high-dimensional ( 1,000) seed vectors and to vectors made from them with three operations: addition (), multiplication (), and permutation (). The seed vectors are chosen at random to represent basic entities—they are like atoms from which everything else is built. Here they represent differences of spectral power in adjacent frequency bins of a time slice. We use random bipolar vectors (of s and s) as seeds. Addition and multiplication happen coordinate-wise, and permutations reorder (shuffle) hypervector coordinates. The similarity of vectors is measured with the cosine, which equals 0 when the vectors are orthogonal—in a high-dimensional space nearly all pairs of vectors are approximately orthogonal. Computing with vectors is like traditional computing with numbers, except that addition and multiplication now operate on vectors, and no arithmetic operation corresponds to the permutation of coordinates.

Unlike most machine-learning methods that require iterative training, HD computing offers one-shot and online learning. Learning happens in a single pass over samples from known speakers. The pass produces a

-dimensional profile hypervector—a class prototype—for each speaker. Profiles for test samples are made with the same algorithm and are identified with the most similar speaker profiles. Therefore, the model need not be retrained when speakers are added.

3 Encoding Speech

Learning the statistics of a signal over time is particularly natural and efficient with hypervectors. The proposed encoder aims to capture the unique pronounciation variation between speakers, similar to the phone -gram-based modeling [kohler2001phonetic]. In this approach, a profile hypervector is designed to learn the unique course of the formants over time for each speaker.

In a spectrogram, typically up to 4 formants stand out at any moment of time. Taking a spectrum a slice at a time, for example the upper left plot in Figure 

1, a formant can be identified by rising power to the left of it and falling to the right. Therefore, formants can be located by comparing the power in adjacent bins. A simple local binary pattern (LBP) encoding [burrello2019hyperdimensional] is used here to encode the locations of the formants in a time slice. From the first bin to the second bin, the LBP encoder looks at whether the power increases or remains the same, or decreases, and reports 1 or 0. In the VoxCeleb dataset, audio is sampled at 16 kHz, and so the first 40 bins of a power spectrum computed over a 5-ms window cover frequencies from 0 to 8,000 Hz. The power in 40 bins gives rise to 39 differences between neighboring bins, resulting in a 39-bit LBP.

The hypervector for the spectrum at time summarizes the output of the LBP encoder. It is made from bipolar seed vectors that represent the 0s and the 1s of the LBP. There are a total of 78 seed vectors to choose from, corresponding to power going up or down at each bin. For example, or represents the power in the ()-th bin greater or less than that in the -th bin. Then according to the output of the LBP encoder, 39 seed vectors are selected and added together. Finally, the resulting sum vector is transformed to a bipolar vector of 1s and s by thresholding it at zero:

(1)

where

(2)

is applied coordinate-wise. Only looking at whether the power is increasing or decreasing has the advantage of not being affected by the power characteristics and the loudness of speech. It is important to note that this encoding yields similar hypervectors for similarly located formants.

Hypervectors for individual spectra are combined into spectra over time by encoding them in -grams and adding into a profile hypervector. Empirically, we found that trigrams and tetragrams work the best. Sampling of spectrum slices at 20-ms intervals means that these -grams represent 60–80 ms of speech, or approximately the length of a phoneme.

A trigram vector is made by permuting the vector for the first spectrum twice, permuting the vector for the second spectrum once, taking the third as is, and multiplying the three vectors coordinate-wise, i.e.:

(3)

where denotes permuting the vector once, and denotes permuting the vector twice. Permutation was implemented by random shuffling the indices of coordinates of the vector.

Finally, the profile hypervector is formed by summing the trigram vectors over time within an utterance, or across multiple utterances. In VoxCeleb dataset, utterances have been collected for each speaker in different contexts with different recording quality and background noise. As it will become clear in Section 4.5, two kinds of profile hypervectors were generated for each speaker: (1) for each context/video/subfolder and (2) one across all the speaker’s contexts. For speaker , his/her context profile hypervector for context is the sum of all trigrams from all utterances in that context, i.e. , and the final profile hypervector for speaker is

Figure 1: Encoding a spectrum at time into a hypervector . The spectrum represents 5 ms of speech sampled at 16 kHz.
(a) Without weighting
(b) With weighting
Figure 2: Correlation matrices of the first 20 speakers’ profile hypervectors.

4 Experiments

4.1 Experimental Setup

We used the VoxCeleb dataset [nagrani2017voxceleb] to develop and test the algorithm. It consists of speech from 1,251 speakers collected from YouTube videos. Each speaker has a folder, divided into a number of subfolders that contain the audio files. The subfolders come from different videos and are referred to as different “contexts.” The speech files in the subfolders are called “utterances.” Following the same procedure as in [nagrani2017voxceleb], we reserved one subfolder/context for testing that had the fewest utterances but at least five. Each speaker is trained with 904 seconds of speech and tested with 64 seconds, on the average. Top-1, Top-5, and Top-10 accuracies were calculated.

4.2 Input Features

Spectrograms are computed using a 5-ms Hann window and a step size of 20 ms. A short window—5 ms vs, the commonly used 25 ms—simplifies the LBP encoding of the formants by smoothing over the multiples of the fundamental frequency and its overtones. Spectrum slices are sampled at 20-ms intervals such that a trigram or a tetragram represents 60–80 ms of speech, approximately the length of a phoneme.

4.3 Encoder Design

Before training and testing on the entire dataset, a few parameters of the encoder need to be determined, such as the dimension of the hypervectors, which -gram to use, and the number of frequency bins to encode. The dimension of the hypervectors was chosen to be 1,024. To determine which -gram to use, uni-gram to penta-gram were used in the encoder to train and test on 40 speakers’ data, which is sufficient to indicate the results for the entire dataset. The results suggested that tetragrams and trigrams are comparable and perform better than the others. Therefore, trigrams were chosen for the encoder. Similarly, different number of frequency bins (26, 32, and 40) were encoded on 40 speakers’ data, and results suggested that encoding the full number of bins (40) performs best. For the proposed approach, the training and testing on 40 speakers’ data take roughly 5 minutes on an Apple M1 processor, so one can quickly test design parameters.

4.3.1 Weighting

Voice activity detection is usually necessary so that desirable features are extracted only from speech segments. The algorithm so far treated a period of silence the same as speech. To counter the lack of information in silence or a weak signal, we use the total energy in the spectrum to weight its hypervector before including it in the -gram. The total energy of a spectrum slice at time is . Thus the weighted trigram at time is

(4)

where the exponent was determined empirically as described in Section 4.3. We found that weighting with the 0.3 power of energy works well with trigrams. Figures 1(a) and 1(b) show the correlation matrices of the first 20 speakers’ profile hypervectors before and after applying the weights. It can be seen that the correlations drops after applying the weighting.

4.3.2 Normalized Weights

Performance may be further improved by normalizing the weights to discount large variations in the power of speech segments across different contexts (due to lack of control over recording conditions in the VoxCeleb dataset). As shown in Figure 3, the average power per frequency bin over an utterance for the same speaker can vary over 15 dB. To avoid favoring contexts with louder voices or even noise, the weight for the hypervector of each spectrum slice is scaled by the ratio between the desired maximum bin power and a speaker’s maximum bin power averaged over the utterance where the hypervector at time is computed from:

(5)

where is the targeted max bin power set for all utterances to be adjusted to and can be an arbitrary constant. We set it to the average of the largest bin power over the first 40 speakers. is the maximum bin power of the utterance being considered and is computed during training. This essentially makes all utterances’ maximum bin power equal, so the weights no longer favor contexts with higher spectral power. Applying this normalizing ratio leads to an increase of 4.3% and 7.3% in the Top-1 and Top-5 accuracies from the weighted case.

Figure 3: Average power per utterance for a speaker across multiple contexts from the VoxCeleb dataset. Different colored lines denote different utterances.
Method Weighting applied to Further processing Top-1 Top-5 Top-10
Baseline none none
Weighting none
Normalized Weighting none
Refinement of profile hypervectors GLVQ
Table 1: Identification accuracy on VoxCelb1 dataset using the proposed methods
Method Top-1 Top-5 Active Parameters1 Stored Mutual Training Classification
during training Parameters Info. Method (Time) Speed
i-vectors/SVM [nagrani2017voxceleb] not reported 1M2 4.04 bits iterative EM+SGD3 not reported
i-vectors/PLDA/SVM [nagrani2017voxceleb] not reported 0.5M4 5.29 bits iterative EM+SGD not reported
CNN [nagrani2017voxceleb] 0.805 0.921 67M 67M 7.57 bits iterative SGD not reported
ACNN [kim2021adaptive] 0.855 0.953 4.69M 4.69M 8.20 bits iterative SGD not reported
This work HD: 1.02k HD: 1.28M 3.93 bits HD: one-shot (128 min) 5.7 ms per
GLVQ: 2.05k GLVQ: 1.28M GLVQ: SGD (92 sec) 1-sec test sample
  • Number of parameters updated for a single pass of one data sample.

  • Assuming two 400-dimensional vectors [nagrani2017voxceleb] were stored for each speaker’s SVM.

  • Generally i-vector extractor is trained with the EM algorithm, and SVM is trained with the SGD algorithm.

  • Assuming two 200-dimensional vectors [nagrani2017voxceleb] were stored for each speaker’s SVM.

  • SVM: Support Vector Machine; PLDA: Probabilistic Linear Discriminant Analysis; CNN: Convolutional Neural Network; ACNN: Adaptive CNN; EM: Expectation Maximization; SGD: Stochastic Gradient Descent.

Table 2: Comparison with prior works on VoxCeleb1 dataset for speaker identification.

4.4 Refinement of profile hypervectors with
Learning Vector Quantization

The profile hypervector that sums a speaker’s context hypervectors corresponds to centroid-based classification, which is commonly used in speech and signal processing (e.g., [RasanenMultivariate2015, KleykoTradeoffs2018, GeClassificationReview2020]) due to its simplicity, although it does not guarantee the most accurate classification  [RosatoHDDistributed2021, KarlgrenSemantics2021]. Of the various classifiers that can take hypervectors as input (e.g., [RachkovskijClassifiers2007, RahimiBiosignal2016, imani2017voicehd, KleykoDensityEncoding2020, DiaoGLVQHD2021]), we used the Generalized Learning Vector Quantization (GLVQ) [SatoGLVQ1995], since it is natural to use context hypervectors to initialize prototypes and then refine them iteratively [DiaoGLVQHD2021]. In each iteration, the classifier uses one misclassified context hypervector from one speaker to update the speaker’s profile vector (“prototype”) as well as the profile vector of the nearest (the most similar) speaker. In this manner, the classification accuracy improves after each iteration.

4.5 Results

Table 1 summarizes the identification accuracies on the entire dataset using the methods described above. The baseline is based on a very simple encoder, and any reasonable feature engineering keeps improving the score. Most importantly, training and testing on the entire dataset require only 1.02k active parameters and take only 135.6 minutes (training: 128 minutes; testing: 7.6 minutes) on a regular CPU-based Linux machine (Intel Xeon® CPU @ 2.40GHz) with a maximum usage of 5 cores during the program. The classification speed is 5.7 ms per 1-second of speech.

The GLVQ classifier with a single prototype per speaker is used to obtain refined profile vectors for speakers from the context hypervectors. After every epoch (i.e., a pass over all training samples) the classification accuracies were evaluated. Figure 

4 shows the results for Top-1, Top-5, and Top-10 test accuracies. Epoch corresponds to the accuracy of the centroid-based classification. For example, in just two epochs Top-1 accuracy increased from to , and the accuracy started to saturate after approximately epochs reaching , , and as Top-1, Top-5, and Top-10 accuracies, respectively. Running GLVQ for 30 epochs took only 92 seconds on a regular CPU-based laptop.

Figure 4: The test accuracy of the GLVQ classifier against the number of training epochs.

Table 2 compares this work to the state-of-the-art non-neural network and neural network approaches. Although identification accuracy has been the only metric reported for most methods, the cost that comes with it is also an important factor that cannot be ignored. Therefore, we proposed a metric to evaluate the efficiency of a design: the energy to train a model per 1-bit of information gain, i.e.:

(6)

where is the training time, and is the information gain (in bits) from the speaker identification system. 111

can be estimated from the Top-1 accuracy

as , where and

are the input speaker id and output speaker id of the identification system, assuming every speaker is equally likely to appear at the input and has the same probability

to get identified at the output. If not identified, a speaker gets misclassified to any one of the rest 1250 speakers equally likely. Although neural networks achieve 4 more bits of information gain, the energy to train the network is much larger than 2 times, as they take more than 4500 active parameters to train iteratively over an unspecified number of epochs. For i-vector approaches, they also require iterative training for i-vector extraction and the per-speaker SVM training. Due to the limited information reported from other works, we are not able to quantify their efficiency. Considering relatively few active parameters used during training and the one-shot learning algorithm, we believe that our approach leads to highly energy-efficient systems for speech processing.

5 Discussion

In this work, we have studied the application of a new computing paradigm to the encoding of speech. With a simple encoding scheme and reasonable feature engineering, it has achieved highly competitive efficiency for its information gain. The results obtained so far are solely based on making use of one acoustic feature (formants) and their course over a short time. There are many more acoustic features yet to be considered, such as the pitch and cepstral coefficients. HD computing is especially suited for encoding a combination of features and producing a fixed-dimensional representation for them [KarlgrenSemantics2021]. Therefore, its identification accuracy is expected to keep improving when combined with other acoustic features, with a modest increase in computing time and memory use. This work can help to originate a simpler, more energy-efficient machine learning for speech processing.

6 Acknowledgements

PCH was supported by NSF ECCS-2147640. PCH, DK, JMP, BAO, and PK were supported in part by the DARPA’s AIE (HyDDENN Project). DK, BAO, and PK were also supported in part by AFOSR FA9550-19-1-0241. DK was supported by the EU’s MSCA Fellowship (839179).

References