Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations with Subwords

Word vector representations enable machines to encode human language for spoken language understanding and processing. Confusion2vec, motivated from human speech production and perception, is a word vector representation which encodes ambiguities present in human spoken language in addition to semantics and syntactic information. Confusion2vec provides a robust spoken language representation by considering inherent human language ambiguities. In this paper, we propose a novel word vector space estimation by unsupervised learning on lattices output by an automatic speech recognition (ASR) system. We encode each word in confusion2vec vector space by its constituent subword character n-grams. We show the subword encoding helps better represent the acoustic perceptual ambiguities in human spoken language via information modeled on lattice structured ASR output. The usefulness of the proposed Confusion2vec representation is evaluated using semantic, syntactic and acoustic analogy and word similarity tasks. We also show the benefits of subword modeling for acoustic ambiguity representation on the task of spoken language intent detection. The results significantly outperform existing word vector representations when evaluated on erroneous ASR outputs. We demonstrate that Confusion2vec subword modeling eliminates the need for retraining/adapting the natural language understanding models on ASR transcripts.



page 1

page 2

page 3

page 4


Confusion2Vec: Towards Enriching Vector Space Word Representations with Representational Ambiguities

Word vector representations are a crucial part of Natural Language Proce...

Spoken Language Intent Detection using Confusion2Vec

Decoding speaker's intent is a crucial part of spoken language understan...

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

In this paper, we present a novel multi-modal deep neural network archit...

On the difficulty of a distributional semantics of spoken language

The bulk of research in the area of speech processing concerns itself wi...

Processing Self Corrections in a speech to speech system

Speech repairs occur often in spontaneous spoken dialogues. The ability ...

Investigating Inner Properties of Multimodal Representation and Semantic Compositionality with Brain-based Componential Semantics

Multimodal models have been proven to outperform text-based approaches o...

SCREEN: Learning a Flat Syntactic and Semantic Spoken Language Analysis Using Artificial Neural Networks

Previous approaches of analyzing spontaneously spoken language often hav...

Code Repositories


Confusion2vec 2.0: Source code, Models, Data

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech is the primary and most natural mode of communication for humans. This makes its use also attractive for human computer interaction, which in turn requires decoding human language to enable spoken language understanding. Human language is a complex construct involving multiple dimensions of information including semantics, syntax and often contain ambiguities which make it challenging for machine inference of communication intent, emotions etc. Several word vector representations have been proposed for effectively describing the human language in the natural language processing community.

Contextual modeling techniques like language modeling, i.e., predicting the next word in the sentence given a window of preceding context, have been shown to model meaningful word representations [2, 19]. Bag-of-word based contextual modeling, where the current word is predicted given both its left and right (local) contexts has shown to capture language semantics and syntax [18]. Similarly, predicting local context from the current word, referred to as skip-gram modeling, is shown to better represent semantic and syntactic distances between words [20]. In [21] log bi-linear models combining global word co-occurrence information and local context information, termed as global vectors (GloVe), is shown to produce meaningful structured vector space. Bi-directional language models are proposed in [22]

, where internal states of deep neural networks are combined to model complex characteristics of word use and its variance over linguistic contexts. The advantages of bi-directional modeling are further exploited along with self-attention using transformer networks

[31] to estimate a representation, termed as BERT (Bidirectional Encoder Representations from Transformers), that has shown its utility on a multitude of natural language understanding tasks [6]. Models such as BERT, ELMo estimate word representations that vary depending on the context, whereas the context-free representations including GloVe and Word2Vec generate a single representation irrespective of the context.

However, most of the word vector representations infer the knowledge through contextual modeling and many of the inherent ambiguities present in human language are often unrecognized or ignored. For instance, from the perspective of spoken language, the ambiguities can be associated with how similar the words sound, i.e., for example, the words “see” and “sea” sound acoustically identical but have different meanings. The ambiguities can also be associated with the underlying speech signal itself due to wide range of acoustic environments involving noise, overlapped speech and channel, room characteristics. These ambiguities often project themselves as errors through ASR systems. Most of the existing word vector representations such as word2vec [20, 18], fasttext [3], GloVe [21], BERT [6], ELMo [22] do not account for the ambiguities present in speech signals and thus degrade while processing the output of noisy ASR transcripts.

Confusion2vec was recently proposed to handle representation ambiguity information present in human language [26]. Confusion2vec is estimated by unsupervised skip-gram training on the ASR output lattices and confusion networks. The analysis of inherent acoustic ambiguity information of the embeddings displayed meaningful interactions between the semantic-syntactic subspace and acoustic similarity subspaces. In [27], the usefulness of the Confusion2vec was confirmed on the task of spoken language intent detection. The Confusion2vec representation significantly outperformed typical word embeddings including word2vec and GloVe when evaluated on noisy ASR transcripts by reducing the classification error rate by approximately 20% relative.

Although, there have been few attempts in leveraging information present in word lattices and word confusion networks for several tasks [29, 14, 30, 33, 28, 13], the main downside with these works is that the word representation estimated by such techniques are task dependent and are restricted to a particular domain and dataset. Moreover, availability of most of the task specific datasets are limited and task specific speech data are expensive to collect. The advantage of Confusion2Vec is that it estimates a generic, task-independent word vector representation via unsupervised learning on lattices or confusion networks generated by an ASR on any speech conversations.

In this paper, we incorporate subwords to represent each word for modeling both the acoustic ambiguity information and the contextual information. Each word is modeled as a sum of constituent n-gram characters. Our motivation behind the use of subwords are the following: (i) they incorporates morphological information of the words by encoding internal structure of words [3], (ii) the bag of character n-grams often have a high overlap between acoustically ambiguous words, (iii) subwords help model under-represented words more efficiently, thereby leading to more robust estimation with limited available data, which is the case since training Confusion2Vec is restricted to ASR lattice outputs, (iv) subwords enable representations for out-of-vocabulary words which are common-place with end-to-end ASR systems outputting characters.

The rest of the paper is organized as follows: Confusion2vec is introduced in Section 2. The proposed subword modeling is presented in Section 3. Section 4 gives details of the evaluation techniques employed for assessing the word embedding models. The experimental setup and results of various analogy and similarity tasks are presented in section 5. Section 6 presents the application of the proposed word vector representation to the spoken language intent detection task. Finally, the paper is concluded in section 7.

Figure 1: Example Confusion Network Output by ASR for the ground-truth phrase “I want to sit”

2 Confusion2Vec

In psycho-acoustics, it is established that humans also relate words with how they sound [1] in addition to semantics and syntax. Inspired by principles of human speech production and perception, we previously proposed Confusion2vec [26]. The core idea is to estimate a hyper-space that not only captures the semantics and syntax of human language, but also augments the vector space with acoustic ambiguity information, i.e., word acoustic similarity information. In other words, word2vec, GloVe can be viewed as a subspace of the confusion2vec vector space.

Several different methodologies are proposed for capturing the ambiguity information. The methodologies are an adaptation of the skip-gram modeling for word confusion networks or lattice-like structures. The word lattices are directed acyclic weighted graphs of all the word sequences that are likely possible. A confusion network is a specific type of lattice with constraints that each word sequence passes through each node of graph. Such lattice-like structures can be derived from machine learning algorithms that output probability measures, for example, an ASR. Figure 

1, illustrates a confusion network that can possibly result from a speech recognition system. Unlike typical simple sentences which are used for training word embeddings like word2vec, GloVe, BERT, ELMo etc., the information in the confusion network can be viewed along two dimensions: (i) contextual dimension, and (ii) acoustic ambiguity dimension.

More specifically, four different configurations of skip-gram modeling algorithms are proposed in our previous work [26], namely: (i) top-confusion, (ii) intra-confusion, (iii) inter-confusion, and (iv) hybrid model. The top-confusion version considers only the most-probable path of the ASR confusion network and applies the typical skip-gram model on it. The intra-confusion version applies the skip-gram modeling on the acoustic ambiguity dimension of the confusion network and ignores the contextual information, i.e., each ambiguous word alternative is predicted by the other over a pre-defined local context. The inter-confusion version applies the skip-gram modeling on the contextual dimension but over each of the acoustic ambiguous words. The hybrid model is a combination of both the intra and inter-confusion configurations. More information on the training configuration is available in [26]. The present work builds upon this basic Confusion2vec framework.

3 Confusion2Vec 2.0 subword model

Subword encoding of words has been popular in modeling semantics and syntax of language using word vector representations [3, 6, 22]. The use of subwords are mainly motivated by the fact that the subwords incorporate morphological information which can be helpful, for example, in relating the prefixes, suffixes and the word root. In this work, we apply subword representation for encoding the word ambiguity information in the human language. We believe we have a compelling case for the use of subwords for representing the acoustic similarities (ambiguities) between the words in the language since more similarly sounding words often have highly overlapping subword representations. This helps model the level of overlap and estimate the magnitude of acoustic similarity robustly. Moreover, use of subwords should help in efficient encoding of under-represented words in the language. This is crucial in the case of Confusion2vec because we are restricted to speech data and their corresponding decoded ASR lattices for training, thereby limiting word-word co-occurrence in contrast to typical word vector representation which can be trained on large amounts of easily available plain text data. Another important aspect is the ability to represent out-of-vocabulary words which are a common place occurrence with end-to-end ASR systems outputting character sequences.

In the proposed model, each word is represented as a sum of its constituent n-gram character subwords. This enables the model to infer the internal structure of each word. For example, a word “want“ is represented with the vector sum of the following subwords:

wa, wan, ant, nt, wan, want, ant, want, want, want

Symbols and are used to represent the beginning and end of the word. The n-grams are generated for n=3 upto n=6. It is apparent that an acoustically ambiguous, similar sounding word “wand” has a high degree of overlap with the set of n-gram characters.

In this paper, we consider two modeling variations: (i) inter-confusion, and (ii) intra-confusion versions of confusion2vec with the subword encoding.

3.1 Intra-Confusion Model

The goal of the intra-confusion model is to estimate the inter-word relations between the acoustically ambiguous words that appear in the ASR lattices. For this, we perform skip-gram modeling over the acoustic similarity dimension (see Figure 1) and ignore the contextual dimension of the utterance. The objective of the intra-confusion model is to maximize the following log-likelihood:


where is the length of the utterance (confusion network) in terms of number of words, is the word in the confusion network output by the ASR at time-step and is the index of the word among the ambiguous alternatives. is the set of indices of all ambiguous words at time-step , is the index of the current word along the acoustic ambiguity dimension, is the subset of ambiguous words barring at the current word , i.e., for example from Figure 1, for the current word, , “want”, {wand, won’t, what}. Additionally, for subword encoding, each word input is represented as:


where is the set of all character n-grams ranging from n=3 to n=6 and the word itself and is the vector representation for n-gram subword . Few training samples (input, target) generated for this configuration pertaining to input confusion network in Figure 1 are (I, eye), (eye, I), (want, wand), (want, won’t), (won’t, what), (wand, what) etc.

3.2 Inter-Confusion Model

The aim of the inter-confusion model is to jointly model the contextual co-occurrence information and the acoustic ambiguity co-occurrence information along both the axis depicted in the confusion network. Here, the skip-gram modeling is performed over time context and over all the possible acoustic ambiguities. The objective of the inter-confusion model is to maximize the following log-likelihood:


where corresponds to set of indices of nodes of confusion network, i.e., words around the current word along the time-axis and is the current context index. is the set of indices of acoustically ambiguous words at a context . For example, for the current word, , “want” in Figure 1, {I, eye, two, tees, to, seat, sit, seed, eat} and {wand, won’t, what, want}. Note, each word input is subword encoded as in equation 2. Few training samples (input, target) generated for this configuration are (want, I), (want, eye), (want, two), (want, to), (want, tees), (what, I), (what, eye), (what, to), (what, tees), (what, two), (won’t, eye) etc.

3.3 Training Loss and Objective

Negative sampling is employed for training the embedding model. Negative sampling was first introduced for training word2vec representation [20]

. It is a simplification of the Noise Contrastive Estimation objective

[9]. The negative sampling for training the embedding can be posed as a set of binary classification problems which operates on two classes: presence of signal or absence (noise). In the context of word embeddings the presence of the context words are treated as positive class and the negative class is randomly sampled from the unigram distribution of the vocabulary. The negative sampling for subword model can be expressed using binary logistic loss as:


where , is the input word, is the output word, is the set of n-gram character subwords for the word , is the vector representation for the character n-gram subword and is the output vector representation of target word . is the number of negative samples to be drawn from the negative sample, noise distribution . The noise distribution is chosen to be the unigram distribution of words in the vocabulary raised to the power as suggested in [20]. Note, for confusion2vec the input word and target word are derived according to equations 1 and  3 for implementing the respective training configurations

4 Evaluations

We perform evaluations of the proposed word embeddings along two aspects. One, in view of the assessing the useful, meaningful information embedded in the word vector representation. Second, in its application to a realistic task of spoken language intent detection.

4.1 Analogy and Similarity Tasks

For evaluating the inherent semantic and syntactic knowledge of the word embeddings, we employ two tasks: (i) semantic-syntactic analogy task, and (ii) word similarity task. The word analogy task was first proposed in [18] which comprises word pair analogy questions of the form is to as is to . The analogy is answered correct if is most similar to

. Another prominent approach is the word similarity task, where rank-correlation between cosine similarity of set of pair of word vectors and human annotated word similarity scores are assessed

[24]. For word similarity task, we use the WordSim-353 database [7] consisting of 353 pairs of words annotated over a score of 1 to 10 depending on the magnitude of word similarity as perceived by humans.

For assessing the word acoustic ambiguity (similarity) information, we conduct the acoustic analogy task, Semantic&syntactic–acoustic analogy task and Acoustic similarity tasks proposed in [26]. The Acoustic analogy task comprises of word pair analogies compiled using homophones which answer questions of the form: sounds similar to as sounds similar to . The acoustic analogy task is designed to assess the ambiguity information embedded in the word vector space [26]. The semantic&syntactic-acoustic analogy task is designed to assess semantic, syntactic and acoustic ambiguity information simultaneously. The analogies are formed by replacing certain words by their homophone alternatives in the original semantic and syntactic analogy task [26]. The acoustic word similarity task is analogous to the word similarity task, i.e., it contains of word pairs which are rated on their acoustic similarity based on the normalized phone edit distances. A value of 1.0 refers to two words sounding identical and 0.0 refers to the word pairs being acoustically dissimilar. More details regarding the evaluation methodologies are available in [26]. The evaluation datasets are made available111

Model Analogy Tasks Similarity Tasks
S&S Acoustic S&S-Acoustic Average Accuracy Word Similarity Acoustic Similarity
Google W2V [20] 61.42% 0.9% 16.99% 26.44% 0.6893 -0.3489
In-domain W2V 59.17% 0.6% 8.15% 22.64% 0.4417 -0.4377
fastText [3] 75.93% 0.46% 17.40% 31.26% 0.7361 -0.3659
Confusion2Vec 1.0 (word) [26] C2V-a 63.97% 16.92% 43.34% 41.41% 0.5228 0.6200
C2V-c 65.45% 27.33% 38.29% 43.69% 0.5798 0.5825
Confusion2Vec 2.0 C2V-a 56.74% 50.79% 44.67% 50.73% 0.3181 0.8108
(subword) C2V-c 56.87% 51.00% 44.98% 50.95% 0.2893 0.8106
Table 1: Results: Different proposed models
C2V-a: Intra-Confusion, C2V-c: Inter-Confusion, S&S: Semantic & Syntactic Analogy.
For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in [26]). For the similarity tasks: all the correlations (Spearman’s) are statistically significant with .

4.2 Spoken Language Intent Classification

We also evaluate the efficacy of the proposed word representation models on the task of spoken language intent classification. A recurrent neural network (RNN) based classifier is employed by initializing the embedding layer with the proposed word vectors. Classification experiments are conducted by training the recurrent neural network on (i) clean manual transcripts, and (ii) noisy ASR transcripts, with evaluations on both manual and ASR transcripts. Classification error rates of the intent detection is used to derive assessments of the word vector representations.

5 Analogy & Similarity Tasks

5.1 Database

The Fisher English Training Part 1, Speech (LDC2004S13) and Fisher English Training Part 2, Speech (LDC2005S13) corpora [5] are used for both training the ASR and the confusion2vec 2.0 embeddings. The choice of database is based on [26] for direct comparison purposes. The corpus consists of spontaneous telephonic conversations between 11,972 native English speakers. The speech data amounts to approximately 1,915 hours sampled at 8 kHz. The corpus is divided into 3 parts for training (1,905 hours, 1,871,731 utterances), development (5 hours, 5000 utterances) and test (5 hours, 5000 utterances). Overall, the transcripts contain approximately 20.8 million word tokens and vocabulary size of 42,150.

5.2 Experimental Setup

The experimental setup is maintained identical to [26] for direct comparison. Brief detail of the setup is as follows:

5.2.1 Automatic speech recognition

A hybrid HMM-DNN based acoustic model is trained on the train subset of the speech corpus using the KALDI speech recognition toolkit [23]. 40 dimensional mel frequency cepstral coefficients (MFCC) features are extracted along with the i-vector features for training the acoustic model. The i-vector features are used to provide speaker and channel characteristics to aid acoustic modeling. The DNN acoustic model, comprises 7 layers with P-norm non-linearity (p=2) each with 350 units [35]

. The DNN is trained using 5 MFCC frame splices with left and right context of 2 to classify among 7979 Gaussian mixtures with stochastic gradient descent optimizer. The CMU pronunciation dictionary


is used as the word-pronunciation transcription lexicon. A tri-gram language model is trained on the training subset of the Fisher English Speech Corpus. The ASR yields word error rates (WER) of 16.57% and 18.12% on the development and the test datasets. Lattices are derived during the ASR decoding with a decoding beam size of 11 and lattice beam size of 6. The lattices are converted to confusion networks with the minimum Bayes risk criterion

[34] for training the confusion2vec embeddings. The resulting confusion networks have a vocabulary size of 41,274 and 69.5 million words, with an average of 3.34 alternative (ambiguous) words for each edge in the graph.

5.2.2 Confusion2Vec 2.0

In order to train the embedding, most frequent words are sub-sampled as suggested in [20], with the rejection threshold set to

. Also, a minimum frequency threshold of 5 is set and the rarely occurring words are pruned from the vocabulary. The context window size for both the acoustic ambiguity and contextual dimensions are uniformly sampled between 1 and 5. The dimension of the word vectors are set to 300. The number of negative samples for negative sampling is chosen to be 64. The learning rate is set to 0.01 and trained for a total of 15 epochs using stochastic gradient descent. All the hyper-parameters are empirically chosen for optimal performance on the development set. We implemented the confusion2vec 2.0 by modifying the source code from fastText

222 [3]. We make our source code and trained models available at

Model Analogy Tasks Similarity Tasks
S&S Acoustic S&S-Acoustic Average Accuracy Word Similarity Acoustic Similarity
Google W2V [20] 61.42% 0.9% 16.99% 26.44% 0.6893 -0.3489
In-domain W2V 59.17% 0.6% 8.15% 22.64% 0.4417 -0.4377
fastText [3] 75.93% 0.46% 17.40% 31.26% 0.7361 -0.3659
Confusion2Vec 1.0 (word) [26] C2V-1 + C2V-a 67.03% 25.43% 40.36% 44.27% 0.5102 0.7231
C2V-1 + C2V-c 70.84% 35.25% 35.18% 47.09% 0.5609 0.6345
C2V-1 + C2V-c (JT) 65.88% 49.4% 41.51% 52.26% 0.5379 0.7717
Confusion2Vec 2.0 fastText + C2V-a 76.10% 22.67% 49.15% 49.31% 0.5744 0.7577
(subword) fastText + C2V-c 76.16% 22.56% 49.12% 49.12% 0.5732 0.7573
Table 2: Results: Different proposed models
C2V-a: Intra-Confusion, C2V-c: Inter-Confusion, S&S: Semantic & Syntactic Analogy.
For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in [26]). For the similarity tasks: all the correlations (Spearman’s) are statistically significant with .

5.3 Results

Table 1

lists the results in terms of accuracies for analogy tasks and rank-correlations for similarity tasks. The first two rows correspond to results with the original word2vec. Google W2V model is the open source model released by Google

333, trained on 100 billion word Google News dataset. We also train an in-domain version of original word2vec on the Fisher English corpus for fair comparison with the confusion2vec models, referred to as “In-domain W2V” in Table 1. The fastText model employed is the open source model trained on Wikipedia dumps with a vocabulary size of more than 2.5 million words released by Facebook444 The middle two rows of the table correspond to confusion2vec embeddings without subword encoding and they are taken directly from [26]. The bottom two rows correspond to the results obtained with subword encoding. Note, the confusion2vec 1.0 is initialized on the Google word2vec model for better convergence. The confusion2vec 2.0 model is initialized on the fastText model to maintain compatibility with subword encodings. We normalize the vocabulary for all the experiments, meaning the same vocabulary is used to evaluate the analogy and similarity tasks to allow for fair comparisons.

Comparing the baseline word2vec and fastText embeddings to the confusion2vec, we observe the baseline embeddings perform well on the semantic&syntactic analogy task and provide good positive correlation on the word similarity task as expected. However, they perform poorly on the acoustic analogy task, semantic&syntactic-acoustic analogy task and give small negative correlation on the acoustic analogy task. All the confusion2vec models perform relatively well on the semantic&syntactic analogy task and word similarity task, but more importantly, yield high accuracies on acoustic analogy task and semantic&syntactic-acoustic analogy tasks and provide high positive correlation with the acoustic similarity task.

Specifically with Confusion2Vec 2.0, among the analogy tasks, we observe the subword encoding enhances the acoustic ambiguity modeling. For the acoustic analogy task we find relative improvement of upto 46.41% over its non-subword counterpart. Moreover, even for the semantic&syntactic-acoustic analogy task, we observe improvements with subword encoding. However, we find a small reduction in performance for the original semantic and syntactic analogy task. Regardless of the small dip in the performance, the accuracies remain acceptable in comparison to the in-domain word2vec model. Overall, taking the average accuracy of all the analogy tasks, we obtain an increase of approximately 16.62% relative over the non-subword confusion2vec models.

Investigating the results for the similarity tasks, we find significant and high correlation of 0.81 for acoustic similarity task with the subword encoding. Again, a small degradation is observed with the word similarity task obtaining a correlation of 0.3181 against the 0.4417 of the in-domain baseline word2vec model. Overall, the results of the analogy and the similarity tasks suggest the subword encoding greatly enhances the ambiguity modeling of confusion2vec.

(a) fastText
(b) Confusion2Vec 2.0: C2V-a
Figure 2: 2-D plots of selected word vectors portraying semantic, syntactic and acoustic relationships after dimension reduction using PCA
The blue lines indicate semantic relationships, blue ellipses indicate syntactic relationships, red lines indicate acoustic-semantic/syntactic relations and red ellipses indicate acoustic ambiguity word relations.

5.4 Model Concatenation

Further, the confusion2vec model can be concatenated with the other word embedding models to produce a new word vector space that can result in better representations as seen in [26]. Table 2 lists the results of the concatenated models. For the previous, non-subword version of the confusion2vec, the vector models are concatenated with the word2vec model trained on the ASR output transcripts (C2V-1). The choice of using the C2V-1 instead of the Google W2V for concatenation was based on empirical findings. Where as to maintain compatibility of subword encoding, the confusion2vec 2.0 models are concatenated with fastText models.

First, comparisons between the non-concatenated versions in Table 1 and the concatenated version in Table 2, of the non-subword models, we observe an improvement of approximately 7.22% relative in average analogy accuracy after concatenation. We don’t observe significant improvement with subword based models after concatenation in terms of average analogy accuracy. However, we observe different dynamics between the acoustic ambiguity and the semantic and syntactic subspaces. Concatenation results in improved semantic and syntactic evaluations with the expense of drop in accuracies of acoustic analogy task. We also note improvements (9.27% relative) in semantic&syntactic-acoustic analogy task after concatenation confirming meaningful existence of both ambiguity and semantic-syntactic relations. Moreover, the word similarity task also yields better correlation after concatenation.

Next, comparisons of the confusion2vec 1.0 (non-subword) and the subword version, we observe significant improvements in the semantic&symantic analogy task (7.51% relative) as well as the semantic&syntactic-acoustic analogy tasks (21.78% relative). Moreover, the subword models outperform the non-subword version in both of the similarity tasks. The subword models slightly under-perform in the acoustic analogy task, but more crucially outperform the Google W2V and FastText baselines significantly.

Further, the concatenated models can be fine-tuned and optimized to exploit additional gains as found in [26]. The row corresponding to Confusion2Vec 1.0 - C2V + C2V-c (JT) is the best result obtained in [26] which involves 2-passes. The Confusion2Vec 2.0 with the subword modeling with a single pass training gives comparable performance to the 2-pass approach. Thus we skip the 2-pass approach with the subword model in favor of ease of training and reproducibility.

5.5 Embedding Visualization

Figure 2

illustrates the word vector spaces of fastText embeddings and the proposed C2V-a embeddings after dimension reduction using principal component analysis. We observe meaningful interactions between the semantic&syntactic subspace and the acoustic ambiguity subspace. For example, in Figure 

1(b), vectors “boy”-“prince”, “see”-“seeing”, “read”-“write”, “uncle”-“aunt” are similar to acoustically ambiguous vector “boy”-“prints”, “sea”-“seeing”, “read”-“write”, “uncle”-“ant” respectively which is not the case in Figure 1(a) with fastText embeddings. Such vector relationships can be exploited for downstream spoken language applications by providing crucial acoustic ambiguity information to recover from speech recognition errors. Also note, the acoustically ambiguous words such as “prinz”, “prince”, “prints” are found clustered together. Another important observation is that the word “prinz”, out-of-vocabulary in English, has an orphaned representation under fastText in Figure 1(a). However, “prinz” finds a meaningful representation on the basis of acoustic signature in the proposed Confusion2vec model as seen in Figure 1(b), i.e., “prinz” is clustered together with acoustically similar words “prince” & “prints” and the vector “boy”-“prinz” is similar to vector “boy”-“prince”. Occurrence of out-of-vocabulary words such as “prinz” is common place with end-to-end ASR systems that output characters prone to errors. Note, out-of-vocabulary words such as “prinz” cannot be represented by typical word embeddings such as word2vec, GloVe etc and hence sub-optimal for representation with many end-to-end ASR systems.

6 Spoken Language Intent Detection

In this section, we apply the proposed word vector embedding to the task of spoken language intent detection. Spoken language intent detection is the process of decoding the speaker’s intent in contexts involving voice commands, call routing and any human computer interactions. Many spoken language technologies use an ASR to convert the speech signal to text, a process prone to errors such as due to the varying speaker and noise environments. The erroneous ASR outputs in turn result in degradation of the downstream intent classification. Few efforts have focused on handling the errors of the ASR to make the subsequent intent detection process more robust to errors. These efforts often involve training the intent classification systems on noisy ASR transcripts. The downsides of training the intent classifiers on the ASR is that the systems are limited with the amount of speech data available. Moreover, varying speech signal conditions and use of different ASR models make such classifiers non-optimal and less practical. In many scenarios, speech data is not available to enable adaptation on ASR transcripts.

In our previous work [27], we applied the non-subword version of the Confusion2vec to the task of spoken language intent detection. We demonstrated the Confusion2vec is able to perform as efficiently as the popular word embeddings like word2vec and GloVe on clean manual transcripts giving comparable classification error rates. More importantly, we were able to illustrate the robustness of the confusion2vec embeddings when evaluated on the noisy ASR transcripts. The confusion2vec gives significantly better accuracies (upto relative 20% improvements) when evaluated on ASR transcripts compared to the word2vec, GloVe embeddings and state-of-the-art models involving more complex neural network intent classification architectures. Moreover, we also showed that Confusion2vec suffers the least degradation between clean and ASR transcripts. We also found that the Confusion2vec consistently provides the best classification rates even when the intent classifier is trained on ASR transcripts. The experiments indicated that the loss in accuracies between training the intent classifier on clean versus the ASR transcripts is reduced to 0.89% from 2.57% absolute. Overall, the results illustrate that confusion2vec has inherent knowledge of the acoustic ambiguity (similarity) word relations which correlates with the ASR errors using which the classifier is able to recover from certain errors more efficiently. In this section, we incorporate the confusion2vec 2.0 embeddings with inherent knowledge of acoustic ambiguity to allow robust intent classification.

6.1 Database

We conduct experiments on the Airline Travel Information Systems (ATIS) benchmark dataset [12]. The dataset consists of humans making flight related inquiries with an automated answering machine with audio recorded and its transcripts manually annotated. ATIS consists of 18 intent categories. The dataset is divided into train (4478 samples), development (500 samples) and test (893 samples) consistent with previous works [27, 11, 8]. For ASR evaluations, the audio recordings are down-sampled from 16kHz to 8kHz and then decoded using the ASR setup described in section 5.2.1 using the audio mappings555 The ASR achieves a WER of 18.54% on the ATIS test set.

6.2 Experimental Setup

For intent classification we adopt a simple RNN architecture identical to [27]

, to allow for direct comparison. The architecture of the neural network is intentionally kept simple for effective inference of the efficacy of the proposed embedding word features. The classifier is comprised of an embedding layer followed by a single layer of bi-directional recurrent neural network (RNN) with long short-term memory (LSTM) units which is followed by a linear dense layer with softmax function to output a probability distribution across all the intent categories. The embedding layer is fixed throughout the training except for the randomly initialized embeddings where the embedding is estimated on the in-domain data specific to the task of intent detection.

The intent classification models are trained on the 4478 samples of training subset and the hyper-parameters are tuned on the development set. We choose the best set of hyper-parameters yielding the best results on the development set and then apply it on the unseen held-out test subset of both the manual clean transcripts and the ASR transcripts and report the results. For training we treat each utterance as a single sample (batch size = 1). The hyper-parameter space we experiment are as follows: the hidden dimension size of the LSTM is tuned over , the learning rate over , the dropout is tuned over . The Adam optimizer is employed for optimization and trained for a total of 50 epochs with early stopping when the loss on the development set doesn’t improve for 5 consecutive epochs.

6.3 Baselines

We include results from several baseline systems for providing comparisons of Confusion2Vec 2.0 with the popular context-free word embeddings, contextual embeddings, popular established NLU systems and the current state-of-the-art.

  1. Context-Free Embeddings: GloVe666 [21], skip-gram word2vec777 [20] and fastText888 [3] word representations are employed. They are referred to as context-free embeddings since the word representations are static irrespective of the context.

  2. ELMo: Peters et al. [22] proposed deep contextualized word representation based on character based deep bidirectional language model trained on large text corpus. The models effectively model syntax and semantics of the language along varying linguistic contexts. Unlike context-free embeddings, ELMo embeddings have varying representations for each word depending on the word’s context. We employ the original model trained on 1 Billion Word Benchmark with 93.6 million parameters999 For intent-classification we add a single bi-directional LSTM layer with attention for multi-task joint intent and slot predictions.

  3. BERT: Devlin et al. [6] introduced BERT bidirectional contextual word representations based on self attention mechanism of Transformer models. BERT models make use of masked language modeling and next sentence prediction to model language. Similar to ELMo, the word embeddings are contextual, i.e., vary according to the context. We employ “bert-base-uncased” model101010 with 12 layers of 768 dimensions each trained on BookCorpus and English Wikipedia corpus. For intent-classification we add a single bi-directional LSTM layer with attention for multi-task joint intent and slot predictions.

  4. Joint SLU-LM: Liu and Lane [17] employed joint modeling of the next word prediction along with intent and slot labeling. The unidirectional RNN model updates intent states for each word input and uses it as context for slot labeling and language modeling.

  5. Attn. RNN Joint SLU: Liu and Lane [16] proposed attention based encoder-decoder bidirectional RNN model in a multi-task model for joint intent and slot-filling tasks. A weighted average of the encoder bidirectional LSTM hidden states provides information from parts of the input word sequence which is used together with time aligned encoder hidden state for the decoder to predict the slot labels and intent.

  6. Slot-Gated Attn.: Goo et al. [8] introduced a slot-gated mechanism which introduces additional gate to improve slot and intent prediction performance by leveraging intent context vector for slot filling task.

  7. Self Attn. SLU: Li et al. [15]

    proposed self-attention model with gate mechanism for joint learning of intent classification and slot filling by utilizing the semantic correlation between slots and intents. The model estimates embeddings augmented with intent information using self attention mechanism which is utilized as a gate for slot filling task.

  8. Joint BERT: Chen et al. [4] proposed to use BERT embeddings for joint modeling of intent and slot-filling. The pre-trained BERT embeddings are fine-tuned for (i) sentence prediction task - intent detection, and (ii) sequence prediction task - slot filling. The Joint BERT model lacks the bi-directional LSTM layer in comparison to the earlier baseline BERT based model.

  9. SF-ID Network: Haihong et al. [10] introduced a bi-directional interrelated model for joint modeling of intent detection and slot-filling. An iteration mechanism is proposed where the SF subnet introduces the intent information to slot-filling task while the ID-subnet applies the slot information to intent detection task. For the task of slot-filling a conditional random field layer is used to derive the final output.

  10. ASR Robust ELMo: Huang and Chen [13]

    proposed ASR robust contextualized embeddings for intent detection. ELMo embeddings are fine-tuned with a novel loss function which minimizes the cosine distance between the acoustically confused words found in ASR confusion networks. Two techniques based on supervised and unsupervised extraction of word confusions are explored. The fine-tuned contextualized embeddings are then utilized for spoken language intent detection.

Model Reference ASR
Context-Free Embeddings Random 2.69 10.75 8.06
GloVe [21] 1.90 8.17 6.27
Word2Vec [20] 2.69 8.06 5.37
fastText [3] 1.90 8.40 6.50
Joint SLU-LM [17]  1.90 9.41 7.51
Attn. RNN Joint SLU [16]  1.79 8.06 6.27
Slot-Gated Attn. [8]  3.92 10.64 6.72
Self Attn. SLU [15]  2.02 9.18 7.16
SF-ID Network [10]  3.14 10.53 7.39
C2V 1.0 [26] 2.46 6.38 3.92
Contextual Embeddings ELMo [22]  1.46 7.05 5.59
BERT [6]  1.12 6.16 5.04
Joint BERT [4]  2.46 7.73 5.27
ASR Robust ELMo (unsup.) [13] 3.24 5.26 2.02
ASR Robust ELMo (sup.) [13] 3.46 5.03 1.57
Proposed Context-Free Embeddings C2V-c 2.0 3.36 5.82 2.46
C2V-a 2.0 2.46 4.37 1.91
fastText + C2V-c 2.0 1.79 4.70 2.91
fastText + C2V-a 2.0 1.90 5.04 3.14
Table 3: Results: Model trained on clean Reference: Classification Error Rates (CER) for Reference and ASR Transcripts
is the absolute degradation of model from clean to ASR. C2V 1.0 corresponds to C2V-1 + C2V-c (JT) in Table 1 and 2.
indicates joint modeling of intent and slot-filling.

6.4 Results

In this section, we conduct experiments by training models on (i) clean human annotations and (ii) noisy ASR transcriptions.

6.4.1 Training on Clean Transcripts

Table 3 lists the results of the intent detection in terms of classification error rates (CER). The “Reference” column corresponds to results on human transcribed ATIS audio and the “ASR” corresponds to the evaluations on the noisy speech recognition transcripts. Firstly, evaluating on the Reference clean transcripts, we observe the confusion2vec 2.0 with subword encoding is able to achieve the third best performance. The best performing confusion2vec 2.0 achieves a CER of 1.79%. Among the different versions of the proposed subword based confusion2vec, we find that the concatenated versions are slightly better. We believe this is because the concatenated models exhibit better semantic and syntactic relations (see Table 1 and  2) compared to the non-concatenated ones. Among the baseline models, the contextual embedding like BERT and ELMo gives the best CER. Note, the proposed confusion2vec embeddings are context-free and are able to outperform other context-free embedding models such as GloVe, word2vec and fastText.

Secondly, evaluating the performance on the noisy ASR transcripts, we find that all the subword based confusion2vec 2.0 models outperform the popular word vector embeddings by a big margin. The subword-confusion2vec gives an improvement of approximately 45.78% relative to the best performing context-free word embeddings. The proposed embeddings also improve over the contextual embeddings including BERT and ELMo (relative improvements of 29.06%). Moreover, the results are also an improvement over the non-subword confusion2vec word vectors (31.50% improvement). Comparisons between the different versions of the proposed confusion2vec show the intra-confusion configuration yields the least CER. The best results with the proposed model outperforms the state-of-the-art (ASR Robust ELMo [13]) by reducing the CER by a relative of 13.12%. Inspecting the degradation, (drop in performance between the clean and ASR evaluations), we find that all the confusion2vec 2.0 with subword information undergo low degradation while giving the best CER, thereby re-affirming the robustness to noise in transcripts. This confirms our initial hypothesis that the subword encoding is better able to represent the acoustic ambiguities in the human language.

6.4.2 Training on Noisy ASR Transcripts

Table 4 presents the results obtained by training models on the ASR transcripts and evaluated on the ASR transcripts. Here we omit all the joint intent-slot filling baseline models, since training on ASR transcripts need aligned set of slot labels due to insertion, substitution and deletion errors which is out-of-scope of this study. We note that the confusion2vec models give significantly lower CER. The subword based confusion2vec models also provide improvements over the non-subword based confusion2vec model (21.28% improvement). Comparing the results in Table 3 and Table 4, we would like to highlight the subword-confusion2vec model gives a minimum CER of 4.37% on model trained on clean transcripts which is much better than the CER obtained by popular word embeddings like word2vec, GloVe, fastText even when trained on the ASR transcripts (15.15% better relatively). These results prove the subword-confusion2vec models can eliminate the need for re-training natural language understanding and processing algorithms on ASR transcripts for robust performance.

Model WER % CER %
Random 18.54 5.15
GloVe [21] 18.54 6.94
Word2Vec [20] 18.54 5.49
Schumann and Angkititrakul [25] 10.55 5.04111111We don’t domain-constrain, optimize or re-score our ASR, as in [25]
C2V 1.0 18.54 4.70
C2V-c 2.0 18.54 4.82
C2V-a 2.0 18.54 4.26
fastText + C2V-c 2.0 18.54 3.70
fastText + C2V-a 2.0 18.54 4.26
Table 4: Results: Model trained and evaluated on ASR transcripts.
C2V 1.0 corresponds to C2V-1 + C2V-c (JT) in Table 1 and 2

7 Conclusion

In this paper, we proposed the use of subword encoding for modeling the acoustic ambiguity information and augment word vector representations along with the semantic and syntax of the language. Each word in the language is represented as a sum of its constituent character n-gram subwords. The advantages of the subwords are confirmed by evaluating the proposed models on various word analogy tasks and word similarity tasks designed to assess the effective acoustic ambiguity, semantic and syntactic knowledge inherent in the models. Finally, the proposed subword models are applied to the task of spoken language intent detection. The results of intent classification system suggest the proposed subword confusion2vec models greatly enhance the classification performance when evaluated on the noisy ASR transcripts. The results highlight that subword-confusion2vec models are robust and domain-independent and do not need re-training of the classifier on ASR transcript.

In the future, we plan to model ambiguity information using deep contextual modeling techniques such as BERT. We believe bidirectional information modeling with attention can further enhance ambiguity modeling. On the application side, we plan to implement and assess the effect of using Confusion2vec models for a wide range of natural language understanding and processing applications such as speech translation, dialogue tracking etc.


  • [1] J. Aydelott and E. Bates (2004) Effects of acoustic distortion and semantic context on lexical access. Language and cognitive processes 19 (1), pp. 29–56. Cited by: §2.
  • [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §1.
  • [3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §1, §1, §3, Table 1, §5.2.2, Table 2, item 1, Table 3.
  • [4] Q. Chen, Z. Zhuo, and W. Wang (2019) Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. Cited by: item 8, Table 3.
  • [5] C. Cieri, D. Miller, and K. Walker (2004) The fisher corpus: a resource for the next generations of speech-to-text.. In LREC, Vol. 4, pp. 69–71. Cited by: §5.1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §1, §3, item 3, Table 3.
  • [7] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web, pp. 406–414. Cited by: §4.1.
  • [8] C. Goo, G. Gao, Y. Hsu, C. Huo, T. Chen, K. Hsu, and Y. Chen (2018) Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2, pp. 753–757. Cited by: item 6, §6.1, Table 3.
  • [9] M. U. Gutmann and A. Hyvärinen (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The journal of machine learning research 13 (1), pp. 307–361. Cited by: §3.3.
  • [10] E. Haihong, P. Niu, Z. Chen, and M. Song (2019) A novel bi-directional interrelated model for joint intent detection and slot filling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5467–5471. Cited by: item 9, Table 3.
  • [11] D. Hakkani-Tür, G. Tür, A. Celikyilmaz, Y. Chen, J. Gao, L. Deng, and Y. Wang (2016) Multi-domain joint semantic frame parsing using bi-directional rnn-lstm.. In Interspeech, pp. 715–719. Cited by: §6.1.
  • [12] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: §6.1.
  • [13] C. Huang and Y. Chen (2020) Learning asr-robust contextualized embeddings for spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8009–8013. Cited by: §1, item 10, §6.4.1, Table 3.
  • [14] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister (2016) LatticeRnn: recurrent neural networks over lattices.. In Interspeech, pp. 695–699. Cited by: §1.
  • [15] C. Li, L. Li, and J. Qi (2018) A self-attentive model with gate mechanism for spoken language understanding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3824–3833. Cited by: item 7, Table 3.
  • [16] B. Liu and I. Lane (2016) Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016, pp. 685–689. External Links: Document Cited by: item 5, Table 3.
  • [17] B. Liu and I. Lane (2016) Joint online spoken language understanding and language modeling with recurrent neural networks. arXiv preprint arXiv:1609.01462. Cited by: item 4, Table 3.
  • [18] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §1, §4.1.
  • [19] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §1, §3.3, Table 1, §5.2.2, Table 2, item 1, Table 3, Table 4.
  • [21] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §1, item 1, Table 3, Table 4.
  • [22] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1, §1, §3, item 2, Table 3.
  • [23] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §5.2.1.
  • [24] T. Schnabel, I. Labutov, D. Mimno, and T. Joachims (2015) Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 298–307. Cited by: §4.1.
  • [25] R. Schumann and P. Angkititrakul (2018) Incorporating asr errors with attention-based, jointly trained rnn for intent detection and slot filling. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6059–6063. Cited by: Table 4, footnote 11.
  • [26] P. G. Shivakumar and P. Georgiou (2019) Confusion2vec: towards enriching vector space word representations with representational ambiguities. PeerJ Computer Science 5, pp. e195. Cited by: §1, §2, §2, §4.1, Table 1, §5.1, §5.2, §5.3, §5.4, §5.4, Table 2, Table 3.
  • [27] P. G. Shivakumar, M. Yang, and P. Georgiou (2019) Spoken language intent detection using confusion2vec. arXiv preprint arXiv:1904.03576. Cited by: §1, §6.1, §6.2, §6.
  • [28] M. Sperber, G. Neubig, N. Pham, and A. Waibel (2019) Self-attentional models for lattice inputs. arXiv preprint arXiv:1906.01617. Cited by: §1.
  • [29] K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. Cited by: §1.
  • [30] Z. Tan, J. Su, B. Wang, Y. Chen, and X. Shi (2018)

    Lattice-to-sequence attentional neural machine translation models

    Neurocomputing 284, pp. 138–147. Cited by: §1.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [32] R. Weide (1998) The cmu pronunciation dictionary, release 0.6. Carnegie Mellon University Pittsburgh, PA. Cited by: §5.2.1.
  • [33] F. Xiao, J. Li, H. Zhao, R. Wang, and K. Chen (2019) Lattice-based transformer encoder for neural machine translation. arXiv preprint arXiv:1906.01282. Cited by: §1.
  • [34] H. Xu, D. Povey, L. Mangu, and J. Zhu (2011) Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language 25 (4), pp. 802–828. Cited by: §5.2.1.
  • [35] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur (2014) Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 215–219. Cited by: §5.2.1.