Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

07/21/2018
by   Yi-Chen Chen, et al.
0

Word embedding or Word2Vec has been successful in offering semantics for text words learned from the context of words. Audio Word2Vec was shown to offer phonetic structures for spoken words (signal segments for words) learned from signals within spoken words. This paper proposes a two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken words. Stage 1 performs phonetic embedding with speaker characteristics disentangled. Stage 2 then performs semantic embedding in addition. We further propose to evaluate the phonetic-and-semantic nature of the audio embeddings obtained in Stage 2 by parallelizing with text embeddings. In general, phonetic structure and semantics inevitably disturb each other. For example the words "brother" and "sister" are close in semantics but very different in phonetic structure, while the words "brother" and "bother" are in the other way around. But phonetic-and-semantic embedding is attractive, as shown in the initial experiments on spoken document retrieval. Not only spoken documents including the spoken query can be retrieved based on the phonetic structures, but spoken documents semantically related to the query but not including the query can also be retrieved based on the semantics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/07/2018

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Embedding audio signal segments into vectors with fixed dimensionality i...
08/07/2018

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

While Word2Vec represents words (in text) as vectors carrying semantic i...
10/23/2018

Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding

Developers increasingly rely on text matching tools to analyze the relat...
06/14/2015

Leveraging Word Embeddings for Spoken Document Summarization

Owing to the rapidly growing multimedia content available on the Interne...
07/29/2019

A Mathematical Model for Linguistic Universals

Inspired by chemical kinetics and neurobiology, we propose a mathematica...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word embedding or Word2Vec [1, 2, 3, 4]

has been widely used in the area of natural language processing 

[5, 6, 7, 8, 9, 10, 11]

, in which text words are transformed into vector representations of fixed dimensionality 

[12, 13, 14]. This is because these vector representations carry plenty of semantic information learned from the context of the considered words in the text training corpus. Similarly, audio Word2Vec has also been proposed in the area of speech signal processing, in which spoken words (signal segments for words without knowing the underlying word it represents) are transformed into vector representations of fixed dimensionality [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. These vector representations carry the phonetic structures of the spoken words learned from the signals within the spoken words, and have been shown to be useful in spoken term detection, in which the spoken terms are detected simply based on the phonetic structures. Such Audio Word2Vec representations do not carry semantics, because they are learned from individual spoken words only without considering the context.

Audio Word2Vec was recently extended to Segmental Audio Word2Vec [26], in which an utterance can be automatically segmented into a sequence of spoken words [27, 28, 29, 30] and then transformed into a sequence of vectors of fixed dimensionality by Audio Word2Vec, and the spoken word segmentation and Audio Word2Vec can be jointly trained from an audio corpus. In this way the Audio Word2Vec was upgraded from word-level to utterance-level. This offers the opportunity for Audio Word2Vec to include semantic information in addition to phonetic structures, since the context among spoken words in utterances bring semantic information. This is the goal of this work, and this paper reports the first set of results towards such a goal.

In principle, the semantics and phonetic structures in words inevitably disturb each other. For example, the words “brother” and “sister” are close in semantics but very different in phonetic structure, while the words “brother” and “bother” are close in phonetic structure but very different in semantics. This implies the goal of embedding both phonetic structures and semantics for spoken words is naturally very challenging. Text words can be trained and embedded as vectors carrying plenty of semantics because the phonetic structures are not considered at all. On the other hand, because spoken words are just a different version of representations for text words, it is also natural to believe they do carry some semantic information, except disturbed by phonetic structures plus some other acoustic factors such as speaker characteristics and background noise [31, 32, 33, 34, 35, 36]. So the goal of embedding spoken words to carry both phonetic structures and semantics is possible, although definitely hard.

But a nice feature of such embeddings is that they may include both phonetic structures and semantics [37, 38]. A direct application for such phonetic-and-semantic embedding of spoken words is spoken document retrieval [39, 40, 41, 42, 43]. This task is slightly different from spoken term detection, in the latter case spoken terms are simply detected based on the phonetic structures. Here the goal of the task is to retrieve all spoken documents (sets of consecutive utterances) relevant to the spoken query, which may or may not include the query. For example, for the spoken query of “President Donald Trump”, not only those documents including the spoken query should be retrieved based on the phonetic structures, but those documents including semantically related words such as “White House” and “trade policy”, but not necessarily “President Donald Trump”, should also be retrieved. This is usually referred to as “semantic retrieval”, which can be achieved by the phonetic-and-semantic embedding discussed here.

This paper proposes a two-stage framework of phonetic-and-semantic embedding for spoken words. Stage 1 performs phonetic embedding but with speaker characteristics disentangled using separate phonetic and speaker encoders and a speaker discriminator. Stage 2 then performs semantic embedding in addition. We further propose to evaluate the phonetic-and-semantic nature of the audio embeddings obtained in Stage 2 by parallelizing with text embeddings [44, 45]. Very encouraging results including those for an application task of spoken document retrieval were obtained in the initial experiments111The code is released at https://github.com/grtzsohalf/Audio-Phonetic-and-Semantic-Embedding.git.

2 Proposed Approach

The proposed framework of phonetic-and-semantic embedding of spoken words consists of two stages:

Stage 1 - Phonetic embedding with speaker characteristics disentangled.

Stage 2 - Semantic embedding over phonetic embeddings obtained in Stage 1.

In addition, we propose an approach for parallelizing the audio and text embeddings to be used for evaluating the phonetic and semantic information carried by the audio embeddings. These are described in Subsections 2.1,  2.2 and 2.3 respectively.

2.1 Stage 1 - Phonetic Embedding with Speaker Characteristics Disentangled

Figure 1: Phonetic embedding with speaker characteristics disentangled.

A text word with a given phonetic structure corresponds to infinite number of audio signals with varying acoustic factors such as speaker characteristics, microphone characteristics, background noise, etc. All the latter acoustic factors are jointly referred to as speaker characteristics here for simplicity, which obviously disturbs the goal of phonetic-and-semantic embedding. So Stage 1 is to obtain phonetic embeddings only with speaker characteristics disentangled.

Also, because the training of phonetic-and-semantic embedding is challenging, in the initial effort we slightly simplify the task by assuming all training utterances have been properly segmented into spoken words. Because there exist many approaches for segmenting utterances automatically [26], and automatic segmentation plus phonetic embedding of spoken words has been successfully trained and reported before [26], such an assumption is reasonable here.

We denote the audio corpus as , which consists of spoken words, each represented as , where is the acoustic feature vector for the tth frame and is the total number of frames in the spoken word. The goal of Stage 1 is to disentangle the phonetic structure and speaker characteristics in acoustic features, and extract a vector representation for the phonetic structure only.

2.1.1 Autoencoder

As shown in the middle of Figure 1, a sequence of acoustic features is entered to a phonetic encoder and a speaker encoder to obtain a phonetic vector in orange and a speaker vector in green. Then the phonetic and speaker vectors , are used by the decoder to reconstruct the acoustic features . This phonetic vector will be used in the next stage as the phonetic embedding. The two encoders , and the decoder are jointly learned by minimizing the reconstruction loss below:

(1)

It will be clear below how to make and separately encode the phonetic structure and speaker characteristics.

2.1.2 Training Criteria for Speaker Encoder

The speaker encoder training requires speaker information for the spoken words. Assume the spoken word is uttered by speaker . When the speaker information is not available, we can simply assume that the spoken words in the same utterance are produced by the same speaker. As shown in the lower part of Figure 1, is learned to minimize the following loss:

(2)

In other words, if and are uttered by the same speaker (), we want their speaker embeddings and to be as close as possible. But if , we want the distance between and larger than a threshold .

2.1.3 Training Criteria for Phonetic Encoder

As shown in the upper right corner of Figure 1, a speaker discriminator takes two phonetic vectors and as input and tries to tell if the two vectors come from the same speaker. The learning target of the phonetic encoder is to ”fool” this speaker discriminator , keeping it from discriminating the speaker identity correctly. In this way, only the phonetic structure information is learned in the phonetic vector , while only the speaker characteristics is encoded in the speaker vector . The speaker discriminator learns to maximize in (3), while the phonetic encoder learns to minimize ,

(3)

where is a real number.

2.1.4 Overall Optimization of Stage 1

The optimization procedure of Stage 1 consists of four parts: (1) training , and by minimizing , (2) training by minimizing , (3) training by minimizing , and (4) training by maximizing . Parts (1)(2)(3) are jointly trained together, while iteratively trained with part (4) [46].

2.2 Stage 2 - Semantic Embedding over Phonetic Embeddings Obtained in Stage 1

Figure 2: Semantic embedding over phonetic embeddings obtained in Stage 1.

As shown in Figure 2, similar to the Word2Vec skip-gram model [1], we use two encoders: semantic encoder and context encoder to embed the semantics over phonetic embeddings obtained in Stage 1. On the one hand, given a spoken word , we feed its phonetic vector obtained from Stage 1 into as in the middle of Figure 2, producing the semantic embedding (in yellow) of the spoken word . On the other hand, given the context window size

, which is a hyperparameter, if a spoken word

is in the context window of , then its phonetic vector is a context vector of . For each context vector of , we feed it into the context encoder in the upper part of Figure 2, and the output is the context embedding .

Given a pair of phonetic vectors , the training criteria for and is to maximize the similarity between and if and are contextual, while minimizing the similarity otherwise. The basic idea is parallel to that of text Word2Vec. Two different spoken words having similar context should have similar semantics. Thus if two different phonetic embeddings corresponding to two different spoken words have very similar context, they should be close to each other after projected by the semantic encoder . The semantic and context encoders and learn to minimize the semantic loss as follows:

(4)

The sigmoid of dot product of and is used to evaluate the similarity. With (4), if and are in the same context window, we want and to be as similar as possible. We also use the negative sampling technique, in which only some pairs are randomly sampled as negative examples instead of enumerating all possible negative pairs.

2.3 Parallelizing Audio and Text Embeddings for Evaluation Purposes

In this paper we further propose an approach of parallelizing a set of audio embeddings (for spoken words) with a set of text embeddings (for text words) which will be useful in evaluating the phonetic and semantic information carried by these embeddings.

Assume we have the audio embeddings for a set of spoken words , where is the embedding obtained for a spoken word and is the total number of distinct spoken words in the audio corpus. On the other hand, assume we have the text embeddings , where is the embedding of the -th text word for the distinct text words. Although the distributions of and in their respective spaces are not parallel, that is, a specific dimension in the space for does not necessarily correspond to a specific dimension in the space for , there should exist some consistent relationship between the two distributions. For example, the relationships among the words {France, Paris, Germany} learned from context should be consistent in some way, regardless of whether they are in text or spoken form. So we try to learn a mapping relation between the two spaces. It will be clear below such a mapping relation can be used to evaluate the phonetic and semantic information carried by the audio embeddings.

Mini-Batch Cycle Iterative Closest Point (MBC-ICP) [45] previously proposed as described below is used here. Given two sets of embeddings as mentioned above, and , they are first projected to their respective top principal components by PCA. Let the projected sets of vectors of and be and respectively. If can be mapped to the space of by an affine transformation, the distributions of and would be similar after PCA [45].

Then a pair of transformation matrices, and , is learned, where transforms a vector in to the space of , that is, , while maps a vector in to the space of . and are learned iteratively by the algorithm proposed previously [45].

In our evaluation as mentioned below, labeled pairs of the audio and text embeddings of each word is available, that is, we know and for each word . So we can train the transformation matrices and using the gradient descent method to minimize the following objective function:

(5)

where the last two terms in (5) are cycle-constraints to ensure that both and are almost unchanged after transformed to the other space and back. In this way we say the two sets of embeddings are parallelized.

3 Experimental Setup

3.1 Dataset

We used LibriSpeech [47] as the audio corpus in the experiments, which is a corpus of read speech in English derived from audiobooks. This corpus contains 1000 hours of speech sampled at 16 kHz uttered by 2484 speakers. We used the “clean” and “others” sets with a total of 960 hours, and extracted 39-dim MFCCs as the acoustic features.

3.2 Model Implementation

In Stage 1, The phonetic encoder , speaker encoder and decoder were all 2-layer GRUs with hidden layer size 128, 128 and 256, respectively. The speaker discriminator is a fully-connected feedforward network with 2 hidden layers with size 128. The value of we used in in (2) was set to 0.01.

In Stage 2, the two encoders and were both 2-hidden-layer fully-connected feedforward networks with size 256. The size of embedding vectors was set to be 128. The context window size was 5, and the negative sampling number was 5.

For parallelizing the text and audio embeddings in Subsection 2.3, we projected the embeddings to the top 100 principle components, so the affine transformation matrices were . The mini-batch size was 200, and in (5) was set to 0.5.

(a)TXT-ph (b)TXT-(se,1h) (c)TXT-(se,ph)
1000 pairs (i)AUD-ph 0.637 0.124 0.550
(ii)AUD-(ph-+se) 0.519 0.322 0.750
(iii)AUD-(ph+se) 0.598 0.339 0.800
3000 pairs (i)AUD-ph 0.465 0.028 0.279
(ii)AUD-(ph-+se) 0.330 0.032 0.254
(iii)AUD-(ph+se) 0.395 0.033 0.313
5000 pairs (i)AUD-ph 0.362 0.012 0.190
(ii)AUD-(ph-+se) 0.263 0.022 0.173
(iii)AUD-(ph+se) 0.315 0.023 0.212
Table 1: Top-1 nearest accuracies when parallelizing the different versions of audio and text embeddings for different numbers of pairs of spoken and text words.
(a)TXT-ph (b)TXT-(se,1h) (c)TXT-(se,ph)
1000 pairs (i)AUD-ph 0.954 0.355 0.898
(ii)AUD-(ph-+se) 0.897 0.653 0.986
(iii)AUD-(ph+se) 0.945 0.742 0.994
3000 pairs (i)AUD-ph 0.854 0.120 0.654
(ii)AUD-(ph-+se) 0.758 0.146 0.671
(iii)AUD-(ph+se) 0.809 0.166 0.752
5000 pairs (i)AUD-ph 0.774 0.050 0.518
(ii)AUD-(ph-+se) 0.658 0.109 0.544
(iii)AUD-(ph+se) 0.717 0.111 0.607
Table 2: Top-10 nearest accuracies when parallelizing the different versions of audio and text embeddings for different numbers of pairs of spoken and text words.
words AUD-(ph+se) AUD-ph TXT-(se,1h)
owned own, only, unknown, owner, land, owns, armed, owen, arm, own, visited, introduced, lived, related, learned,
armed, learned, homes, known, alone only, oughtnt, loaned, ode, owing discovered, met, called, think, known
didn’t did, sitting, give, doesn’t, don’t, giving, bidden, given, getting, being, don’t, can’t, wouldn’t, doesn’t, won’t,
given, hadn’t, too, bidden, listen even, ridden, didnt, deane, givin i’m, you’re, shouldn’t, think, want
Table 3: Some examples of top-10 nearest neighbors in AUD-(ph+se) (proposed), AUD-ph (with phonetic structure) and TXT-(se,1h) (with semantics). The words in red are the common words of AUD-(ph+se) and AUD-ph, and the words in bold are the common words of AUD-(ph+se) and TXT-(se,1h).
groundtruth AUD-(ph+se) AUD-ph
+ 17.8% 15.6%
2.8% 1.8%
Table 4: Spoken document retrieval performance using two different audio embeddings (AUD-(ph+se) and AUD-ph).
(a) query (b) title of a book (c) chapter (d) rank (e) the word with the highest similarity to the query
nations Myths and Legends of All Nations Prometheus the Friend of Man 13/5273 …and shall marry the king of that country…
Anne Anne of Green Gables Mrs. Rachel Lynde Is Surprised 25/5329 …why the worthy woman finally concluded…
German In a German Pension Story 13: A Blaze 22/5232 …through the heavy snow towards the town
castle Montezuma’s Castle and Other Weird Tales THE STRANGE POWDER… 3/5141 …what is its history asked doctor Farrington…
baron Surprising Adventures of Baron Munchausen Chapter 22 18/5375 …at the palace and having remained in this situation…
Table 5: Some retrieval examples of chapters in using AUD-(ph+se) show the advantage of semantics information in phonetic-and-semantic embeddings. The word in red in each row indicates the word with the highest similarity to the query in the chapter.

4 Experimental Results

4.1 Evaluation by Parallelizing Audio and Text Embeddings

Each text word corresponds to many audio realizations in spoken form. So we first took the average of the audio embeddings for all those realizations to be the audio embedding for the spoken word considered. In this way, each word has a unique representation in either audio or text form.

We applied three different versions of audio embedding (AUD) on the top 1000, 3000 and 5000 words with the highest frequencies in LibriSpeech: (i) phonetic embedding only obtained in Stage 1 in Subsection 2.1 (AUD-ph); (ii) phonetic-and-semantic embedding obtained by Stages 1 and 2 in Subsections 2.12.2, except the speaker characteristics not disentangled (AUD-(ph-+se)), or , in (2), (3) not considered; (iii) complete phonetic-and-semantic embedding as proposed in this paper including Stages 1 and 2 (AUD-(ph+se)). So this is for ablation study.

On the other hand, we also obtained three different types of text embedding (TXT) on the same set of top 1000, 3000 and 5000 words. Type (a) Phonetic Text embedding (TXT-ph) considered precise phonetic structure but not context or semantics at all. This was achieved by a well-trained sequence-to-sequence autoencoder encoding the precise phoneme sequence of a word into a latent embedding. Type (b) Semantic Text embedding considered only context or semantics but not phonetic structure at all, and was obtained by a standard skip-gram model using one-hot representations as the input (TXT-(se,1h)). Type (c) Semantic and Phonetic Text embedding (TXT-(se,ph)) considered context or semantics as well as the precise phonetic structure, obtained by a standard skip-gram model but using the Type (a) Phonetic Text embedding (TXT-ph) as the input. So these three types of text embeddings provided the reference embeddings obtained from text and/or phoneme sequences, not disturbed by audio signals at all.

Now we can perform the transformation from the above three versions of audio embeddings (AUD-ph, AUD-(ph-+se), AUD-(ph+se)) to the above three types of text embeddings (TXT-ph, TXT-(se,1h), TXT-(se,ph)) by parallelizing the embeddings as described in Subsection 2.3

. The evaluation metric used for this parallelizing test is the top-k nearest accuracy. If the audio embedding representation

of a word is transformed to the text embedding by , and is among the top-k nearest neighbors of the text embedding representation of the same word, this transformation for word is top-k-accurate. The top-k nearest accuracy is then the percentage of the words considered which are top-k-accurate.

The results of top-k nearest accuracies for k=1 and 10 are respectively listed in Tables 1 and 2, each for 1000, 3000 and 5000 pairs of spoken and text words.

First look at the top part of Table 1 for top-1 nearest accuracies for 1000 pairs of audio and text embeddings. Since column (a) (TXT-ph) considered precise phonetic structures but not semantics at all, the relatively high accuracies in column (a) for all three versions of audio embedding (i)(ii)(iii) implied the three versions of audio embedding were all rich of phonetic information. But when the semantics were embedded in (ii)(iii) (AUD-(ph-+se), AUD-(ph+se)), the phonetic structures were inevitably disturbed (0.519, 0.598 vs 0.637). On the other hand, column (b) (TXT-(se,1h)) considered only semantics but not phonetic structure at all, the relatively lower accuracies implied the three versions of audio embedding did bring some good extent of semantics, except (i) AUD-ph, but obviously weaker than the phonetic information in column (a). Also, the Stage 2 training in rows (ii)(iii) (AUD-(ph-+se), AUD-(ph+se)) gave higher accuracies than row (i) (AUD-ph) (0.339, 0.332 vs 0.124 in column (b)), which implied the Stage 2 training was successful. However, column (c) (TXT-(se,ph)) is for the text embedding considering both the semantic and phonetic information, so the two versions of phonetic-and-semantic audio embedding for rows (ii)(iii) had very close distributions (0.750, 0.800 in column (c)), or carried good extent of both semantics and phonetic structure. The above are made clearer by the numbers in bold which are the highest for each row, and the numbers in red which are the highest for each column. It is also clear that the speaker characteristics disentanglement is helpful, since row (iii) for AUD-(ph+se) was always better than row (ii) for AUD-(ph-+se).

Similar trends can be observed in the other parts of Table 1

for 3000 and 5000 pairs, except the accuracies were lower, probably because for more pairs the parallelizing transformation became more difficult and less accurate. The only difference is that in these parts column (a) for TXT-ph had the highest accuracies, probably because the goal of semantic embedding for rows (ii)(iii) (AUD-(ph

-+se), AUD-(ph+se)) was really difficult, and disturbed or even dominated by phonetic structures. Similar trends can be observed in Table 2 for top-10 accuracies, obviously with higher numbers for top-10 as compared to those for top-1 in Table 1.

In Table 3, we list some examples of top-10 nearest neighbors in AUD-(ph+se) (proposed), AUD-ph (with phonetic structure) and TXT-(se,1h) (with semantics). The words in red are the common words for AUD-(ph+se) and AUD-ph, and the words in bold are the common words of AUD-(ph+se) and TXT-(se,1h). For example, the word “owned” has two common semantically related words “learned” and “known” in the top-10 nearest neighbors of AUD-(ph+se) and TXT-(se,1h). The word “owned” also has three common phonetically similar words “armed”, “own” and “only” in the top-10 nearest neighbors of AUD-(ph+se) and AUD-ph. This is even clearer for the function word “didn’t”. These clearly illustrate the phonetic-and-semantic nature of AUD-(ph+se).

4.2 Results of Spoken Document Retrieval

The goal here is to retrieve not only those spoken documents including the spoken query (e.g. “President Donald Trump”) based on the phonetic structures, but those including words semantically related to the query word (e.g. “White House”). Below we show the effectiveness of the phonetic-and-semantc embedding proposed here in this application.

We used the 960 hours of “clean” and “other” parts of LibriSpeech dataset as the target archive for retrieval, which consisted of 1478 audio books with 5466 chapters. Each chapter included 1 to 204 utterances or 5 to 6529 spoken words. In our experiments, the queries were the keywords in the book titles, and the spoken documents were the chapters. We chose 100 queries out of 100 randomly selected book titles, and our goal was to retrieve query-relevant documents. For each query , we defined two sets of query-relevant documents: The first set consisted of chapters which included the query . The second set consisted of chapters whose content didn’t contain , but these chapters belonged to books whose titles contain (so we assume these chapters are semantically related to ). Obviously and were mutually exclusive, and were the target for semantic retrieval, but couldn’t be retrieved based on the phonetic structures only.

For each query and each document , the relevance score of with respect to , , is defined as follows:

(6)

where is the audio embedding of a word in . So (6) indicates the documents were ranked by the minimum distance between a word in and the query . We used mean average precision (MAP) as the evaluation metric for the spoken document retrieval test.

We compared the retrieval results with two versions of audio embedding: AUD-(ph+se) and AUD-ph. The results are listed in Table 4 for two definitions of groundtruth for the query-relevant documents: the union of and and alone. As can be found from this table, AUD-(ph+se) offered better retrieval performance than AUD-ph in both rows. Note that those chapters in in the second row of the table did not include the query , so couldn’t be well retrieved using phonetic embedding alone. That is why the phonetic-and-semantic embedding proposed here can help.

In Table 5, we list some chapters in retrieved using AUD-(ph+se) embeddings to illustrate the advantage of the phonetic-and-semantic embeddings. In this table, column (a) is the query , column (b) is the title of a book which had chapters in , column (c) is a certain chapter in , column (d) is the rank of out of all chapters whose content didn’t contain , and column (e) is a part of the content in where the word in red is the word in with the highest similarity to . For example, in the first row for the query “nations”, the chapter “Prometheus the Friend of Man” of the book titled “Myths and Legends of All Nations” is in . The word “nations” is not in the content of this chapter. However, because the word “king” semantically related to “nations” is in the content, this chapter was ranked the 13th among all chapters whose content didn’t contain the word “nations”. This clearly verified why the semantics in the phonetic-and-semantic embeddings can remarkably improve the performance of spoken content retrieval.

5 Conclusions and Future Work

In this paper we propose a framework to embed spoken words into vector representations carrying both the phonetic structure and semantics of the word. This is intrinsically challenging because the phonetic structure and the semantics of spoken words inevitably disturbs each other. But this phonetic-and-semantic embedding nature is desired and attractive, for example in the application task of spoken document retrieval. A parallelizing transformation between the audio and text embeddings is also proposed to evaluate whether such a goal is achieved.

References