A short tutorial for the co-utilization of audio and text data (multi-modal analysis)
Understanding the intention of an utterance is challenging for some prosody-sensitive cases, especially when it is in the written form. The main concern is to detect the directivity or rhetoricalness of an utterance and to distinguish the type of question. Since it is inevitable to face both the issues regarding prosody and semantics, the identification is expected to benefit from the observations of human language processing mechanism. In this paper, we combat the task with attentive recurrent neural networks that exploit acoustic and textual features, using a manually created speech corpus that incorporates only the syntactically ambiguous utterances which require prosody for disambiguation. We found out that co-attention frameworks on audio-text data, namely multi-hop attention and cross-attention, can perform better than previously suggested speech-based/text-aided networks. By this, we infer that understanding the genuine intention of the ambiguous utterances incorporates recognizing the interaction between auditory and linguistic processes.READ FULL TEXT VIEW PDF
A short tutorial for the co-utilization of audio and text data (multi-modal analysis)
Inferring the intention of syntactically ambiguous utterances is one of the most challenging issues for spoken language understanding (SLU). If an utterance has an underspecified sentence ender whose role is decided only upon the prosody, the inference requires both the acoustic and textual data of the speech for SLUs (and even human) to correctly infer the intention, since the pitch sequence, the duration between the words, and the overall tone decide the intention of the utterance. For example, in Seoul Korean which is wh-in-situ, the following sentence can be interpreted differently depending on the intonation :
(S1) 몇 개 가져가 myech kay kacye-ka
how quantity bring-USE
(a) How many shall I take?
(b) Shall I take some?
(LMLLH%; yes/no Q)
(c) Take some.
where L, M, and H denote relative pitch of each syllabic block and USE denotes an underspecified sentence ender. Unlike English translations, if given only the depunctuated text (usually provided as an output of automatic speech recognition, ASR), the language understanding modules may not be able to determine if each is a statement or a question. Even with punctuation marks, it is vague whether the question is yes/no or wh-. Thus, we concluded that introducing prosodic information is indispensable for resolving the syntax-semantic ambiguity, as depicted in Figure 1.
Early studies on speech intention analysis adopt a simple concatenation of acoustic and textual features , where parallel CNN or RNN were used to summarize each feature. A recent study includes hierarchical attention networks (HAN)  that point out the components which are essential for inferring the genuine answer. In the related area of speech emotion recognition, multi-hop attention (MHA)  was introduced to encourage a comprehensive information exchange between textual and acoustic features.
However, since the experiments in literature utilize speech utterances with an intention or emotion that is less confusingly conveyed by either text or speech, there has been little study concentrating on the resolution of ambiguous sentences as in (S1). In terms of the prosody-semantics interface, we concluded that the interaction between acoustic and textual information, as in MHA, is required for such cases. We also aim to materialize this philosophy in our co-attentional architecture in the form of a cross-attention (CA), which has shown its power in the area of image-text matching  and visuo-linguistic inference .
In this paper, first, we introduce the architecture of speech intention classification systems that co-utilize the audio and text features, along with a comparison with speech-based and text-aided (self-attentive) RNN models. Next, in the experiment section, the corpus generation process is described, and the result regarding the co-attentional models is compared with the baseline models. Our contribution is as follows:
Applying parallel BRE (P-BRE), multi-hop attention (MHA), and cross-attention (CA) to disambiguate the speech intention of syntactically ambiguous utterances
Analyzing the result in view of experimental linguistics, to match the co-attention frameworks with prosody-semantics interface and biological tendency
Here, we describe how the co-attention frameworks are constructed in terms of speech processing, self-attentive embedding, text-aided analysis, multi-hop attention, and cross-attention, as shown visually in Figure 2. In all models, the input is either speech-only (2.1-2) or audio-text pair (2.3-5). The text-only model is NOT taken into account since the text alone does not help resolve the syntax-semantic ambiguity.
The baseline model utilizes only audio input. Frame-level audio feature is fed as an input of bidirectional long short-term memory (BiLSTM)
, and the final hidden state is fully connected to a multi-layer perceptron (MLP) to yield a correct answer as a maximum probability output in the final softmax layer. Refer to (1) in Figure 1 for illustration.
Since the audio-only RNN model lacks information regarding where the core parts are in analyzing the utterance, we augmented a self-attentive embedding layer as utilized in the sentence representation 
. In brief, a context vector, which has the same width as the hidden layers of the RNN, is jointly trained to assign weight vector to the hidden layer sequence of the bidirectional RNN. The whole process, as in (2) of Figure 1, implies that the weight is decided upon the overall distribution of acoustic features, which significantly concerns the syntactic features and may play a crucial role in predicting the intention of the speech.
Unlike emotion analysis where either textual or acoustic features do not necessarily dominate, in intention analysis, obtaining textual information can bring a significant advantage , even when punctuation mark is removed as in our experiment. Here, text input for the ambiguous sentences are identical (without punctuation mark) for two to four different versions of the speech, but feeding them as an input of separately constructed Audio-BRE may provide supplementary information. The final hidden layer of Audio-BRE-Att is concatenated with that of the Text-BRE-Att to make up a new feature layer, as suggested in (3) in Figure 1.
In multi-hop attention (MHA), which is proposed for speech emotion recognition , textual and acoustic features interact by sequentially transmitting information to each other. This is the background of the expression ‘multi-hop’, and here, we implement hopping from audio to text (4a, MHA-A), and then to the audio again (4b, MHA-AT), as they showed better performance than the further hopped model in the previous study. Also, it is empirically more acceptable than the reverse case since auditory sensory first faces the acoustic data. Hopping is performed by adopting the final representation of each feature as a context vector of the other as in (4) of Figure 1, where the final output of the former and the latter are eventually concatenated.
From the perspective of another co-attention framework, we adopt cross-attention (CA) that fully utilizes the information flow exchanged simultaneously by both acoustic and textual features. In the original paper on image-text matching , image segments are utilized in determining the attention vector for the text, and similarly in reverse. Thus, beyond using the representation regarding one feature as a context vector of the other’s attention weight, given the prosody-semantics interface, we assumed it more plausible to utilize the final representation of Audio-BRE-Att in making up a weight vector for Text-BRE and vice versa. In this case, self-attentive embedding was not applied to the textual features, in order to reflect the auditory-first nature and avoid the analysis being performed on audio and text by an equal rate.
To create the dataset for the analysis of ambiguous speech which requires prosodic disambiguation, a corpus that contains about 1.3K scripts, each incorporating two to four ways of pronunciation (and corresponding intention), was constructed. Specifically, each sentence (i) starts with wh-particle, (ii) incorporates predicate made up of general verbs and pronouns, and (iii) ends with underspecified sentence enders so that the overall prosody varies regarding intention (and sometimes with politeness suffix). All the sentences received consensus of three Korean natives, and the total number of speech utterances reaches 3,552. A male and a female speakers performed recording to obtain a dataset of size 7,104. The number of intentions is seven, namely statement, yes/no question, wh-question, rhetorical question, command, request, and rhetorical command. The categorization is slightly modified from recently distributed Korean corpus  to reflect wh- intervention that matters as in (S1). The specification of the corpus and its detailed generation scheme are published as a separate article .
For acoustic features, mel spectrogram (MS) and root mean square energy per frame (RMSE) were obtained by Librosa and were concatenated frame-wisely. For textual features, character-level embeddings were utilized.
For the character-level features, two types of representation were adopted, namely sparse and dense, as they show best performance for classification tasks . For sparse vectors, multi-hot encodings of the Korean characters were used . These features display conciseness and also preserve the property of the blocks as a conjunct form. For dense features that regard distributional semantics, recently disclosed fastText -based word vector dictionary was exploited .
Considering the head-finality of the Korean sentences, we embedded the acoustic and textual features backward from the endpoint. The maximum sequence length was fixed to cover the longest input, where all the features with shorter frame/character length were padded with zeros. Models were implemented with Keras
, using TensorFlow backend. Architecture and hyper-parameter specification are provided separately on-line, with all the codes for implementation111https://github.com/warnikchow/coaudiotext.
|(1) Audio-BRE||83.9 (0.652)||116K||65s|
|(2) Audio-BRE-Att||89.3 (0.759)||190K||67s|
|(3) Para-BRE-Att||93.2 (0.919)||92.8 (0.919)||260K||70s|
|(4a) MHA-A||93.8 (0.928)||93.5 (0.922)||266K||67s|
|(4b) MHA-AT||92.8 (0.909)||91.8 (0.904)||270K||67s|
|(5) CA||91.8 (0.884)||93.5 (0.919)||326K||65s|
|(3’) Para-ASR||90.0 (0.822)||-||-||-|
|(4a’) MHA-ASR||90.2 (0.799)||-||-||-|
Result on the 10% test set. For each feature, the intersection was chosen among 5-best accuracy and F1 models that were yield during first 100 epochs of training.
Table 1 shows the comparison result utilizing the corpus in Section 3.1. Both train and test sets in (1-5) incorporate the scripts of ground truth, and for the others, the test set scripts were ASR result. Input materials are either sole audio or audio-text combined, both in the training and test phase.
Attention matters: First, by (1) and (2), we observed that audio itself incorporates substantial information regarding speech intention, and physical features such as duration, pitch, tone, and magnitude can help yield the semantic understanding via attention mechanism. It seems to be related to the phenomenon that people often catch the underlying intention of a speech, although one fails to understand the whole words . Also, it was shown that attaching the attention layer guarantees stable convergence of the learning curve.
Text matters: Next, as expected, the text-aided models (3-5) far outperform the speech-only ones (1, 2) , notwithstanding bigger trainable parameter set size and the computation time. Although the character-level features we utilized do not necessarily represent semantics (which is held at least in morpheme-level), it can be interpreted that the utilization of textual feature can help recognize the prosodic prominence within the audio features . It was beyond our expectation that the sparse vectors outperform the dense ones in general (except in CA), which implies that CA takes more advantage from the distributional semantics within the text embedding.
Co-attention framework helps: To be specific on (3-5), we noticed that co-utilizing both audio and text in making up the attention vectors as in (4) MHA or (5) CA shows better performance than a simple concatenation in Para-BRE-Att. Since the studies on speech emotion analysis [17, 18] claim that prosody and semantic cue cooperatively affect inferring the ground truth, we suspect that similar phenomenon takes place in the case of speech intention. That is, acoustic and textual processing are meaningfully benefited by the consequent or simultaneous interaction with each other.
Over-stack may bring a collapse: We first guessed (4b) or (5) would show better performance compared to (4a) due to broader or deeper exchange of information between both sources. However, we had performance degeneration there, finding out that the inference becomes unstable if too much information is stacked. Although the performance of the models may not be directly linked to actual human processing mechanism, just in tendency it is assumed that speech intention analysis is affected dominantly by the combination of speech analysis and a speech-aided text analysis (4a, 5), preferably with the smaller contribution of text-aided speech analysis (4b).
Miscellaneous: For a practical analysis, model parameter size and training time per epoch were recorded (Table 1). Taking into account that audio processing itself incorporates huge computation, co-utilizing the textual information seems to bring significant improvement. Also, we performed an additional experiment on ASR result (3, 4a’) as in , especially for the test utterances. The training was performed with the ground truth, and the models for only the sparse textual features were used. It is notable that both perform competitively with the case of perfect transcription, but the degeneration was more significant in the co-attention framework. This implies that the framework utilizing textual information more aggressively is ironically vulnerable to errors. Thus, both accurate ASR and error-compensating text processing are required for the improvement and application of the systems.
We want to claim that unlike emotion recognition, which is often dominated by the voice tone, intention analysis should rely on both acoustic and textual features. To be specific, the genuine intention cannot be inferred unless the audio and text are both given if the sentence incorporates syntactic ambiguity. This is the background we brought various co-attention frameworks that are roughly supported by a psycholinguistic viewpoint. For example, when we try to understand the meaning of speech, parts of the brain that deal with understanding the act (e.g., statement, question, command) regard both Wernicke’s area (related to semantics ) and Broca’s area (related with linguistic prosody ). Not only their excitation is simultaneous, but also they interact with each other, e.g., via corpus callosum , especially intensively when faced with ambiguous utterances .
As stated previously, utterances with ambiguity are disturbing factors for speech intention understanding, which can mislead the analysis to provide wrong intent or item. However, aggregating both audio and text actively in analyzing such utterances can help more precisely predict the intention, if given transcription with high accuracy. We performed this given the prosody-semantics interface to assure that our approach is meaningful for intriguing problems. In real life, co-attention frameworks can help machines or aphasia patients understand the speech. Followingly, the system users or social chatbots might be able to provide proper response/reaction in free-style or goal-oriented conversations with others.
In this paper, we constructed speech intention recognition systems using co-attentional frameworks inspired by psycholinguistics and prosody-semantics interface of human language understanding. Multi-hop attention and cross-attention outperformed the conventional speech/attention-based and text-aided models, as shown by the evaluation using the audio-text pair recorded with manually created scripts. An additional experiment with ASR output was also conducted to guarantee real world usage. The implemented systems can help SLU modules correctly infer the intention of syntactically/semantically ambiguous utterances in Seoul Korean and possibly in a multi-lingual manner. Besides, we hope the results to provide empirical evidence for finding out the language processing mechanism of ambiguous utterances.
“Speech intention classification with multimodal deep learning,”in
Canadian Conference on Artificial Intelligence. Springer, 2017, pp. 260–271.
Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 201–216.
“Sequence-to-sequence autoencoder based korean text error correction using syllable-level multi-hot vector representation,”in Proceedings of HCLT [in Korean], 2018, pp. 661–664.