Full-length answer generation is the task of generating natural answers over a question and an answer span, usually a fact-based phrase (factoid answer), extracted from relevant knowledge sources such as knowledge-bases (KB) or context passages. Such functionality is desired in conversational agents and dialogue systems to interact naturally with the user over multi-modal interfaces, such as speech and text. Typical task-oriented dialogue systems and chatbots formulate coherent responses from conversation context with a natural language generation (NLG) module. These modules copy relevant facts from context while generating new words, maintaining factual accuracy in a coherent fact-based natural response. Recent research[Liu-et-al:0018ijcai, pal-etal-2019-answering] utilizes a pointer-network to copy words from relevant knowledge sources. While the task of generating natural response to text-based questions have been extensively studied, there is little research on natural answer generation from spoken content. Recent research on Spoken Question Answering and listening comprehension tasks[DBLP:journals/corr/abs-1804-00320] extracts an answer-span and does not generate a natural answer. This motivates us to propose the task of generating full length answer from spoken question and textual factoid answer. However, such a task poses significant challenges as the performance of the system is highly dependent on Automatic Speech Recognizer (ASR) error. To mitigate the effect of Word Error Rate (WER) on ASR predictions, a list of top-N hypothesis, ASR lattices or confusion networks has been used in various tasks such as Dialogue-state-tracking [DBLP:journals/corr/JagfeldV17, zhong2018global], Dialogue-Act detection 
and named-entity recognition. These tasks show that models trained using multiple ASR hypotheses outperforms those trained top-1 ASR hypothesis. While classification and labeling tasks benefit from multiple hypothesis by aggregating the predictions over a list of ASR hypothesis, it is non-trivial to apply the same for NLG using pointer-networks. Our proposed system aims to take advantage of multiple time-aligned ASR hypotheses represented as a confusion network using a pointer-network to generate full-length answers. To the best of our knowledge, there is no prior work for full length answer generation from spoken questions. Our overall research contributions are as follows:
We propose a novel task of full-length answer generation from spoken question. To achieve this, we develop a ConfNet2Seq model which encodes a confusion network and adapts it over a pointer-generator architecture.
We compare the effects of using multiple hypothesis encoded with a confusion network encoder and the best hypothesis encoded with a text encoder.
We publicly release the dataset, comprising of spoken question audio file, the corresponding confusion network, the factoid answer and full-length answer.
2 Related Work
Spoken Language Understanding(SLU) has the additional challenge of disambiguation of ASR errors which drastically affect performance. Several methods have been proposed to curb the effects of the WER. Word lattices from ASR were first used by [DBLP:journals/csl/Hakkani-TurBRT06] over ASR top-1 hypothesis for tasks such as named-entity extraction and call classification. Word confusion networks have been recently used by [Ladhak2016LatticeRnnRN] for intent classification in dialogue systems and by [DBLP:journals/corr/JagfeldV17, pal2020modeling] for dialogue state tracking (DST). [DBLP:journals/corr/JagfeldV17] show that confusion network gives comparable performance to top-N hypotheses of ASR while [pal2020modeling] show that using confusion network improves performance in both in time and accuracy. Another related task in SLU is that of Spoken Question Answering. Recent work [DBLP:journals/corr/abs-1804-00320] on SQuAD dataset introduces the task for machine listening comprehension where the context passages are in audio form. [DBLP:journals/corr/abs-1808-02280] released Open-Domain Spoken Question Answering Dataset (ODSQA) with more than three thousand questions in Chinese and used an enhanced word embedding comprising of word embedding and pingyin-token embedding. [unlu-inproceedings] developed a QA system for spoken lectures and generates an answer span from the video transcription.
Our system generates full length answer from a textual factoid answer and spoken question. We use a pointer generator architecture over two sequences, i.e., over the textual factoid answer sequence and the encoded question sequence produced by the confusion network encoder. In this section, we describe the 1) Confusion network encoder, 2) Final model over spoken question and factoid answer. The full architecture is shown in figure 1.
3.1 Confusion Network Encoder
A Confusion Network is a weighted directed acyclic graph with one or more parallel arcs between consecutive nodes where each path goes through all the nodes. Each set of parallel arcs represents time-aligned alternative words or hypothesis of the ASR weighed by probability. The total probability of all parallel arcs between two consecutive nodes sums up to 1. A confusion networkcan be defined formally as a sequence of sets of parallel weighted arcs as:
where is the ASR hypothesis at position , and its associated probability. We use a confusion network encoder to transform a 2-dimensional confusion network into an 1-dimensional sequence of embeddings as described in [masamura-et-al-2018]. Each word
of the confusion network can be encoded by weighing the word embedding by the ASR probability followed by a non-linear transformation as:
is a trainable parameter. Each set of parallel arcs can be encoded into a vector by a weighted sum over the words of the parallel arc set. The weights measure the relevance of each word among the alternate time-aligned hypothesis. The learnt weight distribution for each parallel-arc set is:
where is a trainable parameter. The final encoding of each set of parallel arcs is:
3.2 Full Length Answer Generation from Spoken Questions
We have followed a Seq2Seq with pointer generator architecture as [pal-etal-2019-answering] to generate full-length answers from a question and factoid answer. However, we query with spoken questions instead of textual questions. The confusion network is extracted from spoken questions using a standard ASR. The question is encoded as where is the encoding from the confusion network encoder explained in section 3.1.
The factoid answer is represented as where is the GloVe embedding [Pennington14glove:global] of a word. We encode the sequences using two 3-layered bi-LSTMs which share weights as:
The encoded hidden states of the 2 encoders are stacked together to produce a single list of source hidden states, . The decoder is initialized with the combined final states of the two encoders as .
The global attention weights are computed on the hidden states of the question and hidden states of the answer, stacked to produce a total of global attention weights. For each source state, , and decoder state, :
where , , , are learnable parameters. The copy mechanism for summarization introduced in [DBLP:journals/corr/SeeLM17] takes advantage of a word distribution over an extended vocabulary comprising of source words and vocabulary words. The probability of copying a word from a text sequence is
. To copy words from the confusion network, we compute the global attention weights over each set of parallel-arc encodings. Here, the global attention weights denote a probability distribution over parallel-arc sets instead of words. These attention weightsare sampled to select the hidden state representation, , of a set of parallel arcs. The ASR scores is a probability distribution over the set of parallel words at position in the confusion network. These are sampled to select the most likely word from that set of parallel arcs. The final probability of copying a word from the confusion network is the joint-probability:
The probability of copying a word from the answer is:
The final probability of a word output at at time by the decoder is as shown in
where is a soft switch for the decoder to generate words or copy words from the source. is the probability of generating a word from the vocabulary. These parameters are computed as described in [DBLP:journals/corr/SeeLM17].
To generate data for our task, we use samples from the full-length answer generation dataset introduced in [pal-etal-2019-answering] where each sample consists of a question, factoid answer and full-length answer. The samples in the dataset were chosen from SQuAD and HarvestingQA. Each sample in our dataset is also a 3-tuple in which is a spoken-form question, is a text-form factoid answer and is the text-form full-length natural answer. samples were randomly selected as the training set, as the development set and as the test set. We also extracted samples from NewsQA dataset and samples from Freebase to evaluate our system on cross-domain datasets.111Code and dataset at: https://github.com/kolk/ConfnetPointerGenBaseline
We used Google text-to-speech to generate the spoken utterances of the questions. Google Voice en-US-Standard-B was used to generate spoken questions in male voice and Google Voice en-US-Wavenet-C was used to generate spoken questions in female voice. All samples are in US accented English. The ASR lattice was extracted using Kaldi ASR [Povey_ASRU2011] and converted to a confusion network for compact representation using SRILM[Stolcke02srilm--]. We used the pre-trained ASpIRE Chain Model which has been trained on Fisher English to transcribe the spoken question and extract the ASR lattices. The training dataset has a WER of and test set has a WER of on the best hypothesis of the ASR, while the cross-dataset evaluation test sets- NewsQA has a WER of and Freebase has a WER of .
5 Experiments and Results
We built our system over OpenNMT-Py[opennmt]. We used a batch size of 32, dropout rate of 0.5, RNN size of 512 and decay steps 10000. The maximum number of parallel arcs in the confusion network and maximum sentence length are set to 20 and 50 respectively. The confusion network contains noise and interjections such as *DELETE* and [noise], [laughter], uh, oh which leads to degradation in system performance. To mitigate the effect of such noise, we remove the whole set of parallel arcs if all the arcs are noise and interjection words. As shown in table 1, the pruned confusion network, named clean confnet, outperforms the system marginally for the SQuAD/HarvestingQA dataset. We also compare the system with a model trained on the best hypothesis of the extracted from the ASR lattice using Kaldi. Here, the confusion network encoder is replaced with a text encoder which shares weights with the factoid answer encoder.
As shown in table 1, we observe for SQuAD/HarvestingQA dataset that the Best-ASR-hypothesis outperforms the clean confusion network model with a 5% margin in BLEU score and 2% margin in ROGUE-L score. To asses the cross-domain generlizability, we also perform cross-dataset evaluation by evaluating our models on samples of a KB based dataset(Freebase) and samples of a machine comprehension dataset(NewsQA). The clean confusion network marginally outperforms the best-hypothesis model in ROGUE scores for cross-dataset evaluation and gives comparable results on BLEU scores. This shows that the confusion network system generalizes better on cross-domain noisy data and is less sensitive to noise introduced by new domains and noisy input signal, when compared with the Best-ASR-Hypothesis model. A plausible reason to this could be that the confusion network model is itself trained on a closed set of hypothesis, as compared to the Best-ASR-Hypothesis model which makes simplifying assumptions about the input signal. A compelling extension to the confusion network model is to adapt the copy attention over all the time-aligned hypotheses of the confusion network input. This would allow the confusion network model to copy among top-N words at any given time-step of the confusion network, instead of an erroneous word with the highest ASR score.
An example of results on a SQuAD/HarvestingQA test sample is as follows:
Gold Question: what was the title of the sequel to conan the barbarian ?
Top-Hypothesis: what was the title of the sequels are counting the barbarian
Factoid Answer: conan the destroyer
Full-length Answer: the title of the sequel to conan the barbarian was conan the destroyer
Clean Confnet Model prediction: the title of the sequels to the barbarian was conan the destroyer
Best-Hypothesis Model prediction: the title of the sequels are counting the barbarian
We propose the task of generating full-length natural answers from spoken questions and factoid answer. We generated a dataset consisting of triples (spoken question, factoid answer, full length answer) and extracted confusion network from the questions. We have used the pointer-network over ASR graphs(confusion network) and show that it gives comparable results to the model trained on the best hypothesis. Our system achieves a BLEU score of 55.92% and ROGUE-L score of 76.78% on SQuAD/HarvestingQA dataset. We perform cross-dataset evaluation to obtain a BLEU score of 42.89% and ROGUE-L score of 66.39% on Freebase, and a BLEU score of 56.86% and ROGUE-L score of 73.12% on NewsQA dataset.