Machine comprehension (MC) on text data has a large improvement recently after large-scale self-supervised pre-trained language models appeared, such as BERT , GPT . Instead of learning MC tasks from scratch, these model is firstly pre-trained on a large unannotated corpus to learn self-supervised representations for general language and then is fine-tuned on the downstream MC dataset. This training schedule helps the model to tackle MC tasks and achieve comparable results to human performance on SQuAD datasets [22, 21].
Previous work  indicated that doing MC on spoken content is much more difficult than on text content, because speech recognition errors have catastrophic impact on MC. On the other hand, an end-to-end model for spoken language task, such as spoken language understanding (SLU) and speech to speech translation, is promising to solve the ASR error propagation problem, although by now the performance of most of the end-to-end models is still not as good as the performance of the corresponding natural language task.
In our work, we proposed SpeechBERT, a pre-trainable model of generic representation for speech and text tasks. By combining the pre-trained language models with a spoken audio encoder, our end-to-end model can circumvent the negative impact caused by cascading ASR and QA models that are trained separately. First, we pre-train the SpeechBERT model both on text corpus and speech audio to enable the model to extract useful semantic features on both speech and text content. When fine-tuning on the MC task, the whole model can be jointly optimized without an error bottleneck caused by ASR. As the first work for end-to-end spoken question answering (SQA), our model gets the results close to the performance of cascading ASR and QA models. Although the room for improvement exists, it is a good first step towards end-to-end SQA.
2 Related Works
2.1 Speech Segment Embedding
In a series of investigations, people want to extract semantic embeddings from speech feature segments given predefined segment boundaries. Speech2Vec  used a speech feature level sequence-to-sequence network to imitate the training process of skip-gram or CBOW in Word2Vec  to help the model extract more semantic feature. Unsupervised segmentation method  was also proposed to jointly learn to segment spoken words and extract the speech embeddings. After learning the speech segment embeddings, the following works [2, 5] tried to map them to a regular word embedding, as we do in bilingual embedding mapping , either in supervised (using paired seeds) or unsupervised (using Generative Adversarial Networks ) approaches. Promising as these works sound, by now the mapping qualities are far from that of a bilingual embedding mapping case. The correctness of the mapping between speech and text embeddings is at most near 25%  even with the aid of oracle boundaries. That indicates the difficulty of disentangling semantic information from noisy speech signals without a supervised ASR model.
We also used the speech segment embedding concept in this work. The method will be introduced in 3.2.2. However, because the baseline methods of the SQA tasks already include a supervised ASR model as front end, we do not impractically pursue a fully unsupervised method for speech content extraction. Instead, we use the labels of what words are exactly represented by speech segments in the pre-training stage.
2.2 End-to-end Model for Spoken Language Tasks
Conventional methods for spoken language tasks need an ASR as a front-end module to extract semantic information from speech signals into plain text. The output of ASR will be treated as natural language data and fed into regular NLP models for downstream tasks. The end-to-end model aims to tackle the whole task from speech level features without cascading the ASR model. The end-to-end model has the benefits: 1) it optimizes the metric of the final task directly, instead of optimizing toward different targets for ASR and NLP models separately 2) it avoids error propagation problem caused by ASR bottleneck 3) the direct exposure of speech information to downstream models can help the model to capture useful information that is not shown in text transcripts.
Regular spoken language understanding (SLU) tasks like intent classification and slot filling have been explored in end-to-end methods [1, 10, 3, 24] wider recently compared to spoken question answering (SQA) tasks . However, these two tasks are at different levels of difficulty. SLU task is a sentence-level classification problem that needs to fill the slots from pre-defined classes by extracting some local information in a short utterance. As the literal meaning of the short utterance is extracted by the SLU model, it is not far from making a correct classification. Compared to SLU, the inputs in the SQA task are much longer spoken paragraphs. Besides understanding the literal meaning, the SQA model needs to organize the global information first because the sophisticated reasoning in the paragraph is required to answer the questions. Fine-grained information is also needed to predict the exact position of the answer span from a very long context. Therefore, we will solve it by pointer networks instead of classification models. These are the reasons why SQA would be a harder problem than SLU.
Based on the BERT  model, we extend the BERT architecture with a speech segment encoder, which functions as an alternative module of the ASR model. Instead of recognizing words, the speech segment encoder aims to directly find good speech representations that can be fed into the BERT model, makes it possible to process text and speech in a shared BERT model. The model architecture and training process in illustrated in Figure 1.
3.1 BERT for Text Pre-training
BERT is a multi-layer Transformer  model. For the text part, given a token sequence
, we represent the tokens with vectors. Then we sum and positional embeddings and sentence segment embeddings to get . Then will be fed into the multi-layer Transformer. At the output layer of BERT, we use the output features to do two tasks: masked language model (MLM) and next sentence prediction (NSP). MLM is to randomly replace 15% of vectors in with a special mask token vector and predict the masked tokens at the same position of output features. NSP is to predict whether the tokens with different sentence segment embeddings are from successive sentences. However, some recent study [14, 13, 30, 16] have indicated that NSP does not improve performance but hurt it instead, so we removed this part and only trained MLM in our setting.
3.2 Speech Segment Encoder Pre-training
For the speech part, we have a speech feature sequence where denotes the number of acoustic features. The speech feature sequence is segmented into audio segments in Section 3.2.1. Given the word boundaries, we encode each segment to get speech version of word vectors where denotes the number of segments. The encoding method will be described in Section 3.2.2.
3.2.1 Speech Segmentation
Segmentation for Training Stage: To effectively extract semantic features in speech signals, we segment the Mel Frequency Cepstral Coefficients (MFCCs) sequences according to the predefined boundaries from forced alignment of an off-the-shelf ASR model.
Segmentation for Testing Stage: At the testing stage, we can not access the ground truth labels to run forced alignment, so we use the ASR model to get the word pseudo-label sequence to run forced alignment. Even with wrong words in ASR recognition results, the boundaries found by forced alignment are usually corresponding to some other true words.
3.2.2 Phonetic-Semantic Joint Embedding
After getting the speech feature segments, we used an RNN sequence-to-sequence autoencoder to encode the segments to obtain phonetic embeddings that captures the phonetic information of acoustic words. The autoencoder training procedure makes audio that have similar phonetic features cluster together. However, due to the need to act as the inputs of the BERT model, simply fitting on pure phonetic information without considering the semantic relations between acoustic words is not desirable. Hence, we use the labels according to acoustic words to get the primary word vectors from the word embedding layer of BERT if words are not in out-of-vocabularies. For each paired audio segment and word, we add a loss term by calculating the L1-distance between the two paired vectors. By doing so, the autoencoder model can learn to fit the BERT input distribution for semantic word embedding, while keeping the acoustic information to reconstruct the original MFCC features as much as possible. This regularization helps the model learn a joint embedding space both for text and speech embedding, extracting semantic level features from speech directly.
To make the concept clear, we listed the loss terms to optimize. Given audio segment as input features, the RNN encoder encode it as a vector . The RNN decoder network maps to output . The encoder-decoder network are trained to minimize the reconstruction error:
At the same time, the vector is constrained by L1-distance loss term:
is the token label behind the audio segment and is the embedding layer of BERT.
3.3 Speech & Text Corpus MLM Jointly Pre-training
After speech segment encoder is well trained, we pre-trained the SpeechBERT with text and speech corpus, then fine-tuning it on downstream QA task.
To denoise the audio representation and to allow feeding BERT not only discrete text embedding but also continuous speech embedding, we jointly optimize MLM loss for both speech and text. The training target of text MLM have been described in 3.1. For speech part, after getting audio representations of the audio segments, we also randomly replace 15% of vectors with as we do in 3.1. For the supervised setting, we can just predict what corresponding tokens are exactly behind the masked speech segments, like MLM.
Considering the performance reported in another end-to-end SLU work 
did not show much improvement when unfreezing the speech feature extraction layer, we freeze the speech encoder network in the MLM training process to speed up the whole training procedure. However, the word embedding is unfrozen to adapt to coexist with speech embedding.
3.4 Fine-tuning on Question Answering
After MLM pre-training, the downstream QA task is fine-tuned to minimize the loss term for predicting the correct start/end position of the answer span, as already proposed in BERT . By introducing a start vector and an end vector , we compute a dot product for and each final hidden vector
from BERT. The dot product value is softmax-normalized over all words in the sentence to compute the probability of wordbeing the start position. End position prediction follows the same procedure.
4 Experimental Setup
We trained our SpeechBERT model on the Spoken SQuAD  dataset, which contains all paragraphs in audio file format while all questions in plain text as the original SQuAD dataset. Also, it contains a SQuAD format ASR transcripts which have 37,111 question answer pairs as the training set and 5,351 as the testing set. The dataset is smaller than the official SQuAD dataset since Spoken SQuAD removed questions that can not find answers in the ASR transcripts. The boundaries of the audio are found from forced alignment with Kaldi , either using ground truth text in training set or ASR transcripts in the testing set. To fine-tuning on Spoken SQuAD, we mapped the correct start/end points of the answer span to audio segments from the paragraph sequence in the original SQuAD training set. The answer segment can also be found in testing set using ASR transcripts provided by Spoken SQuAD.
4.2 Model Settings
4.2.1 Speech Segment Encoder
For speech segment encoder pre-training, we used a bidirectional LSTM as the encoder and a single-directional LSTM as the decoder, both with the input size 39 (MFCC-dim) and hidden size 768 (BERT embedding-dim). Two layers of the fully-connected network are added at the encoder output to enable the encoder to transform the encoded information to fit BERT embedding space. We directly used the audio from Spoken SQuAD training set to train this encoder-decoder network.
4.2.2 BERT Model
We used a PyTorch implementation of BERT111https://github.com/huggingface/transformers to build our BERT model in 12 layers bert-base-uncased setting. For compatibility between text and speech embedding, we did not use the WordPiece tokenizer 
to process text. Instead, we randomly initialized a new embedding layer with our vocabulary set counted in the dataset. The official pre-trained weights are loaded into the weights of our BERT model other than the embedding layer weights. To make the new embedding layer already prepared before text and speech joint training, we trained the MLM task on the text part of Spoken SQuAD training set for three epochs. In text and speech joint training, we directly used the Spoken SQuAD training set. Both the text part and the audio part are fed into the BERT model.
5 Experimental Results
|(GT text)||(ASR trans.)|
|(train on plain text)||EM||F1||EM||F1|
|Mnemonic Reader ||64.00||73.35||40.36||52.87|
|BERT  w/o WordPiece||72.24||82.59||48.71||66.27|
|(train on audio)||EM||F1|
|w/ GT segment||w/o MLM||47.90||61.97|
The comparison between our end-to-end model with other ASR + QA model is shown in Table 1. ASR + QA models were trained and tested on ASR transcripts from Spoken SQuAD dataset, while our SpeechBERT was trained and tested on audio files. The F1 and Exact Match (EM) score of our model is competitive to most of the previous methods although still not outperforms the BERT trained on ASR transcripts. Considering the difficulty of the end-to-end SQA model that dealing with noisy speech features and extracting semantic information in one model is needed, the results are promising enough to show the potential of end-to-end approaches.
5.1 Ablation Studies
5.1.1 Improvement by LM pre-training
To evaluate the contribution of cross-modal language model pre-training, we did the experiment of fine-tuning on SQA directly from text pre-trained weights without speech text joint MLM pre-training. As expected, the F1 and EM scores are dropped by about 1.7, showing the benefits of joint MLM pre-training before fine-tuning.
5.1.2 Quality of Segmentation
We wondered that if the performance is restricted by the quality of word boundaries which is found by forced alignment with ASR transcripts, which have 22.73% WER as mentioned in Spoken SQuAD . To observe whether the quality of segmentation is the bottleneck of performance, we tested our model on the Spoken SQuAD testing set with ground truth text forced alignment used in training time, which would be more accurate than ASR transcripts forced alignment. However, the performance change higher only within 1.3 to 1.5 for F1 and EM scores. This showed that the quality of boundaries is not the main problem that causes performance lower.
5.2 Error Analysis
Although out-of-vocabulary (OOV) is not an issue for spoken audio, we found the OOVs in question text part make the performance lower. As mentioned in the last section, to make the SpeechBERT model be able to process cross-modal input consistently both for speech and text in the same unit, we discarded the WordPiece  tokenizer and use the same vocabulary set as used in our SpeechBERT model. However, this modification disables the model to use WordPiece tokenizer to process name entities, which is crucial for answering correctly. To evaluate our conjecture, we trained a BERT with the same setting on the transcripts of Spoken SQuAD training set but using our new vocabulary set. Consistent with our hypothesis, the F1 and EM score consistently dropped by 2 to 7 for both Spoken SQuAD testing set and SQuAD dev set.
5.2.2 Comparison in different WER
Although our SpeechBERT model still not outperforms BERT trained on ASR transcripts, we can investigate whether SpeechBERT can beat BERT on the questions with higher recognition word error rate (WER). We split the questions into many groups according to different WER and tested the EM score for each group. We defined the ”EM score ratio” as the ratio of the number of questions with EM = 0 to the number of questions with EM = 1. The higher ratio means the more questions are answered wrong compared to corrected answered questions. We calculate the ratios both for SpeechBERT and BERT, the result is shown in Figure 3. Obviously, BERT tends to have a lower ratio when WER is low and higher ratio when WER is higher, while SpeechBERT does not have this tendency and can still correctly answer the questions with extremely high WER.
6 Discussions and future works
Though we achieved a reasonable performance on the SQA task, there is still a large room for future research. The first challenge is the usage of word boundaries. Although it is reasonable to use an off-the-shelf ASR model that acts as a segmenter under a supervised setting, it will be much more desirable if the boundaries can be provided by the end-to-end model itself. In conventional SLU tasks, it is possible to extract information from frame-level speech features to do classification tasks like slot filling. However, it will be an enormous challenge to use frame-level speech features for the SQA model which needs a pointer network to predict positions directly on very long frames. In this work, we choose an easier setting that focuses on embedding learning and language model pre-training to solve SQA with pre-computed word boundaries. One possible way to integrate segmentation into our approach is simply dividing audios by the voice intensity. Alternatively, previous work on simultaneous speech translation 
has proposed algorithms to learn segmentation strategies that directly maximizes the performance of the machine translation system. Joint learning of segmentation and audio embedding that can mutually be enhanced by reinforcement learning is another promising approach. This method can be adapted to the text and speech cross-modal language model pre-training in our work in the future.
The second goal for future research is cross-modal langauge model pre-training with few labels for speech corpus. While paired data was used in our pre-training stage, semi-supervised or unsupervised method can leverage the much larger unpaired corpora.
In this work, we proposed an end-to-end model for spoken question answering. Our model got the results close to the performance of cascading ASR and QA models. It is a stepping stone towards understanding speech content directly from speech information to solve QA problems.
-  (2017) Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §2.2, Table 1.
-  (2018) Almost-unsupervised speech recognition with close-to-zero resource based on phonetic structures learned from very small unpaired speech and text data. CoRR abs/1810.12566. External Links: Cited by: §2.1.
-  (2018) Spoken language understanding without speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6189–6193. Cited by: §2.2.
-  (2018) Speech2Vec: A sequence-to-sequence framework for learning word embeddings from speech. CoRR abs/1803.08976. External Links: Cited by: §2.1.
-  (2018) Unsupervised cross-modal alignment of speech and text embedding spaces. CoRR abs/1805.07467. External Links: Cited by: §2.1.
-  (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §2.1.
-  Distributed representations of words and phrases and their compositionality. in NIPS, 2013.. Cited by: §2.1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Cited by: §1, §3.4, §3, Table 1.
-  (2014) Generative adversarial networks. External Links: Cited by: §2.1.
-  (2018) From audio to semantics: approaches to end-to-end spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 720–726. Cited by: §2.2.
-  (2017) Reinforced mnemonic reader for machine comprehension. CoRR, abs/1705.02798. Cited by: Table 1.
-  (2017) Fusionnet: fusing via fully-aware attention with application to machine comprehension. arXiv preprint arXiv:1711.07341. Cited by: Table 1.
-  (2019) SpanBERT: improving pre-training by representing and predicting spans. CoRR abs/1907.10529. External Links: Cited by: §3.1.
-  (2019) Cross-lingual language model pretraining. CoRR abs/1901.07291. External Links: Cited by: §3.1.
-  (2018) Spoken squad: A study of mitigating the impact of speech recognition errors on listening comprehension. CoRR abs/1804.00320. External Links: Cited by: §1, §2.2, §4.1, §5.1.2.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §3.1.
-  (2019) Speech Model Pre-training for End-to-End Spoken Language Understanding. Cited by: §3.3.
-  (2014) Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 551–556. Cited by: §6.
-  (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §4.1.
-  (2018) Improving language understanding by generative pre-training. Cited by: §1.
-  (2018) Know what you don’t know: unanswerable questions for squad. CoRR abs/1806.03822. External Links: Cited by: §1.
-  (2016) SQuAD: 100, 000+ questions for machine comprehension of text. CoRR abs/1606.05250. External Links: Cited by: §1.
-  (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603. Cited by: Table 1.
-  (2018) Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5754–5758. Cited by: §2.2.
-  (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §3.1.
-  (2015-06) Pointer Networks. arXiv e-prints. Cited by: §2.2.
-  (2017) Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 189–198. Cited by: Table 1.
-  (2018) Segmental audio word2vec: representing utterances as sequences of vectors with applications in spoken term detection. CoRR abs/1808.02228. External Links: Cited by: §2.1, §6.
Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §4.2.2, §5.2.1.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Cited by: §3.1.