Towards Better Understanding of Spontaneous Conversations: Overcoming Automatic Speech Recognition Errors With Intent Recognition

08/21/2019 ∙ by Piotr Żelasko, et al. ∙ Poznan University of Technology 0

In this paper, we present a method for correcting automatic speech recognition (ASR) errors using a finite state transducer (FST) intent recognition framework. Intent recognition is a powerful technique for dialog flow management in turn-oriented, human-machine dialogs. This technique can also be very useful in the context of human-human dialogs, though it serves a different purpose of key insight extraction from conversations. We argue that currently available intent recognition techniques are not applicable to human-human dialogs due to the complex structure of turn-taking and various disfluencies encountered in spontaneous conversations, exacerbated by speech recognition errors and scarcity of domain-specific labeled data. Without efficient key insight extraction techniques, raw human-human dialog transcripts remain significantly unexploited. Our contribution consists of a novel FST for intent indexing and an algorithm for fuzzy intent search over the lattice - a compact graph encoding of ASR's hypotheses. We also develop a pruning strategy to constrain the fuzziness of the FST index search. Extracted intents represent linguistic domain knowledge and help us improve (rescore) the original transcript. We compare our method with a baseline, which uses only the most likely transcript hypothesis (best path), and find an increase in the total number of recognized intents by 25



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spoken language understanding (SLU) consists in identifying and processing of semantically meaningful parts of the dialogue, most often performed on the transcript of the dialogue produced by the automatic speech recognition (ASR) system Ward (1991). These meaningful parts are referred to as dialog acts and provide structure to the flow of conversation. Examples of dialog acts include statements, opinions, yes-no questions, backchannel utterances, response acknowledgements, etc Stolcke et al. (2000). The recognition and classification of dialog acts is not sufficient for true spoken language understanding. Each dialog act can be instantiated to form an intent, which is an expression of a particular intention. Intents are simply sets of utterances which exemplify the intention of the speaking party to perform a certain action, convey or obtain information, express opinion, etc. For instance, the intent Account Check can be expressed using examples such as "let me go over your account", "found account under your name", "i have found your account". At the same time, the intent Account Check would be an instance of the dialog act Statement. An important part of intent classification is entity recognition Nadeau and Sekine (2007). An entity is a token which can either be labeled with a proper name, or assigned to a category with well defined semantics. The former leads to the concept of a named entity, where a token represents some real world object, such as a location (New York), a person (Lionel Messi), a brand (Apple). The latter can represent concepts such as money (one hundred dollars), date (first of May), duration (two hours), credit card number, etc.

1.1 Motivation

Spoken language understanding is a challenging and difficult task Furui (2002). All problems present in natural language understanding are significantly exacerbated by several factors related to the characteristics of spoken language.

Firstly, each ASR engine introduces a mixture of systematic and stochastic errors which are intrinsic to the procedure of transcription of spoken audio. The quality of transcription, as measured by the popular word error rate (WER), attains the level of 5%-15% WER for high quality ASR systems for English Povey et al. (2016); Han et al. (2017); Xiong et al. (2018); Hahm et al. (2018). The WER highly depends on the evaluation data difficulty and the speed to accuracy ratio. Importantly, errors in the transcription appear stochastically, both in audio segments which carry important semantic information, as well as in inessential parts of the conversation.

Another challenge stems from the fact that the structure of conversation changes dramatically when a human assumes an agency in the other party. When humans are aware that the other party is a machine (as is the case in dialogue chatbot interfaces), they tend to speak in short, well-structured turns following the subject-verb-object (SVO) sentence structure with a minimal use of relative clauses Hill et al. (2015). This structure is virtually nonexistent in spontaneous speech, where speakers allow for a non-linear flow of conversation. This flow is further obscured by constant interruptions from backchannel or cross-talk utterances, repetitions, non-verbal signaling, phatic expressions, linguistic and non-linguistic fillers, restarts, and ungrammatical constructions.

A substantial, yet often neglected difficulty, stems from the fact that most SLU tasks are applied to transcript segments representing a single turn in a typical scripted conversation. However, turn-taking in unscripted human-human conversations is far more haphazard and spontaneous than in scripted dialogues or human-bot conversations. As a result, a single logical turn may span multiple ASR segments and be interwoven with micro-turns of the other party or the contrary - be part of a larger segment containing many logical turns.

ASR transcripts lack punctuation, normalization, true-casing of words, and proper segmentation into phrases as these features are not present in the conversational input Żelasko et al. (2018). These are difficult to correct as the majority of NLP algorithms have been trained and evaluated on text and not on the output of an ASR system. Thus, a simple application of vanilla NLP tools to ASR transcripts seldom produces actionable and useful results.

Finally, speech based interfaces are defined by a set of dimensions, such as domain and vocabulary (retail, finance, entertainment), language (English, Spanish), application (voice search, personal assistant, information retrieval), and environment (mobile, car, home, distant speech recognition). These dimensions make it very challenging to provide a low cost domain adaptation.

Last but not least, production ASR systems impose strict constraints on the additional computation that can be performed. Since we operate in a near real-time environment, this precludes the use of computationally expensive language models which could compensate for some of the ASR errors.

1.2 Contribution

We identify the following as the key contributions of this paper:

A discussion of intent recognition in human-human conversations. While significant effort is being directed into human-machine conversation research, most of it is not directly applicable to human-human conversations. We highlight the issues frequently encountered in NLP applications dealing with the latter, and propose a framework for intent recognition aimed to address such problems.

A novel FST intent index construction with dedicated pruning algorithm, which allows fuzzy intent matching on lattices.

To the best of our knowledge, this is the first work offering an algorithm which performs a fuzzy search of intent phrases in an ASR lattice, as opposed to a linear string. We build on the well-studied FST framework, using composition and sigma-matchers to enable fuzzy matching, and extend it with our own pruning algorithm to make the fuzzy matching behavior correct. We supply the method with several heuristics to select the new best path through the lattice and we confirm their usefulness empirically. Finally, we ensure that the algorithm is efficient and can be used in a real-time processing regime.

Domain-adaptation of an ASR system in spite of data scarcity issues. Generic ASR systems tend to be lackluster when confronted with specialized jargon, often very specific to a single domain (e.g., healthcare services). Creating a new ASR model for each domain is often impractical due to limited in-domain data availability or long training times. Our method improves the speech recognition, without the need for any re-training, by improving the recognition recall of the anticipated intents – the key insight sources in these conversations.

2 Related Work

2.1 Domain Knowledge Modeling for Machine Learning

The power of some of the best conversational assistants lies in domain-dependent human knowledge. Amazon’s Alexa is improving with the user generated data it gathers Kumar et al. (2017). Some of the most common human knowledge base structures used in NLP are word lists such as dictionaries for ASR Bach et al. (2007)

, sentiment lexicons

Augustyniak et al. (2016)knowledge graphs such as WordNet Maziarz et al. (2016); Miller (1995) and ConceptNet Speer and Havasi (2012). Conceptually, our work is similar to Velikovich et al. (2018), however, they do not allow for fuzzy search through the lattice.

2.2 Word Confusion Networks

Figure 1: Word confusion network for the utterance just a nonstop flight. indicates an empty transition.

A word confusion network (WCN) Mangu et al. (2000) is a highly compact graph representation of possible confusion cases between words in the ASR lattice. The nodes in the network represent words and are weighted by the word’s confidence score or its a posterioriprobability. Two nodes (words) are connected when they appear in close time points and share a similar pronunciation, which merits suspecting they might get confused in recognition Stolcke (2002); Hakkani-Tur and Riccardi (2003). WCN may contain empty transitions which introduce paths through the graph that skip a particular word and its alternatives. An example of WCN is presented in Figure 1. Note that this seemingly small lattice encodes 46 080 possible transcription variants.

Various language understanding tasks have been improved in recent years using WCNs: language model learning Gretter and Riccardi (2001), ASR improvements Tur et al. (2002); Hakkani-Tür et al. (2006); Ogawa et al. (2012), classification Cortes et al. (2003); Masumura et al. (2018), word spotting Hori et al. (2007); Zhang et al. (2007), voice search Feng and Bangalore (2009), dialog state tracking Jagfeld and Vu (2017) and named entity extraction Hakkani-Tür et al. (2006); Kurata et al. (2012); Hakkani-Tür et al. (2014). Stiefel and Vu (2017) modified the WCN approach to include part-of-speech information in order to achieve an improvement in semantic quality of recognized speech.

2.3 Finite State Transducers

The finite state transducer (FST) Roche and Schabes (1997); Mohri (2004) is a finite state machine with two memory tapes that maps input symbols to output symbols as it reads from the input table and writes to the output tape. FSTs are natural building blocks for systems that transform one set of symbols into another due to the robustness of various FST joining operations such as union, concatenation or composition. Composing FST1 and FST2 is performed by running an input through the FST1, taking its output tape as the input tape for FST2 and returning the output of FST2 as the output of the composed FST. For a formal definition of the operation and a well-illustrated example, we refer the reader to Argueta and Chiang (2018).

Finite state transducers have been widely used in speech recognition Lehr and Shafran (2011); Mohri et al. (2002); Moore et al. (2006)

, named entity recognition

Friburger and Maurel (2004); Gaio and Moncla (2017), morpho-syntactic tagging Roche and Schabes (1995); Forsberg and Hulden (2016); Moeller et al. (2018) or language generation Goyal et al. (2016).

3 Methods

3.1 Automatic Speech Recognition

To transcribe the conversations we use an ASR system built using the Kaldi toolkit Povey et al. (2011) with a TDNN-LSTM acoustic model trained with lattice-free maximum mutual information (LF-MMI) criterion Povey et al. (2016) and a 3-gram language model for utterance decoding. The ASR lattice is converted to a word confusion network (WCN) using minimum Bayes risk (MBR) decoding Xu et al. (2011).

3.2 Domain Knowledge Acquisition - Intent Definition and Discovery

While an in-depth description of tools used in the intent definition process is beyond the scope of this paper, we provide a brief overview to underline the application potential of our algorithm when combined with a sufficient body of domain knowledge. First, let us formalize the notion of intents and intent examples. An intent example is a sequence of words which conveys a particular meaning, e.g., "let me go over your account" or "this is outrageous". An intent is a collection of intent examples conveying a similar meaning, which can be labeled with an intelligible and short description helpful in understanding the conversation.

Some of the intents that we find useful include customer requests (Refund, Purchase Intent), desired actions by the agent (Up-selling, Order Confirmation) or compliance and customer satisfaction risks (Customer Service Complaint, Supervisor Escalation). Defining all examples by hand would be prohibitively expensive and cause intents to have limited recall and precision, as, by virtue of combinatorial complexity of language, each intent needs hundreds of examples. To alleviate this problem we provide annotators with a set of tools, including: fast transcript annotation user interface for initial discovery of intents; an interactive system for semi-automatic generation of examples which recommends synonyms and matches examples on existing transcripts for validation; unsupervised and semi-supervised methods based on sentence similarity and grammatical pattern search for new intent and examples discovery.

In addition, we extend the notion of an example with two concepts that improve the recall of a single example:

  • Blank quota, that defines the number of words that may be found in-between the words of the example and still be acceptable, e.g., "this is very outrageous" becomes a potential match for "this is outrageous" if the blank quota is greater than 0. This allows the annotator to focus on the words that convey the meaning of the phrase and ignore potential filler words.

  • Entity templating allowing examples to incorporate entities in their definitions. With entity templating an example "your flight departs __SYSTEM_TIME__" would match both "your flight departs in ten minutes" and "your flight departs tomorrow at seven forty five p m". This relieves the annotator from enumerating millions of possible examples for each entity and facilitates the creation of more specific examples that increase precision. To illustrate, "your item number is" could incorrectly match "your item number is wrong", but "your item number is __SYSTEM_NUMBER__" would not.

The above methods allow the annotators to create hundreds of intents efficiently, with thousands of examples, allowing millions of distinct potential phrases to be matched. When combined with the ability for customers to configure entities and select a subset of intents that are relevant to their business, this approach produces highly customer-specific repositories of domain knowledge.

3.3 Lattice Rescoring Algorithm

The lattice is an acceptor, where each arc contains a symbol representing a single word in the current hypothesis (see Figure 2). We employ a closed ASR vocabulary assumption and operate on word-level, rather than character- or phoneme- level FST. Note that this assumption is not a limitation of our method. Should the ASR have an unlimited vocabulary (as some end-to-end ASR systems do), it is possible to dynamically construct the lattice symbol table and merge it with the symbol table of intent definitions.

Figure 2: Word confusion network representing the lattice .

To perform intent annotation (i.e., to recognize and mark the position of intent instances in the transcript), we first create the FST index of all intent examples. This index is a transducer which maps the alphabet of words (input symbols) onto the alphabet of intents (output symbols). We construct index in such a way that its composition with the lattice results in another transducer representing the annotated lattice.

Figure 3: An index matching a single intent example tickets for weekend to an intent number 111.

We begin by creating a single FST state which serves as both the initial and the final state and contains a single loop wildcard arc. A wildcard arc accepts any input symbol and transduces it to an empty output symbol. The wildcard arc can be efficiently implemented with special -matchers, available in the OpenFST framework Allauzen et al. (2007). Composition with the singleton FST maps every input symbol in to , which denotes the lack of intent annotations. For each intent example, we construct an additional branch (i.e. a set of paths) in the index which maps multiple sequences of words to a set of symbols representing this particular intent example (see Figure 3).

We use three types of symbols: an intent symbol (including begin , continuation and end symbols), an entity symbol (including an entity placeholder symbol ), and a null symbol .

The intent symbol is the delimiter of the intent annotation and it demarcates the words constituting the intent. The begin () and continuation () symbols are mapped onto arcs with words as input symbols, and the end () symbol is inserted in an additional arc with an input symbol after the annotated fragment of text. It is important that the begin symbol does not introduce an extra input of . Otherwise, the FST composition is inefficient, as it tries to enter this path on every arc in the lattice .

Figure 4: A simple grammar FST for the non-terminal token __TIME__. Note that both states 2 and 4 are final (indicated by the double circle).

The entity symbol marks the presence of an entity in the intent annotation. Each entity in the intent index is constructed as a non-terminal entity placeholder , which allows using the FST lazy replacement algorithm to enter a separate FST grammar describing a set of possible values for a given entity. We use the transducer when it is possible to provide a comprehensive list of entity instances. Otherwise, we provide an approximation of this list by running a named entity recognition model predictions on an n-best list (see Figure 4). Finally, the null symbol means that either no intent was recognized, or the word spanned by the intent annotation did not contribute to the annotation itself.

Figure 5: The intent index which matches three different intent examples: cancel account please with a blank quota of 1, i apologize with the synonym am sorry, and tickets __SYSTEM_TIME__, where the last token is a special non-terminal token, replaced dynamically during composition with an appropriate grammar FST.
Figure 6: Annotated lattice resulting from composition with replacement using before (a) and after (b) pruning. Note that the last word man was rescored as may in the path due to the recognition of an annotated intent.

This procedure successfully performs exact matching of the transcription to intent index, when all words present in the current lattice path are also found in the intent example. Unfortunately, this approach is highly impractical when real transcriptions are considered. Sequences of significant words are interwoven with filler phonemes and words, for instance the utterance "I want uhm to order like um three yyh three tickets" could not be matched with the intent example "I want to order __NUMBER__ tickets".

To overcome this limitation we adapt the intent index to enable fuzzy matching so that some number of filler words can be inserted between words constituting an intent example, while still allowing to match the intent annotation. We add wildcard arcs between each of the intent-matching words, to provide the matching capacity of to matches of any word in the alphabet. The example of such an index is shown in Figure 5.

The naive implementation allowing for superfluous (non-intent) words appearing between intent-matching words would lead to a significant explosion of the annotations spans. Instead, we employ a post-processing filtering step to prune lattice paths where the number of allowed non-intent word is exceeded. Our filtering step has a computational complexity of , where is the number of states and is the number of arcs in the non-pruned annotated lattice .

The pruning algorithm is based on the depth-first search (DFS) traversal of the lattice and marks each state in as either new, visited, or pruned. Only new states are entered and each state is entered at most once. The FST arcs are only marked as either visited or pruned. Each FST state keeps track of whether an intent annotation parsing has begun (i.e., a begin symbol has been encountered but the end symbol has not appeared yet) and how many wildcard words have been matched so far.

The traversal of the lattice is stateful. It starts in a non-matching state and remains in this state until encountering the intent begin symbol . Then the state becomes matching and remains such until encountering the intent end symbol . A state is marked as pruned when the traversal state is matching and the number of wildcard words exceeds the blank quota for the given intent example. Any arc incident with a pruned state is not entered during further traversal, leading to a significant speed-up of lattice processing. After every possible path in the lattice has been either traversed or pruned, all redundant (i.e., pruned or not visited) FST states are removed from the lattice, along with all incident arcs.

Intent Original text Rescored text
Website Mention (with entity Brand) just use your regular acme (ok) that calm pay with credit card just use your regular acme (ok) dot com pay with credit card
Question: Account Lookup can you looked at my count can you look at my account
Question: Account Information i need to recount (mhm sure) number or email addresses i need your account (mhm sure) number or email address
Call Opening think of a collie level thank you for calling
End of Hold thank you for your patients thank you for your patience
Refund work connie the refined work on the refund
Table 1: Examples of successful lattice rescoring along with the recognized intent. In first example real brand name was obfuscated by a fictional brand name ACME. Words in parentheses are turns of another speaker.

The annotated lattice is obtained after final traversal of the lattice which prunes arcs representing unmatched word alternatives. If no intent has been matched on any of the parallel arcs, the traversal retains only the best path hypothesis. Figure 6 presents an example of the annotated lattice before and after pruning.

3.4 Parsing the Annotated Lattice

Despite significant pruning described in the previous section, the annotated lattice still contains competing variants of the transcript. The next step consists in selecting the "best" variant and traversing all paths in which correspond to this transcript. The key concept of our method is to guide the selection of the "best" variant by intents rather than word probabilities. We observe that the likelihood of a particular longer sequence of words in the language is lower than the likelihood of a particular shorter sequence of words. Since a priori longer intent examples are less likely to appear in the lattice purely by chance, the presence of a lattice path containing a longer intent example provides strong evidence for that path.

The complete set of heuristics applied sequentially to the annotated lattice in search of the best path is the following: (a) select the path with the longest intent annotation; (b) select the path with the largest number of intent annotations; (c) select the path with the intent annotation with the longest span (i.e. consider also blank words), (d) select the path with the highest original ASR likelihood. The chosen best path is composed with the annotated lattice to produce the annotated lattice with the final variant of the transcript. The output intent annotations are retrieved by traversing every path in .

3.5 Lattice concatenation

As hinted in Section 1, most NLP tasks are performed on the turn level, which naturally corresponds to the turn-taking patterns in a typical human-machine dialogue. This approach yields good results for chatbot conversational interfaces or information retrieval systems, but for spontaneous human-human dialogues, the demarcation of turns is much more difficult due to the presence of fillers, interjections, ellipsis, backchannels, etc. Thus, we cannot expect those intent examples would align with ASR segments which capture a single speaker turn. We address this issue by concatenating turn-level lattices of all utterances of a person throughout the conversation into a conversation-lattice . This lattice can still be effectively annotated and pruned using algorithms presented in Section 3.3 to obtain the annotated conversation-lattice .

Unfortunately, the annotated conversation-lattice cannot be parsed in search of the best path using the algorithm presented in Section 3.4, because the computational cost of every path traversal in is exponential in the number of words. Fortunately, we can exploit the structure of the conversation-lattice to identify the best path. We observe that is a sequence of segments organized either in series or in parallel. Segments with no intent annotations are series of linear word hypotheses, which branch to parallel word hypotheses whenever an intent annotation is matched (because the original path with no intent annotation is retained in the lattice). The parallel segment ends with the end of the intent annotation. These series and parallel segments can be detected by inspecting the cumulative sum of the difference of out-degree and in-degree of each state in a topologically sorted conversation-lattice . For series regions, this sum will be equal to 0, and greater than 0 in parallel regions. The computational cost of performing this segmentation is , i.e., linear in the number of states and arcs in the annotated conversation-lattice . After having performed the segmentation, the partial best path search in parallel segments is resolved using the method presented in Section 3.4.

4 Experimental results

In this section, we present a quantitative analysis of the proposed algorithm. The baseline algorithm annotates only the best ASR hypothesis. We perform the experiments with an intent library comprised of 313 intents in total, each of which is expressed using 169 examples on average. The annotations are performed on more than 70 000 US English phone conversations with an average duration of 11 minutes, but some of them take even over one hour. The topics of these conversations span across several domains, such as inquiry for account information or instructions, refund requests or service cancellations. Each domain uses a relevant subset of the intent library (typically 100-150 intents are active).

To evaluate the effectiveness of the proposed algorithm, we have sampled a dataset of 500 rescored intent annotations found in the lattices in cancellations and refunds domain. The correctness of the rescoring was judged by two annotators, who labeled 250 examples each. The annotators read the whole conversation transcript and listened to the recording to establish whether the rescoring is meaningful. In cases when a rescored word was technically incorrect (e.g., mistaken tense of a verb), but the rescoring led to the recognition of the correct intent, we labeled the intent annotation as correct. The results are shown in Table 2. Please note that every result above 50% indicates an improvement over the ASR best path recognition, since we correct more ASR errors than we introduce new mistakes.

Intent length Occurrences Accuracy [%]
1 25 32.0
2 139 39.5
3 149 63.7
4 80 76.2
5 53 94.3
6 19 94.7
7+ 35 100.0
Table 2: Rescoring accuracy w.r.t. intent length measured on an annotated dataset of 500 rescored intents.

The results confirm our assumptions presented in Section 3.4

. The longer the intent annotation, the more likely it is to be correct due to stronger contextuality of the annotation. Intent annotations which span at least three words are more likely to rescore the lattice correctly than to introduce a false positive. These results also lead us to a practical heuristic, that an intent annotation which spans only one or two words should not be considered for rescoring. Application of this heuristic results in an estimated accuracy of 77%. We use this heuristic in further experiments. A stricter heuristic would require at least four words span, with an accuracy of 87.7%. Calibration of this threshold is helpful when the algorithm is adapted to a downstream task, where a different precision/recall ratio may be required. We present some examples of successful lattice rescoring in Table 


The proposed algorithm finds 658 549 intents in all conversations, covering 4.1% of all (62 450 768) words, whereas the baseline algorithm finds 526 356 intents, covering 3.3% of all words. Therefore, the increase in intent recognition of the method is 25.1% by rescoring 8.3% of all annotated words (0.34% of all words). Particular intents achieve different improvements ranging from no improvement up to 1062% – ranked percentile results are presented in Table 3. We see that half of intents gain at least 35.7% of improvement, while 20% of all intents gain at least 83.5%.

Intent classes [%] Min. improvement [%]
10 128.9
20 83.5
30 62.4
40 49.4
50 35.7
60 28.7
70 21.6
80 13.7
90 2.0

Table 3: Ranked percentiles of improvement in intent recognition. The improvement is determined for each intent class individually. Intent classes are sorted and binned into percentiles, for each bin we report the minimum improvement for intents in the bin.

5 Conclusions

A commonly known limitation of the current ASR systems is their inability to recognize long sequences of words precisely. In this paper, we propose a new method of incorporating domain knowledge into automatic speech recognition which alleviates this weakness. Our approach allows performing fast ASR domain adaptation by providing a library of intent examples used for lattice rescoring. The method guides the best lattice path selection process by increasing the probability of intent recognition. At the same time, the method does not rescore paths of unessential turns which do not contain intent examples. As a result, our approach improves the understanding of spontaneous conversations by recognizing semantically important transcription segments while adding minimal computational overhead. Our method is domain agnostic and can be easily adapted to a new one by providing the library of intent examples expected to appear in the new domain. The increased intent annotation coverage allows us to train more sophisticated models for downstream tasks, opening the prospects of true spoken language understanding.


  • C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri (2007) OpenFst: a general and efficient weighted finite-state transducer library. In International Conference on Implementation and Application of Automata, pp. 11–23. Cited by: §3.3.
  • A. Argueta and D. Chiang (2018) Composing finite state transducers on gpus. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 2697–2705. Cited by: §2.3.
  • Ł. Augustyniak, P. Szymański, T. Kajdanowicz, and W. Tuligłowicz (2016)

    Comprehensive study on lexicon-based ensemble classification sentiment analysis

    Entropy 18 (1). External Links: Link, ISSN 1099-4300, Document Cited by: §2.1.
  • N. Bach, M. Noamany, I. R. Lane, and T. Schultz (2007) Handling oov words in arabic asr via flexible morphological constraints. In INTERSPEECH, Cited by: §2.1.
  • C. Cortes, P. Haffner, and M. Mohri (2003) Lattice kernels for spoken-dialog classification. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 1, pp. I–628. Cited by: §2.2.
  • J. Feng and S. Bangalore (2009) Effects of word confusion networks on voice search. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 238–245. Cited by: §2.2.
  • M. Forsberg and M. Hulden (2016) Learning transducer models for morphological analysis from example inflections. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pp. 42–50. Cited by: §2.3.
  • N. Friburger and D. Maurel (2004) Finite-state transducer cascades to extract named entities in texts. Theoretical Computer Science 313 (1), pp. 93–104. Cited by: §2.3.
  • S. Furui (2002) Recent progress in spontaneous speech recognition and understanding. In Proceedings of 2002 IEEE Workshop on Multimedia Signal Processing, MMSP 2002, External Links: Document, ISBN 0780377133 Cited by: §1.1.
  • M. Gaio and L. Moncla (2017) Extended named entity recognition using finite-state transducers: an application to place names. In The Ninth International Conference on Advanced Geographic Information Systems, Applications, and Services (GEOProcessing 2017), Cited by: §2.3.
  • R. Goyal, M. Dymetman, and E. Gaussier (2016) Natural language generation through character-based rnns with finite-state prior knowledge. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1083–1092. Cited by: §2.3.
  • R. Gretter and G. Riccardi (2001)

    On-line learning of language models with word error probability distributions

    In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 1, pp. 557–560. Cited by: §2.2.
  • S. Hahm, I. Orife, S. Walker, and J. Flaks (2018) The marchex 2018 english conversational telephone speech recognition system. arXiv preprint arXiv:1811.02058. Cited by: §1.1.
  • D. Hakkani-Tür, F. Béchet, G. Riccardi, and G. Tur (2006) Beyond asr 1-best: using word confusion networks in spoken language understanding. Computer Speech and Language 20 (4), pp. 495 – 514. External Links: ISSN 0885-2308, Document, Link Cited by: §2.2.
  • D. Hakkani-Tür, A. Celikyilmaz, L. Heck, G. Tur, and G. Zweig (2014) Probabilistic enrichment of knowledge graph entities for relation detection in conversational understanding. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §2.2.
  • D. Hakkani-Tur and G. Riccardi (2003) A general algorithm for word graph matrix decomposition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 1, pp. I–I. Cited by: §2.2.
  • K. J. Han, A. Chandrashekaran, J. Kim, and I. Lane (2017) The capio 2017 conversational speech recognition system. arXiv preprint arXiv:1801.00059. Cited by: §1.1.
  • J. Hill, W. R. Ford, and I. G. Farreras (2015)

    Real conversations with artificial intelligence: a comparison between human–human online conversations and human–chatbot conversations

    Computers in Human Behavior 49, pp. 245–250. Cited by: §1.1.
  • T. Hori, I. L. Hetherington, T. J. Hazen, and J. R. Glass (2007) Open-vocabulary spoken utterance retrieval using confusion networks. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 4, pp. IV–73. Cited by: §2.2.
  • G. Jagfeld and N. T. Vu (2017)

    Encoding word confusion networks with recurrent neural networks for dialog state tracking


    Proceedings of the Workshop on Speech-Centric Natural Language Processing

    pp. 10–17. Cited by: §2.2.
  • A. Kumar, A. Gupta, J. Chan, S. Tucker, B. Hoffmeister, M. Dreyer, S. Peshterliev, A. Gandhe, D. Filiminov, A. Rastrow, et al. (2017) Just ask: building an architecture for extensible self-service spoken language understanding. arXiv preprint arXiv:1711.00549. Cited by: §2.1.
  • G. Kurata, N. Itoh, M. Nishimura, A. Sethy, and B. Ramabhadran (2012) Leveraging word confusion networks for named entity modeling and detection from conversational telephone speech. Speech Communication 54 (3), pp. 491 – 502. External Links: ISSN 0167-6393, Document, Link Cited by: §2.2.
  • M. Lehr and I. Shafran (2011) Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 19 (5), pp. 1360–1367. Cited by: §2.3.
  • L. Mangu, E. Brill, and A. Stolcke (2000) Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language 14 (4), pp. 373–400. Cited by: §2.2.
  • R. Masumura, Y. Ijima, T. Asami, H. Masataki, and R. Higashinaka (2018) Neural confnet classification: fully neural network based spoken utterance classification using word confusion networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6039–6043. Cited by: §2.2.
  • M. Maziarz, M. Piasecki, E. Rudnicka, S. Szpakowicz, and P. Kędzia (2016) PlWordNet 3.0 – a Comprehensive Lexical-Semantic Resource. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, N. Calzolari, Y. Matsumoto, and R. Prasad (Eds.), pp. 2259–2268. External Links: Link Cited by: §2.1.
  • G. A. Miller (1995) WordNet: a lexical database for english. COMMUNICATIONS OF THE ACM 38, pp. 39–41. Cited by: §2.1.
  • S. Moeller, G. Kazeminejad, A. Cowell, and M. Hulden (2018) A neural morphological analyzer for arapaho verbs learned from a finite state transducer. In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pp. 12–20. Cited by: §2.3.
  • M. Mohri, F. Pereira, and M. Riley (2002) Weighted finite-state transducers in speech recognition. Computer Speech & Language 16 (1), pp. 69–88. Cited by: §2.3.
  • M. Mohri (2004) Weighted finite-state transducer algorithms. an overview. In Formal Languages and Applications, pp. 551–563. Cited by: §2.3.
  • D. Moore, J. Dines, M. M. Doss, J. Vepa, O. Cheng, and T. Hain (2006) Juicer: a weighted finite-state transducer speech decoder. In

    International Workshop on Machine Learning for Multimodal Interaction

    pp. 285–296. Cited by: §2.3.
  • D. Nadeau and S. Sekine (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (1), pp. 3–26. Cited by: §1.
  • A. Ogawa, T. Hori, and A. Nakamura (2012) Error type classification and word accuracy estimation using alignment features from word confusion network. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4925–4928. External Links: Document, ISSN 2379-190X Cited by: §2.2.
  • D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §3.1.
  • D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur (2016) Purely sequence-trained neural networks for asr based on lattice-free mmi.. In Interspeech, pp. 2751–2755. Cited by: §1.1, §3.1.
  • E. Roche and Y. Schabes (1995) Deterministic part-of-speech tagging with finite-state transducers. Computational linguistics 21 (2), pp. 227–253. Cited by: §2.3.
  • E. Roche and Y. Schabes (1997) Finite-state language processing. MIT press. Cited by: §2.3.
  • R. Speer and C. Havasi (2012) Representing general relational knowledge in conceptnet 5. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), External Links: Link Cited by: §2.1.
  • M. Stiefel and N. T. Vu (2017) Enriching asr lattices with pos tags for dependency parsing. In Proceedings of the Workshop on Speech-Centric Natural Language Processing, pp. 37–47. Cited by: §2.2.
  • A. Stolcke (2002) SRILM–an extensible language modeling toolkit, in proceedings of international conference on spoken language processing. Denver, Colorado, September, pp. 16–20. Cited by: §2.2.
  • A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema, and M. Meteer (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics. External Links: Document, 0006023, ISBN 089120100561737, ISSN 0891-2017 Cited by: §1.
  • G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-Tür (2002) Improving spoken language understanding using word confusion networks. In Seventh International Conference on Spoken Language Processing, Cited by: §2.2.
  • L. Velikovich, I. Williams, J. Scheiner, P. Aleksic, P. Moreno, and M. Riley (2018) Semantic lattice processing in contextual automatic speech recognition for google assistant. pp. 2222–2226. External Links: Link Cited by: §2.1.
  • W. Ward (1991) Understanding spontaneous speech: the phoenix system. In [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, pp. 365–367. Cited by: §1.
  • W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. Cited by: §1.1.
  • H. Xu, D. Povey, L. Mangu, and J. Zhu (2011) Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language 25 (4), pp. 802–828. Cited by: §3.1.
  • P. Żelasko, P. Szymański, J. Mizgajski, A. Szymczak, Y. Carmiel, and N. Dehak (2018) Punctuation prediction model for conversational speech. Proc. Interspeech 2018, pp. 2633–2637. Cited by: §1.1.
  • P. Zhang, J. Shao, Q. Zhao, and Y. Yan (2007) Keyword spotting based on syllable confusion network. In Third International Conference on Natural Computation (ICNC 2007), Vol. 2, pp. 656–659. Cited by: §2.2.