, Chatbots have always been marking the apex of Artificial Intelligence as forefront of all major AI revolutions, such as human-computer interaction, knowledge engineering, expert system, Natural Language Processing, Natural Language Understanding, Deep Learning, and many others. Open-domain chatbots, also known asChitchat bots, can mimic human conversations to the greatest extent in topics of almost any kind, thus are widely engaged for socialization, entertainment, emotional companionship, and marketing. Earlier generations of open-domain bots, such as Mitsuku(Worswick, 2019) and ELIZA(Weizenbaum, 1966), relied heavily on hand-crafted rules and recursive symbolic evaluations to capture the key elements of human-like conversation. New advances in this field are mostly data-driven and end-to-end systems based on statistical models and neural conversational models (Gao et al., 2018) aim to achieve human-like conversations through more scalable and adaptable learning process on free-form and large data sets (Gao et al., 2018), such as MILABOT(Serban et al., 2017), XiaoIce(Zhou et al., 2018), Replika(Fedorenko et al., 2017), Zo(Microsoft, 2019), and Meena(Adiwardana et al., 2020).
Unlike open-domain bots, closed-domain chatbots are designed to transform existing processes that rely on human agents. Their goals are to help users accomplish specific tasks, where typical examples range from order placement to customer support, therefore they are also known as task-oriented bots (Gao et al., 2018). Many businesses are excited about the prospect of using closed-domain chatbots to interact directly with their customer base, which comes with many benefits such as cost reduction, zero downtime, or no prejudices. However, there will always be instances where a bot will need a human’s input for new scenarios. This could be a customer presenting a problem it has never expected for (Larson et al., 2019), attempting to respond to a naughty input, or even something as simple as incorrect spelling. Under these scenarios, expected responses from open-domain and closed-domain chatbots can be very different: a successful open-domain bot should be ”knowledgeable, humourous and addictive”, whereas a closed-domain chatbot ought to be ”accurate, reliable and efficient”. One main difference is the way of handling unknown questions. A chitchat bot would respond with an adversarial question such as Why do you ask this?, and keep the conversation going and deviate back to the topics under its coverage (Sethi, 2019). A user may find the chatbot is out-smarting, but not very helpful in solving problems. In contrast, a task-oriented bot is scoped to a specific domain of intents, and should terminate out-of-scope conversations promptly and escalate them to human agents.
This paper presents AVA (A Vanguard Assistant), a task-oriented chatbot supporting phone call agents when they interact with clients on live calls. Traditionally, when phone agents need help, they put client calls on hold and consult experts in a support group. With a chatbot, our goal is to transform the consultation processes between phone agents and experts to an end-to-end conversational AI system. Our focus is to significantly reduce operating costs by reducing the call holding time and the need of experts, while transforming our client experience in a way that eventually promotes client self-provisioning in a controlled environment. Understanding intents correctly and escalating irrelevant intents promptly are keys to its success. Recently, NLP community has made many breakthroughs in context-dependent embeddings and bidirectional language models like ELMo, OpenAI, GPT, BERT, RoBERTa, DistilBERT, XLM, XLNet (Dai and Le, 2015; Peters et al., 2017; Devlin et al., 2019; Peters et al., 2018; Lample and Conneau, 2019; Peters et al., 2018; Howard and Ruder, 2018; Yang et al., 2019; Liu et al., 2019; Tang et al., 2019). In particular, the BERT model (Devlin et al., 2019)
has become a new NLP baseline including sentence classification, question answering, named-entity recognition and many others. To our knowledge there are few measures that address prediction uncertainties in these sophisticated deep learning structures, or explain how to achieve optimal decisions on observed uncertainty measures. The off-the-shelf softmax outputs of these models are predictive probabilities, and they are not a valid measure for the confidence in a network’s predictions(Ghahramani, 2016; Maddox et al., 2019; Pearce et al., 2018; Shridhar et al., 2019), which are important concerns in real-world applications (Larson et al., 2019).
Our main contribution in this paper is applying advances in Bayesian Deep Learning to quantify uncertainties in BERT intent predictions. Formal methods like Stochastic Gradient (SG)-MCMC (Li et al., 2016; Rao and Frtunikj, 2018; Welling and Teh, 2011; Park et al., 2018; Maddox et al., 2019; Seedat and Kanan, 2019), variational inference (Blundell et al., 2015; Ghahramani, 2016; Graves, 2011; Hernández-Lobato and Adams, 2015)
extensively discussed in literature may require modifying the network. Re-implementation of the entire BERT model for Bayesian inference is a non-trivial task, so here we took the Monte Carlo Dropout (MCD) approach(Ghahramani, 2016) to approximate variational inference, whereby dropout is performed at training and test time, using multiple dropout masks. Our dropout experiments are compared with two other approaches (Entropy and Dummy-class), and the final implementation is determined among the trade-off between accuracy and efficiency.
We also investigate the usage of BERT as a language model to decipher spelling errors. Most vendor-based chatbot solutions embed an additional layer of service, where device-dependent error models and N-gram language models(Lin et al., 2012) are utilized for spell checking and language interpretation. At representation layer, Wordpiece model(Schuster and Nakajima, 2012) and Byte-Pair-Encoding(BPE) model(Gage, 1994; Sennrich et al., 2016) are common techniques to segment words into smaller units, thus similarities at sub-word level can be captured by NLP models and generalized on out-of-vocabulary(OOV) words. Our approach combines efforts of both sides: words corrected by the proposed language model are further tokenized by Wordpiece model to match pre-trained embeddings in BERT learning.
Despite all advances of chatbots, industries like finance and healthcare are concerned about cyber-security because of the large amount of sensitive information entered during chatbot sessions. Task-oriented bots often require access to critical internal systems and confidential data to finish specific tasks. Therefore, 100% on-premise solutions that enable full customization, monitoring, and smooth integration are preferable than cloud solutions. In this paper, the proposed chatbot is designed using RASA open-source version and deployed within our enterprise intranet. Using RASA’s conversational design, we hybridize RASA’s chitchat module with the proposed task-oriented conversational systems developed on Python, Tensorflow and Pytorch. We believe our approach can provide some useful guidance for industries contemplate adopting chatbot solutions in their business domains.
|T1 label||T2 label||T3 label||Questions|
|Account Maintenance||Call Authentication||Type 2||Am I allowed to give the client their Social security number?|
|Call Authentication||Type 5||Do the web security questions need to be reset by the client if their web access is blocked?|
|Web Reset||Type 1||How many security questions are required to be asked to reset a client’s web security questions?|
|Account Permission||Call Authentication||Type 2||How are the web security questions used to authenticate a client?|
|Agent Incapactiated||Type 3||Is it possible to set up Agent Certification for an Incapacitated Person on an Individual Roth 401k?|
|TAX FAQ||Miscellaneous||What is||Do I need my social security number on the 1099MISC form?|
|Transfer of Asset||Unlike registrations||Type 2||Does the client need to provide special documentation if they want to transfer from one account to another account?|
|Brokerage Transfer||Type 3||Is there a list of items that need to be included on a statement to transfer an account?|
|Banking||Add Owner||Type 4||Once a bank has been declined how can we authorize it?|
|Add/Change/Delete||Type 3||Does a limited agent have authorization to adjust bank info?|
|Irrelevant||-||-||How can we get into an account with only one security question?|
|-||-||Am I able to use my Roth IRA to set up a margin account?|
|-||-||What is the best place to learn about Vanguard’s investment philosophy?|
Recent breakthroughs in NLP research are driven by two intertwined directions: Advances in distributed representations, sparked by the success of word embeddings(Mikolov et al., 2010, 2013), character embeddings (Kim et al., 2015; dos Santos and Gatti, 2014; Dos Santos and Zadrozny, 2014), contextualized word embeddings (Peters et al., 2018; Radford and Sutskever, 2018; Devlin et al., 2019)2008; Collobert et al., 2011), RNN(Elman, 1990), Attention Mechanism (Bahdanau et al., 2015), and Transformer as seq2seq model with parallelized attentions (Vaswani et al., 2017), have defined the new state of the art deep learning models for NLP.
Principled uncertainty estimation in regression(Ermon, 2018)2016) and classification (et al., 2017) are active areas of research with a large volume of work. The theory of Bayesian neural networks (Neal, 1995; MacKay, 1992) provides the tools and techniques to understand model uncertainty, but these techniques come with significant computational costs as they double the number of parameters to be trained. Gal and Ghahramani (Ghahramani, 2016) showed that a neural network with dropout turned on at test time is equivalent to a deep Gaussian process and we can obtain model uncertainty estimates from such a network by multiple-sampling the predictions of the network at test time. Non-Bayesian approaches to estimating the uncertainty are also shown to produce reliable uncertainty estimates (B. Lakshminarayanan, 2017)
; our focus in this paper is on Bayesian approaches. In classification tasks, the uncertainty obtained from multiple-sampling at test time is an estimate of the confidence in the predictions similar to the entropy of the predictions. In this paper, we compare the threshold for escalating a query to a human operator using model uncertainty obtained from dropout-based chatbot against setting the threshold using the entropy of the predictions. We choose dropout-based Bayesian approximation because it does not require changes to the model architecture, does not add parameters to train, and does not change the training process as compared to other Bayesian approaches. We minimize noise in the data by employing spelling correction models before classifying the input. Further, the labels for the user queries are human curated with minimal error. Hence, our focus is on quantifying epistemic uncertainty in AVA rather than aleatoric uncertainty(Kendall and Gal, 2017). We use mixed-integer optimization to find a threshold for human escalation of a user query based on the mean prediction and the uncertainty of the prediction. This optimization step, once again, does not require modifications to the network architecture and can be implemented separately from model training. In other contexts, it might be fruitful to have an integrated escalation option in the neural network (Geifman, 2019), and we leave the trade-offs of integrated reject option and non-Bayesian approaches for future work.
Similar approaches in spelling correction, besides those mentioned in Section 1, are reported in Deep Text Corrector (Atpaino, 2017) that applies a seq2seq model to automatically correct small grammatical errors in conversational written English. Optimal decision threshold learning under uncertainty is studied in (Lepora, 2016) as Reinforcement learning and iterative Bayesian optimization formulations.
3 System Overview and Data Sets
3.1 Overview of the System
Figure 1 illustrates system overview of AVA. The proposed conversational AI will gradually replace the traditional human-human interactions between phone agents and internal experts, and eventually allows clients self-provisioning interaction directly to the AI system. Now, phone agents interact with AVA chatbot deployed on Microsoft Teams in our company intranet, and their questions are preprocessed by a Sentence Completion Model (introduced in Section 6) to correct misspellings. Then, inputs are classified by an intent classification model (Section 4 & 5), where relevant questions are assigned predicted intent labels, and downstream information retrieval and questioning answering modules are triggered to extract answers from a document repository. Irrelevant questions are escalated to human experts following the decision thresholds optimized using methods introduced in section 5. This paper only discusses the Intent Classification model and the Sentence Completion model.
3.2 Data for Intent Classification Model
Training data for AVA’s intent classification model is collected, curated, and generated by a dedicated business team from interaction logs between phone agents and the expert team. The whole process takes about one year to finish. In total 22,630 questions are selected and classified to 381 intents, which compose the relevant questions set for the intent classification model. Additionally, 17,395 questions are manually synthesized as irrelevant questions, and none of them belongs to any of the aforementioned 381 intents. Each relevant question is hierarchically assigned with three labels from Tier 1 to Tier 3. In this hierarchy, there are 5 unique Tier-1 labels, 107 Tier-2 labels, and 381 Tier-3 labels. Our intent classification model is designed to classify relevant input questions into 381 Tier 3 intents and then triggers downstream models to extract appropriate responses. The five Tier-1 labels and the numbers of intents include in each label are: Account Maintenance (9074), Account Permissions (2961), Transfer of Assets (2838), Banking (4788), Tax FAQ (2969). At Tier-1, general business issues across intents are very different, but at Tier-3 level, questions are quite similar to each other, where differences are merely at the specific responses. Irrelevant questions, compared to relevant questions, have two main characteristics:
Some questions are relevant to business intents but unsuitable to be processed by conversational AI. For example, in Table 1, question ”How can we get into an account with only one security question?” is related to Call Authentication in Account Permission, but its response needs further human diagnosis to collect more information. These types of questions should be escalated to human experts.
Out of scope questions. For example, questions like ”What is the best place to learn about Vanguard’s investment philosophy?” or ”What is a hippopotamus?” are totally outside the scope of our training data, but they may still occur in real world interactions.
3.3 Textual Data for Pretrained Embeddings and Sentence Completion Model
Inspired by the progress in computer vision, transfer learning has been very successful in NLP community and has become a common practice. Initializing deep neural network with pre-trained embeddings, and fine-tune the models towards task-specific data is a proven method in multi-task NLP learning. In our approach, besides applying off-the-shelf embeddings from Google BERT and XLNet, we also pre-train BERT embeddings using our company’s proprietary text to capture special semantic meanings of words in the financial domain. Three types of textual datasets are used for embeddings training:
Sharepoint text: About 3.2G bytes of corpora scraped from our company’s internal Sharepoint websites, including web pages, word documents, ppt slides, pdf documents, and notes from internal CRM systems.
Emails: About 8G bytes of customer service emails are extracted.
Phone call transcriptions: We apply AWS to transcribe 500K client service phone calls, and the transcription text is used for training.
All embeddings are trained in case-insensitive settings. Attention and hidden layer dropout probabilities are set to 0.1, hidden size is 768, attention heads and hidden layers are set to 12, and vocabulary size is 32000 using SentencePiece tokenizer. On AWS P3.2xlarge instance each embeddings is trained for 1 million iterations, and takes about one week CPU time to finish. More details about parameter selection for pre-training are avaialble in the github code. The same pre-trained embeddings are used to initialize BERT model training in intent classification, and also used as language models in sentence completion.
|BERT small + Sharepoint Embeddings||0.944|
|BERT small + Google Embeddings||0.949|
|BERT large + Google Embeddings||0.954|
|XLNet Large + Google Embeddings||0.927|
|LSTM with Attention + Word2Vec||0.913|
|LSTM + Word2Vec||0.892|
|Logistic Regression + TFIDF||0.820|
|Xgboost + TFIDF||0.760|
|Naive Bayes + TFIDF||0.661|
Comparison of intent classification performance. BERT and XLNet models were all trained for 30 epochs using batch size 16.
4 Intent Classification Performance on Relevant Questions
Using only relevant questions, we compare various popular model architectures to find one with the best performance on 5-fold validation. Not surprisingly, BERT models generally produce much better performance than other models. Large BERT (24-layer, 1024-hidden, 16-heads) has a slight improvement over small BERT (12-layer, 768-hidden, 12-heads), but less preferred because of expensive computations. To our surprise, XLNet, a model reported outperforming BERT in mutli-task NLP, performs 2 percent lower on our data.
BERT models initialized by proprietary embeddings converge faster than those initialized by off-the-shelf embeddings (Figure 2.a). And embeddings trained on company’s sharepoint text perform better than those built on Emails and phone-call transcriptions (Figure 2.b). Using larger batch size (32) enables models to converge faster, and leads to better performance.
5 Intent Classification Performance including Irrelevant Questions
We have shown how BERT model outperforming other models on real datasets that only contain relevant questions. The capability to handle 381 intents simultaneously at 94.5% accuracy makes it an ideal intent classifier candidate in a chatbot. This section describes how we quantify uncertainties on BERT predictions and enable the bot to detect irrelevant questions. Three approaches are compared:
Predictive-entropy: We measure uncertainty of predictions using Shannon entropy where is the prediction probability of -th sample to -th class. Here, is softmax output of the BERT network (B. Lakshminarayanan, 2017). A higher predictive entropy corresponds to a greater degree of uncertainty. Then, an optimally chosen cut-off threshold applied on entropies should be able to separate the majority of in-sample questions and irrelevant questions.
Drop-out: We apply Monte Carlo (MC) dropout by doing 100 Monte Carlo samples. At each inference iteration, a certain percent of the set of units to drop out. This generates random predictions, which are interpreted as samples from a probabilistic distribution (Ghahramani, 2016). Since we do not employ regularization in our network, in Eq. 7 in Gal and Ghahramani (Ghahramani, 2016) is effectively zero and the predictive variance is equal to the sample variance from stochastic passes. We could then investigate the distributions and interpret model uncertainty as mean probabilities and variances.
Dummy-class: We simply treat escalation questions as a dummy class to distinguish them from original questions. Unlike entropy and dropout, this approach requires retraining of BERT models on the expanded data set including dummy class questions.
5.1 Experimental Setup
All results mentioned in this section are obtained using BERT small + sharepoint embeddings (batch size 16). In Entropy and Dropout approaches, both relevant questions and irrelevant questions are split into five folds, where four folds (80%) of relevant questions are used to train the BERT model. Then, among that 20% held-out relevant questions we further split them into five folds, where 80% of them (equals to 16% of the entire relevant question set) are combined with four folds of irrelevant questions to learn the optimal decision variables. The learned decision variables are applied on BERT predictions of the remaining 20% (906) of held-out relevant questions and held-out irrelevant questions (4000), to obtain the test performance. In dummy class approach, BERT model is trained using four folds of relevant questions plus four folds of irrelevant questions, and tested on the same amount of test questions as Entropy and Dropout approaches.
5.2 Optimizing Entropy Decision Threshold
To find the optimal threshold cutoff , we consider the following Quadratic Mixed-Integer programming problem
to minimize the quadratic loss between the predictive assignments and true labels . In (1), is sample index, and is class (intent) indices. is binary matrix, and is also , where the first
columns are binary values and the last column is a uniform vector, which represents the cost of escalating questions. Normally is a constant value smaller than 1, which encourages the bot to escalate questions rather than making mistaken predictions. The first and second constraints of (1) force an escalation label when entropy . The third and fourth constraints restrict
as binary variables and ensure the sum for each sample is 1. Experimental results (Figure 3) indicate that (1) needs more than 5000 escalation questions to learn a stabilized. The value of escalation cost has a significant impact on the optimal value, and in our implementation is set to 0.5.
|number of irrelevant|
|questions in training||1000||5000||8000||10000||100||1000||2000||3000||1000||5000||8000||10000|
|optimal entropy cutoff||2.36||1.13||0.85||0.55||-||-||-||-||-||-||-||-|
|optimal mean prob cutoff||-||-||-||-||0.8172||0.6654||0.7921||0.0459||-||-||-||-|
|optimal std. cutoff||-||-||-||-||0.1533||0.0250||0.0261||0.0132||-||-||-||-|
|mean accuracy in classes||93.7%|
|accuracy of the dummy class||94.5%|
|precision (binary classification)||51.4%||70.2%||74.7%||79.8%||90.7%||68.8%||68.9%||63.7%||81%||95.3%||99.5%||99.6%|
|recall (binary classification)||96.7%||91.3%||88.1%||83.5%||93.9%||82.7%||83.2%||84.7%||99.7%||98.7%||92.6%||86%|
|F1 score (binary classification)||0.671||0.794||0.808||0.816||0.738||0.751||0.754||0.727||0.894||0.967||0.959||0.923|
5.3 Monte Carlo Drop-out
In BERT model, dropout ratios can be customized at encoding, decoding, attention, and output layer. A combinatorial search for optimal dropout ratios is computationally challenging. Results reported in the paper are obtained through simplifications with the same dropout ratio assigned and varied on all layers. Our MC dropout experiments are conducted as follows:
Change dropout ratios in encoding/decoding/attention/output layer of BERT
Train BERT model on 80% of relevant questions for 10 or 30 epochs
Export and serve the trained model by Tensorflow serving
Repeat inference 100 times on questions, then average the results per each question to obtain mean probabilities and standard deviations, then average the deviations for a set of questions.
According to the experimental results illustrated in Figure 4, we make three conclusions: (1) Epistemic uncertainty estimated by MCD reflects question relevance: when inputs are similar to the training data there will be low uncertainty, whilst data is different from the original training data should have higher epistemic uncertainty. (2) Converged models (more training epochs) should have similar uncertainty and accuracy no matter what drop ratio is used. (3) The number of epochs and dropout ratios are important hyper-parameters that have substantial impacts on uncertainty measure and predictive accuracy and should be cross-validated in real applications.
We use mean probabilities and standard deviations obtained from models where dropout ratios are set to 10% after 30 epochs of training to learn optimal decision thresholds. Our goal is to optimize lowerbound and upperbound , and designate a question as relevant only when the mean predictive probability is larger than and standard deviation is lower than . Optimizing and , on a 381-class problem, is much more computationally challenging than learning entropy threshold because the number of constraints is proportional to class number. As shown in (2), we introduce two variables and to indicate the status of mean probability and deviation conditions, and the final assignment variables is the logical AND of and . Solving (2) with more than 10k samples is very slow (shown in Appendix), so we use 1500 original relevant questions, and increase the number of irrelevant questions from 100 to 3000. For performance testing, the optimized and are applied as decision variables on samples of BERT predictions on test data. Performance from dropout are presented in Table 3 and Appendix. Our results showed decision threshold optimized from (2) involving 2000 irrelevant questions gave the best F1 score (0.754), and we validated it using grid search and confirmed its optimality (shown in appendix).
5.4 Dummy-class Classification
Our third approach is to train a binary classifier using both relevant questions and irrelevant questions in BERT. We use a dummy class to represent those 17,395 irrelevant questions, and split the entire data sets, including relevant and irrelevant, into five folds for training and test.
Performance of dummy class approach is compared with Entropy and Dropout approaches (Table 3). Deciding an optimal number of irrelevant questions involved in threshold learning is non-trivial, especially for Entropy and Dummy class approaches. Dropout doesn’t need as many irrelevant questions as entropy does to learn optimal threshold, mainly because the number of constraints in (2) is proportional to the class number (381), so the number of constraints are large enough to learn a suitable threshold on small samples (To support this conclusion, we present extensive studies in Appendix on a 5-class classifier using Tier 1 intents). Dummy class approach obtains the best performance, but its success assumes the learned decision boundary can be generalized well to any new irrelevant questions, which is often not valid in real applications. In contrast, Entropy and Dropout approaches only need to treat a binary problem in the optimization and leave the intent classification model intact. The optimization problem for entropy approach can be solved much more efficiently, and is selected as the solution for our final implementation.
It is certainly possible to combine Dropout and Entropy approach, for example, to optimize thresholds on entropy calculated from the average mean of MCD dropout predictions. Furthermore, it is possible that the problem defined in (2) can be simplified by proper reformulation, and can be solved more efficiently, which will be explored in our future works.
6 Sentence Completion using Language Model
We assume misspelled words are all OOV words, and we can transform them as [MASK] tokens and use bidirectional language models to predict them. Predicting masked word within sentences is an inherent objective of a pre-trained bidirectional model, and we utilize the Masked Language Model API in the Transformer package (HuggingFace, 2017) to generate the ranked list of candidate words for each [MASK] position. The sentence completion algorithm is illustrated in Algorithm 1.
6.2 Experimental Setup
For each question, we randomly permutate two characters in the longest word, the next longest word, and so on. In this way, we generate one to three synthetic misspellings in each question. We investigate intent classification accuracy changes on these questions, and how our sentence completion model can prevent performance changes. All models are trained using relevant data (80%) without misspellings and validated on synthetic misspelled test data. Five settings are compared: (1) No correction: classification performance without applying any auto-correction; (2) No LM: Auto-corrections made only by word edit distance without using Masked Language model; (3) BERT Sharepoint: Auto-corrections made by Masked LM using pre-trained sharepoint embeddings together with word edit distance; (4) BERT Email: Auto-corrections using pretrained email embeddings together with word edit distance; (5) BERT Google: Auto-corrections using pretrained Google Small uncased embedding data together with word edit distance.
We also need to decide what is an OOV, or, what should be included in our vocabulary. After experiments, we set our vocabulary as words from four categories: (1) All words in the pre-trained embeddings; (2) All words that appear in training questions; (3) Words that are all capitalized because they are likely to be proper nouns, fund tickers or service products; (4) All words start with numbers because they can be tax forms or specific products (e.g., 1099b, 401k, etc.). The purposes of including (3) and (4) is to avoid auto-correction on those keywords that may represent significant intents. Any word falls outside these four groups is considered as an OOV. During our implementation, we keep monitoring OOV rate, defined as the ratio of OOV occurrences to total word counts in recent 24 hours. When it is higher than 1%, we apply manual intervention to check chatbot log data.
We also need to determine two additional parameters , the number of candidate tokens prioritized by masked language model and , the beam size in our sentence completion model. In our approach, we set and to the same value, and it is benchmarked from 1 to 10k by test sample accuracy. Notice that when and are large, and when there are more than two OOVs, Beam Search becomes very inefficient in Algorithm 1. To simplify this, instead of finding the optimal combinations of candidate tokens that maximize the joint probability , we assume they are independent and apply a simplified Algorithm (shown in Appendix) on single OOV separately. An improved version of sentence completion algorithm to maximize joint probability will be our future research. We haven’t consider situations when misspellings are not OOV in our paper. To detect improper words in a sentence may need evaluation of metrics such as Perplexity or Sensibleness and Specificity Average (SSA)(Adiwardana et al., 2020), and will be our future goals.
According to the experimental results illustrated in Figure 5, pre-trained embeddings are useful to increase the robustness of intent prediction on noisy inputs. Domain-specific embeddings contain much richer context-dependent semantics that helps OOVs get properly corrected, and leads to better task-oriented intent classification performance. Benchmark shows B4000 leads to the best performance for our problem. Based on this, we apply sharepoint embeddings as the language model in our sentence completion module.
The chatbot has been implemented fully inside our company network using open source tools including RASA(Bocklisch et al., 2017), Tensorflow, Pytorch in Python enviornment. All backend models (Sentence Completion model, Intent Classification model and others) are deployed as RESTFUL APIs in AWS Sagemaker. The front-end of chatbot is launched on Microsoft Teams, powered by Microsoft Botframework and Microsoft Azure directory, and connected to backend APIs in AWS environment. All our BERT model trainings, including embeddings pretraining, are based on BERT Tensorflow running on AWS P3.2xlarge instance. The optimization procedure uses Gurobi 8.1 running on AWS C5.18xlarge instance. BERT language model API in sentence completion model is developed using Transformer 2.1.1 package on PyTorch 1.2 and Tensorflow 2.0.
During our implementation, we further explore how the intent classification model API can be served in real applications under budget. We gradually reduce the numbers of attention layer and hidden layer in the original BERT Small model (12 hidden layers, 12 attention heads) and create several smaller models. By reducing the number of hidden layers and attention layers in half, we see a remarkable 100% increase in performance (double the throughput, half the latency) with the cost of only 1.6% drop in intent classification performance.
Our results demonstrate that optimized uncertainty thresholds applied on BERT model predictions are promising to escalate irrelevant questions in task-oriented chatbot implementation, meanwhile the state-of-the-art deep learning architecture provides high accuracy on classifying into a large number of intents. Another feature we contribute is the application of BERT embeddings as language model to automatically correct small spelling errors in noisy inputs, and we show its effectiveness in reducing intent classification errors. The entire end-to-end conversational AI system, including two machine learning models presented in this paper, is developed using open source tools and deployed as in-house solution. We believe those discussions provide useful guidance to companies who are motivated to reduce dependency on vendors by leveraging state-of-the-art open source AI solutions in their business.
We will continue our explorations in this direction, with particular focuses on the following issues: (1) Current fine-tuning and decision threshold learning are two separate parts, and we will explore the possibility to combine them as a new cost function in BERT model optimization. (2) Dropout methodology applied in our paper belongs to approximated inference methods, which is a crude approximation to the exact posterior learning in parameter space. We are interested in a Bayesian version of BERT, which requires a new architecture based on variational inference using tools like TFP Tensorflow Probability. (3) Maintaining chatbot production system would need a complex pipeline to continuously transfer and integrate features from deployed model to new versions for new business needs, which is an uncharted territory for all of us. (4) Hybridizing ”chitchat” bots, using state-of-the-art progresses in deep neural models, with task-oriented machine learning models is important for our preparation of client self-provisioning service.
We thank our colleagues in Vanguard CAI (ML-DS team and IT team) for their seamless collaboration and support. We thank colleagues in Vanguard Retail Group (IT/Digital, Customer Care) for their pioneering effort collecting and curating all the data used in our approach. We thank Robert Fieldhouse, Sean Carpenter, Ken Reeser and Brain Heckman for the fruitful discussions and experiments.
- Towards a human-like open-domain chatbot. External Links: Cited by: §1, §6.2.
- Deep-text-corrector. GitHub. Note: https://github.com/atpaino/deep-text-corrector Cited by: §2.
- Simple and scalable predictive uncertainty estimation using deep ensembles. pp. 6405. Cited by: §2, 1st item.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, Cited by: §2.
- Weight uncertainty in neural network. In Proceedings of the 32nd ICML, pp. 1613–1622. Cited by: §1.
- Rasa: open source language understanding and dialogue management. CoRR abs/1712.05181. External Links: Cited by: §7.
- Artificial paranoia. Artificial Intelligence 2 (1), pp. 1–25. External Links: Cited by: §1.
- Natural language processing (almost) from scratch. JMLR 12. External Links: Cited by: §2.
- A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th ICML, pp. 160–167. Cited by: §2.
- Semi-supervised sequence learning. In In Proceedings of Advances in Neural Information Processing Systems 28, pp. 3079–3087. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NACL, Vol 1, pp. 4171–4186. Cited by: AVA: A Financial Service Chatbot based on Deep Bidirectional Transformers, §1, §2.
- . In Proceedings of the 25th International Conference on Computational Linguistics, Dublin, Ireland, pp. 69–78. Cited by: §2.
- Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1818–II–1826. Cited by: §2.
- Finding structure in time. COGNITIVE SCIENCE 14 (2), pp. 179–211. Cited by: §2.
- Accurate uncertainties for deep learning using calibrated regression. Cited by: §2.
- On calibration of modern neural networks. Vol. 70. Cited by: §2.
- Bayesian reinforcement learning. Foundations and Trends in Machine Learning 8 (5-6). Cited by: §2.
- Avoiding echo-responses in a retrieval-based conversation system. Conference on Artificial Intelligence and Natural Language, pp. 91–97. Cited by: §1.
- A new algorithm for data compression. C Users J. 12 (2), pp. 23–38. External Links: Cited by: §1.
- Neural approaches to conversational ai. In SIGIR ’18, Cited by: §1, §1.
- SelectiveNet: a deep neural network with an integrated reject option. Cited by: §2.
- Dropout as bayesian approximation: representing model uncertainty in deep learning. Vol. 48, pp. 1050. Cited by: §1, §1, §2, 2nd item.
- Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24, pp. 2348–2356. Cited by: §1.
Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd ICML, Vol 37, ICML’15, pp. 1861–1869. Cited by: §1.
- Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the ACL, Melbourne, Australia, pp. 328–339. Cited by: §1.
- Transformers. GitHub. Note: https://github.com/huggingface/transformers Cited by: §6.1.
- What uncertainties do we need in bayesian deep learning for computer vision?. Vol. 30. Cited by: §2.
- Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2741–2749. Cited by: §2.
- Cross-lingual language model pretraining. CoRR abs/1901.07291. Cited by: §1.
- An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 9th EMNLP-IJCNLP, Cited by: §1, §1.
- Threshold learning for optimal decision making. In Advances in Neural Information Processing Systems 29, pp. 3763–3771. Cited by: §2.
Learning weight uncertainty with stochastic gradient mcmc for shape classification.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5666–5675. External Links: Cited by: §1.
- Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 System Demonstrations, USA, pp. 169–174. Cited by: §1.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. Cited by: §1.
- A practical bayesian framework for backpropagation networks.. Vol. 4. Cited by: §2.
- A simple baseline for bayesian uncertainty in deep learning. Neural Information Processing Systems (NeurIPS). Cited by: §1, §1.
- Zo. Note: https://www.zo.ai Cited by: §1.
- Recurrent neural network based language model. Vol. 2, pp. 1045–1048. Cited by: §2.
- Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546. External Links: Cited by: §2.
- Bayesian learning for neural networks. Ph.D. Thesis. Cited by: §2.
- Sampling-based bayesian inference with gradient uncertainty. CoRR abs/1812.03285. External Links: Cited by: §1.
- Uncertainty in neural networks: bayesian ensembling. ArXiv abs/1810.05546. Cited by: §1.
- Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th ACL, Vancouver, Canada, pp. 1756–1765. Cited by: §1.
- Dissecting contextual word embeddings: architecture and representation. CoRR abs/1808.08949. Cited by: §1.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the NAACL, New Orleans, Louisiana, pp. 2227–2237. Cited by: §1, §2.
- Improving language understanding by generative pre-training. In arxiv, Cited by: §2.
- Deep learning for self-driving cars: chances and challenges. In 2018 IEEE/ACM 1st International Workshop on Software Engineering for AI in Autonomous Systems (SEFAIAS), Vol. , Los Alamitos, CA, USA, pp. 35–38. External Links: Cited by: §1.
- Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5149–5152. External Links: Cited by: §1.
- Towards calibrated and scalable uncertainty representations for neural networks. ArXiv abs/1911.00104. Cited by: §1.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th ACL, Berlin, Germany, pp. 1715–1725. Cited by: §1.
- A deep reinforcement learning chatbot. CoRR. External Links: Cited by: §1.
- The state of chatbots in 2019. External Links: Cited by: §1.
- A comprehensive guide to bayesian convolutional neural network with variational inference. CoRR abs/1901.02731. External Links: Cited by: §1.
- Distilling task-specific knowledge from BERT into simple neural networks. CoRR abs/1903.12136. Cited by: §1.
- Attention is all you need. In NIPS, Cited by: §2.
- ELIZA—a computer program for the study of natural language communication between man and machine. 9 (1), pp. 36–45. External Links: Cited by: §1.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Madison, WI, USA, pp. 681–688. External Links: Cited by: §1.
- Mitsuku. Note: http://www.mitsuku.com Cited by: §1.
- XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. Cited by: §1.
- The design and implementation of xiaoice, an empathetic social chatbot. CoRR abs/1812.08989. Cited by: §1.
Appendix A Appendix
All extended materials and source code related to this paper are avaliable on https://github.com/cyberyu/ava Our repo is composed of two parts: (1) Extended materials related to the main paper, and (2) Source code scripts. To protect proprietary intellectual property, we cannot share the question dataset and proprietary embeddings. We use an alternative data set from Larson et al., “An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction”, EMNLP-IJCNLP 2019, to demonstrate the usage of code.
a.1 Additional Results for the Main Paper
Some extended experimental results about MC dropout and optimization are presented on github.
a.1.1 Histogram of Uncertainties by Dropout Ratios
We compare histograms of standard deviations observed from random samples of predictions. The left side contains histograms generated by 381-class intent models trained for 10 epochs, with dropout ratio varied from 10 percent to 90 percent. The right side shows histograms generated by 381 class models trained for 30 epochs.
a.1.2 Uncertainty comparison between 381-class vs 5-class
To understand how uncertainties change vs. the number of classes in BERT, we train another intent classifier using only Tier 1 labels. We compare uncertainty and accuracy changes at different dropout rates between the original 381-class problem and the new 5-class problem.
a.1.3 Grid Search For Optimal Threshold on Dropout
Instead of using optimization, we use a grid search to find optimal combinations of average probability threshold and standard deviation threshold. The search space is set as a 100 x 100 grid on space [0,0] to [1,1], where thresholds vary by step of 0.01 from 0 to 1. Applying thresholds to outputs of BERT predictions give us classifications of relevance vs. irrelevance questions, and using the same combination of test and irrelevant questions we visualize the F1 score in contour map shown on github repo.
a.1.4 Optimal Threshold Learning on Dropout 381 classes vs 5 classes
Using the same optimization process mentioned in equation (2) of the main paper, we compare the optimal results (also CPU timing) learned from 381 classes vs. 5 classes.
a.1.5 Simple algorithm for sentence completion model
When multiple OOVs occur in a sentence, in order to avoid the computational burden using large beamsize to find the optimal joint probabilities, we assume all candidate words for OOVs are independent, and apply Algorithm 2 one by one to correct the OOVs.
a.2 Intent Classification Source Code
a.2.1 BERT embeddings Model Pretraining
The jupyter notebook for pretraining embeddings is at https://github.com/cyberyu/ava/blob/master/scripts/notebooks/BERT_PRETRAIN_Ava.ipynb. Our script is adapted from Denis Antyukhov’s blog “Pre-training BERT from scratch with cloud TPU”. We set the VOC_SIZE to 32000, and use SentencePiece tokenizer as approximation of Google’s WordPiece. The learning rate is set to 2e-5, training batch size is 16, training setps set to 1 million, MAX_SEQ_LENGTH set to 128, and MASKED_LM_PROB is set to 0.15.
To ensure the embeddings is training at the right architecture, please make sure the bert_config.json file referred in the script has the right numbers of hidden and attention layers.
a.2.2 BERT model training and exporting
The jupyter notebook for BERT intent classification model training, validation, prediciton and exporting is at https://github.com/cyberyu/ava/blob/master/scripts/notebooks/BERT_run_classifier_Ava.ipynb. The main script run_classifier_inmem.py is tweaked from the default BERT script run_classifier.py, where a new function serving_input_fn(): is added. To export that model in the same command once training is finished, the ’–do_export=true’ need be set True, and the trained model will be exported to directory specified in ’–export_dir’ FLAG.
a.2.3 Model Serving API Script
We create a jupyter notebook to demonstrate how exported model can be served as in-memory classifier for intent classification, located at https://github.com/cyberyu/ava/scripts/notebooks/inmemory_intent.ipynb. The script will load the entire BERT graph in memory from exported directory, keep them in memory and provide inference results on new questions. Please notice that in “getSess()” function, users need to specify the correct exported directory, and the correct embeddings vocabulary path.
a.2.4 Model inference with Dropout Sampling
We provide a script that performs Monte Carlo dropout inference using in-memory classifier. The script assumes three groups of questions are saved in three separate files: training.csv, test.csv, irrelevant.csv. Users need to specify the number of random samples, and prediction probabilities results are saved as corresponding pickle files. The script is available at https://github.com/cyberyu/ava/scripts/dropout_script.py
a.2.5 Visualization of Model Accuracy and Uncertainty
The visualization notebook https://github.com/cyberyu/ava/scripts/notebooks/BERT_dropout_visualization.ipynb uses output pickle files from the previous script to generate histogram distribution figures and figures 4(b) and (c).
a.3 Threshold Optimization Source Code
a.3.1 Threshold for entropy
Optimization script finding best threshold for entropy is available at https://github.com/cyberyu/ava/blob/master/scripts/optimization/optimize_entropy_threshold.py. The script requires Python 3.6 and Gurobi 8.1.
a.3.2 Threshold for mean probability and standard deviation
Optimization script finding best mean probability threshold and standard deviation threshold is available at https://github.com/cyberyu/ava/blob/master/scripts/optimization/optimize_dropout_thresholds.py
a.4 Sentence Completion Source Code
The complete Sentence Completion RESTFUL API code is in https://github.com/cyberyu/ava/scripts/sentence_completion/serve.py. The model depends on BertForMaskedLM function from Transformer package (ver 2.1.1) to generate token probabilities. We use transformers-cli (https://huggingface.co/transformers/converting_tensorflow_models.html) to convert our early pretrained embeddings to PyTorch formats. The input parameters for API are:
Input sentence. The usage can be three cases:
The input sentence can be noisy (containing misspelled words) that require auto-correction. As shown in the example, the input sentence has some misspelled words.
Alternatively, it can also be a masked sentence, in the form of “Does it require [MASK] signature for IRA signup”. [MASK] indicates the word needs to be predicted. In this case, the predicted words will not be matched back to input words. Every MASKED word will have a separate output of top M predict words. But the main output of the completed sentence is still one (because it can be combined with misspelled words and cause a large search) .
Alternatively, the sentence can be a complete sentence, which only needs to be evaluated only for Perplexity score. Notice the score is for the entire sentence. The lower the score, the more usual the sentence is.
Beamsize: This determines how many alternative choices the model needs to explore to complete the sentence. We have three versions of functions, predict_oov_v1, predict_oov_v2 and predict_oov_v3. When there are multiple [MASK] signs in a sentence, and beamsize is larger than 100, v3 function is used as independent correction of multiple OOVs. If beamsize is smaller than 100, v2 is used as joint-probability based correction. If a sentence has only one [MASK] sign, v1 (Algorithm 2 in Appendix) is used.
Customized Vocabulary: The default vocabulary is the encoding vocabulary when the bidirectional language model was trained. Any words in the sentence that do not occur in vocabulary will be treated as OOV, and will be predicted and matched. If you want to avoid predicting unwanted words, you can include them in the customized vocabulary. For multiple words, combine them with “—” and the algorithm will split them into list. It is possible to turn off this customized vocabulary during runtime, which simply just put None in the parameters.
Ignore rule: Sometimes we expect the model to ignore a range of words belonging to specific patterns, for example, all words that are capitalized, all words that start with numbers. They can be specified as ignore rules using regular expressions to skip processing them as OOV words. For example, expression ”[A-Z]+” tells the model to ignore all uppercase words, so it will not treat ‘IRA’ as an OOV even it is not in the embeddings vocabulary (because the embeddings are lowercased). To turn this function off, use None as the parameter.
The model returns two values: the completed sentence, and its perplexity score.
a.5 RASA Server Source Code
The proposed chatbot utilizes RASA’s open framework to integrate RASA’s “chitchat” capability with our proposed customized task-oriented models. To achieve this, we set up an additional action endpoint server to handle dialogues that trigger customized actions (sentence completion+intent classification), which is specified in actions.py file. Dialogue management is handled by RASA’s Core dialogue management models, where training data is specified in stories.md file. So, in RASA dialogue_model.py file run_core function, the agent loads two components: nlu_interpreter and action_endpoint.
The entire RASA project for chatbot is shared under https://github.com/cyberyu/ava/bot. Please follow the github guidance in README file to setup the backend process.
a.6 Microsoft Teams Setup
Our chatbot uses Microsoft Teams as front-end to connect to RASA backend. We realize setting up MS Teams smoothly is a non-trivial task, especially in enterprise controlled enviornment. So we shared detailed steps on Github repo.
a.7 Connect MS Teams to RASA
At RASA side, the main tweak to allow MS Team connection is at dialogue_model.py file. The BotFrameworkInput library needs to be imported, and the correct app_id and app_password specified in MS Teams setup should be assigned to initialize RASA InputChannel.