In this work, we examine the problem of incrementally improving deployed QA systems in an industrial setting. We consider the domain of customer care of a wireless network provider and focus on answering frequent questions (focussing on the long tail of the question distribution Bernstein et al. (2012)). In this setting, the most frequent topics are covered by a separate industry-standard chatbot based on hand-crafted rules by dialogue engineers. Our proposed process is based on the augmented cross-industry standard process for data mining Sonntag (2008) (augmented CRISP data mining cycle). In particular, we are interested in methods for improving a model after its deployment through re-ranking of the initial ranking results. In advance, we follow the steps of the CRISP cycle towards deployment for generating a state-of-the-art baseline QA model. First, we examine existing data (data understanding) and prepare a corpus for training (data preparation). Second, we implement and train a QA pipeline using state-of-the-art open source components (modelling). We perform an evaluation using different amounts of data and different pipeline configurations (evaluation), also to understand the nature of the data and the application (business understanding). Third, we investigate the effectiveness and efficiency of re-ranking in improving our QA pipeline after the deployment phase of CRISP. Adaptivity after deployment is modelled as (automatic) operationalisation step with external reflection based on, e.g., user feedback. This could be replaced by introspective meta-models that allow the system to enhance itself by metacognition Sonntag (2008). The QA system and the re-ranking approach are evaluated using a separate test set that maps actual user queries from a chat-log to answers of the QA corpus. Sample queries from the evaluation set with one correct and one incorrect sample are shown in Table 1.
With this work, we want to answer the question whether a deployed QA system that is difficult to adapt and that provides a top-10 ranking of answer candidates, can be improved by an additional re-ranking step that corresponds to the operationalisation step of the augmented CRISP cycle. It is also important to know the potential gain and the limitations of such a method that works on top of an existing system. We hypothesise that our proposed re-ranking approach can effectively improve ranking-based QA systems.
|User Query||Correct Answer||Incorrect Answer|
|Bekomme ich bei Vertragsverlängerung ein neues Handy? (Do I get a new phone when extending my contract?)||Ab wann Sie Ihre Rufnummern verlängern können und welche Angebote bei einer Vertragsverlängerung auf Sie warten, sehen Sie in Ihrem persönlichen Kundenportal Mein T-Mobile […] (In your online customer area of T-Mobile, you can see when you can continue your telephone numbers and which offers await you after extending your contract […]) (rank 1)||Suchen Sie ein neues Gerät, das genau Ihre Bedürfnisse und Anforderungen erfüllt? Sie wollen rechtzeitig über Neuerungen informiert werden? […] (You are looking for a new device that satisfies all your requirements? You want to get recent news? […]) (rank 5)|
|tarife ohne bimdung (plans without bimding contract) – misspelled||Wenn Sie bereits ein Handy besitzen und nur eine Simkarte benötigen, haben wir genau das Richtige für Sie: die Klax-SIM. […] (If you own a new phone and all you need is a SIM card, we got exactly the right offer: the Klax-SIM. […]) (rank 3)||Eine Übersicht über unsere aktuellen My Mobile Handytarife inklusive aller wichtigen Details finden Sie auf der Tarifseite. […] (An overview of our current service plans with all important details can be found on our website. […]) (rank 1)|
|Kosten für vertragübernahme (costs for a contract transfer)||Sie können Verträge an andere Personen übergeben, zusammenlegen oder trennen. Die Kosten belaufen sich auf […] Ausführliche Informationen zum Thema finden Sie in den FAQ. (You can transfer, join and split contracts from and to other persons. The costs are […] More detailed information can be found in our FAQ.) (rank 10)||Ein Zukauf von Freiminuten ist nicht möglich und bei unseren aktuellen Tarifen auch nicht notwendig, da Freiminuten hier unlimitiert sind. (You cannot buy additional minutes. However, that’s not required with our plans, because minutes are unlimited.) (rank 1)|
|Kreditkarte (credit card)||Eine Zahlung mittels Kreditkarte ist selbstverständlich bei uns möglich. Sollten Sie Ihre Zahlungsart auf Kreditkarte ändern oder Ihre Daten aktualisieren wollen, können Sie dies direkt über unseren LiveChat veranlassen. […] (Of course, you can pay with your credit card. If you want to change your payment settings to credit card or if you want to update your data, you can do so using our LiveChat. […]) (rank 7)||Die Änderung Ihrer Kreditkartendaten ist zu Ihrer Sicherheit nur telefonisch bei der Serviceline unter […] (For security reasons, you can change your credit card data via phone using our service hotline at […] only) (rank 1)|
|Hallo, ich möchte ein iPhone 7 kaufen (Ratenzahlung). Ich hab schon ein Vertrag (bis 09.2017)..wenn ich das verlängern möchte muss ich die Raten von meine altes Handy weiter zahlen? Lg (Hello, I’d like to buy an iPhone 7 (paying by instalments). I have a contract (till 09/2017)..if I want to extend it, do I need to pay the remaining rates for my old phone? Kind regards)||Ratenzahlungen oder Stundungen bei offenen Rechnungsbeträgen bietet T-Mobile NICHT an […] (T-Mobile does NOT offer payment by instalments or deferred payments for outstanding bill amounts.) (not in top-10)||Mit der Umstellung auf LTE hat sich nichts am Geschwindigkeitsprofil inklusive der erreichbaren Maximalgeschwindigkeiten Ihres aktuellen Tarifes geändert. […] (The transition to LTE (4G) does not affect the maximum data transfer rate of your present service plan.) (rank 1)|
2 Related Work
The broad field of QA includes research ranging from retrieval-based Xue et al. (2008); Das et al. (2016); Minaee and Liu (2017); Dhingra et al. (2018) to generative Serban et al. (2015, 2015), as well as, from closed-domain Eric and Manning (2017); Oraby et al. (2017) to open-domain QA Serban et al. (2015); Joshi et al. (2017); Rajpurkar et al. (2016); Chen et al. (2017). We focus on the notion of improving an already deployed system.
For QA dialogues based on structured knowledge representations, this can be achieved by maintaining and adapting the knowledgebase Sonntag et al. (2007); Sonntag (2009, 2010). In addition, Sonntag (2008) proposes metacognition models for building self-reflective and adaptive AI systems, e.g., dialogue systems, that improve by introspection. Buck et al. present a method for reformulating user questions: their method automatically adapts user queries with the goal to improve the answer selection of an existing QA model Buck et al. (2017).
Other works suggest humans-in-the-loop for improving QA systems. Savenkov and Agichtein use crowdsourcing for re-ranking retrieved answer candidates in a real-time QA framework Savenkov and Agichtein (2016). In Guardian, crowdworkers prepare a dialogue system based on a certain web API and, after deployment, manage actual conversations with users Huang et al. (2015). EVORUS learns to select answers from multiple chatbots via crowdsourcing Huang et al. (2018). The result is a chatbot ensemble excels the performance of each individual chatbot. Williams et al. present a dialogue architecture that continuously learns from user interaction and feedback Williams et al. (2017).
We propose a re-ranking algorithm similar to Savenkov and Agichtein (2016): we train a similarity model using n-gram based features of QA pairs for improving the answer selection of a retrieval-based QA system.
3 Question Answering System
We implement our question answering system using state-of-the-art open source components. Our pipeline is based on the Rasa natural language understanding (NLU) framework Bocklisch et al. (2017) which offers two standard pipelines for text classification: spacy_sklearn and tensorflow_embedding. The main difference is that spacy_sklearn uses Spacy111https://spacy.io/
for feature extraction with pre-trained word embedding models and Scikit-learnPedregosa et al. (2011) for text classification. In contrast, the tensorflow_embedding et al. (2015) as machine learning backend. Figure 1 shows the general structure of both pipelines. We train QA models using both pipelines with the pre-defined set of hyper-parameters. For tensorflow_embedding
, we additionally monitor changes in system performance using different epoch configurations222https://rasa.com/docs/nlu/components/#intent-classifier-tensorflow-embedding. Further, we compare the performances of pipelines with or without a spellchecker and investigate whether model training benefits from additional user examples by training models with the three different versions of our training corpus including no additional samples (kw), samples from 1 user (kw+1u) or samples from 2 users (kw+2u) (see section Corpora). All training conditions are summarized in Table 2. Next, we describe the implementation details of our QA system as shown in Figure 1: the spellchecker module, the subsequent pre-processing and feature encoding, and the text classification. We include descriptions for both pipelines.
|parameters||default||default with epochs|
|training corpus||kw, kw+1u, kw+2u|
Spellchecker We address the problem of frequent spelling mistakes in user queries by implementing an automated spell-checking and correction module. It is based on a Python port333https://github.com/mammothb/symspellpy of the SymSpell algorithm444https://github.com/wolfgarbe/SymSpell initialized with word frequencies for German555German 50k: https://github.com/hermitdave/FrequencyWords. We apply the spellchecker as first component in our pipeline.
Pre-Processing and Feature Encoding. The spacy_sklearn pipeline uses Spacy for pre-processing and feature encoding. Pre-processing includes the generation of a Spacy document and tokenization using their German language model de_core_news_sm (v2.0.0). The feature encoding is obtained via the vector function of the Spacy document that returns the mean word embedding of all tokens in a query. For German, Spacy provides only a simple dense encoding of queries (no proper word embedding model).
The pre-processing step of the tensorflow_embedding pipeline uses a simple whitespace tokenizer for token extraction. The tokens are used for the feature encoding step that is based on Scikit-learn’s CountVectorizer. It returns a bag of words histogram with words being the tokens (1-grams).
Text Classification. The spacy_sklearn
pipeline relies on Scikit-learn for text classification using a support vector classifier (SVC). The model confidences are used for ranking all answer candidates; the top-10 results are returned.
Text classification for tensorflow_embedding is done using TensorFlow with an implementation of the StarSpace algorithm Wu et al. (2017). This component learns (and later applies) one embedding model for user queries and one for the answer id. It minimizes the distance between embeddings of QA training samples. The distances between a query and all answer ids are used for ranking.
In this work, we include two corpora: one for training the baseline system and another for evaluating the performance of the QA pipeline and our re-ranking approach. In the following, we describe the creation of the training corpus and the structure of the test corpus. Both corpora have been anonymised.
Training Corpus. The customer care department provides answers to common user questions. Each answer is tagged with a variable amount of keywords or key-phrases (, ), in total. We asked students to augment the training corpus with, in total, two additional natural example queries. This process can be scaled by crowdsourcing for an application in productive systems that might include more answers or that requires more sample question per answer or both. The full dataset contains, on average, sample queries per answer totalling in queries overall. For model training, all questions (including keywords) are used as input with the corresponding answer as output. We generated three versions of the training corpus: keywords only (kw, ), keywords with samples from 1 user (kw+1u, ) and keywords with samples from 2 users (kw+2u, ).
The performance of the implemented QA system and of our re-ranking approach is assessed using a separate test corpus. It includes real user requests from a chat-log of T-Mobile Austria, which are assigned to suitable answers from the training corpus (at most three). The assignment was performed manually by domain experts of the wireless network provider. We use this corpus for estimating the baseline performance of the QA pipeline using different pipeline configurations and different versions of the training corpus. In addition, we use the corpus for evaluating our re-ranking approach per cross-validation: we regard the expert annotations as offline human feedback. The queries in this corpus contain a lot of spelling mistakes. We address this in our QA pipeline generation by implementing a custom spell-checking component.
4 Baseline Performance Evaluation
We evaluate the baseline model using all training configurations in Table 2 to find a well-performing baseline for our re-ranking experiment. We use the evaluation corpus as reference data and report the top-1 to top-10 accuracies and the mean reciprocal rank for the top-10 results (MRR@10) as performance metrics. For computing the top-n accuracy, we count all queries for which the QA pipeline contains a correct answer on rank 1 to n and divide the result by the number of test queries. The MRR is computed as the mean of reciprocal ranks over all test queries. The reciprocal rank for one query is defined as : The RR is if the correct answer is ranked first, if it is at the second rank and so on. We set RR to zero, if the answer is not contained in the top-10 results.
Results. Figure 2 shows the accuracy and MRR values for all conditions. We only restrict tensorflow_embedding to the default number of epochs which is . At the corpus level, we can observe that the accuracy and the MRR increase when training with additional user annotations for all pipeline configurations. For example, the spacy_sklearn pipeline without spell-checking achieves a top-10 accuracy of and a MRR of when using the kw training corpus with keywords only. Both measures increase to and , respectively, when adding two natural queries for training. In some cases, adding only 1 user query results in slightly better scores. However, the overall trend is that more user annotations yield better results.
In addition, we observe performance improvements for pipelines that use our spell-checking component when compared to the default pipelines that do not make use of it: The spacy_sklearn kw+2u condition performs better, the tensorflow_embedding kw+2u condition performs better, in terms of top-10 accuracy. We can observe similar improvements for the majority of included metrics. Similar to the differentiation by corpus, we can find cases where spell-checking reduces the performance for a particular measure, against the overall trend.
Overall, the tensorflow_embedding pipelines perform considerably better than the spacy_sklearn pipeline irrespective of the remaining parameter configuration: the best performing methods are achieved by the tensorflow_embedding pipeline with spell-checking. Figure 3 sheds more light on this particular setting. It provides performance measures for all corpora and for different number of epochs used for model training. Pipelines that use epochs for training range among the best for all corpora. When adding more natural user annotations, using epochs achieves similar or better scores, in particular concerning the top-10 accuracy and the MRR. Re-ranking the top-10 results can only improve the performance in QA, if the correct answer is among the top-10 results. Therefore, we use the tensorflow_embedding pipeline with spellchecking, epochs and the full training corpus as baseline for evaluating the re-ranking approach.
5 Re-Ranking Approach
Our re-ranking approach compares a user query with the top-10 results of the baseline QA system. In contrast to the initial ranking, our re-ranking takes the content of the answer candidates into account instead of encoding the user query only. Our algorithm compares the text of the recent user query to each result. We include the answer text and the confidence value of the baseline system for computing a similarity estimate. Finally, we re-rank the results by their similarity to the query (see Algorithm 1).
We consider a data-driven similarity function that compares linguistic features of the user query and answer candidates and also takes into account the confidence of the baseline QA system. This similarity estimate shall enhance the baseline by using an extended data and feature space, but without neglecting the learned patterns of the baseline system. The possible improvement in top-1 accuracy is limited by the top-10 accuracy of the baseline system (), because our re-ranking cannot choose from the remaining answers. Figure 4 shows how the re-ranking model is connected to the deployed QA system: it requires access to its in- and outputs for the additional ranking step.
We consider the gradient boosted regression tree for learning a similarity function for re-ranking similar to Savenkov and Agichtein (2016). The features for model training are extracted from pre-processed query-answer pairs. Pre-processing includes tokenization and stemming of query and answer and the extraction of uni-, bi- and tri-grams from both token sequences666We use default word tokenizer, Snowball stemmer and n-gram extraction of the nltk toolkit Bird et al. (2009)
. We include three distance metrics as feature: the Jaccard distance, the cosine similarity777We use the implementation for Jaccard distance and cosine similarity as found in the following Github gist: gaulinmp/similarity_example.ipynb, and the plain number of n-gram matches between n-grams of a query and an answer.
6 Re-Ranking Performance Evaluation
We compare our data-driven QA system with a version that re-ranks resulting top-10 candidates using the additional ranking model. We want to answer the question whether our re-ranking approach can improve the performance of the baseline QA pipeline after deployment. For that, we use the evaluation corpus () for training and evaluating our re-ranking method using 10-fold cross-validation, i.e., of the data is used for training and for testing with different train-test splits.
The training and testing procedure per data split of the cross-validation is shown in Algorithm 2. For each sample query in the train set , we include the correct answer and one randomly selected negative answer candidate for a balanced model training. We skip a sample, if the correct answer is not contained in the top-10 results: we include of the data (see top-10 accuracy of the baseline QA model in Figure 3). The baseline QA model and the trained re-ranking method are applied to all sample queries in the test set . Considered performance metrics are computed using the re-ranked top-10 . We repeat the cross-validation times to reduce effects introduced by the random selection of negative samples. We report the average metrics from cross-validation folds and the repetitions of the evaluation procedure.
Results. The averaged cross-validation results of our evaluation, in terms of top-n accuracies and the MRR@10, are shown in Table 3: the top-1 to top-9 accuracies improve consistently. The relative improvement decreases from for the top-1 accuracy to for the top-9 accuracy. The top-10 accuracy stays constant, because the re-ranking cannot choose from outside the top-10 candidates. The MRR improves from to ().
Our results indicate that the accuracy of the described QA system benefits from our re-ranking approach. Hence, it can be applied to improve the performance of already deployed QA systems that provide a top-10 ranking with confidences as output. However, the performance gain is small, which might have several reasons. For example, we did not integrate spell-checking in our re-ranking method which proved to be effective in our baseline evaluation. Further, the re-ranking model is based on very simple features. It would be interesting to investigate the impact of more advanced features, or models, on the ranking performance (e.g., word embeddings Mikolov et al. (2013)
and deep neural networks for learning similarity functionsDas et al. (2016); Minaee and Liu (2017)). Nevertheless, as can be seen in examples 1, 2 and 4 in Table 1
, high-ranked but incorrect answers are often meaningful with respect to the query: the setting in our evaluation is overcritical, because we count incorrect, but meaningful answers as negative result. A major limitation is that the re-ranking algorithm cannot choose answer candidates beyond the top-10 results. It would be interesting to classify whether an answer is present in the top-10 or not. If not, the algorithm could search outside the top-10 results. Such a meta-model can also be used to estimate weaknesses of the QA model: it can determine topics that regularly fail, for instance, to guide data labelling for a targeted improvement of the model, also known as active learningSettles (2010)
, and in combination with techniques from semi-supervised learningDhingra et al. (2018); Chang et al. (2016).
Data labelling and incremental model improvement can be scaled by crowdsourcing. Examples include the parallel supervision of re-ranking results and targeted model improvement as human oracles in an active learning setting. Results from crowd-supervised re-ranking allows us to train improved re-ranking models Savenkov and Agichtein (2016); Huang et al. (2018), but also a meta-model that detects queries which are prone to error. The logs of a deployed chatbot, that contain actual user queries, can be efficiently analysed using such a meta-model to guide the sample selection for costly human data augmentation and creation. An example of a crowdsourcing approach that could be applied to our QA system and data, with search logs can be found in Bernstein et al. (2012).
We implemented a simple re-ranking method and showed that it can effectively improve the performance of QA systems after deployment. Our approach includes the top-10 answer candidates and confidences of the initial ranking for selecting better answers. Promising directions for future work include the investigation of more advanced ranking approaches for increasing the performance gain and continuous improvements through crowdsourcing and active learning.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §3.
-  (2012) Direct answers for search queries in the long tail. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems - CHI ’12, New York, New York, USA, pp. 237. External Links: Cited by: §1, §7.
-  (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: footnote 6.
-  (2017-12) Rasa: Open Source Language Understanding and Dialogue Management. External Links: Cited by: §3.
Ask the Right Questions: Active Question Reformulation with Reinforcement Learning. CoRR abs/1705.0. External Links: Cited by: §2.
-  (2016) Alloy: Clustering with Crowds and Computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI ’16, New York, New York, USA, pp. 3180–3191. External Links: Cited by: §7.
-  (2017) Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870–1879. External Links: Cited by: §2.
-  (2016) Together we stand: Siamese Networks for Similar Question Retrieval. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, Stroudsburg, PA, USA, pp. 378–387. External Links: Cited by: §2, §7.
-  (2018) Simple and Effective Semi-Supervised Question Answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 582–587. External Links: Cited by: §2, §7.
-  (2017-08) Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 37–49. External Links: Cited by: §2.
-  (2018-01) Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time. External Links: Cited by: §2, §7.
-  (2015) Guardian: A Crowd-Powered Spoken Dialog System for Web APIs. In Third AAAI Conference on Human Computation and Crowdsourcing, External Links: Cited by: §2.
-  (2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Volume 1: Long Papers, R. Barzilay and M. Kan (Eds.), pp. 1601–1611. External Links: Cited by: §2.
-  (2013-01) Efficient Estimation of Word Representations in Vector Space. External Links: Cited by: §7.
-  (2017-08) Automatic Question-Answering Using A Deep Similarity Neural Network. External Links: Cited by: §2, §7.
-  (2017) ”How May I Help You?”: Modeling Twitter Customer ServiceConversations Using Fine-Grained Dialogue Acts. In Proceedings of the 22nd International Conference on Intelligent User Interfaces - IUI ’17, New York, New York, USA, pp. 343–355. External Links: Cited by: §2.
-  (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
-  (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. External Links: Cited by: §2.
-  (2016) CRQA: Crowd-Powered Real-Time Automatic Question Answering System. Fourth AAAI Conference on Human Computation and Crowdsourcing. External Links: Cited by: §2, §2, §5, §7.
-  (2015-07) Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. External Links: Cited by: §2.
-  (2015-12) A Survey of Available Corpora for Building Data-Driven Dialogue Systems. External Links: Cited by: §2.
-  (2010) Active learning literature survey. University of Wisconsin, Madison 52 (55-66), pp. 11. Cited by: §7.
-  (2007) SmartWeb handheld - multimodal interaction with ontological knowledge bases and semantic web services. In Artifical Intelligence for Human Computing, ICMI 2006 and IJCAI 2007 International Workshops, Banff, Canada, November 3, 2006, Hyderabad, India, January 6, 2007, Revised Seleced and Invited Papers, pp. 272–295. External Links: Cited by: §2.
-  (2008-07) On Introspection, Metacognitive Control and Augmented Data Mining Live Cycles. External Links: Cited by: §1, §2.
-  (2009) Introspection and adaptable model integration for dialogue-based question answering. In IJCAI, pp. 1549–1554. Cited by: §2.
-  (2010) Ontologies and Adaptivity in Dialogue for Question Answering. First edition, Studies on the Semantic Web, Vol. 4, AKA and IOS Press, Heidelberg. External Links: Cited by: §2.
-  (2017) Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, Stroudsburg, PA, USA, pp. 665–677. External Links: Cited by: §2.
-  (2017) StarSpace: embed all the things!. CoRR abs/1709.03856. External Links: Cited by: §3.
-  (2008) Retrieval models for question and answer archives. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’08, New York, New York, USA, pp. 475. External Links: Cited by: §2.