This paper presents our system, Spoke, for storing and searching Knowledge Base (KB) articles for different organizations. Spoke is available as a SaaS (Software-as-a-Service) product that can be used by any organization for documenting and searching over their internal workplace articles. We start by discussing salient aspects of the problem of KB management as a SaaS product (KB SaaS).
1.1 Knowledge Base Search as a Service
Each organization using Spoke 111www.askspoke.com for KB management creates a private corpus containing articles that are available only to the users from that organization. A user can query Spoke with their workplace queries, and the goal of Spoke is to respond with the right article if such an article already exists inside the KB. Table 1 shows four common domains, sample questions from these domains, and titles of KB articles (body of the article omitted for brevity) that answer these questions.
|KB Domains||Sample Knowledge-seeking Questions||Sample KB titles|
|Information Technology (IT)||“How do I get on the VPN?” “My macbook froze. Help!”||“Connecting to the VPN” “Troubleshooting Macbook”|
|Human Resources (HR)||“Do we support 401k?” “Where is our recruiting rubrik?”||“Retirement benefits” “Hiring guideines”|
|Sales||“Maximum amount I can spend on a client dinner” “Where can I find the Q4 sales numbers?”||“Client Dinner Expenses”, “Sales Dashboards”|
|Marketing||“Brand assets”||“Where is our brand logo?”|
Indexing and searching over documents has been studied extensively in the information retrieval literature [Manning et al.2008, Harmon1996]. However, searching over internal KB is uniquely challenging when compared to web search and document retrieval tasks studied in academia [Robertson et al.1996] for the following reasons:
Dynamic KB: Real world KB are dynamic. During the lifetime of a KB deployment, new KB articles are created or existing articles may be modified; old articles get less relevant or entirely outdated with the creation of new articles. E.g. an article on Sales Process Outline created in 2017 will become outdated in 2019 as the sales process changes. This poses a challenge for ML-based search system which must be designed to quickly unlearn old behavior with new conflicting information.
Siloed Datasets: The KBs from different organizations are siloed and so information and signals across them cannot be directly combined to train an ML system. This is unlike web search [Yin et al.2016] where millions of query url pairs are available.
Article Type: Articles inside an organization can belong to arbitrary domains each having its specific semantics and jargon. E.g. articles from IT domains often contain names of internal servers or printers that are not a part of common knowledge. It is not scalable to inject knowledge and semantics specific to each domain in the search system. Furthermore, articles can take various forms, which may not be easily indexed: files (pdfs, Microsoft Word Docs, etc), images, hyperlinks, etc.
Scalable ML: Spoke is deployed in thousands of separate organizations, thus it is not possible to separately train an ML-based search model for each deployment.
1.1.1 Limitations of Internal KB Search over Web Search
In the last few decades, the web search experience has improved significantly using multiple signals e.g. graph-based signals [Kleinberg1999, Brin and Page1998], anchor text [Chakrabarti et al.1999], web click mining [Joachims et al.2005]
, etc. However, these signals are not available for internal KB search which has been restricted to term-match based features. While neural networks based approaches[Gysel et al.2018, Chakravarti et al.2017, Bai et al.2010] show great promise at learning relevance and semantics, they require a large amount of in-domain data, which is not possible in KB SaaS setup due to as the datasets are siloed. To this end, we have designed our system Spoke to extract more signal by using user interactions.
1.1.2 Feedback-driven search experience in Spoke
We show how organizations typically use Spoke and illustrate the feedback-driven search process. Organizations usually have a few experts who are in charge of helping end users with their questions. These experts are also responsible for creating KB articles in Spoke to answer user questions. Each KB article in Spoke has a four user-supplied fields that can be indexed: title, body, keywords, and link. Users issue their questions via conversational media like chat. Spoke responds with one answer or no answer (when it is not confident in the relevance of any article.) In case a user expresses unhappiness with the results, Spoke reaches out to the experts that can then respond to the query by either creating a new KB article in response (recognizing an information gap) or by responding with an existing KB answer that Spoke missed. This process is illustrated in Figure 1. Since prediction of user happiness is not the focus of this work, we simply predict user happiness using a simple regular expressions-based system.
This paper discusses the design of our KB system Spoke that address the challenges of KB SaaS discussed in Sec. 1.1. Our paper makes the following contributions:
Support real-time online learning to rank i.e. Spoke learns from user and expert feedback in real time.
We use a novel trick to change the scoring function which allows unlearning of old information using a constant amount of new user feedback. This allows the KB search to evolve with the updation and deletion of old KB articles and addition of new KB articles.
We present a relevance scoring function that explicitly models high-dimensional lexical features (e.g. raw words) in a kernelized form using query similarity functions.
We show that our adaptive system outperforms a strong L2R baseline by upto 41% in offline experiments. Our system is deployed for hundreds of orgs and is continually getting better at returning relevant results.
2 Relevance Scoring with Online L2R
In this section, we will show how we design a relevance scoring function for KB SaaS addressing the challenges listed in Sec. 1.1. We also present an algorithmic overview of KB management in Spoke.
2.1 Formal Problem Definition
Let be a query and let be a KB article. As detailed above, we allow users to provide positive or negative feedback for a query and document . Let us assume at time , we get feedback for query and document . For document , let us define all the positive feedback queries as:
Similarly define . Let . Our goal is to design a system that can learn from to improve relevance scoring for the organization.
2.2 Scoring with Pairwise and Lexical Scores
As a first step, we model the relevance score as a sum of pairwise match-based score and lexicalized score as:
for appropriate functions and and features and , which are defined below.
Pairwise match features, : These features compute the match between query and document using different textual match-based feature extractors e.g. term-based similarity like BM25 [Robertson et al.1996]; semantic similarity like Word2Vec [Mikolov et al.2013] based dot product; synonyms match, etc. are computed by applying these feature extractors over the query and different textual fields of like title and body. We use around 50 match features in our system (see Sec. 4.3).
Lexical features, : Refers to raw words or word-based features (e.g. embeddings) that are extracted from queries in the training corpus and are associated with documents . These features allow us to extract associations of specific query words with documents. For example a document about Tax Forms may have words like W2, 1099, IRS associated with it as lexical features. Lexical features have large dimensionality and hence are vastly more expressive than the match features and crucial to expressing semantics of the domain in our scoring function.
Fuethermore, we use the pairwise match features in a static scoring function that is fixed for all articles (and all organizations using Spoke) and the lexical features in an adaptive scoring function which is trained from query feedback for each organization separately. The advantage of this approach is that we can create offline using pre-labeled training examples using state-of-the-art Learning-To-Rank (L2R) techniques [Joachims2002, Burges2010] with only a few hundred examples while allowing customizing the overall score for each organization. We will compare our adaptive algorithm to purely static baseline in Sec. 4. Next we describe how we create the adaptive part of the scoring function.
2.3 Adapt Lexical Match from Feedback
The query feedback-based score of each document is expressed using lexical features, as
Letting be the set of parameters over all documents, can be trained by empirical risk minimization (ERM) over examples :
is a loss function andis a regularizer (e.g. l2 norm). This setup of directly training weights over lexical or word-based features is exemplified by [Radlinski and Joachims2005, Bai et al.2010].
However, rather than representing parameters over lexical features, we express the score in a dual kernelized [Lodhi et al.2002] form222In the SVM literature, often referred to as the Kernel Trick.:
where is an appropriate kernel function representing similarity between two queries, is the weight of query for document , is a function that aggregates the query similarity scores, and are constants. We justify this choice in Sec. 2.5, where we show how the kernelized representation in Eq. 5 is as expressive as the primal featurized representation in Eq. 3 when function, and in addition provides several practical advantages.
2.4 Choosing Parameters for Adaptive Scoring
In this section, we show how we select the key parameters of : the query similarity kernel , the query score aggregator function , and weights of past queries in Eq. 6
. The hyperparametersand are tuned on development data.
2.4.1 Choosing the Query Similarity Function
The query similarity function computes how similar two queries are in their intent. It is a fixed function and is constant across all Spoke
deployments. This function can be a kernel function like cosine similarity over Bag-of-Words (BOW) but can also be more powerful learned functions like neural networks[Bogdanova et al.2015]. In our setup, we pick a simple yet expressive function, with TFIDF representations over unigrams and bigrams.
2.4.2 Choosing the Aggregation Function
As mentioned in Sec. 2.3 (and described further in detail in Sec. 2.5), if we want to mimic empirical risk minimization, we can choose to be the function. However, from a practical KB design perspective, the function must satisfy certain constraints which we discuss below.
Montonocity: should monotonically increase with larger values to ensure that replacing less similar queries with more similar increases the score:
Increasing: should monotonically increase with more positive values to ensures that adding an irrelevant query does not dilute the total query similarity output:
Bounded Magnitude: Assuming is bounded, must have a bounded magnitude:
for some constant . This somewhat surprising constraint is motivated by practical KB maintenance concerns. As discussed in Sec. 1.1, KBs change dynamically over time, as organizations add new KB articles to their Spoke deployment and as old articles get less relevant. We want Spoke to be able to learn to give higher scorer to new articles than old (potentially outdated) articles with a bounded number of mistakes (i.e. feedbacks). Eq. 2 shows that each article initially has a fixed static score to which the adaptive score is added over time. Thus we can guarantee that new articles can get higher score than old articles with bounded mistakes only if the magnitude of is bounded. Boundedness of can be guaranteed (Eq. 5) only if is also bounded.
The boundedness constraint rules out function (suggested by the ERM approach) since can grow unboundedly. Another reasonable aggregator average does not satisfy constraint 2. We propose to use that computes the sum of highest values from a set wherein, for each new query , we are summing the score of most similar queries from past positive (negative) queries (), This function is bounded by and it satisfies all of the above constraints.
2.4.3 Training weights of queries
For implementing the function in Eq. 6, we store the queries from the user and expert feedback and train their weights using an online learning algorithm. We choose online learning over more common batch training [Guo et al.2016, Gysel et al.2018, Bai et al.2010] as learning instantaneously is important for our product to show it’s utility and win user trust. We store at most
positive and negative queries for each article for scalability. We adopt a version of Perceptron-style additive update algorithm implemented in the dual space as described by[Shalev-Shwartz and Singer2007] which amounts to constant updates to query weights (initialized to zero). However, we make the update weights different for expert feedback () than user feedback () with based on a practical insight that the relevance opinion of experts is more valuable than the opinion of a user.
Algorithm 1 presents the overall strategy for handling various events — KB creation, searching, and feedback — in Spoke. Note that represents the indicator function.
2.5 Why Choose the Kernelized Form?
2.5.1 Expressiveness of the kernelized approach:
The kernelized scoring function in Eq. 5 subsumes the score in Eq. 3 for common feature functions with appropriate choice of and the constants. The reasoning follows from the celebrated Representer Theorem over Reproducing Kernel Hilbert Spaces (RKHS) [Hofmann et al.2008] (a generalization of the kernel trick). Using this theorem, we can assert that having as normalized BOW features in Eq. 3 is equivalent to setting to CosineSim, to the function, and by setting , , and appropriately in Eq. 5.
2.5.2 Practical Advantages of the kernelized approach:
Choosing the kernelized form for implementing using query similarity rather than explicitly storing query features with documents confers three practical advantages.
It gives an explicit handle over the influence of past queries (and the relative score of old documents vs new documents) by controlling the aggregation function . As detailed in Section 2.4.2, we choose instead of the .
It allows faster deployment of new query features. E.g. consider the scenario where we change the lexical features by adding Word2Vec features to existing BOW features. In the Kernelized approach, we can achieve this without any retraining simply by replacing old query similarity function with a new query similarity function . In the explicitly featurized approach, we will have to retrain and replace adapt models for all client organizations.
3 Related Work
. However, to the best of our knowledge, all the published algorithms on online L2R assume the ability to inject random noise in the results for exploration (e.g. when using Thompson Sampling). In our deployment, we do not have the liberty to do this since our users see only one result and are quite sensitive to the quality of our results.
Our work is also related to lexicalized approaches to search which learn a large number of features explicitly based on query terms [Bai et al.2010]. Many recent neural approaches to IR also take into account terms using their embeddings [Gysel et al.2018] [Chakravarti et al.2017]. Also our work is closely related to a plethora of question answering work in NLP [Chen et al.2017], some of which use paraphrases [Fader et al.2013] for question answering which is akin to our notion of query similarity. However, there is some evidence [Guo et al.2016] to suggest that retrieval should be treated differently from question answering. [Bogdanova et al.2015] adapt the notion of query similarity towards semantically equivalent questions rather than actual paraphrases. One of the works that comes close to us is [Radlinski and Joachims2005] who create chains of related queries from web logs and extract lexical features from queries. Reader should also refer to [Avula et al.] for an analysis of using conversational platforms for search. The key features that makes our KB SaaS setup and algorithm apart from the related work are that we 1) use online learning since each organization uses Spoke differently, 2) do not inject noise in our predictions, and 3) allow for the possibility of KB to be changed during a deployment.
4 Experiments and Metrics
In this section, we will present experiments using our online L2R approach. We will present datasets, baselines and compare our online L2R with competitive baselines. We will show how our online L2R strategy with adaptive learning outperforms competitive baselines.
|Client Id||domain(s)||#KB||#q||avg q per KB|
|1||HR, IT, Safety||10||13||1.4|
|2||IT, HR, Finance, Office||135||192||1.43|
|4||HR, Design, Facilities||58||185||3.21|
|7||HR, Marketing, IT, Office||30||62||2.07|
|8||Business Ops, Legal, HR||206||1352||6.57|
|9||IT, HR, Ops, Product, Customer Support||39||100||2.62|
|10||Marketing, Sales, Product, Data Analysis||38||51||1.42|
|12||Legal, Product, HR, Ops IT, Education, Engineering||102||385||3.78|
|ClientId||BM25||Static L2R (artificial + client)||Our Algorithm (static + adapt)||F1 %|
4.1 Datasets for Offline Training and Evaluation
We use two kind of datasets for our experiments.
To train the parameters for static pairwise match model , we (the authors of this paper) created a dataset containing 364 questions matched to 83 KB articles. For each question, there is a single unique KB answer (similar to as shown in Table 1).
We obtain data from real world feedback from 12 of our clients. These clients were chosen due to the high level of product engagement and the diversity of use cases they cover. Each data set contains a stream of events generated as a result of users from that organization naturally interacting with an older version of our system. Each event in the stream is timestamped and has one of the following types:
KB creation or updation: a KB article is created or updated.
KB deletion: a KB article is deleted.
Query search and feedback: tuple where a user searches with a query ; our system responds with an article and the user gives a feedback .
Expert feedback: tuple for cases our system could not answer query , and a domain expert responds with article .
We discard all tuples with negative feedback for offline training and evaluation as they do not provide the ground truth. Table 2 shows information and statistics of our client datasets.
|LemmaComparison||Compare lemma in query and text|
|Term Match||Unigram and bigram dot product|
|Synonym Match||Term overlap with synonyms|
|Word2Vec match||IDF-weighted word2vec dot product|
|Acronym match||Query and text acronym overlap|
4.2 Training and Evaluation
We train the offline model only on the artificial dataset. We trained a LambdaMart model [Burges2010] as well as a simple linear RankSVM model [Joachims2002] minimizing pairwise ranking loss [Burges2010] and found no performance difference. So for simplicity, we use the linear model for . We fine tune the hyperparameters — (weight of positive queries), (weight of negative queries), (weight for expert feedback), and (weight update for user feedback), and (the score threshold) — on development data from four of our clients, maximizing the total micro F1. We exclude the development data from evaluation.
We use the client dataset for evaluation. For each query, there is at most 1 correct answer. For each dataset, we simulate real user and expert feedback. We run Algorithm 1 going over the event stream in the order of timestamps. We provide negative or positive feedback using ground truth. When the system makes a mistake, we reveal the correct KB article only when the event corresponds to expert feedback as that is the role of experts (Sec. 1.1.2). We evaluate all search algorithms using four metrics: Precision@1, Recall@1, F1@1, and MRR (mean reciprocal rank.)
4.3 Experimental Comparisons
We compare our online L2R algorithm with two strong baselines: the BM25 algorithm [Robertson et al.1996], that is the de facto standard available in publicly available search software like Apache Solr333http://lucense.apache.org/solr, and the static pairwise match baseline (only ) trained with RankSVM [Joachims2002]. For fair comparison, we train the baseline model on the artificial data as well as the development data used for tuning our algorithm (Sec. 4.2). Comparing with the -only baseline shows how much we can gain from doing online learning from user feedback.
Features for :
We use the match feature templates in Table 4 for the model. For term match we use modified term frequency reprsentations to normalize for document length [Singhal et al.1996]. For Synonyms, we use the PPDB dataset [Ganitkevitch et al.2013]
. For word vectors, we use GLOVE embeddings[Mikolov et al.2013]. We compute these features with the query and four textual items obtained from KB: title, body, keywords, and all (title, body, and keywords concatenated).
Our algorithm setting:
For our algorithm, we set query similarity kernel to be the cosine similarity over unigrams and bigrams. We use with as a query similarity aggregator.
Table 3 shows the performance of online learning, static learning, and BM25. Our algorithm outperforms the static baseline by average 10.4% relative improvement in F1 (and up to 41%), which in turn vastly outperforms BM25 (17.5% relative improvement in average F1). Notably the Pearson Correlation coefficient between F1 in Table 3 and average query per KB is 0.66. This hints that incorporating adaptive learning is more likely to help for cases with higher queries per KB article.
5 Conclusion and Future Work
We presented a production system for supporting internal KB search inside organizations that gets continually better at responding to queries using conversational feedback. We leverage online learning and incorporate various practical concerns into the design of our algorithm and scoring function. In the future, we aim to inject neural networks-based features into our online learning setup by using deep learning-based query similarity functions.
- [Avula et al.] Sandeep Avula, Jaime Arguello, Robert Capra, Jordan Dodson, Yuhui Huang, and Filip Radlinski. Embedding search into a conversational platform to support collaborative search. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval.
- [Bai et al.2010] Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. Learning to rank with (a lot of) word features. Inf. Retr., 13(3):291–314, June 2010.
- [Bogdanova et al.2015] Dasha Bogdanova, Cicero dos Santos, Luciano Barbosa, and Bianca Zadrozny. Detecting semantically equivalent questions in online user forums. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 123–131. Association for Computational Linguistics, 2015.
- [Brin and Page1998] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.
- [Burges2010] Christopher J. C. Burges. From ranknet to lambdarank to lambdamart : An overview. Technical report, 2010.
- [Chakrabarti et al.1999] Soumen Chakrabarti, Byron E. Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson, and Jon Kleinberg. Mining the web’s link structure. Computer, 32(8):60–67, August 1999.
- [Chakravarti et al.2017] Rishav Chakravarti, Jiri Navratil, and Cicero Dos Santos. Improved answer selection with pre-trained word embeddings. 08 2017.
- [Chen et al.2017] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics, 2017.
- [Fader et al.2013] Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1608–1618. Association for Computational Linguistics, 2013.
- [Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-burch. Ppdb: The paraphrase database. In In HLT-NAACL 2013, 2013.
- [Grotov and de Rijke2016] Artem Grotov and Maarten de Rijke. Online learning to rank for information retrieval: Sigir 2016 tutorial. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, pages 1215–1218, New York, NY, USA, 2016. ACM.
- [Guo et al.2016] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pages 55–64, New York, NY, USA, 2016. ACM.
- [Gysel et al.2018] Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. Neural vector spaces for unsupervised information retrieval. ACM Trans. Inf. Syst., 36(4):38:1–38:25, June 2018.
- [Harmon1996] D. K. Harmon. Overview of the Third Text Retrieval Conference (TREC-3). DIANE Publishing Company, 1996.
[Hofmann et al.2008]
Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola.
Kernel methods in machine learning.Annals of Statistics, 36(3):1171–1220, 2008.
- [Joachims et al.2005] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, pages 154–161, New York, NY, USA, 2005. ACM.
- [Joachims2002] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 133–142, New York, NY, USA, 2002. ACM.
- [Kleinberg1999] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, September 1999.
[Langford and Zhang2007]
John Langford and Tong Zhang.
The epoch-greedy algorithm for contextual multi-armed bandits.In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pages 817–824, USA, 2007. Curran Associates Inc.
- [Lodhi et al.2002] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2, 2002.
- [Manning et al.2008] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
- [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3111–3119, USA, 2013. Curran Associates Inc.
- [Radlinski and Joachims2005] Filip Radlinski and Thorsten Joachims. Query chains: Learning to rank from implicit feedback. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pages 239–248, New York, NY, USA, 2005. ACM.
- [Robertson et al.1996] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. pages 109–126, 1996.
- [Shalev-Shwartz and Singer2007] Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms. Mach. Learn., 69(2-3):115–142, December 2007.
- [Singhal et al.1996] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’96, pages 21–29, New York, NY, USA, 1996. ACM.
- [Yin et al.2016] Dawei Yin, Yuening Hu, Jiliang Tang, Tim Daly, Mianwei Zhou, Hua Ouyang, Jianhui Chen, Changsung Kang, Hongbo Deng, Chikashi Nobata, Jean-Marc Langlois, and Yi Chang. Ranking relevance in yahoo search. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.