Towards Productionizing Subjective Search Systems

03/31/2020 ∙ by Aaron Feng, et al. ∙ 0

Existing e-commerce search engines typically support search only over objective attributes, such as price and locations, leaving the more desirable subjective attributes, such as romantic vibe and worklife balance unsearchable. We found that this is also the case for Recruit Group, which operates a wide range of online booking and search services, including jobs, travel, housing, bridal, dining, beauty, and where each service is among the biggest in Japan, if not internationally. We present our progress towards productionizing a recent subjective search prototype (OpineDB) developed by Megagon Labs for Recruit Group. Several components within OpineDB are enhanced to satisfy production demands, including adding a BERT language model pre-trained on massive hospitality domain review corpora. We also found that the challenges of productionizing the system are beyond enhancing the components. In particular, an important requirement in production-quality systems is to instrument a proper way of measuring the search quality, which is extremely tricky when the search results are subjective. This led to the creation of a high-quality benchmark dataset from scratch, involving over 600 queries by user interviews and a collection of more than 120,000 query-entity relevancy labels. Also, we found that the existing search algorithms do not meet the search quality standard required by production systems. Consequently, we enhanced the ranking model by fine-tuning several search algorithms and combining them under a learning-to-rank framework. The model achieves 5 improvement and 90+ queries making these queries ready for AB-testing. While some enhancements can be immediately applied to other verticals, our experience reveals that benchmarking and fine-tuning ranking algorithms are specific to each domain and cannot be avoided.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A subjective search system finds entities that satisfy subjective requests, such as “hotels with clean rooms and close to really good cafes”. Recently, a crowdsourcing task in [opinedb] revealed that users’ search criteria are largely subjective; across 7 common verticals, over 50% and up to 82% of users’ most frequent search criteria are subjective. Existing e-commerce search engines, however, typically allow users to search only over the objective attributes, such as price, rating, or location, and may further allow the results to be filtered over a predefined set of subjective attributes.

We conducted a similar survey for the hospitality domain in the Japanese market and obtained similar results. With these encouraging results, we proceeded to develop a production-quality subjective search system for the hospitality domain. We started with OpineDB [opinedb, voyageur], a subjective search prototype that is developed by Megagon Labs, a research arm of Recruit Group. OpineDB extracts subjective data (e.g., cleanliness of rooms, friendliness of staff etc.) from review text, explicitly models subjective attributes in a schema (e.g., a set of attributes such as cleanliness, staff, breakfast, value for money etc.), and supports free-text querying over the aggregated subjective data (e.g., “find me hotels with really clean rooms for a romantic getaway”). See Figure 1 for an overview.

Figure 1: Overview of OpineDB. Given the subjective query, OpineDB interprets the query into one of the pre-defined subjective attributes (i.e., “cleanliness” in this example) and ranks the entities using features (i.e., a histogram of the different levels of cleanliness) summarized from opinions extracted from reviews.

It turns out that taking OpineDB to a production-level a search system is more challenging than expected, as we shall describe next.

Matching. The quality of OpineDB’s search results is sensitive to its underlying subjective schema. The subjective schema is constructed from reviews and contains the most important subjective attributes that represent what people search for in the target domain. OpineDB generates query results by first matching the request against the schema attributes. It also approximates the attributes in the schema to match the request in case a direct match cannot be determined. At the core of the matching process is a phrase2vec [word2vec] embedding module which is used by OpineDB to map the text of the queries to one of the attributes (e.g., “really clean room” cleanliness). We found that results of requests that can be matched well with the schema’s attributes are of significantly higher quality than those that are not directly captured by the schema. To address the challenge of improving the quality of query results obtained when requests do not match well with the schema attributes, we leverage the BERT [bert] language model which was recently shown to improve the already powerful Google search engine [bert-google]. In particular, the BERT model pre-trained on a large hotel review corpus can retrieve entities based on textual similarity between queries and review text without relying on the subjective schema [rte]. As we discuss in Section 3, the BERT-based method complements OpineDB and combining them under a learning-to-rank framework yields the best overall search quality.

Benchmarking. The second major challenge was that there does not exist any benchmark dataset for evaluating subjective search systems. Even worse, we could not obtain subjective queries from Recruit Group’s online services; their systems capture only objective searches. Hence, we found it necessary to create a benchmark of carefully curated queries that can represent the incoming real query distributions. For each query in the benchmark, we will also need to construct the set of entities that satisfies the subjective aspects of the query. As this task is crowdsourced and our labeling task is subjective by nature, quality control is of utmost importance. Finally, it is expensive to determine the set of entities for each query by labeling all the query-entity pairs. So we need to determine a selected subset of “most valuable” pairs to send to the crowd-workers. We describe in Section 2 our nontrivial crowdsourcing process to construct this benchmark. Currently, our benchmark consists of over (1) 600 subjective queries from crowd-workers and (2) 40,000 query-entity relevancy labels. This benchmark has enabled us to systematically evaluate different search algorithms with confidence.

Localization. Building the subjective search system requires significant engineering overhaul to provide Japanese language support since Recruit Group consists of many Japanese online services. To adapt OpineDB for the Japanese language, we (1) implemented GiNZA [ginza], a spaCy-like Japanese NLP library, to deploy OpineDB on Japanese reviews/queries and (2) pre-trained the BERT models on a large corpus of Japanese review text [rte].

In what follows, we focus our discussions on the main efforts, which are the construction of the benchmark and the evaluation of the enhanced OpineDB. Section 2 describes our crowdsourcing process for constructing the evaluation benchmark. Section 3 introduces our learning-to-rank framework for combining OpineDB with BERT. We experimentally validated the performance of our enhanced OpineDB in Section 4 and conclude in Section 5.

2 Benchmark Dataset

In this section, we describe how we create the benchmark dataset for evaluating subjective search systems by crowdsourcing. Since we targeted a Japanese hospitality platform, the benchmark dataset consists of two parts: (1) a set of subjective queries for hotel and (2) a label set, i.e., the query-hotel relevancy labels depicting how well the hotel satisfies the query.

2.1 Subjective Queries

To ensure that the benchmark queries are meaningful in the real production setting, we need to ensure that the queries are representative of real subjective requests. To do this, we first collected around 900 hotel reservation dialogues from pairs of workers, where one plays the role of a hotel reservation agent and the other plays the role of a customer seeking to reserve a hotel. Additionally, we interviewed over 500 Japanese crowd-workers and asked them to describe their requests when making hotel reservations. By analyzing these dialogues and interview results, we found that the requests always contain an area that specifies their desired travel destination. An area can be a specific address, a district, or a region. In fact, most Japanese hotel booking applications require customers to specify a location to begin their search.

These user studies eventually yielded 336 queries, where each query is of the form “(query_text, area)”. For example, (hotels with beautiful sea view, Atami). The areas range from small towns such as Ueno to large metropolitan areas that contain multiple prefectures such as the Kanto region.

To collect even more queries, we crowdsourced again and asked workers to write a tagline summary sentence of Japanese hotel reviews posted on, which is one of the largest online hotel booking sites in Japan. For each tagline summary, we specify the prefecture of the hotel which the tagline summary is associated with. We collected another 328 queries this way. The statistics of the subjective query set are shown in Table 1.

Total # of Total # of Query Length
Queries Distinct Areas Average Shortest Longest
664 79 16.03 4 84
Table 1: Summary Statistics of Query Set.

2.2 Relevancy Labels

The benchmark dataset also contains a set of query-hotel relevancy labels which indicates whether the hotel is relevant for the associated query. For each query-hotel pair, we ask crowd-workers to examine the hotel front page in, including photos, descriptive text, reviews, etc., and decide whether the hotel is relevant to the query within 90 seconds. However, as this task is inherently subjective and there are many query-hotel pairs, we need to ensure (a) quality control over the labels obtained, and (b) prioritize query-hotel candidates for labeling.

Resolving Subjectivity. The relevancy labeling task is highly subjective. For example, a customer may consider one hotel as clean but another customer may consider the same hotel as dirty. Thus, there may be disagreements between different crowd-workers. To make matters worse, workers may make a decision based on different pieces of information (e.g., pictures of hotel rooms or reviews or descriptions).

To improve the quality of the labels we obtain, we instruct the crowd-workers to make each judgement only after they have examined all the information: the hotel description, photos, and hotel reviews. To avoid malicious or random labeling, we require each worker to justify their labels with evidence which will be verified later. We further implemented a majority voting to ensure that we pick only labels where there is a certain degree of agreement among workers. Each query-hotel pair was labeled by 3 unique workers, and a pair is relevant only if at least 2 out of 3 workers consider that the hotel is relevant to the query.

Iterative Labeling. Since there is a large number of candidate pairs, it is prohibitively expensive to label all of them. To reduce the labeling cost, we label only entities (i.e., hotels) that are returned by the search algorithms and improve the labeling iteratively. By labeling iteratively, we can adapt to the “importance” of each query-hotel pair, which can change over time as we develop the search system. For example, for the precision@10 metric (i.e., precision for the top-10 entities), we first label only the top-10 entities returned by each algorithm. At the very beginning, the most important pairs were the ones returned by the first version of OpineDB. It is also important to understand how much OpineDB is better than a baseline algorithm. We considered random ordering as the first baseline and labeled its resulting pairs as well. In the next iteration, as we fine-tuned the schema of OpineDB and added more models (BERT and BM25), we also labeled their top-10 entities to evaluate their results. The pairs already labeled in the previous round are not re-labeled. Note that the decision of fine-tuning and adding new models were based on our error analysis on the results of OpineDB. This means that we did not know the importance of those candidates ahead of time. We repeated this process 3 times to obtain the final dataset and models.

Eventually, we collected 40,886 unique query-hotel pair labels based on 122,658 judgments of crowd-workers. Among all unique pairs, 23,862 pairs are labeled as relevant. We summarize the statistics of these labels in Table 2.

Rounds #Queries #Pairs %Pos. Prec@10 New Tested Models
1 500 6,000 52.6% 0.622 OpineDB, Random
2 500 28,386 60.3% 0.662 OpineDB+, BERT, BM25
3 664 40,886 58.4% 0.755 Rating, Logit, LambdaMart
Table 2: Summary of the Relevancy Label Dataset. See Section 3 and 4 for the details of the tested models. We underlined the model that produces the best performance at each round.

3 Ranking Algorithms

In this section, we briefly summarize the ranking algorithms that we implemented and experimented with. Our implementation started with OpineDB as a single ranking algorithm. Intuitively, OpineDB answers subjective queries by mapping the query term to one or more subjective attributes. Each attribute captures an important aspect of the underlying subjective data (e.g., hotel reviews). OpineDB models each attribute as a linguistic domain consisting of all linguistic variations that describe the attribute. For example,

In [opinedb], OpineDB

extracts these phrases from the review corpus and summarizes them using an opinion extraction pipeline. Although this pipeline exists for English reviews, building one for Japanese is not trivial due to the lack of mature NLP tools and resources for Japanese. To overcome this challenge, we implemented and open-sourced GiNZA


, a Japanese NLP library, for building an extractor based on dependency parsing and pattern matching.

Not surprisingly, the search quality of OpineDB heavily depends on the set of subjective attributes and linguistic domains. The first version of OpineDB achieved a precision@10 of only 62.2% (Table 2), but after we fine-tuned the attribute schema on a development set by adding more attributes and phrases, the precision@10 went up to 66.2% (denoted as OpineDB+, row 2 of Table 2).

While OpineDB uses word2vec [word2vec] to increase the size of linguistic domains and fine-tuning the schema can further improve the search quality, we notice in our error analysis that it is hard for OpineDB when there are queries of meaning beyond the combination of words. By this observation, we implemented a similarity search algorithm by constructing sentence embeddings using a BERT model. In this method, we train SentencePiece [kudo-richardson-2018-sentencepiece] and BERT models on review text with over 20 million sentences [rte] and fine-tune on about 300k thousand review-reply sentence pairs. For instance, OpineDB always returns relevant entities for queries that are covered by the schema like “静かな宿” (quiet place), but it does not perform well in the cases like “最低限度のサービスで好きにさせてくれる” (minimum service that let me have my own way). For the latter query, OpineDB wrongly matches hotels with reviews of the meaning close to "minimum service", while the BERT-based approach matches semantically similar sentences like “ いい意味でほったらかし” (in a good sense that they leave me alone).

With a similar motivation to BERT, we also implemented the following two search algorithms:

  • Rating: This method leverages the structured information like hotel ratings and fine-grained aspect ratings available in the hotel booking platform. We compute a weighted sum of the ratings by a word2vec similarity between the subjective query with each aspect.

  • Okapi BM25: BM25 is a variant of the classic TF-IDF ranking algorithm [ir] used in many popular document retrieval systems like elasticsearch.

Our experiments reveal that although these algorithms do not out-perform OpineDB individually, they complement OpineDB as they tend to outperform OpineDB for cases where OpineDB does not perform well as shown in Table 3. This led to the idea of using learning-to-rank [liu2009learning] for combining these different ranking models to achieve an improved search quality.

OpineDB BERT Rating BM25
Precision@10 0.5 Precision@10 0.5
#queries 63 25 27 18
Table 3: Orthogonality with OpineDB. Out of the 63 queries that OpineDB has precision@10 < 0.5, we show the number of overlapped queries that each ranking algorithm performs better (precision@10 0.5)

Intuitively, learning-to-rank formulates the ranking problem as a supervised learning problem. To combine the base ranking algorithms (

OpineDB, BERT, BM25, and Rating), we featurize each query-entity pair by computing the ranking scores returned by each algorithm. We then train a learning-to-rank model using these features on a training set extracted from the benchmark dataset.

Specifically, we consider two learning-to-rank models, Logit and LambdaMART

. Logit trains a Logistic Regression binary classifier on the binary relevancy labels and, at prediction time, uses the classifier’s probabilistic output as the ranking score. LambdaMART 

[lambdamart] is a popular learning-to-rank algorithm with a similar supervised learning setting but with the goal of minimizing the average number of inversions in rankings.

4 Experiments

We evaluate the search quality of the different search algorithms in this section. Our first result shows that combining the base ranking algorithms (OpineDB, BERT, BM25, and Rating) using learning-to-rank significantly improves the overall search quality. Second, we evaluate the algorithms on two special requirements important to our production setting: (1) top-k sensitivity and (2) precision distribution over the benchmark queries. Top-k sensitivity is important because for hotel booking, the ranked list only contains hotels with available rooms so the system is expected to high-quality top-k results across different k’s. The query precision distribution is important because when introducing subjective search as new functionality, we need to identify a subset of highly accurate queries for AB-testing so that the online platform can later be confident to fully deploy the system. Our results show that the current system satisfies both requirements and is ready for the next-stage evaluation.

4.1 Experiment Setup

To measure the goodness of a given ranking, we use two very popular metrics, (1) Precision@K and (2) Normalized Discounted Cumulative Gain (NDCG) [ir]. Precision@K is defined as if out of the top- results are labeled as relevant. Besides the number of relevant results in top-, NDCG also quantifies whether the relevant results are ranked higher than the irrelevant results, by penalizing highly-ranked irrelevant results.

For the learning-to-rank models, we divide the benchmark dataset by query and use 250 queries that occur most often in the hotel reservation dialogues for testing and the rest 414 queries for training. The 250 testing queries are associated with 17,381 unique query-hotel labels, and the training set contains 414 queries and 23,505 unique query-hotel labels. For the fine-tuning of OpineDB schema, we also use the same queries in the training set.

We implemented all the ranking models in Python and use libraries shown in Table 4. All the experiments are executed using an AWS c5.9xlarge server with 72GB RAM.

OpineDB BERT Rating BM25 Logit LambdaMART
GiNZA Faiss [faiss] Gensim Elasticsearch Sklearn pyltr
Table 4: Python Libraries for Ranking Models

4.2 Overall Ranking Precision

Table 5 shows the precision@K and NDCG@K of 6 ranking algorithms: OpineDB, BM25, BERT, Rating, Logit, and LambdaMART already described in Section 3. We choose and because top-10 is commonly used in web-based search interface and top-3 is suitable for applications such as chatbots and mobile apps.

The results show that OpineDB has the highest quality among the 4 base models and by combining with the other models under learning-to-rank, the accuracy is significantly improved on all the 4 metrics. Overall, the simple Logistic Regression method (Logit) achieves the best scores and improves over OpineDB by about 0.09 on Top-10 precision and NDCG, about 0.08 on Top-3 scores.

Prec.@10 Prec.@3 NDCG@10 NDCG@3
OpineDB 0.664 0.693 0.648 0.691
BERT 0.578 0.598 0.562 0.591
Rating 0.581 0.560 0.578 0.557
BM25 0.657 0.667 0.661 0.666
Logit 0.755 0.774 0.742 0.769
LambdaMART 0.741 0.768 0.722 0.763
Table 5: Precisions of Ranking Algorithms.

4.3 Sensitivity Study on Top-K

In a production-level search system, the search results are often presented to users and usually followed by the user making a hotel or restaurant reservation. Since the user interface only shows entities that are not fully booked, we need to ensure that the top- search results are of high quality across multiple . This motivates us to verify whether the precision drops significantly when increases.

In Figure 2, we can see the precision of the learning-to-rank models drop faster than other baselines. This is due to our training labels are collected for Top 10 candidates only, but we still see the learning-to-rank models maintain a significant improvement through from 1 to 30. Even at (i.e., all the top-10 hotels are fully booked), the precision of the search results remains 70%.

Figure 2: Top-K Sensitivity

4.4 Precision Distribution by Query

In a production software setting, there are often many reasons to be careful about introducing new functionality. One way to get the system being used is to start with some small workload set. In the subjective search case, this means we can provide a plus to an existing search system with support on subjective queries by simply rerouting some queries to using our approach. This motivates us to verify how many queries that our approach is performing very well, e.g., 9 out of Top-10 are correct.

According to Figure 3, we show the precision distribution of the logit learning-to-rank ranking model. Over the 250 testing queries, 54.3% are returned Top-10 with 9 or 10 correct results, and 64.7% are returned Top-3 with all correct results. When deploying subjective search systems in production, this allows us to contribute to an existing system by rerouting these queries alike.

(a) Precision@10
(b) Precision@3
Figure 3: Precision Distribution by Query of Logit

5 Conclusion

As user-generated data becomes more prevalent, it plays a critical role when users make decisions about products and services. However, existing search systems are not sufficiently supporting search over subjective data like online reviews. To bridge this gap, we implemented a production-level subjective search system that achieved 70+% overall precision and 90+% precision on more than half of the benchmark queries.

Takeaways. To build such a production search system, several challenges need to be appropriately addressed. First, the search quality of a single search algorithm (e.g., OpineDB, BM25, or BERT) is not enough for production use. Through error analysis, we found those models complement each other to some degree. We achieved the final good precision by combining these algorithms under a learning-to-rank framework. Second, it is nontrivial to evaluate the quality of subjective search since a benchmark of subjective queries does not exist. We constructed a set of over 600 subjective queries from real dialogues, interviews and review tagline summaries. We also crowdsourced the query-hotel relevancy labels iteratively and applied careful quality control to maximize the values of the collected labels. Last but not least, certain engineering efforts are needed for localization when building the search system.

As for future work, we plan to deploy the system to other popular production scenarios, such as proposing recommendations when users are seeking for suggestions in chatbots. We are also expending the subjective search system to other domains beyond hospitality including housing and restaurant recommendation.