Search plays a critical role on eCommerce sites, empowering online shoppers to discover and purchase items. Large scale eCommerce platforms such as eBay carry a wide variety of inventory that match a search query, varying in terms of properties or aspects of items in the recall set. These aspects could be retail properties such as item condition, price, shipping options or product-specific characteristics such as features, model variants or specifications. Further, based on individual buying intents, different users may be interested in shopping specific aspects of items for the same search query. To accommodate various buying intents, as open marketplaces, it is critical that search results in modern eCommerce websites showcase the variety and selection of inventory available.
Ranking search results, in general, is focused on determining the ordering of documents based on their relative relevance to maximize their utility. Learning-to-rank models aim to learn the notion of relative document relevance during training using pairwise or listwise loss functions to learn a scoring function(Li, 2011). However, during inference, most models are limited to pointwise or univariate scoring functions where the documents are scored independent of other items in the set. Ecommerce search result ranking, further, is tailored towards surfacing highly relevant and sellable inventory to users. Unlike web search, where the goal is to address an informational need, eCommerce search is focused on empowering users to contrast and compare inventory towards completing a purchase. Typically, many eCommerce sites order items in the recall set by their likelihood of sale (sellability), as predicted by a learning to rank model. Irrespective of the loss function used in training the model (pointwise, pairwise, or listwise), inferencing is done in a point-wise fashion assuming independent document relevance. The ranking order thus produced, specifically in the top results that tend to have the largest influence on shoppers, may not be aligned with the objective of showcasing the selection of inventory available. Further, the rankers could deem items with certain aspects more sellable than other aspects due to observed majority buying behavior, which may lead to an over representation of the former in the top ranked results. While independently, each of the items in the top results may be more likely to be sold relative to others, they may be sub-optimal when perceived as a set of top results, specifically in the context of addressing various buying intents. This may lead to a gap between what shoppers expect to see in the top results of a search result page, compared to what is actually shown (impressions). To that end, it is critical to methodically determine a suitable set of top items to be impressed to shoppers against a given search query.
To further illustrate the problem being addressed, we will use an example query laptop in the context of item aspect condition. In an over-simplified scenario for the purpose of illustration, based on the majority of historic sales, let’s say laptops in refurbished condition are more likely to be sold, assuming all other factors are equal. In this case, a pointwise scoring ranker may favor refurbished laptops, since, independently refurbished laptops are more likely to be sold. However, historic sales may also constitute considerable sales of new and used condition laptops. Although refurbished items may seem more suitable to be shown at the top of the results based on independent likelihood of sale, several minority buying intents may have been left unattended to in this scenario. Further, considering a set of top results as perceived by a user, we may be better off impressing certain new and used items as well, in order to showcase the selection of inventory and choice. This raises questions on what the appropriate distribution of condition aspect in top laptop results is, and how such a distribution can be enforced while ranking. Since a point-wise scoring ranker does not explicitly influence the distribution of items with respect to condition in this example, we must develop methods that facilitate the same.To summarize, the point-wise nature of most learning may lead to a mismatch between item aspect distribution of what shoppers purchase on average with respect to a query (desired distribution) versus distribution of items actually impressed on a search results page (purchase-impression gap). This brings up two primary questions:
How do we determine the desired distribution of aspects for items against a given search query?
If we know the desired distribution, how do we enforce it in top results on a search results page?
Purchase behavior of shoppers over a period of time against a query can serve as a reasonable proxy for the desired distribution of aspects to be impressed. Through the application of navigational features and query reformulations, shoppers eventually buy what they intend to. Although the available inventory at a given instance and the behavior of the point-wise scoring ranker influences the distribution of item aspects impressed on a search results page, distribution of aspects with respect to items purchased by shoppers after issuing a query, can provide a global view of shopper preferences. In this work, we determine the desired distribution of aspects for each query from shopper purchase patterns observed over a period of time.
Once a desired distribution of aspects is established, the ranker must be aware of the other items being placed in the ranked results in order to be able to enforce the distribution. An active area of research, groupwise scoring functions learn to score a fixed-size group of items (Ai et al., 2019). However, implementing groupwise scorers in the production environment of large scale eCommerce marketplaces is challenging, specifically in the context of computational cost and latency. Another approach that has been studied and implemented, especially in the context of diversification of search results, is sequential reranking of top results(Agrawal et al., 2009) (Zhu et al., 2014). Sequential reranking is a greedy approach to reranking results by placing items in a ranked list sequentially from top to bottom, selecting the next candidate to be placed by taking into account the ones that have already been placed in the re-ranked list. To enforce the desired distribution, we implement a sequential reranker of top k search results produced by a conventional pointwise scoring ranker (henceforth referred to as textitproduction ranker). The aim of the reranker is to select the next candidate item to be appended to the re-ranked list such that the selection minimizes purchase-impression gap, based on items already added, while ensuring that the selected item is independently sellable. In other words, the reranker trades-off between a candidate item’s best-match score produced by production ranker, and the candidate’s potential to minimize the purchase-impression gap on the search results page. This is achieved by modeling the perceived purchase-impression gap at each position using specially constructed features, referred to as aspect-impression-share features (ais features), that take into account the aspects of items placed in higher positions, in conjunction with best-match score.
In this paper, we present methods developed to address the purchase-impression gap that may be observed on eCommerce sites as a consequence of pointwise scoring functions used by most learning-to-rank models. We obtain a desired distribution of item aspects that should be impressed in the top results for a query by mining historical item aspect level purchase behaviors corresponding to the query. We then present a sequential reranker that reorders the top results from a conventional pointwise ranker, minimizing purchase-impression gap whilst selecting independently sellable items by employing ais features. Early versions of the reranker with a small set of aspects launched on eBay search showed promising item conversion and user engagement improvements. Offline experiments on randomly sampled search behavior datasets show around 10% reduction in purchase-impression gap while leading to improvements in conversion metrics such as sale rank and mean reciprocal rank of purchased items. The proposed methodology, on site implementation of the reranker and details on experiments and results are presented in the section that follow.
2. Related Work
In eCommerce search, users either have a specific intent/product in mind or they are issuing a broad query to understand the breadth of inventory available. In either case, they end up performing a lot of comparison shopping and identifying tradeoffs that work for them before making a purchase decision. Users perform a joint evaluation of all the candidate products either on the search page or through the Cart or by Saving items to their wish lists. We will review some of the related work which focuses on how users evaluate items when multiple options are presented and how diversifying search results satisfies different user needs and helps them make purchase decisions faster. Lichtenstein et. al (Lichtenstein and Slovic, 1971) presented some early work where user decisions are different when choices are presented separately compared to when they are presented together. Similarly, there are a number of previous studies analyzing the click behavior on a document and its influence on both rank and other documents in the presentation (Joachims et al., 2017)(Joachims et al., 2007)(Craswell et al., 2008). We also measured and validated the influence of neighborhood on the preference of an item in ecommerce in our earlier work (Indrakanti et al., 2019a). Several studies were also performed to showcase diverse search results given that there are different user needs for certain queries. They can be mainly divided into implicit and explicit diversification. Implicit approaches model similarity between documents and try to place documents which are similar to the query and dissimilar to the previously placed documents (Carbonell and Goldstein, 1998). Explicit approaches models certain aspects of the query like taxonomy (Agrawal et al., 2009), query reformulations (Radlinski and Dumais, 2006), external resources (He et al., 2012)
etc and try to place documents that cater to all those aspects. All of these approaches apply heuristics based utility functions whereas Zhu et al addressed search results diversification as a learning problem where a ranking function is learned for diverse ranking of search results(Zhu et al., 2014). Our work extends these previous approaches to adapt to eCommerce and focuses on effectively modeling different aspects of the query that are correlated with purchases. We also change the problem optimizing for diversity to showcasing the breadth of inventory that satisfies a purchase distribution measured on user behavior.
Our aim is to close the gap between purchase and impression distributions measured based on conversion distribution of search results for a query and the recall distribution of the top items, where is a non-zero integer. We introduce a sequential reranker on top of the existing production ranker. It reranks top items to minimize the purchase-impression gap with respect to aspect and features constructed to capture impression share and historically mined purchase shares. Due to the nature of the sequential ranker, during online inferencing, after items have been ranked , we chose the item so that it has a good best match score and it also aligns with the purchase share distribution. The sequential reranker calculates a bridge_score which is defined as :
where is the feature and is the weight associated with the feature for the given query
The intermediate bridge_score is weight-summed with the best-match ranking score from the production ranker to generate the final ranking score. The bridge_score is weighed by the parameter (1 - )/ where is a learned parameter that determines the relative importance of the two intermediate scores. For our current model , is learned globally but we will be exploring query-wise parameters in the future.
We build query-wise models, so for each query the weights are calculated separately.
3.1. GMV Shares as weights
The weights for the bridge_score are the aspect-specific Gross Merchandise Volume (GMV) shares that have been calculated by analysing historical behavioral data for a query. For the items that give a converting signal, we run a data pipeline job that records the values for certain aspects painted on the item (eg condition, buying format etc). We consolidate this data over a period of time to get the GMV share for each of this aspect as a distribution on its aspect values, summing up to 1. For instance, for the query iphone if we observe that users are more likely to interact with new items as opposed to refurbished or old in the ratio 5:3:2 , then the GMV share distribution will be [0.5, 0.3, 0.2] for new , refurbished and old aspect values respectively. Since the purchase distribution is different for each query, the weights are also differ by query.
3.2. Aspect Impression Share features
Each aspect value for which we have a GMV share is a candidate Aspect-Impression-Share(ais) feature . The ais features represent how much a given feature differs from the distribution so far. It is greater than 0 if the feature (or aspect value ) is painted on the current item, 0 otherwise.
For instance, aspect condition could have 3 ais features: ais_is_new, ais_is_refurbished, ais_is_old. To construct ais features , we first generate binary features : is_new, is_refurbished ,is_old indicating the condition of the item . For example, if the item belongs to new condition, , , . Now we define an intermediate delta feature as described in (Indrakanti et al., 2019b), which will capture how diverse the current feature is from the distribution that we have seen before. The following example illustrates ais feature computation for a sample scenario. Let us assume we have already ranked 10 items and are selecting the to be appended, with 6 of the 10 items being new, 3 old and 1 in refurbished condition.. If the current item is new, the intermediate delta features are:
The ais features are defined as : . Therefore the features can be computed to be
4. Onsite Implementation
The Sequential Ranker is implemented in production and re-ranks top results out of results that are already produced and ranked by our Production Ranker. This ranker sequentially reorders results from the Production Ranker by starting from the highest ranked position and moving lower down the list, each time selecting the next candidate that maximizes the score defined by the re-ranking formula that is described in a previous section.
Parameters for the sequential re-ranker could be easily changed and customized via a corresponding configuration file called a profile. Via a given profile, we can choose how many top results we want to reorder (previously mentioned value ) as well as how many results we want to consider when choosing item candidates (previously mentioned value ). The only prerequisite is that . Besides choosing and , we could also customize the parameter for sets of similar queries or even individual queries, which we plan to explore in the future, and apply different re-ranking formulas for different queries. Figure 1 provides an overview of the sequential reranking architecture.
The metric that we want to minimize with the reranker is the gap between the impression of top items and the learned gmv share distribution of a query. The purchase-gap is formally defined as the mean non-negative difference between a query’s expected GMV share and the impressed recall distribution across all its aspect values. Note that, a gap only contributes if its greater than 0. We don’t penalize an aspect value to have more than the expected GMV share .
We also ensure that the conversion metrics like mean reciprocal rank (MRR) are not impacted adversely in the process of re-ranking.
To size up our problem, we randomly sampled 3000 queries to look at the purchase-impression gap. We saw that on average, there is a 13% gap between sales and impression for each aspect (like condition) in this dataset for each query. Around 77% of the queries had a Purchase-Impression Gap for the aspects we considered.
We also conducted an A/B test where we would determine the Aspect with the highest Purchase-Impression Gap for a query and then boost items which aligned with the GMV Share distribution for the Aspect with the highest Gap. The Test did well and we saw significant lifts in Gross Merchandise Value Bought metric, along with other metrics. This feature has now been integrated in our Production Ranker.
For this experiment, we collected behavioral data for a time period of 14 days and collected over 80,000 Query Search Result Sessions upto top 50 Items for each query. We obtained the GMV Share by mining the historical Purchase Distribution for these queries . We ran the Sequential Ranker on top of the Production Ranker to rerank top 20 items from the top 50 items available. We trained with four different values of : 1 ( which resolves to Best Match Score from Production Ranker), 0.8 , 0.5, 0.2. The results are summarized in table 1 . We see the biggest reduction in Gap when = 0.2 so the Bridge Score has a much higher weight . However the MRR lift is very small. Therefore for our A/B test we choose to use = 0.5 which reduces the Gap by 8% and also gives a MRR Lift of 3.7%. In the future, we will try to learn at a query level instead of a query-set level.
Most commonly used learning-to-rank models score items independent of others while ranking, irrespective of pairwise or listwise loss functions used during training. Search rankers on eCommerce sites such as eBay generally employ rankers trained on sellability of items against a query. However, the pointwise nature of scoring can lead to an over-representation of items with certain aspects over others in the top search results, leading to a purchase-impression gap, i.e. a mismatch between what shoppers purchase on average with respect to a query versus what is actually shown. The effects of this are even more pronounced on eCommerce sites where shoppers compare and contrast items to make buying decisions, as compared to web search. To address such a mismatch, we must develop ranking methods that are aware of the other items being placed in a ranked search result page. Further, such methods must be able to enforce a desired distribution of aspects in the top results. To that end, we developed methods to establish a preferred distribution of aspects to be impressed in the top search results against a query. We implemented a sequential reranker that reorders the top results produced by pointwise scoring production ranker. The sequential reranker manages a tradeoff between best-match score which represents independent sellability of an item and bridge-score, the item’s potential to minimize purchase-impression gap.
We mine historical buying patterns with respect to individual search queries to establish a preferred aspect distribution of item aspects with respect to each query. We then apply the linear sequential reranker to rerank top best-match results from production ranker. Experiments on randomly sampled validation datasets indicate that the presented methods lead to a significant reduction in average purchase-impression gap measured over the top 20 reranked results. Early versions of this implementation based on a small selected set of item aspects launched on eBay search sites lead to statistically significant lifts in conversion and engagement metrics on search result pages.
While the sequential reranker presented in this work, for the purpose of simplicity, is a linear model that manages a straightforward tradeoff between best-match score and bridge score, it can be easily extended to more complex models. For instance, a deep neural network can be trained to learn this tradeoff as illustrated in (Zhu et al., 2014). A smaller set of popular global item aspects were used in this work. The set of aspects can be extended to a larger set that includes local item or query-specific aspects to improve the reranker. Further, We employ separate features to represent individual item aspects. The framework can be extended to consume embeddings for items learnt over an item aspect space to facilitate bridge score computation. In summary, we presented an intuitive approach to address an important problem observed on large eCommerce sites such as eBay. We employ insights from historic purchase behavior to debias existing ranking through a simple yet powerful and extensible framework that provides a scalable solution without adding significant latency to site experiences.
- Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining, pp. 5–14. Cited by: §1, §2.
- Learning groupwise multivariate scoring functions using deep neural networks. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 85–92. Cited by: §1.
- The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. Cited by: §2.
- An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining, pp. 87–94. Cited by: §2.
- Combining implicit and explicit topic representations for result diversification. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 851–860. Cited by: §2.
- Influence of neighborhood on the preference of an item in ecommerce search. External Links: Cited by: §2.
- Exploring the effect of an item’s neighborhood on its sellability in ecommerce. arXiv preprint arXiv:1908.03825. Cited by: §3.2.
- Accurately interpreting clickthrough data as implicit feedback. In ACM SIGIR Forum, Vol. 51, pp. 4–11. Cited by: §2.
- Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS) 25 (2), pp. 7–es. Cited by: §2.
- A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems 94 (10), pp. 1854–1862. Cited by: §1.
- Reversals of preference between bids and choices in gambling decisions.. Journal of experimental psychology 89 (1), pp. 46. Cited by: §2.
- Improving personalized web search using result diversification. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 691–692. Cited by: §2.
- Learning for search result diversification. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 293–302. Cited by: §1, §2, §6.