Users tend to browse multiple SERPs to view more products and make comparisons before they make final purchase decisions in product search. From the log of a commercial product search engine, we observe that in about 5% to 15% of search traffic, users browse and click results in the previous pages and purchase items in the later result pages. Users’ clicks can be considered as implicit feedback that indicates their preference in the current query session. Relevance feedback (RF) approaches can be used to extract the relevance topic and re-rank the subsequent SERPs. There has been some research on multi-page search (Jin et al., 2013; Zeng et al., 2018). However, these methods are designed for document retrieval, which has different characteristics from product search. Documents consist of text while products are essentially entities that have many aspects such as brand, color, size and so on. In addition, in contrast to document retrieval, where relevance is a universal evaluation criterion, a product search system is evaluated based on user purchases that depend on both product relevance and customer preferences. In this paper, we study the problem of multi-page product search, where little research has been conducted.
Most previous studies on product search focus on product relevance (Duan et al., 2013; Van Gysel et al., 2016; Wu et al., 2018; Karmaker Santu et al., 2017). Attempts were also made to improve customer satisfaction by diversifying search results (Yu et al., 2014). Recently, Ai et al. (2017) introduced a personalized ranking model which takes the users’ preferences learned from their historical reviews together with the queries as the basis for ranking. However, the personalized model cannot cope with the situations such as users that have not logged in during searching and thus can not be identified, users that logged in but do not have enough purchase history, and a single account being shared by several family members. In these cases, user purchase records are either not available or containing substantial noise. Moreover, users’ long-term behaviors may not be as informative to indicate the user’s preferences as short-term behaviors such as interactions with the shown items in a query session. These limitations of existing work on product search motivate us to study short-term feedback for modeling user preferences in a query session, which do not require additional customers’ information or their purchase history, and compare long-term and short-term context in multi-page product search.
Traditional relevance feedback (RF) methods, which extract expansion terms from feedback documents, have potential word mismatch problems (Zamani and Croft, 2016). To tackle this problem, we propose an end-to-end context-aware embedding model that can incorporate both long-term and short-term context to predict purchased items. In this way, semantic match and the co-occurence relationship between clicked and purchased items are both captured in the embeddings. We show the effectiveness of incorporating short-term context against baselines using both long-term context and no context. Also, our model performs better than state-of-art word-based RF models by a large margin.
Our contributions can be summarized as follows: (1) we reformulate conventional one-shot ranking to dynamic ranking (i.e., multi-page search) based on user clicks in product search; (2) we introduce different context dependency assumptions and propose a simple yet effective end-to-end embedding model to capture different types of dependency; (3) we investigate different aspects in multi-page product search on real search log data and show the effectiveness of incorporating short-term context and neural embeddings. Our study on multi-page product search indicates that this is a promising direction and worth more attention.
2. Related Work
Product Search. Most previous work treats product search as a one-shot ranking problem, where given a query, static results are shown to users regardless of their interaction with the result lists. Facets of products have been used for product search (Lim et al., 2010; Vandic et al., 2013). Language model based approaches have been studied to support keyword search (Duan et al., 2013). Later, to further solve vocabulary mismatch, models that measure semantic match between queries and products based on reviews have been proposed (Van Gysel et al., 2016; Ai et al., 2017). Other aspects of product search such as popularity, visual preference and diversity have also been studied (Long et al., 2012; Guo et al., 2018; Yu et al., 2014). In terms of labels for training, there are studies on using clicks, purchases, click-rate, add-to-cart ratios and order rates as labels (Wu et al., 2018; Karmaker Santu et al., 2017). In a different approach, Hu et al. (2018)
use online reinforcement learning mechanism to rank products dynamically when users request next SERPs. However, they update a global ranker given the signal of purchases. In contrast, our model updates SERPs for each individual query based on the clicks collected under the query.
Multi-page Search and Relevance Feedback (RF). Some research has been conducted on multi-page search (Jin et al., 2013; Zeng et al., 2018). They are word-based or learning-to-rank based methods and focus on document retrieval where relevance plays a different role than in product search. There has been considerable research on RF. However, most of them are unsupervised methods and based on bag-of-word representations, such as Rocchio (Rocchio, 1971) and the Relevance Model (RM3) (Lavrenko and Croft, 2017). Embedding-based RF methods have also been proposed to leverage semantic match (Zamani and Croft, 2016; Bi et al., 2019). Although these RF methods can also be applied in our task, we propose an end-to-end neural model for RF in the context of product search.
3. Context-aware Product Search
We first formulate the task of multi-page product search. Then we discuss different assumptions of context dependency models and propose a context-aware embedding model for the task.
3.1. Problem Formulation
A query session111We refer to the series of user behaviors associated with a query as a query session, i.e, a user issues a query, clicks results, paginates, purchases items and finally ends searching with the query. is initiated when a user issues a query to the search engine. Let be the set of items on the -th search result page ranked by an initial ranker and denote by the union of . For practical purposes, we let the re-ranking candidate set for page be where and is the set of re-ranked items viewed by the user in the first pages. Given user , query , and the set of clicked items in the first pages as context, the objective is to rank all, if any, purchased items in at the top of the next result page.
3.2. Context Dependency Models
Figure 1 shows the graphical models for three types of context dependencies, long-term context, short-term context, and long-short-term context. denotes the latent variable of a user’s long-term interest independent of queries, and clicks in the first result pages, i.e., , represents the user’s short-term preference. Purchased items on and after page , i.e., , depends on query and different types of context under different dependency assumptions.
Long-term Context Dependency. Only users’ long-term preferences, usually represented by their historical queries and the corresponding purchased items (denoted as in Figure 1) are used to predict purchased items. Such models can provide personalized search results at the beginning of a query session (Ai et al., 2017). However, this assumption needs user identity and purchase history, which are not always available. Moreover, it may not be informative to predict the final purchase since users’ current search intent may be different from any of her previous searches and purchases.
Short-term Context Dependency. In this assumption, given the observed clicks in the first pages () as short-term context, the items purchased in the subsequent result pages (), are conditionally independent of the user , shown in Figure 1. Users with little or no purchase history and who have not logged in can benefit directly under such a ranking scheme.
Long-short-term Context Dependency. In this model, an unseen item after page is scored according to , which considers both long-term context () and short-term context (). This setting considers more information but it also has the drawback of requiring users identity and purchase history.
In this paper, we focus on non-personalized short-term context and include the other two types of context for comparison.
3.3. Context-aware Embedding Model
We designed a context-aware model (CEM) where different dependency assumptions can be captured by varying the corresponding coefficients, shown in Figure 2.
We use product titles to represent products since merchants tend to put the most informative, representative text such as the brand, size, color, material and even target customers in product titles. We use the average of title word embeddings of a product as its own embedding ().
222 Other encoding methods such as non-linear projection of average word embeddings and recurrent neural network have not performed better than this simple method.
Other encoding methods such as non-linear projection of average word embeddings and recurrent neural network have not performed better than this simple method.In this way, word representations can be generalized to new items, and we do not need to cope with the cold-start problem.
User Embeddings. Each user has a unique representation from a lookup table, which is shared across search sessions and updated by the gradient learned from previous user transactions. In this way, the long-term interest of the user is captured and we use the user embeddings as long-term context in our models.
Query Embeddings. Similar to item embeddings, we use the average embedding of query words as the representation ().
Short-term Context Embeddings. We use the set of clicked items to represent user preference behind the query. We assume the sequence of clicked item does not matter when modeling short-term user preference under the same query. One reason is that user’s purchase needs are often fixed given her query. Another reason is that the order of user clicks is usually based on the rank of retrieved products as users examine each result from top to bottom. Similar to (Rocchio, 1971; Bi et al., 2019), we represent the relevant set with the centroid of each item in the set and average item embeddings are used to represent the centroid, denoted as . 333We also tried an attention mechanism to weight each clicked item according to the query and represent the user preference with a weighted combination of clicked items. However, this method is not better than combining clicks with equal weights.
Overall Context Embeddings. We use a convex combination of user, query, and click embeddings as the representation of overall context . i.e.
This overall context is then treated as the basis for predicting purchased items in . When or is set to 0, the corresponding short-term or long-term context does not take effect. In other cases, both types of context are considered. By varying the values of and , we can use Equation 1 to model different types of context dependency and do comparisons.
Attention Allocation Model for Items. With the overall context collected from the first
pages, we further construct an attentive model to re-rank the products in the candidate set
. This re-ranking process can be considered as an attention allocation problem. Given the context that indicates the user’s preference and a set of candidate items that have not been shown to the users yet, the item which attracts more user attention will have higher probability to be purchased. The attention weights then act as the basis for re-ranking. They can be computed as:
where is computed according to Equation 1. This function can also be interpreted as the generative probability of an item in the candidate set given the context . Then the model is trained by maximizing the likelihood of observing conditioning on corresponding in the training set.
|Toys & Games||Garden & Outdoor||Cell Phones & Accessories|
’ indicates significant worse than SCEM in paired student t-test with. Note that difference larger than 3% is approximately significant.
Datasets. We randomly sampled three category-specific datasets, namely, “Toys & Games”, “Garden & Outdoor”, and “Cell Phones & Accessories”, from the logs of a commercial product search engine spanning ten months between 2017 and 2018. We keep only the query sessions with at least one clicked item on any page before the pages with purchased items. Our datasets include up to a few million query sessions containing several hundred thousand unique queries. The average lengths of product titles in these categories are from 13 to 22 and vocabulary sizes are from 0.2M to 1M.
Evaluation Methodology. The sessions that occurred in the first 34 weeks are used for training, the following 2 weeks for validation and the last 4 weeks for testing. Given that the datasets are static, we can only evaluate the performance of one-shot re-ranking from page given the context collected from the first pages. We experimented on the cases when 666We also experimented the setting of , results show similar trends and the improvements are larger since there are more clicks available in the first two SERPs.. As in (Rocchio, 1971), only rank lists of unseen items are evaluated. at cutoff 100, and at 10 are used as the metrics.
Baselines. We compare our short-term context-aware embedding model (SCEM) with four groups of baseline: retrieval models without using context, long-term, short-term and long-short-term context-aware models. Baselines without using context include the production model (PROD) (a state-of-art learning-to-rank model) and models that re-rank the results in the candidate set retrieved by PROD by randomly shuffle (RAND), popularity (POP), the query likelihood model (QL) (Ponte and Croft, 1998), or the query embedding based model (QEM) (CEM with ). Long-term context-aware baselines are the relevance model (RM3) (Lavrenko and Croft, 2017) applied on the titles of the user’s historical purchased products, denoted as LCRM3, and long-term context-aware embedding model (LCEM), which sets in CEM. RM3 that considers the clicked items in the current query session as positive feedback serves as the short-term context-aware baseline, denoted as SCRM3. 777We also tested the embedding-based relevance model (ERM) (Zamani and Croft, 2016) as an embedding-based baseline. However, it does not perform better than RM3 across different settings, so it was not included. When both long-term and short-term context are incorporated in CEM, i.e., in Equation 1, the model is referred to as long-short-term context-aware embedding model (LSCEM).
Query sessions with multiple purchases on different pages were split into sub-sessions, one for each page with a purchase. We trained our models with Tensorflow for 20 epochs with 256 samples in each batch. Based on our validation results, we setto 1 for SCRM3, SCEM, and LSCEM; was set to 0.8 for LCRM3 and 1 for LCEM.
Results. Table 5 shows the performance of different methods on multi-page product search. Among all the methods, SCEM and SCRM3 perform better than all the other baselines without using short-term context, including their corresponding retrieval baseline, QEM, and QL respectively, and PROD which considers many additional features, showing the effectiveness of incorporating short-term context. In contrast to the effectiveness of short-term context, long-term context does not help much when combined with queries alone or together with short-term context. LCRM3 outperforms QL on all the datasets by a small margin; LCEM and LSCEM always perform worse than QEM and SCEM respectively when incorporating long-term context.
QL performs similarly to RAND, which indicates that relevance captured by exact word matching is not the key concern in the rank lists of the production model. Most candidate products are consistent with the query intent but the final purchase depends on users’ preference. Popularity, as an important factor that consumers will consider, can improve the performance upon QL. However, it is still worse than the production model most of the time.
We found that neural embedding methods are more effective than word-based baselines. QEM performs significantly better than QL, sometimes even better than PROD. When considering context, SCEM is much more effective than SCRM3. Neural embeddings capture not only semantic similarity but also co-occurrence of clicked and purchased items, which are more beneficial than exact word match for top retrieved items in product search. In addition, these embeddings also carry the popularity information since items purchased more will get more gradients during training.
5. Conclusion and Future Work
We propose an end-to-end context-aware neural embedding model to represent various context dependency assumptions for predicting purchased items in multi-page product search. Our experimental results indicate that incorporating short-term context is more effective than using long-term context or not using context at all. It is also shown that our neural context-aware model performs better than the state-of-art word-based feedback models. Our work indicates that multi-page product search is a promising research topic. For future work, it would be better to evaluate our short-term context re-ranking model online, in an interactive setting as each result page can be re-ranked dynamically. Moreover, other information such as images and price can also be included to extract user preferences from their feedback.
Acknowledgements.This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF IIS-1715095. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
- Learning a hierarchical embedding model for personalized product search. In SIGIR’17, pp. 645–654. Cited by: A Study of Context Dependencies in Multi-page Product Search, §1, §2, §3.2.
- Iterative relevance feedback for answer passage retrieval with passage-level semantic match. In ECIR’19, pp. 558–572. Cited by: §2, §3.3.
- A probabilistic mixture model for mining and analyzing product search log. In CIKM’13, pp. 2179–2188. Cited by: §1, §2.
- Multi-modal preference modeling for product search. In 2018 ACM Multimedia, pp. 1865–1873. Cited by: §2.
- Reinforcement learning to rank in e-commerce search engine: formalization, analysis, and application. arXiv preprint arXiv:1803.00710. Cited by: §2.
- Interactive exploratory search for multi page search results. In WWW’13, pp. 655–666. Cited by: §1, §2.
- On application of learning to rank for e-commerce search. In SIGIR’17, pp. 475–484. Cited by: §1, §2.
- Relevance-based language models. In ACM SIGIR Forum, Vol. 51, pp. 260–267. Cited by: §2, §4.
- Multi-facet product information search and retrieval using semantically annotated product family ontology. Information Processing & Management 46 (4), pp. 479–493. Cited by: §2.
- Enhancing product search by best-selling prediction in e-commerce. In CIKM’12, pp. 2479–2482. Cited by: §2.
- A language modeling approach to information retrieval. In SIGIR’98, pp. 275–281. Cited by: §4.
- Relevance feedback in information retrieval. The Smart retrieval system-experiments in automatic document processing. Cited by: §2, §3.3, §4.
Learning latent vector spaces for product search. In CIKM’16, pp. 165–174. Cited by: §1, §2.
- Facet selection algorithms for web product search. In CIKM’13, pp. 2327–2332. Cited by: §2.
- Turning clicks into purchases: revenue optimization for product search in e-commerce. In SIGIR’18, pp. 365–374. Cited by: §1, §2.
- Latent dirichlet allocation based diversified retrieval for e-commerce search. In WSDM’14, pp. 463–472. Cited by: §1, §2.
- Embedding-based query language models. In ICTIR’16, pp. 147–156. Cited by: §1, §2, footnote 7.
- Multi page search with reinforcement learning to rank. In SIGIR’18, pp. 175–178. Cited by: §1, §2.