Log In Sign Up

Aggregated Customer Engagement Model

by   Priya Gupta, et al.

E-commerce websites use machine learned ranking models to serve shopping results to customers. Typically, the websites log the customer search events, which include the query entered and the resulting engagement with the shopping results, such as clicks and purchases. Each customer search event serves as input training data for the models, and the individual customer engagement serves as a signal for customer preference. So a purchased shopping result, for example, is perceived to be more important than one that is not. However, new or under-impressed products do not have enough customer engagement signals and end up at a disadvantage when being ranked alongside popular products. In this paper, we propose a novel method for data curation that aggregates all customer engagements within a day for the same query to use as input training data. This aggregated customer engagement gives the models a complete picture of the relative importance of shopping results. Training models on this aggregated data leads to less reliance on behavioral features. This helps mitigate the cold start problem and boosted relevant new products to top search results. In this paper, we present the offline and online analysis and results comparing the individual and aggregated customer engagement models trained on e-commerce data.


page 1

page 2

page 3

page 4


A NoSQL Data-based Personalized Recommendation System for C2C e-Commerce

With the considerable development of customer-to-customer (C2C) e-commer...

Situation Awareness and Information Fusion in Sales and Customer Engagement: A Paradigm Shift

With today's savvy and empowered customers, sales requires more judgment...

CASPR: Customer Activity Sequence-based Prediction and Representation

Tasks critical to enterprise profitability, such as customer churn predi...

Neural Networks for Fashion Image Classification and Visual Search

We discuss two potentially challenging problems faced by the ecommerce i...

Integration of returns and decomposition of customer orders in e-commerce warehouses

In picker-to-parts warehouses, order picking is a cost- and labor-intens...

Attention Fusion Networks: Combining Behavior and E-mail Content to Improve Customer Support

Customer support is a central objective at Square as it helps us build a...

A Smart System for Selection of Optimal Product Images in E-Commerce

In e-commerce, content quality of the product catalog plays a key role i...

1. Introduction

Retail e-commerce accounts for billions of US dollars in sales worldwide every year. Optimizing product search is critical for customer satisfaction. This means finding the right products to match the customer’s intent, and ranking them in the order that is most important to the customer. Learning to Rank (LTR) (Liu, 2011)

is a common approach to rank search results. It is a supervised learning algorithm that uses past customer behavior or manual human labels as the signal for product preference.

Product search, unlike web search, poses unique challenges in collecting labeled data. While human annotation of web search results can be useful (Le, 2010), it can lead to misleading labels in product search. Different product facets might have different importance to customers. For example, price might be a primary factor for some while brand name might be more important to others. The preferences also change based on product type. In studies where human judges were used to rate the products, there was significant disagreement in the ratings and true customer engagement signals (Shubhra Kanti Karmaker Santu, 2017; Omar Alonso, 2009).

Using past customer behavior as labels in LTR has similar challenges. For the same query, individual customer engagement (ICE) such as view, click, add-to-cart, add-to-wishlist, purchase etc. might be different. Hence, this ICE based data leads to different labels for the same query-product pairs. With the new model, we propose using aggregated customer engagements (ACE) within a day for the same query across different customer search sessions as labels. The labels now encode the total number of times all customers engaged with a product for the same query within a day. This aggregation gives an estimate of how important a product is in relation to other products since customer engagement is our proxy for relevance. This relative importance is the fundamental idea behind pairwise ranking of LambdaRank algorithm

(Burges, ).

Customer implicit feedback is known to suffer from position and selection bias (Agarwal, 2019)

. Position bias occurs because people are more likely to examine the products that are ranked high. Since customers can only interact with the products that are presented to them, the implicit feedback is biased towards products that are selected by the current ranking model. This bias leads to the “rich get richer and poor get poorer” phenomenon. This is aggravated by the ranking model’s heavy dependence on behavioral features. These are features that directly capture the historic customer behavior data and memorize their preference. While behavioral features help with variance reduction when fitting the ranking model to the customer feedback data, over-reliance on these features leads to less generalizability. For new products or rarely purchased products that do not have enough customer engagement, a model that predominantly ranks based on behavioral features might rank them poorly. In this paper, we show empirically and theoretically, that the ACE model relies less on behavioral features and is better at ranking new or under-impressed products. Long term this can mitigate the impact of position bias.

The rest of the paper is organized as follows: In section  2 we describe the methodology behind ACE model. Then we share results comparing the ICE and ACE models in section  3. In section  4, we dive deep into the theoretical explanation of the efficacy of the ACE model. The final section  5 details our conclusions and future work.

2. ACE Model

For each customer search event with a query , a list of products are returned in search results. Each of these products

has a feature vector

associated with it that the LTR model uses for ranking. Customers may engage with some products in the search results. This engagement is used as labels for the training data. In case of binary labeling, if a customer engaged with a product, its label , otherwise it is zero. In case of the ICE model, each query , the corresponding products in search results and associated feature vectors and labels form an instance of ICE training data. We use millions of instances to train and test the ICE model.

In the ACE model, we aggregate the labels for the same query-product pairs across different customer search events within the same day:

The motivation behind selecting one day as the aggregation window was to have a large enough window such that the aggregation includes several instances of query-products pairs even for infrequent queries, but small enough window to avoid day-over-day trends in customer preferences in the aggregation. The aggregation of labels can lead to unbounded values for , particularly for popular query-product pairs that occur tens, hundreds or even thousand times in the data. So we bucket these aggregated labels

using quantiles and map onto a finite set of integer labels. For ACE model, each unique daily query along with all the corresponding distinct search results and their associated features and the aggregated labels form an instance of ACE training data. Like the ICE model, we use millions of ACE instances to train and test the ACE model.

The capping of labels in ACE data has the consequence of limiting the influence of highly popular query-product pairs on the ranking model. In other words, down-sampling the signals from these frequently impressed query-product pairs leads to a ranking model that relies less on behavioral features. As a result, products that have little engagement, either because they are new or preferred by a minority of customers, have a chance to move to top search results. This is how the ACE model helps alleviate the cold start problem that is common in product search.

3. Results

We trained ICE and ACE models on three data sets from a large e-commerce website. Each set had more than 3M data points. We tested the models offline and ran A/B tests online. With each experimental result, we iterated over the ACE method to improve its efficacy at mitigating the cold start problem without impacting other customer engagement related metrics.

Aggregating customer engagement and collapsing several searches with the same query into one data point leads to a larger percentage of distinct queries and products in the data. For example, in one of our data sets 86% of the queries in the ACE data were distinct while only 49% in the ICE data were unique. Similarly, 89% of the query-product pairs in the ACE data were distinct while 62% in ICE data were unique. Greater diversity in the data leads to smaller errors and more generalized models (Trevor Hastie, 2009).

During model assessment, we measure the variance reduction in the labels due to each feature selected by the model. This variance reduction is correlated with feature importance. We compared the total variance reduction due to all the behavioral features in the ACE and ICE models trained on the three data sets. The larger the variance reduction due to a feature, the more the model relies on it for ranking. As is clear in Table.  

1, the ACE models rely less on behavioral features when compared to ICE model. Although the difference is small, it is enough to allow more textual and product newness (days since launch date) related features to be picked up by the ACE model. This helps surface products that are better exact matches with the query even if they are new and have not accumulated enough customer engagement related behavioral features.

Dataset ICE Model ACE Model
1 85.34% 80.54%
2 76.8% 76.38%
3 83.23% 80.81%
Table 1. Total variance reduction on model prediction due to all behavioral features picked up by feature selection during model development

In the offline analysis of the models, we measured how effective the ACE model was at surfacing newer products, that were less than seven days old, in top 16 search results. When compared to the ICE model, on one data set the ACE model served 30% more new product impressions while on another data set it served 110% more. The big jump from 30% to 110% came due to an improvement in feature selection strategy that was applied to the latter data set.

After incorporating improvements and learnings from offline analysis, we ran an A/B test on the e-commerce website in the US. We measured the difference between new product impressions, clicks and purchases between the ICE and ACE models. For this online experiment, ’new product’ was defined as any product launched on the e-commerce website in the past three days. The experiment ran for two weeks during which the ACE model served 12.4% more new product impressions than the ICE model when measured over millions of customer search sessions. Because of the increase in new product impressions, the ACE model led to 10.6% more clicks and 17.54% more number of purchases of these new products. These results validated our hypothesis that the ACE model can help with the cold start problem.

4. Theoretical Explanation

The ACE model down-samples the signals from highly impressed products whose behavioral features have enough power to predict the customer action well. Thus, models trained on ACE data focus more on non-behavioral features which leads to better generalizability.

Given a query , denote to be a random selected product that presented to the customer. Let to be the random variable that represents the customer action for that pair. The feature vector of the pair can be transformed and decomposed into behavioral features () and non-behavioral features () that are random elements in . For , denote the conditional expectations as

Then and are theoretically the best prediction of the customer action given the knowledge of or . We also assume the effects have been orthogonalized and the residues are uncorrelated:


Suppose the scoring function behind the ranking takes the form of with a weight parameter , then the optimal weight that minimizes the expected mean square error can be represented as :


where and represent the variances that cannot be explained by only or respectively and

(2) also uses the assumption (1). Then solving the quadratic function of in (2), we have the optimal weight equals


So the model will put less weight on behavioral features if non-behavioral features get better ability to reduce the variance (corresponding to the decrease of ) or/and behavioral features reduce less variance (corresponding to the increase of ).

We claim that for the current data distribution, ACE will decrease and increase . Consequently, the model will put more weight on non-behavioral features.

Suppose there are customer search sessions with the query in the training data. Given labels , ACE model makes a new label out of for the product and features . For simplicity, we consider the extreme cut-off situation where we define


Then for a feature if we denote then for the new label


Now the variance that cannot be explained by for label is where


The derivation of (6) uses the fact that and . Then the corresponding quantity for is

Figure 1. Plot of unexplained variance in ACE model

against probability of customer action

when aggregating different number of customer sessions m. This figure shows that as number of sessions aggregated goes up, the unexplained variance quickly drops to zero as probability of customer action increases.

The error will be small if is close to or , that is, given , we are able to tell with more certainty whether the customer will take action or not. On the other hand, it is maximized when , corresponding to given , we are still randomly guessing the result. So the plot between and

looks like a bell curve and with the effect of ACE, the curve skews to the left with the increase of aggregated session number

as shown in Figure 1.

We observe that currently majority of the variance of the label is explained by the behavioral features. That is , the unexplained variance of behavioral features is much smaller than . In order for to be small, the majority of the mass of the distribution of needs to accumulate at the two ends of . Since the label itself skews to the left (most products won’t receive action), we claim that for behavioral features, majority of the mass of the distribution accumulates to the left end of . From Figure 1, we see that the error increases at the left end of with the increase of . Thus, the unexplained variance for behavioral features in ACE becomes larger: . On the other hand, relatively most of the mass of concentrates in the middle part of as the unexplained variance of is large. From the figure we see at the middle part of , the error quickly drop to near zero as increases. Thus, for non-behavioral features, the unexplained variance becomes smaller: . Hence, the optimal weight on the behavioral features in ACE reduces as

5. Conclusion

In this paper, we described a new way of processing training data for developing ranking models for e-commerce. We showed that aggregating customer engagement across different search sessions for the same query leads to better ranking of new and under-impressed products. This is because the ACE model relies less on behavioral features which otherwise tend to dominate the ranking. Products that are popular have the benefit of position and selection bias leading to strong behavioral features. With a model that under-samples these customer engagement signals, we can mitigate the cold start problem. Products that are under-impressed, perhaps because they are preferred by a minority of customers, also get a chance to show in top search results. We plan to conduct more online experiments with the ACE model to further validate the methodology using empirical results.


  • A. Agarwal (2019) A general framework for counterfactual learning-to-rank. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (), pp. 5–14. External Links: Document, Link Cited by: §1.
  • [2] C. J.C. Burges () From ranknet to lambdarank to lambdamart: an overview. Microsoft Research Technical Report (), pp. . External Links: Document, Link Cited by: §1.
  • J. Le (2010) Ensuring quality in crowdsourced search relevance evaluation: the effects of training question distribution. Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (), pp. . Note: External Links: Document, Link Cited by: §1.
  • T. Liu (2011) Learning to rank for information retrieval. edition, , Vol. , Springer, New York, NY. Note: Cited by: §1.
  • S. M. Omar Alonso (2009) Relevance criteria for e-commerce: a crowdsourcing-based experimental analysis.. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (), pp. . Note: External Links: Document, Link Cited by: §1.
  • C. Z. Shubhra Kanti Karmaker Santu (2017) On application of learning to rank for e-commerce search. SIGIR ’17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (), pp. 475–484. External Links: Document, Link Cited by: §1.
  • J. F. Trevor Hastie (2009) The elements of statistical learning. 2nd edition, Vol. , Springer, New York, NY. Note: External Links: Document, Link Cited by: §3.