Hundreds of millions of people around the world visit their e-mail inboxes daily, mostly to communicate with their contacts, although a significant fraction of time is spent checking utility bills, reading newsletters, and tracking purchases. To monetize this overwhelming amount of traffic, e-mail clients typically show display ads in a form of images alongside native e-mail content. Convincing users to exit the “e-mail mode", characterized by a relentless focus on the task of dealing with their mail, in order to enter a mode where they are willing to click on ads is a challenging task. Effective personalization and targeting , where the goal is to find the best matching ads to be displayed for each individual user, is essential to tackling this problem, as ads need to be highly relevant to overcome user’s inclination to focus narrowly on the e-mail task. In addition to financial gains for online businesses , proactively tailoring advertisements to the tastes of each individual consumer also leads to an improved user experience, and can help with increasing user loyalty and retention .
Inbound e-mails are still insufficiently explored and exploited for the purposes of ad targeting, while arguably representing a treasure trove of monetizable data. A recent study  showed that only of inbound volume represents human-generated e-mails. Furthermore, out of the remaining of traffic more than represents e-mails related to online shopping. Given that a significant percentage of overall traffic has commercial intent, a popular form of targeted advertising is mail retargeting (MRT), where advertisers target users who previously received e-mails from certain commercial web domains. These e-mails present a strong signal useful for targeting campaigns, as they give a broad picture about each customer’s interests and relationship with commercial domains. A recent paper  proposed to make use of this vast potential and presented a clustering method to generate MRT rules, showing that such rules are more accurate than the ones generated by human experts.
However, in order to go beyond MRT rules that utilize only the fact that users and e-commerce sites communicated, advertisers require more detailed data such as purchased product name and price that are often part of e-mail body. E-mail clients have been working with e-commerce and travel domains to standardize how e-mails are formatted, resulting in schemas maintained by the schema.org community111http://schema.org/email, accessed June 2015. With more and more e-commerce sites using standard schemas, e-mail clients can provide more personalized user notifications, such as package tracking222yahoomail.tumblr.com/post/107247385566/track-your-packages-now-in-the-yahoo-mail-app, accessed June 2015 and flight details333venturebeat.com/2014/10/22/yahoo-mail-now-tells-you-about-upcoming-flights-and-events, accessed June 2015. In addition, e-mail receipt extraction brings monetization opportunity through product advertising to users based on their individual purchase history. Availability of purchase data from multiple commercial e-mail domains puts the e-mail provider in the unique position to be able to build better recommendation systems than those based on any one commercial e-mail domain alone. In particular, unlike e-commerce websites that make recommendations of type: “Customers who bought also bought ", e-mail providers can make recommendations of type: “Customers who bought from vendor also bought from vendor ", allowing much more powerful and effective targeting solutions.
In this paper we tell the story of an end-to-end development of product ads for Yahoo Mail. The effort included developing a product-level purchase prediction algorithm, capable of scaling to millions of users and products. To this end, we propose an approach that embeds products into real-valued, low-dimensional vector space using a neural language model applied to a time series of user purchases. As a result, products with similar contexts (i.e., their surrounding purchases) are mapped to vectors that are nearby in the embedding space. To be able to make meaningful and diverse suggestions about the next product to be purchased, we further cluster the product vectors and model transition probabilities between clusters. The closest products in the embedding space from the most probable clusters are used to form final recommendations for a product.
The product prediction model was trained using a large-scale purchase data set, comprising more than million purchases made by million users, involving million unique products. The model was evaluated on a held-out month, where we tested the effectiveness of recommendations in terms of yield rate. In addition, we evaluated several baseline approaches, including showing popular products to all users, showing popular products in various user groups (called cohorts, specified by user’s gender, age, and location), as well as showing products that are historically frequently purchased after the product a user most recently bought. To mitigate the cold start problem, popular products in user’s cohort were used as back-fill recommendations for users without earlier purchases.
Empirical results show that the proposed product-level model is able to more accurately predict purchases than baseline approaches. In the experimental section we also share results of bucket tests on live traffic, where we compared the performance of the baselines in the same ad slot. Our method substantially improves key business metrics, measured in terms of clicks and conversions. Moreover, our product-prediction technique was successfully implemented at large scale, tested in live buckets, and finally launched in production. The system is able to make near-real-time product recommendations with latency of less than . In Figure 1 we show an example of our product recommender, where an offer from walmart.com is suggested after related purchase was made on amazon.com (marked with red).
2 Related work
In this section we describe related work in mail-based advertising, as well as neural language models that motivated our product recommendation approach.
2.1 Leveraging e-mail data in advertising
Web environment provides content publishers with a means to track user behavior in much greater detail than in an offline settings, including capturing user’s registered information and activity logs of user’s clicks, page views, searches, website visits, social activities, and interactions with ads. This allows for targeting of users based on their behavior, which is typically referred to as ad targeting 
. With the rise of big data applications and platforms, machine learning approaches are heavily leveraged to automate the ad targeting process. Within the past several years, there has been a plethora of research papers that explored different aspects of online advertising, each with the goal of maximizing the benefits for advertisers, content publishers, and users.
For any machine learning technique, features used to train a model typically have a critical influence on the performance of the deployed algorithm. Features derived from user events collected by publishers are often used in predicting the user’s propensity to click or purchase . However, these features represent only a weak proxy to what publishers and advertisers are actually interested in, namely, user’s purchase intent. On the other hand, commercial e-mails in the form of promotions and purchase receipts convey a strong, very direct purchase intent signal that can enable advertisers to reach a high-quality audience. According to the Direct Marketing Association’s “National Client E-mail Report"444www.powerprodirect.com/statistics, accessed June 2015, an e-mail user has been identified as having over more value than the average online customer. Despite these encouraging facts, limited work has been done to analyze the potential of features derived from commercial e-mails. A recent work 
was among the first attempts to investigate the value of commercial e-mail data. The authors applied Sparse Principal Component Analysis (SPCA) on the counts of received e-mails in order to cluster commercial domains, demonstrating significant promise of e-mail data source.
Outside of the e-mail domain, information about user’s purchase history is extensively used by e-commerce websites to recommend relevant products to their users . Recommendation systems predict which products a user will most likely be interested in either by exploiting purchase behavior of users with similar interests (referred to as collaborative filtering ) or by using user’s historical interaction with other products (i.e., context-based recommendation ). Unlike these studies, we are not limited to the data from a single website, as purchases extracted from e-mails enable us to gather information from hundreds of different e-commerce websites that can be exploited to learn better product predictions. To the best of our knowledge, this work represents the first study that offers a comprehensive empirical analysis and evaluation of product predictors using e-mail data of such scale and nature.
2.2 Neural language models
In a number of Natural Language Processing (NLP) applications, including information retrieval, part-of-speech tagging, chunking, and many others, specific objectives can all be generalized to the task of assigning a probability value to a sequence of words. To this end, language models have been developed, defining a mathematical model to capture statistical properties of words and the dependencies among them[3, 20]. Traditionally, language model approaches represent each word as a feature vector using a one-hot representation, where a word vector has the same length as the size of a vocabulary, and the position that corresponds to the observed word is equal to 1, and 0 otherwise. However, this approach often exhibits significant limitations in practical tasks, suffering from high dimensionality of the problem and severe data sparsity, resulting in suboptimal performance.
Neural language models have been proposed to address these issues, inducing low-dimensional, distributed embeddings of words by means of neural networks[5, 8, 28]. Such approaches take advantage of the word order in text documents, explicitly modeling the assumption that closer words in the word sequence are statistically more dependent. Historically, inefficient training of the neural network-based models has been an obstacle to their wider applicability, given that the vocabulary size may grow to several millions in practical tasks. However, this issue has been successfully addressed by recent advances in the field, particularly with the development of highly scalable continuous bag-of-words (CBOW) and skip-gram (SG) language models [23, 24] for learning word representations. These powerful, efficient models have shown very promising results in capturing both syntactic and semantic relationships between words in large-scale text corpora, obtaining state-of-the-art results on a plethora of NLP tasks.
3 Proposed approach
In this section we describe the proposed methodology for the task of product recommendations, which leverages information about prior purchases determined from e-mail receipts. To address this task we propose to learn representation of products in low-dimensional space from historical logs using neural language models. Product recommendation can then be performed in the learned embedding space through simple nearest neighbor search.
More specifically, given a set of e-mail receipt logs obtained from users, where user’s log is defined as an uninterupted sequence of receipts, and each e-mail receipt consists of purchased products, our objective is to find -dimensional real-valued representation of each product such that similar products lie nearby in the vector space.
We propose several approaches for learning product representations that address specifics of the recommendations from e-mail receipts. We first propose prod2vec method that considers all purchased products independently. We then propose novel bagged-prod2vec method that takes into account that some products are listed as purchased together in e-mail receipts, which results in better, more useful product representations. Finally, we present product-to-product and user-to-product recommendation models that make use of the learned representations.
3.1 Low-dimensional product embeddings
prod2vec. The prod2vec model involves learning vector representations of products from e-mail receipt logs by using a notion of a purchase sequence as a “sentence” and products within the sequence as “words”, borrowing the terminology from the NLP domain (see Figure 2 for graphical representation of the model). More specifically, prod2vec learns product representations using the skip-gram model  by maximizing the objective function over the entire set of e-mail receipt logs, defined as follows
where products from the same e-mail receipt are ordered arbitrarily. Probability of observing a neighboring product given the current product is defined using the soft-max function,
where and are the input and output vector representations of product , is the length of the context for product sequences, and is the number of unique products in the vocabulary. From equations (3.1) and (3.2) we see that prod2vec models context of product sequence, where products with similar contexts (i.e., with similar neighboring purchases) will have similar vector representations. However, prod2vec does not explicitly take into account that e-mail receipt may contain multiple products purchased at the same time, which we address by introducing a bagged version of prod2vec described below.
bagged-prod2vec. In order to account for the fact that multiple products may be purchased at the same time, we propose a modified skip-gram model that introduces a notion of a shopping bag. As depicted in Figure 3, the model operates at the level of e-mail receipts instead at the level of products. Product vector representations are learned by maximizing a modified objective function over e-mail sequences , defined as follows
Probability of observing products from neighboring e-mail receipt , , given the -th product from -th e-mail receipt reduces to a product of probabilities , each defined using soft-max (3.2). Note that the third sum in (3.3) goes over receipts, so the items from the same e-mail receipt do not predict each other during training. In addition, in order to capture temporal aspects of product purchases we propose to use the directed language model, where as context we only use future products . The modification allows us to learn product embeddings capable of predicting future purchases.
Learning. The models were optimized using stochastic gradient ascent, suitable for large-scale problems. However, computation of gradients in (3.1) and (3.3) are proportional to the vocabulary size , which is computationally expensive in practical tasks as could easily reach millions of products. As an alternative, we used negative sampling approach proposed in , which significantly reduces the computational complexity.
3.2 Product-to-product predictive models
Having learned low-dimensional product representations, we considered several possibilities for predicting the next product to be purchased.
Given a purchased product, the method calculates cosine similarities to all other products in the vocabulary and recommends the topmost similar products.
prod2vec-cluster. To be able to make more diverse recommendations, we considered grouping similar products into clusters and recommending products from a cluster that is related to the cluster of previously purchased product. We applied -means clustering algorithm implemented on the top of Hadoop distributed system where we grouped products based on cosine similarity between their low-dimensional representations. We assume that purchasing a product from any of the clusters after a purchase from cluster follows a multinomial distribution , where is the probability that a purchase from cluster is followed by a purchase from cluster
. In order to estimate parameters, for each and , we adopted a maximum likelihood approach,
In order to recommend a new product given a purchased product , we first identify which cluster belongs to (e.g., ). Next, we rank all clusters , , by the value of and consider the top ones as top-related clusters to cluster . Finally, products from the top clusters are sorted by their cosine similarity to , and we use the top products as recommendations.
3.3 User-to-product predictive models
In addition to product-to-product predictions [16, 17], most recommendation engines allow user-to-product predictions as well [15, 30]. Recommendation for a user are typically made considering historical purchases and/or interests inferred using other data sources such as user’s online behavior , social contacts , and others. In this section we propose a novel approach to simultaneously learn vector representations of products and users such that, given a user, recommendations can be performed by finding nearest products in the joint embedding space.
user2vec. The user2vec model simultaneously learns vector representations of products and users by considering the user as a “global context”, motivated by paragraph2vec algorithm . The architecture of such model is illustrated in Figure 4. The training data set was derived from user purchase sequences , which comprised users and their purchased products ordered by the time of purchase, , where denotes number of items purchased by user . During training, user vectors are updated to predict the products from their e-mail receipts, while product vectors are learned to predict other products in their context. For simplicity of presentation and w.l.o.g. in the following we present non-bagged version of the language model, however we note that it is straightforward to extend the presented methodology to use the bagged version.
More specifically, objective of user2vec is to maximize the log-likelihood over the set of all purchase sequences,
where is the length of the context for products in purchase sequence of the -th user. The probability is defined using a soft-max function,
where is the output vector representation of , and is averaged vector representation of the product context including corresponding , defined as
where is the input vector representation of . Similarly, the probability is defined as
where is the output vector representation of , and is averaged input vector representation of all the products purchased by user ,
One of the main advantages of the user2vec model is that the product recommendations are specifically tailored for that user based on his purchase history. However, disadvantage is that the model would need to be updated very frequently. Unlike product-to-product recommendations, which may be relevant for longer time periods, user-to-product recommendations need to change often to account for the most recent user purchases.
The experimental section is organized as follows. We first describe data set used in development of our product-to-product and user-to-product predictors. Next we present valuable insights regarding purchase behavior of different age and gender groups in different US states. This information can be leveraged to improve demo- and geo-targeting of user groups. This is followed by a section on effectiveness of recommendation of popular products in different user cohorts, including age, gender, and US state. Finally, we show comparative results of various baseline recommendation algorithms. We conclude with the description of the system implementation at a large scale and bucket results that preceded our product launch.
4.1 Data set
Our data sets included e-mail receipts sent to users who voluntarily opted-in for such studies, where we anonymized user IDs. Message bodies were analyzed by automated systems. Product names and prices were extracted from e-mail messages using an in-house extraction tool.
Training data set collected for purposes of developing product prediction models comprised of more than million purchases made by million users from commercial websites. The product vocabulary included million most frequently purchased products priced over .
More formally, data set was derived by forming e-mail receipt sequences for each user , along with their timestamps. Specifically, , where is a receipt comprising one or more purchased products, and is receipt timestamp. Predictions were evaluated on a held-out month of user purchases , formed in the same manner as . We measured the effectiveness of recommendations using prediction accuracy. In particular, we measured the number of product purchases that were correctly predicted, divided by the total number of purchases. For all models, the accuracy was measured separately on each day, based on the recommendations calculated using prior days. We set a daily budget of distinct recommendations each algorithm is allowed to make for a user to , based on an estimate of optimal number of ads users can effectively perceive daily .
4.2 Insights from purchase data
To get deeper insights into behavior of online users based on their demographic background and geographic location, we segregated users into cohorts based on their age, gender, and location, and looked at their purchasing habits. Such information can be very valuable to marketers when considering how and where to spend their campaign budget. We only considered users from the US, and computed statistics per each state separately. First, we were interested in differences between male and female shopping behavior for different age ranges. In addition, we were interested in the following aspects of purchasing behavior: 1) percentage of online users who shop online; 2) average number of products bought; 4) average spent per user; and 4) average price of a bought item. Figures 5 and 6 illustrate the results.
In Figure 5 for each gender and age group we show the percentage of online users who shop online, as well as the average prices of bought items. Expectedly, we see that male and female demographics exhibit different shopping patterns. Throughout the different age buckets percentage of female shoppers is consistently higher than for males, peaking in the 30-34 age range. On the other hand, percentage of male shoppers reaches its maximum in the 21-24 bucket, and from there drops steadily. In addition, we observe that male users buy more expensive items on average.
Furthermore, in Figure 6 we show per-state results, where darker color indicates higher values. Note that we only show results for different locations and age ranges, as we did not see significant differences between male and female purchasing behavior across states. First, in Figures 6(a) and 6(b) we show the percent of online shoppers in each state, for younger (18-20) and medium-age (40-44) populations, where we can see there exist significant differences between the states. In Figures 6(c) and 6(d) we show the average number of purchased products per state. Here, for both age buckets we see that the southern states purchase the least number of items. However, there is a significant difference between younger and older populations, as in northern states younger populations purchase more products, while in the northeast and the western states this holds true for the older populations. If we take a look at Figures 6(e) and 6(f) where we compare the average amount of money spent per capita, we can observe similar patterns, where in states like Alaska and North Dakota younger populations spend more money than peers from other states, and for older populations states with highest spending per user are California, Washington, and again Alaska. Interestingly, users from Alaska are the biggest spenders, irrespective of the age or metric used. Similar holds when we consider average price of a purchased item, shown in Figures 6(g) and 6(h).
4.3 Recommending popular products
In this section we evaluate predictive properties of popular products. Recommending popular products to global population is a common baseline due to its strong performance, especially during holidays (e.g., Christmas, Thanksgiving). Such recommendations are intuitive, easy to calculate and implement, and may serve in a cold start scenarios for newly registered users when we have limited information.
We considered how far back we need to look when calculating popularity and how long popular products stay relevant. In Figure 7 we give results for popular products calculated in the previous , and days of training data , evaluated on the first , and days of test data . To account for ever-changing user purchase tastes, the results indicate that popular products need to be recalculated at least every days with lookback of at most days.
Next, we evaluated the prediction accuracy of popular products computed for different user cohorts, and compared to the accuracy of globally popular products. In Figure 8 we compare accuracies of popular products in different user cohorts, with the lookback fixed at days. We can observe that popular products in gender cohorts give bigger lift in prediction accuracy than popular products in age or state cohorts. Further, jointly considering age and gender when calculating popular products outperforms the popular gender products. Finally, including geographic dimension contributes to further accuracy lift. Overall, the results indicate that in the cold start scenario the best choice of which popular products to recommend are the ones from the user’s (age, gender, location) cohort. The results also suggest that popular products should be recalculated often to avoid decrease in accuracy with every passing day.
|despicable me||first aid for the usmle step 1||disney frozen lunch napkins|
|monsters university||usmle step 1 secrets 3e||disneys frozen party 9 square lunchdinner plates|
|the croods||first aid basic sciences 2e||disneys frozen party 9oz hotcold cups|
|turbo||usmle step 1 qbook||disneys frozen party 7x7 square cakedessert plates|
|cloudy with a chance of meatballs||brs physiology||disneys frozen party printed plastic tablecover|
|hotel transylvania||rapid review pathology with student consult||disneys frozen party 7 square cakedessert plates|
|brave||lippincotts microcards microbiology flash cards||disney frozen beverage napkins birthday party supplies|
|the smurfs||first aid cases for the usmle step 2||disney frozen 9 oz paper cups|
|wreckit ralph||highyield neuroanatomy||frozen invitation and thank you card|
|planes||lange pharmacology flash cards third edition||disneys frozen party treat bags|
|jaws quick spit antifog 1 ounce||cressi neoprene mask strap|
|cressi neoprene mask strap||cressi frameless mask||cluster 1|
|cressi frameless mask||cressi scuba diving snorkeling freediving mask snorkel set|
|akona 2 mm neoprene low cut socks||akona 2 mm neoprene low cut socks|
|tilos neoprene fin socks||tilos neoprene fin socks||cluster 2|
|cressi scuba diving snorkeling freediving mask snorkel set||mares equator 2mm dive boots|
|mares cruise mesh duffle bag||jaws quick spit antifog 1 ounce|
|us divers island dry snorkel||aqua sphere kayenne goggle with clear lens clear black regular||cluster 3|
|us divers trek travel fin||aqua sphere kayenne goggle with clear lens black regular|
|us divers proflex ii diving fins||nikon coolpix aw120 161 mp wi-fi and waterproof digital camera|
|mares cruise backpack mesh bag||olympus stylus tg ihs digital camera with 5x optical zoom||cluster 4|
|water gear fin socks||nikon coolpix aw110 wi fi and waterproof digital camera with gps|
4.4 Recommending predicted products
In this section we experiment with recommending products to users based on neural language models described in Section 3. Specifically, we compare the following algorithms:
1) prod2vec-topK was trained using data set , where product vectors were learned by maximizing log-likelihood of observing other products from sequences , as proposed in (3.1). Recommendations for a given product were given by selecting the top most similar products based on the cosine similarity in the resulting vector space.
2) bagged-prod2vec-topK was trained using , where product vectors were learned by maximizing log-likelihood of observing other products from e-mail sequences as proposed in (3.3). Recommendations for a given product were given by selecting the top most similar products based on the cosine similarity in the resulting vector space.
3) bagged-prod2vec-cluster was trained similarly to the bagged-prod2vec model, followed by clustering the product vectors into clusters and calculating transition probabilities between them. After identifying which cluster belongs to (e.g., ), we rank all clusters by their transition probabilities with respect to . Then, the products from top clusters are sorted by cosine similarity to , where top from each cluster are used as recommendations (). Example predictions of bagged-prod2vec-cluster compared to predictions of bagged-prod2vec are shown in Table 2. It can be observed that predictions based on the clustering approach are more diverse.
4) user2vec was trained using data set where product vectors and user vectors were learned by maximizing log-likelihood proposed in (3.5) (see Figure 4). Recommendations for a given user were given by calculating cosine similarity between the user vector and all product vectors, and retrieving the top nearest products.
5) co-purchase. For each product pair we calculated frequency , with which product was purchased immediately after product . Then, recommendations for a product were given by sorting the frequencies , and retrieving the top products.
Since user may have multiple products purchased prior to day , separate predictions need to reach a consensus in order to choose the best products to be shown on that day. To achieve this we propose time-decayed scoring of recommendations, followed by choice of top products with the highest score. More specifically, given user’s products purchased prior to along with their timestamps, , for each product we retrieve top recommendations along with their similarity scores, resulting in the set }, where denotes cosine similarity. Next, we calculate a decayed score for every recommended product,
where is a difference in days between current day and the purchase time of product that led to recommendation of , and is a decay factor. Finally, the decayed scores are sorted in descending order and the top products are chosen as recommendations for day .
Training details. Neural language models were trained using a machine with 96GB of RAM memory and 24 cores. Dimensionality of the embedding space was set to , context neighborhood size for all models was set to . Finally, we used negative samples in each vector update. Similarly to the approach in , most frequent products and users were subsampled during training. To illustrate the performance of the language models, in Table 1 we give examples of product-to-product recommendations computed using bagged-prod2vec, where we see that the neighboring products are highly relevant to the query product (e.g., for “despicable me" the model retrieved similar cartoons).
Evaluation details. Similarly to how popular product accuracy was measured, we assumed a daily budget of distinct product recommendations per user. Predictions for day are based on prior purchases from previous days, and we did not consider updating predictions for day using purchases that happened during that day.
Results. In the first set of experiments we evaluated performance of prod2vec for different values of decay factors. In Figure 9 we show prediction accuracy on test data when looking , and days ahead. Initial prod2vec predictions were based on the last user purchase in the training data set . The results show that discounting of old predictions leads to improved recommendation accuracy, with decay factor of being an optimal choice.
Next, we evaluate different product-to-product and user-to-product methods and compared them to the popular products approach found in Figure 8 to perform the best, namely recommending popular products computed separately for each (state, age, gender) cohort. Results are summarized in Figure 10. We can observe that all neural-based prediction algorithms outperformed method that recommends popular products. Further, even though user2vec model had the best overall accuracy on day 1, its prediction power quickly drops after 3 days.
On the other hand, variants of the prod2vec model did not exhibit such behavior, and their performance remained steady across the first 7 days. Of all prod2vec models, the bagged-prod2vec-cluster model achieved the best performance, indicating that diversity in predictions leads to better results. Finally, predictions based on co-purchases did not perform as well as neural language models, supporting the “don’t count, predict" claim from  where the authors suggest that simple co-occurence approaches are suboptimal.
4.5 Bucket results
Following offline evaluation of different product prediction methods, we conducted additional A/B testing experiments on live Yahoo Mail traffic. We ran two buckets with two different recommendation techniques, both on of Yahoo Mail users. In the first bucket, for users with prior purchases the recommendations were based on the bagged-prod2vec-cluster model, while for users without prior purchases they were based on globally popular products. In the second bucket all users were recommended globally popular products. For purposes of a fair bucket test, both models were retrained and refreshed with the same frequency of days.
|Metric||(5% traffic)||(5% traffic)||(5% traffic)|
Both test buckets were evaluated against a control bucket in which products ads were replaced with standard ads from Yahoo Ad serving platform. All three buckets had significant amount of users with prior purchases.
Evaluation was done based on the click-through rate (CTR) computed as . In particular, we measured the number of clicks on product ads that occurred after the targeted recommendation, divided by the total number of shown (or impressed) product ads. In the control bucket, we computed the same metric for standard ads. For the two product ad buckets we additionally measured the yield rate, calculated as , where conversion refers to an actual purchase of the recommended product. Conversions were attributed to impression for up to hours since the time users clicked or observed a product ad.
The results are presented in Table 3. We can make several observations. First, both popular and predicted bucket showed better CTR numbers than the control bucket, indicating that the users prefer product ads over standard ads. Second, the prediction bucket achieved slightly better CTR rates than the popular bucket. Finally, the prediction bucket achieved significantly better YR than the popular bucket. This indicates that many times users click on popular products out of curiosity and do not end up buying the product, whereas clicks on targeted products lead to more purchases as they better capture user interests. Overall, the presented results strongly suggest benefits of the proposed approach for the task of product recommendations.
In addition, in Figure 11 we show how CTR rates change over time. Similarly to our offline experiments, it can be observed that popular recommendations become stale much faster than predicted recommendations, as indicated by the steeper CTR drop. They are also more susceptible to novelty bias following model updates, as seen by larger increase in CTR. Another useful finding was that day updates are not sufficient, confirming findings from Figure 7.
5 System Deployment
In this section we cover the details on our system implementation that led to final product deployment.
5.1 Implementation details
Due to product requirements for near-real-time predictions, we chose to use product-to-product recommendations, with the bagged-prod2vec-cluster model. The model is updated every days with the most recent purchase data. The product vectors are stored on Hadoop Distributed File System (HDFS), and updated via training procedure implemented on Hadoop555https://hadoop.apache.org, accessed June 2015, where we heavily leverage parallel processing with MapReduce. We form purchase sequences using the newest purchased products extracted from e-mail receipts, and incrementally update the product vectors instead of training them from scratch.
Popular products were used as back-fill for users who do not have prior purchase history. Following the experimental results, we recalculated popular products every days, with a lookback of days. More specifically, we extracted most frequently purchased products for every state-age-gender cohort, and showed them in randomized order to users without purchase history.
We implemented a multi-tier architecture in order to productionize product recommendation capability on Yahoo Mail. For data storage we used a custom, high-performance distributed key-value store (similar to Cassandra666http://cassandra.apache.org/, accessed June 2015), which persists user profiles and product-to-product prediction model. The user profile store utilizes user identifier as a key and stores multiple columns representing user prior purchases as values. User profiles are updated hourly with new products, extracted from e-mail receipts. Each purchase record persists in memory with time-to-live (TTL) set to 60 days, after which period it is discarded.
Given predictions for a certain user, once the user logs into the e-mail client we show a new recommendation after every user action, including clicking on folders, composing e-mail, and searching the inbox. Recommendation are circled in the order of decayed prediction score, as described in Section 4.4. The product ads are implemented in the so-called “pencil" ad position, just above the first e-mail in the inbox (Figure 1).
6 Conclusions and FUTURE WORK
We presented the details of a large-scale product recommendation framework that was launched on Yahoo Mail in the form of product ads. We discussed the recommendation methodology, as well as the high-level implementation details behind our system. To perform recommendations, we described how we employed neural language models capable of learning product embeddings for product-to-product predictions, as well as user embeddings for user-to-products predictions. Several variants of the prediction models were tested offline and the best candidate was chosen for an online bucket test. Following the encouraging bucket test results, we launched the system in production. In our ongoing work, we plan to utilize implicit feedback available through ad views, ad clicks, and conversions to further improve the performance of our product recommendation system.
-  A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola. Scalable distributed inference of dynamic user interests for behavioral targeting. In KDD, pages 114–122, 2011.
-  J. Alba, J. Lynch, B. Weitz, C. Janiszewski, R. Lutz, A. Sawyer, and S. Wood. Interactive home shopping: consumer, retailer, and manufacturer incentives to participate in electronic marketplaces. The Journal of Marketing, pages 38–53, 1997.
-  R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press New York, 1999.
-  M. Baroni, G. Dinu, and G. Kruszewski. Don t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 238–247, 2014.
-  Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Springer, 2006.
-  A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013.
-  C.-H. Cho and H. J. Cheon. Why do people avoid advertising on the internet? Journal of advertising, 33(4):89–97, 2004.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011.
-  N. Djuric, V. Radosavljevic, M. Grbovic, and N. Bhamidipati. Hidden conditional random fields with distributed user embeddings for ad targeting. In IEEE International Conference on Data Mining, Dec 2014.
-  N. Djuric, H. Wu, V. Radosavljevic, M. Grbovic, and N. Bhamidipati. Hierarchical neural language models for joint representation of streaming documents and their content. In International World Wide Web Conference (WWW), 2015.
-  D. Essex. Matchmaker, matchmaker. Communications of the ACM, 52(5):16–17, 2009.
-  M. Grbovic, N. Djuric, V. Radosavljevic, and N. Bhamidipati. Search retargeting using directed query embeddings. In International World Wide Web Conference (WWW), 2015.
M. Grbovic, G. Halawi, Z. Karnin, and Y. Maarek.
How many folders do you really need?: Classifying email into a handful of categories.In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 869–878. ACM, 2014.
-  M. Grbovic and S. Vucetic. Generating ad targeting rules using sparse principal component analysis with constraints. In international conference on World Wide Web, pages 283–284, 2014.
-  D. J. Hu, R. Hall, and J. Attenberg. Style in the long tail: discovering unique interests with latent variable models in large scale social e-commerce. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1640–1649. ACM, 2014.
-  J. Katukuri, T. Konik, R. Mukherjee, and S. Kolay. Recommending similar items in large-scale online marketplaces. In Big Data (Big Data), 2014 IEEE International Conference on, pages 868–876. IEEE, 2014.
-  J. Katukuri, R. Mukherjee, and T. Konik. Large-scale recommendations in a dynamic marketplace. In Workshop on Large Scale Recommendation Systems at RecSys, volume 13, 2013.
-  R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neural language models. In Proceedings of the 31th International Conference on Machine Learning, 2014.
-  R. Kiros, R. S. Zemel, and R. Salakhutdinov. A multiplicative model for learning distributed text-based attribute representations. arXiv preprint arXiv:1406.2710, 2014.
-  V. Lavrenko and W. B. Croft. Relevance based language models. In SIGIR, pages 120–127. ACM, 2001.
-  Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.
-  G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80, Jan. 2003.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
-  B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. arXiv preprint arXiv:1403.6652, 2014.
-  D. Riecken. Personalized views of personalization. Communications of the ACM, 43(8):27–28, 2000.
R. Socher, D. Chen, C. D. Manning, and A. Ng.
Reasoning with neural tensor networks for knowledge base completion.In Advances in Neural Information Processing Systems, pages 926–934, 2013.
J. Turian, L. Ratinov, and Y. Bengio.
Word representations: a simple and general method for semi-supervised learning.In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. Association for Computational Linguistics, 2010.
-  Y. Zhang and M. Pennacchiotti. Predicting purchase behaviors from social media. In WWW, pages 1521–1532, 2013.
-  X. W. Zhao, Y. Guo, Y. He, H. Jiang, Y. Wu, and X. Li. We know what you want to buy: a demographic-based system for product recommendation on microblogs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1935–1944. ACM, 2014.