Modeling Complementary Products and Customer Preferences with Context Knowledge for Online Recommendation

03/16/2019 ∙ by Da Xu, et al. ∙ WALMART LABS 0

Modeling item complementariness and user preferences from purchase data is essential for learning good representations of products and customers, which empowers the modern personalized recommender system for Walmart's e-commerce platform. The intrinsic complementary relationship among products captures the buy-also-buy patterns and provides great sources for recommendations. Product complementary patterns, though often reflected by population purchase behaviors, are not separable from customer-specific bias in purchase data. We propose a unified model with Bayesian network structure that takes account of both factors. In the meantime, we merge the contextual knowledge of both products and customers into their representations. We also use the dual product embeddings to capture the intrinsic properties of complementariness, such as asymmetry. The separating hyperplane theory sheds light on the geometric interpretation of using the additional embedding. We conduct extensive evaluations on our model before final production, and propose a novel ranking criterion based on product and customer embeddings. Our method compares favorably to existing approaches in various offline and online testings, and case studies demonstrate the advantage and usefulness of the dual product embeddings as well as the user embeddings.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With tens of millions of products available on Walmart’s e-commerce platform (, our recommender system aims at providing customers with efficient and personalized recommendation service through the huge volume of information. Many modern recommender systems tend to provide personalized recommendation based on customers’ explicit and implicit preferences, where content-based (Soboroff and Nicholas, 1999) and collaborative filtering systems (Koren and Bell, 2015) have been widely applied to e-commerce (Linden et al., 2003), online media (Saveski and Mantrach, 2014), social network (Konstas et al., 2009), etc. In recent years representation learning methods have quickly gained popularity in online recommendation literature. Alibaba (Wang et al., 2018) and Pinterest (Ying et al., 2018)

have deployed large-scale recommender system based on their trained product embeddings. Youtube also use their trained video embeddings as part of the input to a deep neural network

(Covington et al., 2016). Different from many other use cases, offers a huge variety of products and nowadays customers shop on for all-around demands from electronics to daily grocery, rather than specific preferences on narrow categories of products. Therefore understanding the intrinsic relationships among products (Zheng et al., 2009) while taking care of individual customer preferences sets the tune for our work. A topic modelling approach was recently proposed to infer complements and substitutes of products as link prediction task using the extracted product relation graphs of (McAuley et al., 2015).

For, product complementary relationship characterizes their co-purchase patterns. For example, customers who purchase a new TV often purchase HDMI cables next, and then purchase cable adaptors. Here HDMI cables are complementary to TV, and cable adaptors are complementary to HDMI cable. By virtue of this example we motivate several properties of the complementary relationship:

  • Asymmetric. HDMI cable is complementary to TV, but not vice versa.

  • Non-transitive. Though HDMI cables complement new TV and cable adaptors complement HDMI cable, cable adaptors are not complementary to TV.

  • Transductive. HDMI cables are also likely to complement other TVs with similar model and brand.

  • Higher-order. The complementary products for (TV, HDMI cable) combo can be different from their individual complements.

Although we would like to design our machine learning algorithm according to the above properties, special attention is needed for the complexity incurred by

noise and sparsity issues in collected data. The former mainly refers to the low signal-to-noise ratio in customers’ purchase sequences, which is our main source of learning product relationship and customer preferences. Notably, it’s impossible to directly extract ideal complementary purchase sequences from data, no matter whether we use purchases within certain time span or consecutive purchases. There is simply no guarantee that the purchased products are all somehow related. We list a customer’s single day purchase as a sequence for illustration:

{Xbox, games, toothbrush, toothpaste, pencil, notepad}.

Although the pairs of Xbox and games, toothbrush and toothpaste, pencil and notepad in the sequence are strong signals for modeling the complementary patterns among items, the pairs of games and toothbrush, toothpaste and pencil are all noise. To deal with this, we introduce a customer-product interaction term to directly take account of the noises and achieve the goal of modelling personalized preferences at the same time. We use to denote the set of users (customers) and to denote the set of items (products). Let

be a user categorical random variable and

be a sequence of item categorical random variables that represents consecutive purchases before time

. If we estimate the conditional probability

by softmaxclassifier with score function , then following the preceeding arguments we may assume that consists of a user-item preference term and a item complementary pattern term as shown in (1).


Now we have accounting for user-item preference (bias) and characterizing the strength of complementary pattern. When the complementary pattern of an purchase sequence is weak, i.e is small, the user-item preference term is enlarged by the model and vice versa.

On the other hand, the sparsity issue is more or less standard for recommerder systems, and various techniques have been developed for both content-based and collaborative filtering systems (Papagelis et al., 2005; Huang et al., 2004). Specifically, it has been shown that modelling with contextual information boosts performances in many cases (Melville et al., 2002; Balabanović and Shoham, 1997; Hannon et al., 2010).

Representation learning with shallow embedding gives rises to several influential works in natural language processing (NLP) and graph representation learning. The

skip-gram (SG) and continuous bag of words (CBOW) models (Mikolov et al., 2013) and their variants including GLOVE (Pennington et al., 2014), fastText (Bojanowski et al., 2017), Doc2vec (

paragraph vector

) (Le and Mikolov, 2014) have been widely applied to learn word-level and sentence-level embedding. While classical node embedding methods such as Laplacian eigenmaps (Belkin and Niyogi, 2003) and HOPE (Ou et al., 2016) arise from deterministic matrix factorization, many recent work like node2vec (Grover and Leskovec, 2016) explore from the stochastic perspective using random walk. However, we point out that the word and sentence embedding models target at semantic and syntatic similarities while the node embedding models aim at topological similarities. These relationships are all symmetric and mostly transitive, as opposed to the complementary relationship. Furthermore, the transductive property of complementariness requires similar products to have similar complementary products, which suggests that we also learn product similarity explicitly or implicitly.

We propose the novel idea of using dual embeddings for products. While both sets of embedding are used to model complementariness, product similarities are implicitly represented on one of the embeddings by merging product contextual knowledge. Case studies and empirical results show that the dual product embeddings are capable of capturing the desired properties of complementariness. In Section 3, we provide the model details, our ranking criteria and the geometric interpretations from separating hyperplane perspective. We conduct intensive offline and online evaluations with embeddings trained for 5 million customers and

2 million items. We significantly improve the metrics in various test settings, compared with the current online recommendation model. The customer embeddings, which are designed to learn the personal preferences that are deviated from population behavior, are analyzed in downstream supervised and unsupervised learning tasks for user segmentation.

2. Contributions and Related Works

Compared to previously published works on learning embedding for products/customers or modelling product relationships on Web-scale data, our novel contributions as summarized below.

Model product complementary relationship together with customer preferences - Since product complementary patterns are often entangled with customer preference, we consider both factors in modelling purchase sequences. The previous work on inferring complements and substitutes with topic model on Amazon data relies on the extracted graph of product relationships and therefore customers are not involved (McAuley et al., 2015). The approach proposed by Alibaba first construct the weighted product co-purchase count graph from purchase records and implement a node embedding model (Wang et al., 2018). Therefore customer preferences are not taken into consideration after the aggregation. Same argument applies to works of item-based collaborative filtering (Sarwar et al., 2001) and product embedding (Vasile et al., 2016).

Use dual product embedding spaces to model complementariness - Single embedding space may not be able to model the asymmetric property of complelemtariness, specially when they are treated as projections to lower-dimensional real inner-product vector space where inner products are symmetric. Although Youtube’s work takes both user preference and video viewing patterns into consideration, it does not explore complementary relationship (Covington et al., 2016)

. PinSage, the graph convolutional neural network recommender system of Pinterest, also assumes symmetric relations between their pins

(Ying et al., 2018). To our best knowledge, we are the first to use dual embedding spaces in modelling product relationship for recommender systems.

Besides the above major contributions, we also propose a fast inference algorithm to deal with the cold start problem (Lam et al., 2008; Schein et al., 2002). With a large incoming volume of new items from vendors on daily basis, inferring embedding for cold-start items without retraining the whole model is crucial for Different from the solutions for collaborative filtering using matrix factorization methods (Zhou et al., 2011; Bobadilla et al., 2012), we utilize product contextual features. This is in accordance with recent work which find contextual features playing important role in mitigating the cold-start problem (Saveski and Mantrach, 2014; Gantner et al., 2010).

3. Method

In this section we introduce the technical details of our method. We first list the different components of our data, clarify their relationships, and define the embeddings and score functions in Section 3.1. We put together the whole framework for learning the embeddings under a Bayesian network structure in Section 3.2. We then discuss the best ranking criterion for our embedding approach in Section 3.3. The geometric interpretation of using dual product embeddings is discussed in Section 3.4. Finally we present our fast inference algorithm for cold-start products in Section 3.5.

3.1. Setup

Let be the set of users and be the set of items. The contextual features for items such as brand, title and description are treated as words. Labels from catalog hierarchy tree and discretized continuous features are treated as tokens. Without loss of generality, for each item we concatenate the words and tokens into a vector denoted by for item . We use to denote the whole set of words and tokens which has instances in total. Similarly each user has a feature vector and denotes the whole set of discretized user features. A complete observation for user with the first purchase after time and previous consecutive purchases is given by .

Our main goal is to predict the next purchase according to the user and the most recent purchases. To merge contextual knowledge into item/user representations, we also use these representations to predict contextual features. Our formulation for each target above is similar to that of SG and CBOW:

  • - Predict next-to-buy item given user and recent purchases.

  • - Predict items’ contextual features. We adopt the factorization from CBOW with

  • - Predict users’ features. We also factorize this term into .

In SG/CBOW models, estimating the conditional probabilities is treated as multi-class classification problem with softmax classifier

where the score function gives the similarity of the two items according to their embeddings such as . So we then define the embedding parameters for items, users, words, user features and the score functions.

  • Let and be the dual embeddings for items, such that and measures the complementariness of given and given respectively. We refer to as item-in embedding and as item-out embedding.

  • Let be the set of embeddings for users and be the set of embeddings for item-user context, such that measures the personalized preference of user on item .

  • Let be the set of word embeddings, such that measures relatedness between item and word .

  • Let be the set of discretized user feature embeddings, such that measure relatedness between feature and user .

Although the use of and explicitly relate contextual knowledge to item and user embeddings, it remains unclear how to define score function for the complementary item of a purchase sequence, i.e

. Similar problem has also been spotted in other embedding-based recommender systems. PinSage has experimented on mean pooling and max pooling as well as using LSTM

(Ying et al., 2018). Youtube uses mean pooling (Covington et al., 2016) and another work from Alibaba proposes using the attention mechanism (Zhou et al., 2018). We choose using simple pooling methods over others for interpretability and speed. Our Preliminary experiments show that mean pooling constantly outperforms max pooling, so we settle down to the mean pooling show in (2).


It is straightforward to see that and meet the demand of a realization of and mentioned in (1).

3.2. Model

According to our target conditional probabilities in previous section, the joint probability function of the full observation has an equivalent Bayesian network representation (Wainwright et al., 2008) (Figure 1). Under the Bayesian network structure the log probability function has certain factorization (decomposition) which we show in (3). Since we do not model the marginals of and with embeddings, we treat them as constant terms represented by in (3).


In analogy to SG/CBOW model, each term in (3) can be treated as as multi-class classification problem with softmax classifier. However, careful scrutiny reveals that such formulation is not adapted to our case. To be concrete, using softmax classification model for implies that given the user and recent purchases we only predict one next-to-buy complementary item. However, it is very likely that there are several complementary items for the purchase sequence. Same argument holds when modelling item words and user features with multi-class classification problem. We notice that similar issue was also raised in fasttext (Mikolov et al., 2013).

Figure 1.

The Bayesian network representation of joint probability distribution of complete observation.

Just like fasttext, we treat each term in (3) as binary logistic classification problem. Now the semantic for becomes that given the user and recent purchases, whether is purchased next. Similarly,

now implies whether or not the item has the context words. The loss function for each complete observation using binary logistic loss is given in (



Recall that models user preferences and models item complementary patterns. So the terms in first line of (4) aims at optimizing the user embedding and item-user embedding together with the dual item embeddings and , according to observed user purchase sequences. The second line merges contextual knowledge to item-in embedding such that item with similar word contexts are learnt to be close to each other. This is in analogy to the CBOW version of (Le and Mikolov, 2014), that if we treat item as documents then measures item contextual similarity between and . Further more, the transductive property of complementariness is implied at the same time, because if is large and is close to in item-in embedding space then is also likely to be large. Same argument also applies to the third line, where users with similar features has closer representations in user embedding space.

We notice that binary classification loss requires summing over all possible negative instances, which is computationally impractical. So we also implement negative sampling as approximation for all binary logistic loss terms, with either frequency-based or hierarchical softmax negative sampling schema proposed in (Mikolov et al., 2013). For instance, the first line in (4) is now approximated by (5) where denotes a set of negative item samples.


3.3. Ranking

In online recommendation we provide top ranked items for each user based on their recent purchases and/or personal preferences. We use the dual item embeddings and , user embedding and item-user embedding for ranking candidate items. We experiment on three ranking criteria and evaluate them analytically and numerically.

The first criterion only considers user preference by ranking items according to . As expected, items previously bought by the user have top ranks. Also we observe that popular items are ranked lower. We believe this is because during training, item-user embeddings of popular items try to match with too many user embeddings and can not stay close to any single user. Offline and online tests show that the recommendation based on this criterion gives undesirable performance, since general folk have more complicated shopping behaviors than repeating their purchases.

The second criterion considers both user preference and item complementary relationship, where items are ranked according to . Quite unexpectedly the performance turns out to be mediocre. After intensive case studies we conclude the reason as follow. For purchase sequence with its strong complementary item defined in (6), the user preference term sometimes drive the top recommendation a little bit away from the strong complementary item to compensate for user preference.


By conducting case-control studies, we find out that the user preference term is still essential for learning item-level complementary relationship (see Section4.3) despite its counter effects on ranking.

Motivated by our analysis on above criteria, we propose an user-aggregated criterion inspired by the probablistic formulation of . Notice that is never directly estimated without users, but under the Bayesian network (Figure 1) setting we can make inference on it, as we show in (7).


The approximation step in (7) is made under the assumption that when is adequately large, for any meaningful purchase sequence , the random variable

approximately follows uniform distribution. As a matter of fact, here

gives the probability of a purchase sequence being made by each user. With 2 million items and 5 million users in training dataset, when is adequately large such as 7 (which we use in our training), it becomes highly unlikely that a specific 7-item sequence is observed on more than few users. This allows us to make the approximation: . Motivated by (7) we propose the user-aggregated criteria given by (8).


Here is the average user preference on item , which can be interpreted as item popularity. So the criterion in (8) combines item popularity with item complementariness. With this ranking criterion we observe the best performances in online and offline tests, which is not of surprise since item popularity often play important roles in users’ decision making. We also observe that rescaling the embeddings to unit length gives slightly better performances.

3.4. Geometric Interpretation

Figure 2. Geometric illustration for the separating hyperplane interpretation with and . For as complementary item, , are negative samples and is positive sample. Also, combo is positive sample and is negative sample, where combo is represented by the centroid of the two items. is optimized such that positive/negative samples have positive/negative distances to the separating hyperplane characterized by its normal vector and intercept (not shown here).

In this section we give our geometric interpretation for the additional item-out embedding and the user-aggregated criterion. The concept of dual item embedding is not often seen in embedding-based recommender system literature, but here it is essential for modelling the intrinsic properties of complementariness. Suppose all embeddings are fixed except , according to classical separating hyperplane theory, the vector is actually the normal of the separating hyperplane for item with respect to the embeddings of positive and negative purchase sequences. In other words, the hyperplane tries to identify that for item as a next-to-buy complementary item, which previous purchase sequences are positive and which are not.

Consider the total loss function , where is the number of observed purchase sequences and the subscript gives the index of the observation. For the loss function in (4) item-out embeddings only appears in the first two terms. Using the separability of we can collect all terms that involve the item-out embedding of item , as we show in (9). For clarity purpose we make in (9), i.e only the most recent purchase is included.


In (9), represents the whole set of item-user pairs in the observed two-item purchase sequences for user . denotes other pairs where item is used as one of the negative samples in (5). The scalar is the preference of user on item . Then it is obvious that optimizing in (9

) is equivalent to solving a logistic regression problem with

as regression parameters. The design matrix is constructed from fixed item-in embeddings . One difference here is that we have fixed intercept terms . Analytically this means that we use users’ preferences on , as intercept, when using their purchase sequences to model the complementary relationship of to other items. Theoretically this suggests that the optimized separating hyperplane is not intercept-free. Though we do not explicitly learn the intercept term, we may simply take the populational mean as the empirical intercept of the hyperplane, which is also the item popularity described before.

By virtue of logistic regression, the regression parameter vector gives the normal to the optimized separating hyperplane, under intercept . Since we first scale all to unit length in ranking stage, the user-aggregated ranking criteria is actually the distance of item (represented by ) to the separating hyperplane in the item-in embedding space. We provide a sketch of this concept in Figure 2.

When , we replace in (9) with mean pooling for purchase sequence . Geometrically speaking, we now optimize the separating hyperplane with respect to the centroids of the positive and negative purchase sequences in item-in embedding space. The sketch in also provided in Figure 2. Representing sequences by their centroids helps us capturing the higher-order complementariness beyond pair-wise setting. We also provide case studies to illustrate this point in Section 4.4.

3.5. Item Cold Start Inference

The other advantage of modelling with item contextual features is that we are able to infer item-in embedding for cold-start items only using trained context embedding . For item with contextual features , we are able to infer its item-in embedding according to (10) so that it lies close to similar items in item-in embedding space. And as a consequence they are more likely to share similar complementary items.


Inferring item-out embedding for cold-start item, however, is not of concern. For recommender system on the main challenge lies in the cases when users purchase or add to cart cold-start items. When that occurs we should still be capable of recommending complementary items. One the other hand, with 2 million frequent items in training dataset covering all categories, we almost never have to recommend cold-start items to users. Case studies for cold-start item recommendation are also provided in Section 4.4.

4. Experiment and Result

To fully evaluate our method and the embeddings we conduct intensive offline and online tests to gain enough evidence before the final deployment after A/B testing. The purposes of the recommendation experiments are to test our model under different configurations and compare with previous online models as well as other strong baseline models. Since there is no ground truth for us to evaluate the item embeddings, we conduct intensive case studies and show some of them here for demonstration. The user embeddings are explored in downstream supervised and unsupervised learning task for various user segmentation (persona) studies.

4.1. Dataset, Training and Testing

Around five million frequent customers with their purchase records are sampled and then splitted into training and testing dataset with respect to a cutoff date. Preliminary experiments show that excluding suspicious resellers and purchase records during holiday seasons help improving performances. We sample around two million representative items according to their sales such that all product subcategories are covered.

Construct purchase sequences for training. Our model mostly learns item complementary relationship from observed purchase sequences, so the construction of which can affect the quality of the embeddings. The way to order the sequences and the size of sliding window when extracting sequences can have non-trivial influence. After many preliminary experiments we find out that using transaction time to order purchase sequences is problematic because very often the customers proceed to check out with a large cart of related and unrelated items, which results in a lot of ties in sequence. Add-to-cart time, however, is more close to the actual chronological order by which the items are considered. So we reorder the purchase sequences according to the add-to-cart time.

In general, when extracting consecutive purchases from user for learning , we do not want to be either too large or too small. Intuitively speaking if is large the noise may also be large, and if is small some signals may get lost. We experiment on several reasonable values and it turns out gives best results. We stick to using as sliding window size in following experiments.

(a) Comparison of recommendation performances for a specific anchor item.
(b) Online treatment/control evaluations.
Figure 3. Screen shots of online evaluation results from our online testing system. The upper panel shows recommendations for a specific anchor item and the online performances of our model (blue) and current model (orange). Lower panel gives the overall online evaluation results on day-to-day basis.

Item and user contextual knowledge. Modelling purchase sequences with item and user contextual knowledge is an innovation of our work. For products on the some useful contextual features are listed as follow.

  • Textual features: title, brand and descriptions.

  • Labeled features: catalog hierarchy tree, each item has a unique path of {…,department, category, subcatgory,…}.

  • Continuous features: price, size, volume, etc.

We implement standard NLP pipelines to preprocess the textual features. The discretized continuous features and the labels from catalog hierarchy tree are treated as tokens. Although our model is capable of including user context, we do not consider them here in experiments because we wish to evaluate our user embedding according to these user context. Including them in training can cause information leak.

Training with asynchronous stochastic gradient descent

. It has been observed in word2vec and its variants that despite the total number of parameters is huge for shallow embedding models, the number of parameters being updated during each training step is actually small since each observation is only associated with a few parameter vectors. So in our implementation we adopt the Hogwild! algorithm whose efficiency for optimizing sparse separable cost functions has been widely acknowledged. Finally, our implementation is based on C++ which only takes 11 hours to train on our dataset which consists 1 billion observations from the 5 million users and 2 million items, in a Linux server with 16 CPU threads and 120 GB memory.

Offline testing procedure. Our offline testing aims at simulating the actual online recommendation scenarios for evaluating model performances for recommending complementary items. According to our discussions in Section 3.3 we use the user-aggregated criterion for ranking. Different from the training setup where we use the past items to predict a next-to-purchase item, in testing the choice of is more subtle. Also, it is very common that customers do not immediately purchase complementary items but after several days. But if the time-lag is too long then previous purchases is less likely to be related to future purchases. Taking the time-lag factors into consideration, we use the current and previous -day purchases to predict future purchases up to

-days, if there is any. We use average top-K prediction accuracy (APA@K) as offline evaluation metric.

4.2. Baseline Models for Offline Evaluation

We compare with several other methods that also give recommendations based on item-level relationship.

Information-theory based method. We construct the point wise mutual information matrix (PMI matrix) for items based on co-purchase counts. For any anchor item we recommend items with highest pointwise mutual information. The co-purchased counts can be computed according to

where is all online shopping sessions and if user purchase item at session and zero otherwise. The PMI matrix for the items is defined by , where and . The PMI matrix method is only capable of characterizing symmteric relationships among items but it may capture short-term relations since it considers co-purchase from same session.

Graph embedding method. Similar to the method proposed by Alibaba (Wang et al., 2018) we first extract item co-occurrence counts from customer records, i.e

where if user purchase item and purchase item within (the same as our model) future purchases. The co-occurrence count can reflect the asymmetric complementary relationship and does not have the same-session constraint. A weighted directed graph is then constructed from the normalized co-occurrence counts. We give cutoffs on the edge weights so that the graph is not too dense. Then the node2vec algorithm is implemented to learn the node embeddings of the product co-count graph. This method utilizes topological information of the co-occurrence graph, but the relationship it models is still symmetric and contextual knowledge can not be directly added to node2vec.

Product and user embedding method. Prod2vec and User2vec learns product and user embedding according to purchase records (Vasile et al., 2016). Prod2Vec, which learns directly from session-based shopping sequences, applies a SG model similar to word2vec. The User2Vec model, on the other hand, is more inspired by doc2vec

. For recommendation the items are also ranked according to their cosine similarity with anchor item’s embedding. Although

User2Vec can model user preference, the item relationship is still assumed to be symmetric and contextual knowledge is not considered.

4.3. Evaluation results

We first report the performance of our model under various offline testing configurations. First we experiment on different and mentioned in Section4.1, with user embedding dimension and item embedding dimension . The results are given in Table1.

(,) (7,7) (7,5) (7,3) (5,7) (5,5)
APA@10 0.2147 0.2112 0.2070 0.2254 0.2205
APA@5 0.1962 0.1929 0.1886 0.2020 0.2001
(,) (5,3) (3,7) (3,5) (3,3)
APA@10 0.2141 0.2518 0.2482 0.2393
APA@5 0.1986 0.2336 0.2293 0.2205
Table 1. The offline evaluation results of our method under various test settings, with , and .

From Table1 we see that as gets larger the results becomes worse. It has been justified before that using too many previous purchase records can bring noise into recommendation. While increasing gives better results since users purchase more items, the margin of increase is not significant from to and from to . This suggests that users mostly continue purchasing related complementary items shortly after the original purchases.

We then experiment on different dimensions for user and item embedding. We observe that gives very similar testing results under various settings so we stick with . However, different user embedding dimensions gives slightly different outcome, as we show in Table 2. Also we compare results between including and not including item contextual knowledge in Table 2.

0 20 40 60 80 100
APA@10 0.196 0.2602 0.2574 0.2559 0.2533 0.2518
Context included False True
APA@10 0.1778 0.2518
Table 2. The evaluation results under different and contextual knowledge inclusion, with , and .
Model Ours PMI node2vec prod2vec
APA@10 0.2518 0.0791 0.1103 0.1464
APA@5 0.2336 0.0675 0.0959 0.1321
Table 3. Comparision with baseline models under , , and .

We see from Table2 that including user preference term () and contextual features boosts the performance of our approach. But if we keep increasing user embedding dimension after

, the recommendation performance slightly drops. It has been widely observed in word embedding models that the embedding size shall not be too large or too small. A recent work reveals the connection between embedding size and the bias-variance trade-off

(Yin and Shen, 2018). From this perspective we think that using larger user embedding size may over-fit the purchase sequences with user preference.

When comparing with other baseline methods in offline evaluation, we use the default hyper-parameter settings for nodes2vec and prod2vec/user2vec with embedding sizes equal to 100. Since there is no official implementation for prod2vec/user2vec, we rely on the word2vec and doc2vec features provided by Gensim, an open-source statistical semantics library (Rehrek and Sojka, 2011). The results are provided in Table 3.

Offline evaluations show that our approach outperforms the baseline methods by significant margins and thus we proceed to online evaluations. Our online testing system monitors not only the overall performance but also performances for individual item recommendations. We provide a screen shot of the online evaluation outcomes between our model and current model in Figure 3. The metric is significantly improved in the treatment group with our model.

4.4. Case Study

In this section we randomly sample several recommendations as case study to demonstrate the advantage of our approach.

From the examples in Figure 5 we see that our approach captures the asymmetric property of complementariness, where TV mount frame is found complementary to TV and not vice versa. Examples in Figure5 show an instance of the higher-order relations learnt by our model. The complementary items for the combo of {cake pan, spatula} are very different from the recommendations for single cake pan and spatula (not shown here). In Figure 7 we examine the recommendations for a real-world scenario, where the customer has several unrelated items in shopping cart. Since we use the average of the item embeddings in the cart as the cart’s embedding, our recommendation is a mixture of the complementary items for the cart. Finally in Figure 7 we show our analysis for a cold-start product, a new 50” HD LED TV which has never been purchased by any customer before. We infer the item-in embedding for the TV and find that it lies closely to other HD LED TVs. And the complementary items are totally reasonable, since they are all complementary to the similar TVs.

Figure 4. Top 3 recommendations for TV and TV mount frame.
Figure 5. Top 3 recommendations for round cake pan and the combo of round cake pan with straight spartula.
Figure 6. Top 4 recommendations (second row) for a real-world shopping cart (first row) consisting of unrelated items.
Figure 7. Top 3 similar items (first row) to the cold-start 50” HD LED TV and our top 3 recommendation (second row).

4.5. User embedding analysis

Although we do not explicitly use individual user embedding in our final user-aggregated ranking criterion, they can be applied to various downstream tasks. Here we demonstrate with two cases, where the user embeddings are used in clustering and k-nearest neighbors learning tasks for user segmentation studies. User segmentation plays important roles for modern e-commerce platform where they are directly or indirectly involved in advertising and marketing. We refer to different user segmentation criteria as persona. For instance, the persona ’new parents’ separates customers into new parents and others. Personas are also the user features we could have included in our model. On we have business team giving scores to a subset of customers on each persona.

category precision recall accuracy ap*
Pet Owner 0.7378 0.9925 0.7343 0.7443
Busy Families 0.7378 0.9925 0.7723 0.7797
New Parents 0.8569 0.9976 0.8553 0.8631
Table 4.

Metrics of classification tasks for the KNN classifier using user embeddings. *ap means the average precision score.

Figure 8. The average persona scores of each cluster.

Here we consider three personas: pet owners, busy family and new parents. For each persona, we sample a million customers with persona score. We first conduct a k-nearest neighbor learning, where customers with high persona score are labelled as positive and the rest as negative. Around 80% customers are randomly selected as training samples. The testing result on remaining customers are given in Table 4. It shows that the user embeddings indeed capture personal preferences since similar users according to our embeddings is more likely to have same persona.

We then conduct a KMeans clustering analysis with 10 centers for the

5 million customers who have user embedding. The cosine distance is used as similarity measure. For each cluster in the output, we compute the average persona score for all customers in that cluster. The result is shown in Figure8. We see that the grouping pattern is very strong, where cluster 1 and 2 have many pet owners, and cluster 3, 4, 5 and 8 have many busy families as well as new parents, which is very reasonable.

5. Conclusion

We propose a novel representation learning framework for modelling product complementary relationship and customer preference from purchase records with dual product embeddings. Contextual knowledge of both products and customers are simultaneously merged into the embeddings, and a fast inference algorithm is developed for cold-start challenge. With our user-aggregated ranking method we outperform the baseline models in offline testings and the current model in online testing by significant margins.


  • (1)
  • Balabanović and Shoham (1997) Marko Balabanović and Yoav Shoham. 1997. Fab: content-based, collaborative recommendation. Commun. ACM 40, 3 (1997), 66–72.
  • Belkin and Niyogi (2003) Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15, 6 (2003), 1373–1396.
  • Bobadilla et al. (2012) JesúS Bobadilla, Fernando Ortega, Antonio Hernando, and JesúS Bernal. 2012. A collaborative filtering approach to mitigate the new user cold start problem. Knowledge-Based Systems 26 (2012), 225–238.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191–198.
  • Gantner et al. (2010) Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, and Lars Schmidt-Thieme. 2010. Learning attribute-to-feature mappings for cold-start recommendations. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 176–185.
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
  • Hannon et al. (2010) John Hannon, Mike Bennett, and Barry Smyth. 2010. Recommending twitter users to follow using content and collaborative filtering approaches. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 199–206.
  • Huang et al. (2004) Zan Huang, Hsinchun Chen, and Daniel Zeng. 2004. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 116–142.
  • Konstas et al. (2009) Ioannis Konstas, Vassilios Stathopoulos, and Joemon M Jose. 2009. On social networks and collaborative recommendation. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 195–202.
  • Koren and Bell (2015) Yehuda Koren and Robert Bell. 2015. Advances in collaborative filtering. In Recommender systems handbook. Springer, 77–118.
  • Lam et al. (2008) Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc Duong. 2008. Addressing cold-start problem in recommendation systems. In Proceedings of the 2nd international conference on Ubiquitous information management and communication. ACM, 208–211.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.
  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 1 (2003), 76–80.
  • McAuley et al. (2015) Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785–794.
  • Melville et al. (2002) Prem Melville, Raymond J Mooney, and Ramadass Nagarajan. 2002. Content-boosted collaborative filtering for improved recommendations. Aaai/iaai 23 (2002), 187–192.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Ou et al. (2016) Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1105–1114.
  • Papagelis et al. (2005) Manos Papagelis, Dimitris Plexousakis, and Themistoklis Kutsuras. 2005. Alleviating the sparsity problem of collaborative filtering using trust inferences. In Trust management. Springer, 224–239.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  • Rehrek and Sojka (2011) Radim Rehrek and Petr Sojka. 2011. Gensim — Statistical Semantics in Python. (2011).
  • Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295.
  • Saveski and Mantrach (2014) Martin Saveski and Amin Mantrach. 2014. Item cold-start recommendations: learning local collective embeddings. In Proceedings of the 8th ACM Conference on Recommender systems. ACM, 89–96.
  • Schein et al. (2002) Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. 2002. Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 253–260.
  • Soboroff and Nicholas (1999) Ian Soboroff and Charles Nicholas. 1999. Combining content and collaboration in text filtering. In Proceedings of the IJCAI, Vol. 99. sn, 86–91.
  • Vasile et al. (2016) Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-prod2vec: Product embeddings using side-information for recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 225–232.
  • Wainwright et al. (2008) Martin J Wainwright, Michael I Jordan, et al. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1, 1–2 (2008), 1–305.
  • Wang et al. (2018) Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. arXiv preprint arXiv:1803.02349 (2018).
  • Yin and Shen (2018) Zi Yin and Yuanyuan Shen. 2018. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems. 895–906.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. arXiv preprint arXiv:1806.01973 (2018).
  • Zheng et al. (2009) Jiaqian Zheng, Xiaoyuan Wu, Junyu Niu, and Alvaro Bolivar. 2009. Substitutes or complements: another step forward in recommendations. In Proceedings of the 10th ACM conference on Electronic commerce. ACM, 139–146.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1059–1068.
  • Zhou et al. (2011) Ke Zhou, Shuang-Hong Yang, and Hongyuan Zha. 2011. Functional matrix factorizations for cold-start recommendation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 315–324.