SWAG: Item Recommendations using Convolutions on Weighted Graphs

11/22/2019 ∙ by Amit Pande, et al. ∙ 15

Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. In this work, we present a Graph Convolutional Network (GCN) algorithm SWAG (Sample Weight and AGgregate), which combines efficient random walks and graph convolutions on weighted graphs to generate embeddings for nodes (items) that incorporate both graph structure as well as node feature information such as item-descriptions and item-images. The three important SWAG operations that enable us to efficiently generate node embeddings based on graph structures are (a) Sampling of graph to homogeneous structure, (b) Weighting the sampling, walks and convolution operations, and (c) using AGgregation functions for generating convolutions. The work is an adaptation of graphSAGE over weighted graphs. We deploy SWAG at Target and train it on a graph of more than 500K products sold online with over 50M edges. Offline and online evaluations reveal the benefit of using a graph-based approach and the benefits of weighing to produce high quality embeddings and product recommendations.



There are no comments yet.


page 6

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks (CNNs) are used to establish state-of-the-art performance on many Computer Vision applications [2]. CNNs consist of a series of parameterized convolutional layers operating locally (around neighboring pixels of an image) to obtain hierarchy of features about an image. The first layer learns simple edge-oriented detectors. Higher layers build up on the learning of lower layers to learn more complex features and objects. The success of CNNs in Computer Vision has inspired efforts to extend the convolutional operation from regular grids (2D images), to graph-structured data [9]. Graphs, such as social networks, word co-occurrence networks, guest purchasing behavior, protein-protein interactions and communication networks, occur naturally in various real-world applications. Analyzing them yields insights into the structure of society, language, and different patterns of communication. In such graphs, a node’s neighborhood is variable sized (each node can have any number of connections to other nodes unlike a pixel which has 8 nearest neighbors and 16 second degree neighbors and that too with a sense of directionality). Generalizing Convolution to graph structures should allow models to learn location-invariant features.

The early extension of convolution to graph-structured data [6] is theoretically motivated but not scalable to large graphs as it incurs quadratic computational complexity in number of nodes. Moreover, it requires the graph to be completely observed during training (transductive scenario). Defferrard et al.  [8], Kipf & Welling [16, 17] propose approximations to Graph Convolutions that are computationally-efficient (linear complexity, in the number of edges).

Hamilton and Ying [11, 27] extend graph convolution networks to scenarios where the entire graph is not required during training. In other words, the model learns a function over inputs such as node attributes and node-neighborhood that can be applied to any input graph or node in general, making it more suitable for inductive settings. For example - for a retailer like Target, assortments are frequently updated and thousands of items as well as millions of guests are added every few days. It is desirable to train a model once and let it inductively generate powerful embedding on newer nodes (items or guests) without retraining on the entire dataset. The high-dimensional information about a node’s neighborhood (graph structure) as well as the node attributes (other higher dimensional information about a graph) can be efficiently condensed or encoded in the form of graph embeddings using unsupervised graph embedding methods for dimensionality reduction. Such embeddings have demonstrated great performance on a number of tasks including node classification [11, 27, 22, 13], knowledge-base completion [21]

, semi-supervised learning 

[26], and link prediction [3]

. These node embeddings can then be fed to downstream machine learning systems and aid in tasks such as node classification, clustering, and link prediction. As introduced by Perozzi et al. 

[22] and Hamilton [11], these methods operate in two discrete steps: First, they sample pair-wise relationships from the graph through random walks. Second, they train an aggregation function or an embedding model to learn representations that encode pairwise node similarities.

Recent works have focused on creating an inductive framework for generating node embeddings by leveraging node features (e.g. text attributes, node profile information, node degrees) in order to learn an embedding function from the graph which can be generalized to unknown graphs/nodes. GraphSAGE uses (a) sampling and (b) aggregation operations to generate higher-quality recommendations than comparable deep learning and graph-based alternatives at Pinterest 

[11, 27].

However, the links between nodes of a graph convey specific information which is not properly captured by existing architectures. The weights between nodes may signify the cost or advantages or popularity of a transition from one node to another. For example - weights between two nodes in a graph, with each product being a node may represent the probability of co-views, co-purchases, rate of substitution or cost of substitution, depending on the application usage. Traditional product recommendation algorithms such as collaborative filtering 

[25] use this information to deliver product recommendation. In this work, we incorporate weights into the graph based algorithm. The resulting algorithm has three components - (a) Sampling, (b) Weighting and (c) AGgregation has been abbreviated as GraphSWAG or simply SWAG in the paper.

The main contributions of this work are as follows:

  1. In this work, we tune graph sampling and aggregation operations by incorporating the knowledge of edge weights into the procedure. Weights in graph are used for sampling, aggregation as well as generation of random walks and measuring loss.

  2. The proposed framework (SWAG) is used for similar or related product recommendations for a retailer to combine the insights from (a) product or item description (text), (b) item images and (c) purchase behavior (views / add-to-cart / purchases) into a single framework.

  3. The offline as well as online experiments illustrate that such a scheme outperforms image or item attributes based deep learning and unweighted graph based approaches.

This paper is organized as follows. Section II gives an overview of related works. Section III explains the proposed method and the inputs. Section IV gives an overview of the algorithm followed up by experimental results in Section V. Section VI gives conclusions and directions for future work.

Ii Background & Related Work

Our work builds upon recent advances in the field of Graph neural networks (GNNs) or Graph Convolution Networks (GCNs). GCNs are connectionist models that capture the dependence of graphs via message passing between the nodes of graphs. Unlike standard neural networks, graph neural networks retain a state that can represent information from its neighborhood with arbitrary depth. The concept of neural network for graphs was first introduced in  [7]. Initial approaches were difficult to train for a fixed point, recent advances in network architectures, optimization techniques, and parallel computation have addressed computational speed issues. The GCNs borrow the image of image convolutions (with small filters) to allow message passing along local neighbors of a node and significantly speed up the model training and convergence. The following properties of graphs are helpful to train GCNs for complex data science tasks. 1) Graphs are the most typical locally connected structures. 2) The shared weights of GCNs reduce the computational cost compared with traditional spectral graph theory. 3) multi-layer structure of GCNs allows us to deal with hierarchical patterns, which captures the features of various sizes. Bruna et al. [6] developed an initial GCN based on spectral graph theory. Following on this work, a number of authors proposed improvements, extensions, and approximations of these spectral convolutions [20, 29, 10, 4], leading to new state-of-the-art results on benchmarks such as node classification, link prediction, as well as web scale recommendations (e.g., the MovieLens benchmark [20, 11, 27]). These approaches have consistently outperformed techniques based upon matrix factorization or random walks. Hamilton et al.  [12], Bronstein et al.  [5] and Zhou et al. [28] provide comprehensive surveys of recent advancements.

The inductive approaches such as GraphSAGE and Pin-SAGE [11, 27]

derive embeddings as a function of node features and neighbors so that the function is scalable or usable over unseen graphs. Instead of training a distinct embedding vector for each node, a set of aggregator functions is trained. Each aggregator function aggregates information from a different number of hops, or search depth, away from a given node. The approach presented in this work is an improvement over this work by leveraging graph weights for sampling, aggregation and in unsupervised loss. We use a unsupervised loss function to generate the recommendations / embeddings for millions of online items. It is a highly scalable GCN framework (can operate on billions of nodes) and based on running local convolutions or aggregations on nodes. For training the model, nodes are selected for the loss function using random walks and negative sampling is used to select negative examples. Using random walks alleviates the requirement of entire adjacency matrix of graph to live in memory.

To our best knowledge, weighted graphs have received little attention. The closest of graphical convolutional neural network (GCNN) with edge information are in G2S [29] and r-GCN [23]

for natural language processing. The former uses the edge weights to aggregate the information from neighbors through element-wise multiplication for the state of nodes (See G2S 

[29] and also equation (8) of [28] for details). However, the edge weights in G2S are learned from the node embeddings through Gates (like GRU). To the contrary, our edge weights are the input to the model. This allows us to incorporate edge information from other sources such as user-browse behavior in graphs. In the latter (r-GCN), the edge weights are used in regularization only because the latter focuses on link prediction, in which regularization plays an important role.

Iii Proposed Method

In this section, we describe the technical details of the SWAG algorithm and its implementation for product recommendations. The key computational blocks of the algorithm is the notion of localized graph convolutions. To generate the embeddings for a node (item), multiple convolutional modules or aggregators aggregate feature information (item descriptions or visual appearance) from the node and its local graph neighborhood. This approach was first proposed in [11]. However, all the neighbors are equally treated in this approach. [27] proposed using importance sampling to find important neighbors of a node but all important nodes are equally treated. The aggregators use the weights to mix neighbors accordingly.

Iii-a Problem Setup

Target is one of the largest general merchandise retailers in US, with Target.com consistently being ranked as one of the most-visited retail Web sites. The website serves millions of product recommendations daily to the guests.

Our task is to generate high quality embeddings of items that can be used for nearest neighborhood lookups and subsequent usage in recommendations. In order to learn these embeddings, we model the shopping behavior of guests at Target as a graph with each node representing an item. In addition to the graph structure, we assume that the items are associated with additional features i.e. metadata or content information about the item. Each item is associated with rich item descriptions and image features. The learnt embeddings are to be used for product recommendations.

Iii-B Generating graph weights

In this work, we try two methods to generate graph weights from behavioral observations of aggregate guest shopping behavior.

Iii-B1 Jaccard Index

The edges of graph are weighted according to past customer views. Therefore our graph has weights on all its edges. The weights of the graph are generated by the Jaccard Index. To be specific, we calculate the relative frequency of views for each pair of items and then make an

transformation of that relative frequency. For online items and , the relative frequency is defined as follows:


where is the view counts or the number of guests that view item and in one session and is the view counts for either item or being viewed in a session. In online retail, the relative view frequency for item and is usually very small. common view is already a very big number for a pair of items, we divide the relative frequency by the median of frequency in one category to scale into the weight function , which is defined to be


After the transformation, the weights are closer to uniform distribution on

across the edges.

Iii-B2 Weighted co-occurrences

: In another approach, we generate weights not just using co-view counts but also give weights to add-to-cart and ultimately bought together counts. The different activities of guests, such as view / add to cart / purchase of products are weighed using empirically determined weights. Further, we apply time decay on co-occurrences to capture the recency of items. The weighted co-occurrence of two products and for customer sessions is given by:


where, and are highest weights of products and in session , is recency of session (if session has occurred days back then is ).

Finally, we normalize the weights per node and apply arctan transform to set the weights in range .

Iii-C Generating node embeddings

The image embeddings are generated using the pre-trained VGG-16 model [24]

. The last fully connected layers are not used and we use the output up to convolutional layers and max-pool layers (the last layer is used as average-pool instead of max-pool).

The item embeddings are obtained by training a word embedding model [19] on item attributes and descriptions in our item catalog.

Iv Algorithm

In this section, we describe the technical details of the SWAG architecture and training, as well as a GPU pipeline, to efficiently generate embeddings using a trained SWAG model. The model has two main steps - sampling and aggregation. We introduce the notation used in the paper in Table I.

Notation Explanation
the given graph
the node set of
the edge set of
a certain node of the graph
a certain edge of the graph
the weight of edge

the maximum geometric mean of weights along paths

connecting and ; and may not be neighbors
or the neighboring function on or the neighbors of node
the hidden state of node at -th layer
the linear parameter of neural network at depth

the non-linearity function for network, use ReLU for all

layers except the last one.
the aggregator function applied to neighbors at depth k.
Choices are GCN, mean aggregator, LSTM etc.
the positive degree on path weights inside loss function
the positive degree on edge weights inside sampling
the positive degree on edge weights inside aggregation
TABLE I: Notation in Swag Algorithm.

Iv-a Sampling

Sampling is very important in Graph Convolutional Networks. As opposed to computer vision, where convolutional neural networks can use pixel proximity as a feature, GCNs do propagation guided by the graph structure [28]. Therefore, for any given node, we need to efficiently select its neighbors for convolution. In Swag Algorithm, the neighbor function,


samples a subset of neighbors for any given node based on the edge weights of its neighbors. In contrast to prior work[11], in which the neighbor function selects neighbors uniform randomly, we select neighbors with probability proportional to , where is the weight of edge and is a sampling degree hyper-parameter in . In our use case of product recommendations, the larger the weight of the edge, the more chances that the corresponding neighbor should be selected in sampling. When , the impact of weights is neutralized. On the other hand, larger value of implies that only neighbors with large weights will get selected. We formalize the sampling algorithm as follows. Each layer of SWAG can have distinct number of sampled neighbors, so the algorithm below will be applied to each layer of the neural network.

  Input: Graph and a weight function for any , a sampling hyper-parameter
  Output: Graph with homogeneous number of neighbors.
  for each  do
      = , s.t.;
     and sample based on
  end for
Algorithm 1 Sampling: SWAG embedding generation

Iv-B Aggregation

After sampling, the selected neighbors need to be aggregated to their corresponding nodes for information clustering. The aggregation step is similar to convolution over nearby pixels in images and has the goal of aggregating information from neighboring nodes. However, a node’s neighbors have no particular or natural ordering in graphs. The mean aggregator, for example, would take a element-wise weighted mean of vectors in . The max-pool operator would take the max of the weighted embeddings and so on. We formulate the aggregation algorithm as follows.

  Input: Graph : input features ; depth ; weight matrices ; non-linearity ; differentiable aggregator functions ; neighborhood function ; edge weight function .
  Output: Vector representations for all
  for each  do
     for each  do
     end for
  end for
Algorithm 2 Aggregation: SWAG embedding generation

In aggregation, if there are two sources of input features as in our case embeddings from image and text, we can combine them together as


is some linear transformation matrix to make

and in the same dimension and it is a trainable parameter in training; is a nonlinear element-wise function.

The intuition behind the algorithm is that at each iteration, or search depth, nodes aggregate information from their local neighbors, and as this process iterates, nodes incrementally gain more and more information from further reaches of the graph from their neighbors. Unlike prior works [11], the hidden state here is discounted using the edge weight in aggregation to the state of node . Our guideline for this multiplicative factor is to incorporate the importance of item-to-item view dependency so that higher weights are aggregated more than lower ones. The parameter is neutralized when it is set to zero. For larger values of the neighbors with higher weights contribute more to the aggregation.

The aggregation function could be one of those in [11]: Mean aggregator, LSTM aggregator, Pooling aggregator, node2vec [10], GCN [16].

Iv-C Loss function

The Sampling and Aggregation operations are forward propagation operations, i.e. we are assuming that the weights and hyper-parameters are already learnt. The model parameters can be learned using standard stochastic gradient descent and back-propagation techniques using the loss function described in this section. In order to learn useful, predictive representations in a fully unsupervised setting, we apply a graph-based loss function to the output representations,

, and train parameters of the aggregator functions in equation (6) for via stochastic gradient descent. The graph-based loss function encourages nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct:


where is a node that co-occurs near on fixed-length random walk,

is the sigmoid function,

is a negative sampling distribution, and Q defines the number of negative samples. is the accumulated mean of the weights on the random walk for node and and is another hyper-parameter to be tuned for the exponential degrees of on weights of random walks. In our implementation, we choose geometric mean of the weights along the random walk for . Other ways of combining edge weights include arithmetic and maximum of weights of edges along the path and we will leave the exploration to the future work.

By adding the weights into the loss function (7), the algorithm will be more focused on minimizing the distance between node and with larger edge weights. Since SWAG algorithm tries to get the embeddings with larger relative view frequency items closer, the weighted loss function is more useful to our purpose.

V Experiments

We evaluate the embeddings generated by SWAG to recommend related products to guests when they click on a product and reach product display page. To recommend related products or items, we select the K nearest neighbors to the query item in the embedding space. The performance on this task is evaluated both online and offline.

(a) Clothing
(b) Home
(c) Baby
(d) Electronics
Fig. 1: Impact of

(sampling hyperparameter) in view rates. The x-axis is the logarithmic value of

with base . The best view rates are obtained for

V-a Setup

The loss function of SWAG is unsupervised, hence it implies that it tries to bring neighboring items closer in embedding space and also bring high weighted neighbors closer than low weighted neighbors. In our tasks, we actually train four models for four distinct categories of merchandise: clothing or apparel, baby, home products and electronics items. We choose these four separately because the co-views or co-purchases across one such category are found to be more relevant for the guests than cross-category. Moreover, we train four different models as we assume that the role of item embeddings or image embeddings or past guest behavior would differ depending on the category. Intuitively, image embeddings may play a large role in apparel selection than purchasing an iPhone. The total number of training nodes is close to 500K and graphs have close to 50M edges. The graph is generated using the guest’s interaction with the retailer’s website (billions of touchpoints).

For offline evaluation, we take past session logs of online guest behavior. We set up an offline evaluation where we evaluate the performance of these embeddings against past guest sessions. For example, if a guest viewed item A and then viewed items B,C,D,E and F in a past session, we assume A to be the seed item and B/C/D/E/F to be the actual views of the guest. We compare this to the recommendations from the model in consideration and calculate the actual view rate. View rate thus defined as the percentage of guests who looked at top N recommendations (N is typically set to 5 as most guest look at top 5 recommendations only) and clicked on one of them. However, this simulation is based on past traces of online behavior and the guests were not actually shown to the recommendations.

We apply word2vec algorithm on item description and item attributes to generate the

-dimensional embeddings for online items on Target.com in these four categories. For image embeddings, we tried both the VGG-16 and ResNet-50 models from ImageNet to generate the image embeddings for online items on Target.com. On evaluation, we settled on VGG-16 model as the embeddings performed slightly higher than ResNet-50 for our task and our product catalog. VGG-16 embeddings are 512 dimension vector while ResNet embeddings were 2048 dimensional vector. The size of input embeddings has an inverse relation with the computational speed of model. The weights are generated using the logic mentioned above by combining the co-views, add-to-cart and purchase behaviors of the guests on a 200 day window for each item, weighing for recency and normalizing it.

(a) Clothing
(b) Home
(c) Baby
(d) Electronics
Fig. 2: Impact of (aggregation hyperparameter) in view rate.. The x-axis is the logarithmic value of with base . The best view rates are obtained with non-zero in each category.
(a) Clothing
(b) Home
(c) Baby
(d) Electronics
Fig. 3: Impact of (loss hyperparameter) in view rate. The x-axis is the logarithmic value of with base . The best performance is reached by non-zero values of for each category.
(a) View rate
(b) Computational time
Fig. 4: View rate and run time comparisons for different size of samples (neighbors). Computational time is calculated for two P100 GPU nodes with G memory for clothing category. X-axis represents the size of sample used.

To get best hyper-parameter sets for each configuration (SWAG for a particular category of items for input being (a) item description, (b) item image, (c) both), we use the skopt package for tuning 111https://scikit-optimize.github.io/. Here the loss function is set to maximize the view rate. We tune the hyper-parameters of , and in the uniform logarithmic range of . At the end, the algorithm is the same as the GraphSage algorithm in [11]. At the high end, the weights raised to power of have been significantly reduced. We discuss the impacts of hyper-parameters , and below.

The training was conducted on GPU nodes with (up to) 377 GB of RAM, 40 CPUs (Xeon 2.2GHz) and two P100 GPUs each.

In our offline evaluation, it is observed that the view rates are higher for clothing and electronics categories and lower for home and baby. These differences are due to market trends and seasonality associated with the past browse events. For example - during the festival season more people shop electronics than baby or home items. They don’t conclude anything about the performance of the algorithm per se.

V-B Impacts of weight hyper-parameters , and

The impact of hyper-parameter , which is the sampling exponential degree on edge weights, is shown in Figure 1. In Figure 1, we take the average ratio among all of training outputs for in the range of to and the -axis is the logarithmic of . In all categories, increasing significantly improves the view rate, particularly when . Please note that the impact of sampling hyper-parameters is illustrated in Figures for SWAG models with item-description embeddings (ID) as the input (for reasons mentioned later). However, the trend was the same for all other inputs.

The view ratios are higher for closer to zero (). In Figure 2, we observe significant dip in view rates for clothing as compared to other categories. Typically, clothing or apparel is a category where guests browse the most and across multiple categories before purchasing. Hence, a lot of edges can be spurious or irrelevant (with lower weights). Weighted aggregation seems to improve the performance by lowering the weightage to low weight neighbors.

Figure 3 plots the impact of loss hyperparameter on view rates across the categories. The impact of loss degree is not significant. The bar-plots of the ratio has almost the same height for the choice of ’s in all categories. But we can observe that low values of have a slightly higher view rate across all categories.

(a) Clothing
(b) Home
(c) Baby
(d) Electronics
Fig. 5: Impacts of different aggregators in training.
Fig. 6: t-SNE plot of embeddings for items in 2 dimensions.
(a) Clothing
(b) Home
(c) Baby
(d) Electronics
Fig. 7:

Probability density of pairwise cosine similarity for image embeddings, text embeddings, SAGE and SWAG embeddings.

V-C Impacts of size of neighborhood sampling

Figure 4

(a) explains the changes in view-rate as we increase the sample size. In our graphs, each node has more than 100 neighbors. Sampling them leads to homogeneity and also speeds up computation of embeddings in each epoch. For clothing, as evident in the figure, the increase in view rate is marginal beyond sample size of 30. Figure 

4(b) shows the computational time required for training the model for different sample sizes. It can be seen that the computational time increases significantly beyond sample size of 30. Similar behavior was evident for home category where the graph size was large. Thus, we choose a sample size of for clothing and home. Electronics and Baby categories have smaller graphs and hence a optimal tradeoff was chosen around sample size of 50.

Clothing 16.2 10.0 10.5 10.5 22.4 23.5 16.5 20.2 22.5 23.6
Home 12.0 12.5 5.3 5.3 14.2 16.5 13.2 14.5 14.3 16.5
Electronic 20.5 20.2 7.2 7.2 21.9 25.1 20.5 21.5 22.1 25.2
Baby 12.5 13.5 3.4 3.4 14 14.5 16.8 17.5 17.0 17.6
TABLE II: View rate for different models. ID: Item Description based word2vec embeddings directly used to generate recommendations. II: Item Image based visual embeddings based recommendations. SWAG: SWAG algorithm without any node embeddings, SAGE: GraphSAGE without any node embeddings. indicates the node embeddings used as input with SAGE and/or SWAG model .

V-D Impacts of aggregators

All the aggregators presented in this section are weighted i.e. the output of the aggregators is weighted by the edge weight scaled exponentiated by hyperparameter . The performance of gcn, swag_mean, LSTM, mean pooling, maximum pooling are compared in Figure 5. The swag_mean aggregator is same as graphsage_mean [11] but with weights (). A detailed explanation of these aggregators is given in [11] We find that the swag_mean and mean_pooling aggregators outperforms other aggregators by a narrow margin in each category.

V-E Impact of input node embeddings

Table II gives view rate of different embeddings for the four categories. ID refers to Item Description based word2vec embeddings directly used to generate recommendations. II refers to Item Image based visual embeddings based recommendations. The numbers for SWAG and GraphSAGE are reported with/without the node embeddings used as input. We make some interesting observations from these view rates: First, we observe that item description embeddings perform slightly better than image embeddings for clothing and almost equal to image embeddings in other categories. This can be attributed to richness of attribute data as well as imperfections in using direct product images for generating embeddings. The product attributes include important information describing the product and are quite useful. The product images have background colors as well as models wearing an outfit. In future iterations, we plan to segment out the targeted product and use embeddings for regions of attention instead of using full image embeddings. We also observed that the SAGE and SWAG models have same performance in absence of node embeddings. The computational time required for SWAG(+ID) is significantly lesser than the time required for SWAG(+II) and SWAG(+ID+II) variants. However, we observe that the performance (view rate) is better than or similar to those. For Baby category, the basic SWAG model has very poor performance but incorporating node embeddings improve the view rates significantly.

Categories Clothing-5 Clothing-25 Home-5 Home-25 Baby-5 Baby-25 Elec-5 Elec-25
MRR (Swag) 0.083 0.105 0.061 0.077 0.062 0.095 0.081 0.088
MRR (Sage) 0.077 0.085 0.044 0.052 0.048 0.053 0.060 0.068
MRR (CF) 0.075 0.085 0.051 0.050 0.050 0.055 0.061 0.065
MPR (Swag) 0.107 0.146 0.075 0.098 0.072 0.101 0.106 0.110
MPR (Sage) 0.090 0.093 0.060 0.071 0.051 0.068 0.079 0.090
MPR (CF) 0.088 0.091 0.061 0.070 0.051 0.070 0.077 0.091
TABLE III: Values of mean percentile ranking (MPR) and mean reciprocal ranking (MRR) in four categories with 5 or 25 recommended items i.e., Clothing-5 indicates the clothing category with top 5 recommended items. We compare the metric values for SWAG with SAGE and collaborative filtering (CF) algorithms.

V-F MPR and MRR values

Table III gives the mean Percentile ranking (MPR) and Mean reciprocal ranking (MRR[14] metric values for the four categories (Clothing, home, baby and electronics) with 5 or 25 recommended items. We use a session of 2 month customers’ real website browsing transactions as a ground truth for our recommendation to calculate the metrics. In Table III, we observe that the SWAG model is uniformly better than SAGE model due to the fact that SWAG model integrates the transactional view information on the edge weight through aggregation. Meanwhile, SAGE and the collaborative filtering recommendation model are almost on the same level in terms of MPR and MRR values. The performance of other baselines presented in literature such as GCN and node2vec was inferior to GraphSAGE in offline tests, so we chose to not run exhaustive tests on them.

V-G Embedding similarity distribution

An important indication of the effectiveness of the learned embeddings is the widely distributed distances between random pairs of output embeddings. If all items are at about the same distance (i.e., the distances are tightly clustered) then the embedding space does not have enough “resolution” to distinguish between items of different relevance. Figure 7 plots the distribution of cosine similarities between pairs of items using Image, Item, SAGE and SWAG embeddings. SWAG has the most spread out distribution indicating the ability to distinguish between items of different relevance and also avoiding any collusion in approximate algorithms to find K nearest neighbors (such as LSH).

V-H A/B tests

Lastly, we also report on the production A/B test experiments, which compared the performance of SWAG to other deep learning content-based recommender systems at Target on the task of product recommendations.

The entire pipeline involves models built in PySpark, TensorFlow and Keras. The input item description or text embeddings are generated using a PySpark word2vec model that runs on a spark cluster sitting on top of HDFS file system. The PySpark modules run on around 150 executors with each driver and executor running on 26G memory. Generation of weights (using Jaccard index and weighted cooccurrences) is done using a PySpark and python map-reduce module respectively. The image embeddings are generated using pre-trained Keras models. The SWAG module is built and run using TensorFlow on GPU nodes cluster with each node having multiple GPUs and connected to HDFS filesystem for convenient access. The TensorFlow version used is 1.10.0. The weights and inputs are refreshed daily and scored by a trained SWAG model to give output embeddings which are used to generate nearest neighbors. To keep the operational latency in the website to be low (few milliseconds), we copy these nearest neighbors to a production server where only a lookup is required in real-time. The recommendations are generated daily to reflect the changes in product catalog and trends in guest browse behaviors. The hyper-parameters for each SWAG model are trained offline and tuning of these parameters, done using skopt library takes a few days for each category.

The metric of interest is (a) interaction rate or the view rate and (b) conversion rate or the rate of clicked items being finally added to cart and being checked out. These metrics are measured in real-time based on the tagging each and every page and carousel in website.

We ruled out using raw word2vec based Item-Description embeddings or raw VGG based Item-Image embeddings for production test after testing the recommendations offline with a group of volunteers. The performance of other baselines presented in literature such as GCN and node2vec was inferior to GraphSAGE in offline tests, so we chose to not test them in A/B tests. The other deep learning candidates for online test were based on (a) visual recommendations (for clothing) based on CNNs provided by industry vendor, This model uses more complex and multiple CNN models to identify objects of interest in an image and generates similar items such as visual search and shopping tools available publicly online, (b) visual + behavioral recommendations (for clothing) based on CNNs feeding to a SIAMESE network [18], (c) a variant of GRAPHSAGE model with parameters tuned for our graph and item data. We find that SWAG consistently performs better on these metrics than other deep learning based approaches. The performance of (a-c) compared to SWAG as baseline was 50%, 65% and 80% in terms of interaction rate as well as conversion rate.

Fig. 8: Examples of products recommended by different algorithms. Image on extreme left are some random products, while the top 3 recommendations by the raw algorithms (SWAG, SAGE and CF) are presented next

V-I Visual inspection

We visualize the embedding space by randomly choosing 300 items from clothing and compute the 2D t-SNE coordinates in Figure 6. The proximity of the SWAG embedding corresponds well with the similarity of content, and that items of the same category are embedded closer to each other in t-SNE space. Women’s leggings are clustered to bottom left while women’s athletic tops, shape-wear, socks, men’s dresses all form their own distant space in t-SNE space.

Figure 8 illustrates top few recommendations using each strategy for 2 sample items. We show the recommendations comparisons of the top three algorithms under consideration (SWAG, graphSAGE and CF) only for a random product from 4 categories - clothing, grocery, accessories and baby. Although all algorithms give good recommendations, the recommendations for SWAG are more nuanced and more similar. For grocery, the browse data is smaller, so we find that CF doesn’t do well as others. CF does tend to give results from other categories than the selected category (juice is recommended for milk, skirt for a fleece and a DVD for an accessory). This is not the case with GraphSAGE and SWAG. However, we find that the top-3 recommendations produced by SWAG are more relevant than GraphSAGE (showing half and half for milk, sleeveless recs for with-sleeve fleece).

Vi Conclusion

We proposed SWAG, a graph convolutional network (GCN) suitable for weighted graphs. SWAG is capable of learning embeddings for nodes in web-scale graphs and deployed to generate recommendations for millions of product recommendations at Target. We compared performance of SWAG with offline session metrics, embedding distributions, visual and A/B tests all demonstrating substantial improvements in recommendation performance over other deep learning architectures.

There are possible areas of improvements such as incorporating attention to weighted graphs [1], more exhaustive evaluation of computer vision models to extract better image embeddings [15] and generating graphs based on store and online purchases made by guests. The authors would also like to demonstrate performance on publicly available weighted datasets (and make some datasets public) in future.


The authors thank Sayon Majumdar and Jacob Portnoy for helpful insights on training and testing SWAG for product recommendations and online tests.


  • [1] Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alex Alemi. Watch your step: Learning graph embeddings through attention. In Advances in neural information processing systems, 2018.
  • [2] Md Zahangir Alom, Tarek M Taha, Christopher Yakopcic, Stefan Westberg, Mahmudul Hasan, Brian C Van Esesn, Abdul A S Awwal, and Vijayan K Asari. The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164, 11 2018.
  • [3] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.
  • [4] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.
  • [5] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [6] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. International Conference on Learning Representations (ICLR), CBLS, 2014.
  • [7] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
  • [8] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
  • [9] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
  • [10] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
  • [11] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • [12] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  • [14] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In 8th IEEE International Conference on Data Mining, 2008.
  • [15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 4700–4708, 2017.
  • [16] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2016.
  • [17] Thomas N Kipf and Max Welling. Variational graph auto-encoders. NeurIPS Workshop on Bayesian Deep Learning (NeurIPS BDL), 2016.
  • [18] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
  • [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [20] Federico Monti, Michael Bronstein, and Xavier Bresson. Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems, pages 3697–3707, 2017.
  • [21] Andrew Y Ng, Michael I Jordan, and Yair Weiss.

    On spectral clustering: Analysis and an algorithm.

    In Advances in neural information processing systems, pages 849–856, 2002.
  • [22] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
  • [23] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.
  • [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
  • [25] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques.

    Advances in artificial intelligence

    , 2009, 2009.
  • [26] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
  • [27] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In 24TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2018.
  • [28] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
  • [29] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, pages 457–466, 2018.