A critical component of a modern day e-commerce platform is a user-personalized system for serving recommendations. While there has been extensive academic research for recommendations in the general e-commerce setting, user personalization in the online groceries domain is still nascent. An important characteristic of online grocery shopping is that it is highly personal. Customers show both regularity in purchase types and purchase frequency, as well as exhibit specific preferences for product characteristics, such as brand affinity for milk or price sensitivity for wine.
One important type of grocery recommender system is a within-basket recommender, which suggests grocery items that go well with the items in a customer’s shopping basket, such as milk with cereals or pasta with pasta sauce. In practice, customers often purchase groceries with a particular intent, such as for preparing a recipe or stocking up for daily necessities. Therefore, a within-basket recommendation engine needs to consider both item-to-item compatibility within a shopping basket as well as user-to-item affinity, to generate efficient product recommendations that are truly user-personalized.
In this paper, we introduce Real-Time Triple2Vec, RTT2Vec, a real-time inference architecture for serving within-basket recommendations. Specifically, we develop a representation learning model for personalized within-basket recommendation task, and then convert this model into an approximate nearest neighbour (ANN) retrieval task for real-time inference. Further, we also discuss some of the scalability trade-offs and engineering challenges when designing a large-scale, deep personalization system for a low-latency production application.
For evaluation, we conducted exhaustive offline experiments on two grocery shopping datasets and observe that our system has superior performance when compared to the current state-of-the-art models. Our main contributions can be summarized as follows:
We introduce an approximate inference method which transforms the inference phase of a within-basket recommendation system into an Approximate Nearest Neighbour (ANN) embedding retrieval.
We describe a production real-time recommendation system which serves millions of online customers, while maintaining high throughput, low latency, and low memory requirements.
2 Related Work
Collaborative Filtering (CF) based techniques have been widely adopted in academia and industry for both user-item  and item-item recommendations . Recently,this approach has been extended to the within-basket recommendation task. The factorization-based models, BFM and CBFM , consider multiple associations between the user, the target item, and the current user-basket to generate within-basket recommendations. Even though these approaches directly optimize for task specific metrics, they fail to capture non-linear user-item and item-item interactions.
Due to the success of using latent representation of words (such as the skip-gram technique [19, 18]) in various NLP applications, representation learning models have been developed across other domains. The word2vec inspired CoFactor  model utilizes both Matrix Factorization (MF) and item embeddings jointly to generate recommendations. Item2vec  was developed to generate item embeddings on itemsets. Using these, item-item associations can be modeled within the same itemset (basket). Prod2vec and bagged-prod2vec  utilize the user purchase history to generate product ads recommendations by learning distributed product representations. Another representation learning framework, metapath2vec , uses meta-path-based random walks to generate node embeddings for heterogenous networks, and can be adapted to learn latent representations on a user-item interaction graph. By leveraging both basket and browsing data jointly, BB2vec 
learns dual vector representations for complementary recommendations. Even though the above skip-gram based approaches are used in wide areas of applications such as digital advertising and recommendation systems, they fail to jointly optimize for user-item and item-item compatibility.
There has also been significant research to infer functionally complementary relations for item-item recommendation tasks. These models focus on learning compatibility , complementarity [25, 12, 24], and complementary-similarity [17, 16] relations across items and categories from co-occurrence of items in user interactions.
In this section, we explain the modeling and engineering aspects of a production within-basket recommendations system. First, we briefly introduce the state-of-the-art representation learning method for within-basket recommendation tasks, triple2vec. Then, we introduce our Real-Time Triple2Vec (RTT2Vec) system inference formulation, production algorithm, and system architecture.
Problem Definition: Consider users = and items = in the dataset. Let denote a basket corresponding to user , where basket refers to a set of items . The goal of the within-basket recommendation task is given (, ) generate top-k recommendations where is complementary to items in and compatible to user .
3.1 Triple2vec model
We utilize the triple2vec  model for generating personalized recommendations. The model employs (user , item , item ) triples, denoting two items (, ) bought by the user in the same basket, and learns representation for the user and a dual set of embeddings () for the item pair (, ).
The cohesion score for a triple () is defined by Eq. 1. It captures both user-item compatibility (, ) as well as item-item complementarity (). The embeddings are learned by maximizing the co-occurrence log-likelihood of each triple as:
where . Similarly, and can be obtained by interchanging (,) and (,) respectively.
3.2 RTT2Vec: Real-Time Model Inference
Serving a personalized basket-to-item recommendation system is challenging in practice. In conventional production item-item or user-item recommendation systems, model recommendations are precomputed offline via batch computation, and cached in a database for static lookup in real-time. This approach cannot be can’t be applied to basket-to-item recommendations, due to the exponential number of possible shopping baskets. Additionally, model inference time increases with basket size (number of items), making it challenging to perform real-time inference within production latency requirements.
We transform the inference phase of triple2vec (Section 3.1) into a similarity search of dense embedding vectors. For a given user and anchor item , this can be achieved by taking of the cohesion score (Eq. 1) and adjusting it as shown in Eq. 3. The first term, the query vector, depends on the inputs and , and the second term, the ANN index, only depends on , thus transforming our problem into a similarity search task.
Further, we speed up the similarity search of the inference problem by using an off-the-shelf Approximate Nearest Neighbour (ANN) indexing library, such as FAISS , ANNOY , or NMSLIB [20, 7], to perform approximate dot product inference efficiently at large-scale.
We also observe that model performance improves by interchanging the dual item embeddings and taking the average of the cohesion scores, as shown in Eq. 4.
3.3 RTT2Vec: Production Algorithm
The RTT2Vec algorithm used for generating top-k within-basket recommendations in production consists of three principal tasks: basket-anchor set selection, model inference, and post-processing. These steps are described below in detail:
Basket-anchor set selection: To generate personalized within-basket recommendations, we replace the item embeddings and with the average embedding of all the items in the shopping basket. This approach works very well for baskets with smaller sizes, but in practice, a typical family’s shopping basket of groceries contains dozens of items. Taking the average of such large baskets results in losing information about the individual items in the basket. For larger baskets, we deploy a sampling algorithm which randomly selects 50% of items in the basket as a basket-anchor set.
Model Inference: For each item in the basket-anchor set, we create the query vector using the pre-trained user embedding and item embeddings and (refer Eq. 4). Then, we search the query vector in the Approximate Nearest Neighbour (ANN) index to retrieve the top-k recommendations.
The ANN index is created from the concatenation of the dual item embeddings j . The ANN index and embeddings are stored in memory for fast lookup. In practice, the inference can be further speed up by performing a batch lookup in the ANN index instead of performing a sequential lookup for each item in the basket-anchor set.
After the top-k recommendations are retrieved for each anchor item in the basket-anchor set, a recommendation aggregator module is used to blend all the recommendations together. The aggregator uses several factors such as number of distinct categories in the recommendation set, the individual item scores in the recommendations, taxonomy-based weighting, and business rules to merge the multiple recommendation sets, and filter to a top-k recommendation set.
Post-processing: Once the top-k recommendation set is generated, an additional post-processing layer is applied. This layer incorporates diversification of items, removes blacklisted items and categories, utilizes market-basket analysis association rules for taxonomy-based filtering, and applies some business requirements to generate the final top-k recommendations for production serving.
3.4 RTT2Vec: Production System Architecture
In this section, we provide a high level overview of our production recommendation system as illustrated in Figure 1. This system is comprised of both offline and online components. The online system consists of a memcached distributed cache, streaming system, a real time inference engine, and a front-end client. The offline system encompasses a data store, a feature store serving all the recommendation engines at Walmart, and an offline model training framework deployed on a cluster of GPUs.
At Walmart Grocery, we deal with a large volume of customer interactions, streaming in at various velocities. We use the Kafka streaming engine to capture real-time customer data without delay and store the data in a Hadoop-based distributed file system. For offline model training, we construct training examples by extracting features from our feature store through Hive and Spark jobs. Then, the training examples are input into an offline deep learning model, which is trained on a GPU cluster, generating user and dual-item embeddings. These embeddings are then stored in an embedding store (distributed cache) to facilitate online retrieval by the real-time inference engine.
The primary goal of deploying a real-time inference engine is to provide personalized recommendations, while ensuring very high throughput and providing a low-latency experience to the customer. The real-time inference engine utilizes a Approximate Nearest Neighbor (ANN) index, constructed from the trained embeddings, and deployed as a micro-service. This engine interacts with the front-end client to obtain user and basket context and generates personalized within-basket recommendations in real-time.
Our experimental evaluation is performed on one public dataset and one proprietary dataset. Both datasets are split into train, validation, and test sets. The public Instacart dataset is already split into prior, train and test sets. For the Walmart Grocery dataset, the train, validation, and test sets comprise of one year, the next 15 days, and the next one month of transactions respectively.
Walmart: We use a subset of a proprietary online Walmart Grocery  dataset for these experiments. The dataset contains approximately 3.5m users and 90k items with 800m interactions.
Metrics: We evaluate the performance of models with the metrics: Recall@K and NDCG@K. Recall@K measures the fraction of relevant items successfully retrieved when the top-K items are recommended. NDCG@K (Normalized Discounted Cumulative Gain) is a ranking metric which uses position in the recommendation list to measure gain. Metrics are reported at K=20.
For the within-basket recommendation task, given a subset of the basket, the goal is to predict the remaining items in the basket. Let the basket be split into two sets and , where denotes the subset of items in basket used for inference, and denotes the remaining set of items in the basket. Let denote the top-K recommendation list generated using . Then:
where denotes the rank of the item in the recommended list, and is the indicator function indicating if .
4.3 Baseline Models
Our system is evaluated against the following models:
ItemPop: The Item Popularity (ItemPop) model selects the top-K items based on their frequency of occurrence in the training set. The same set of items are recommended for each test basket for each user.
item2vec: The item2vec  model uses Skip-Gram with Negative Sampling (SGNS) to generate item embeddings on itemsets. We apply this model on within-basket item-sets to learn co-occurrence of items in the same basket.
BB2vec: The BB2vec  model learns vector representations from basket and browsing sessions. For a fair comparison with other models, we have adapted this method to only use basket data and ignore view data.
triple2vec: The state-of-the-art triple2vec model (as explained in Section 3.1) employs the Skip-Gram model with Negative Sampling (SGNS), applied over (user, item, item) triples in the test basket to generate within-basket recommendations.
: We use an embedding size of 64 for all skip-gram based techniques, along with the Adam Optimizer with a initial learning rate of 1.0, and the noise-contrastive estimation (NCE) of softmax as the loss function. A batch size of 1000 and a maximum of 100 epochs are used to train all skip-gram based models. We use 5 million triples to train the Instacart dataset and 200 million triples for the Walmart dataset.
We next evaluate our model predictive performance and system latency. The models are trained on an NVIDIA K80 GPU cluster, each consisting of 48 CPU cores. For evaluation and benchmarking, we use an 8-core x86_64 CPU with 2-GHz processors.
Predictive Performance: We compare the performance of our system, RTT2Vec, against the models described in Section 4.3 on the within-basket recommendation task. For each basket in the test set, we use 80% of the items as input and the remaining 20% of items as the relevant items to be predicted. As displayed in Table 1, we observe that our system outperforms all other models on both Instacart and Walmart datasets, improving Recall@20 and NDCG@20 by 9.37% (5.75%) and 21.5% (9.01%) for Instacart (Walmart) datasets when compared to the current state-of-the-art model triple2vec.
Real-Time Latency: Further, we test real-time latency for our system using exact and approximate inference methods as discussed in Section 3. Figure 2 displays system latency (ms) versus basket size. To perform exact inference based on Eq. 4, we use ND4J  and for approximate inference (as discussed in Section 3.2), we test Faiss, Annoy, and NMSLIB libraries.
ND4J is a highly-optimized scientific computing library for the JVM. Faiss is used for efficient similarity search of dense vectors that can scale to billions of embeddings, Annoy is an approximate nearest neighbour library optimized for memory usage and loading/saving to disk ,and NMSLIB is a similarity search library for generic non-metric spaces.
On average, ND4J adds 186.5ms of latency when performing exact real-time inference. For approximate inference, Faiss, Annoy, and NMSLIB libraries add an additional 29.3ms, 538.7ms, and 16.07ms of system latency respectively. Faiss and NMSLIB provide an option to perform batch queries on the index, therefore latency is much lower than Annoy. Faiss and NMSLIB are 6-10 times faster than the exact inference method using ND4J. In practice, we use NMSLIB in our production system as it provides better overall performance.
5 Conclusion and Future Work
In this paper, we propose a state-of-the-art real-time user-personalized within-basket recommendation system, RTT2vec, to serve personalized item recommendations at large-scale within production latency requirements. As per our knowledge, this study is the first description of a large-scale production grocery recommendation system in the industry. Our approach outperforms all baseline models on evaluation metrics, while respecting low-latency requirements when serving recommendations at scale.
Due to the increasing adoption of online grocery shopping and the associated surge in data size, there is an increase in the training time required for deep embedding models for personalized recommendations. Future work includes investigating the performance tradeoff of different sampling methodologies during model training. We are also exploring the introduction of additional content and contextual embeddings for improving model predictions further.
-  Note: ANNOY library. https://github.com/spotify/annoy, accessed Sep 2019. Cited by: §3.2.
-  Note: “The Instacart Online Grocery Shopping Dataset 2017” , Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 25th Sept 2019 Cited by: 1st item.
-  Note: Eclipse Deeplearning4j Development Team. Deeplearning4j: Open-source distributed deep learning for the JVM, Apache Software Foundation License 2.0. http://deeplearning4j.org Cited by: §4.4.
-  Note: Walmart. “Grocery home shopping”, http://grocery.walmart.com/ , Accessed Aug-2019. Cited by: 2nd item.
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.1.
-  (2016) Item2vec: neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. Cited by: §2, 2nd item.
-  (2013) Engineering efficient and effective non-metric space library. In International Conference on Similarity Search and Applications, pp. 280–293. Cited by: §3.2.
-  (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144. Cited by: §2.
-  (2015) E-commerce in your inbox: product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1809–1818. Cited by: §2.
-  (2008) Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. Cited by: §2.
-  (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §3.2.
-  (2019) Complete the look: scene-based complementary product recommendation. In , pp. 10532–10541. Cited by: §2.
-  (2017) Basket-sensitive personalized item recommendation. Cited by: §2.
-  (2016) Factorization meets the item embedding: regularizing matrix factorization with item co-occurrence. In Proceedings of the 10th ACM conference on recommender systems, pp. 59–66. Cited by: §2.
-  (2003) Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet computing (1), pp. 76–80. Cited by: §2.
-  (2019) Complementary-similarity learning using quadruplet network. arXiv preprint arXiv:1908.09928. Cited by: §2.
-  (2015) Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.
-  (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
-  (2016) Non-metric space library (nmslib). Cited by: §3.2.
-  (2018) Inferring complementary products from baskets and browsing sessions. arXiv preprint arXiv:1809.09621. Cited by: §2, 3rd item.
-  (2015) Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4642–4650. Cited by: §2.
-  (2018) Representing and recommending shopping baskets with complementarity, compatibility and loyalty. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1133–1142. Cited by: §3.1.
-  (2019) Modeling complementary products and customer preferences with context knowledge for online recommendation. CoRR abs/1904.12574. External Links: Cited by: §2.
-  (2018) Quality-aware neural complementary item recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 77–85. Cited by: §2.