"Are you sure?": Preliminary Insights from Scaling Product Comparisons to Multiple Shops

by   Patrick John Chia, et al.

Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our design choices and run extensive benchmarks on multiple shops to stress-test it. Finally, we run a small user study on property selection and conclude by discussing potential improvements and highlighting the questions that remain to be addressed.



There are no comments yet.


page 1

page 2

page 3

page 4


Using Collaborative Filtering to Recommend Champions in League of Legends

League of Legends (LoL), one of the most widely played computer games in...

Predicting Residential Property Value in Catonsville, Maryland: A Comparison of Multiple Regression Techniques

Predicting Residential Property Value in Catonsville, Maryland: A Compar...

Accounting for Variance in Machine Learning Benchmarks

Strong empirical evidence that one machine-learning algorithm A outperfo...

Less Arbitrary waiting time

Property testing is the cheapest and most precise way of building up a t...

Inference for a test-negative case-control study with added controls

Test-negative designs with added controls have recently been proposed to...

Next Steps for the Colorado Risk-Limiting Audit (CORLA) Program

Colorado conducted risk-limiting tabulation audits (RLAs) across the sta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online shopping has seen tremendous growth in recent years (Statista Research Department, 2020), and shoppers now face an innumerable number of possibilities, which paradoxically may lead to decreasing satisfaction in their purchase decisions (Scheibehenne et al., 2010). Recommender systems (RSs) have been playing an indispensable role in fighting information overload, and large players (Pichestapong, 2019; Hosanagar et al., 2014) have been mostly responsible for modelling and product innovation (Tsagkias et al., 2020; Bhagat et al., 2018)Comparison engines (CEs) are a special case of RS, in which a product detail page (PDP) displays alternative choices in a table containing informative product specifications (Fig. 1). Unlike prevalent ”More like this” RSs, comparison tables when well-designed not only promote products that are relevant, but also intentionally select products which help customers better understand the range of available features. However – as demonstrated by the sub-optimal alternatives in Fig. 1 – building a CE is far from trivial even for players with full ownership of the data chain. In this paper, we share preliminary lessons learned when building CEs in a B2B scenario, that is, designing a scalable pipeline that is deployed across multiple shops. As convincingly argued in (Tagliabue et al., 2020b; Bianchi et al., 2020; Tagliabue et al., 2021), multi-tenant deployments require models to generalize to dozens of different retailers: a successful CE is therefore not only hard to build, but valuable to a wide range of practitioners – on one side, practitioners outside of humongous websites, who want to enhance their shop in the face of rising pressure from major players; on the other, multi-tenant SaaS providers who need to provide AI-based services that scale to a large number of clients.111As an indication of the relevant SaaS market size, we witnessed Coveo, Algolia, Lucidworks and Bloomreach raising more than 100M USD each from venture funds in the last two years for AI-powered services (Techcrunch, ; Techcrunch, 2019a, b, 2021). We summarize our contributions as follows:

  • we are the first, to the best of our knowledge, to detail a pipeline for building a comparison engine designed to be scalable in a multi-tenant scenario;

  • we perform extensive experiments on various data cleaning and augmentation approaches. One of our major practical contributions – in line with what independently reported by Hao et al. (2020) – is questioning the widely held belief that co-occurrence patterns are a sufficient proxy for substitutable products (McAuley et al., 2015);

  • we discuss the importance of diversity in the comparison table and propose a decision process to determine relevant attributes.

Figure 1. Recommendations as comparisons on Amazon.com: the first product (in yellow) is the product in the PDP, and the other three are suggestions. As clear from item #2 and #3, finding substitutes at scale is all but a trivial task.

Example of product comparisons on Amazon.com.

While we do acknowledge that full online testing is needed to answer some outstanding design questions, we supplement our pipeline tests with a user study, allowing for a preliminary comparison between our results and human judgments, as well as guiding future decisions in our roadmap. Practitioners looking to replicate our work are encouraged to check the Appendix for details on our tools, modelling choices and Mechanical Turk setup.

2. Comparison Engine Pipeline

In this section, we present the pipeline architecture of our comparison system. The pipeline is composed of three main stages: a first candidate fast retrieval phase, to narrow down the search space; a candidate refinement phase, to ensure precision and produce the final shortlist of products; lastly, a final selection phase, to determine the information to be displayed in the comparison table. We will first explain the logic of each stage, then detail the experiments performed to benchmark the pipeline.222Note that due to space constraints, we cite the most relevant literature inline at the most appropriate step.

Figure 2. An overview of our CE pipeline: (1) product for the current PDP; (2) fast retrieval of candidate substitutes, focusing on recall; (3) refinement of candidate substitutes using binary classification model; (4) selection of important properties; (5) final selection of substitutes to display in comparison table.

Overview of our CE pipeline.

Shop A Shop B
Configuration P@R=0.7 P@R=0.8 P@R=0.9 P@R=0.7 P@R=0.8 P@R=0.9
Baseline 0.744 0.743 0.682 0.611 0.573 0.539
C=0; S=0 0.759 (0.0235) 0.710(0.0212) 0.645 (0.0195) 0.734 (0.0324) 0.690 (0.0333) 0.632 (0.0290)
C=0; S=1 0.766 (0.0257) 0.723 (0.0237) 0.659 (0.0189) 0.755 (0.0379) 0.706 (0.0423) 0.643 (0.0382)
C=1; S=0 0.833 (0.0162) 0.802 (0.0226) 0.740 (0.0301) 0.777 (0.0162) 0.732 (0.0131) 0.658 (0.0123)
C=1; S=1 0.842 (0.0150) 0.812 (0.0208) 0.753 (0.0280) 0.789 (0.0189) 0.743 (0.0196) 0.663 (0.0176)
Table 1. Precision@Recall = {0.7,0.8,0.9} for various configurations for Shop A and B; P@R=X denotes Precision at Recall of X.

2.1. Fast Retrieval

Candidate retrieval aims to quickly generate potential substitutes given a query product, with a focus on recall (step in Fig. 2): we try to get a more diverse set, knowing false positives will be screened out in later phases. A common practice for fast retrieval in a dense space is using k-NN over an embedding space (Zhao et al., 2018; Covington et al., 2016): since recent literature (Bianchi et al., 2020, 2021; Tagliabue et al., 2020a) provides extensive evidence on the representational qualities of behavioral embeddings, we train a prod2vec space (Grbovic et al., 2015) by adapting word2vec (Mikolov et al., 2013) to eCommerce – i.e. a prod2vec space is just a word2vec space, where words in a sentence are replaced with products in a shopping session (Appendix B). After obtaining a prod2vec space we apply k-NN (based on cosine distance) to retrieve the closest products as its substitute candidates. Analogous to words in word2vec, products which are distributionally similar (based on historical sessions) are close in the prod2vec space, therefore the candidates retrieved in this step are already biased towards substitutable products.

2.2. Candidate Refinement

Candidates produced by the first stage are passed to the second stage for fine-grained processing. The goal of this stage is to boost precision by filtering out candidates that do not have matching product type, and re-rank the remaining ones so the most comparable ones are at the top of the list (step in Fig. 2). We employ a binary classification model (i.e. given a pair of products, are they substitutes?) built on top of a Siamese Network (Bromley et al., 1993), fed with unsupervised behavioral data.

2.2.1. Unsupervised Behavioural Data

Generating training data for substitute product detection is a well-explored topic in the literature (McAuley et al., 2015; Chen et al., 2020; Zhang et al., 2019)

. However, our inference is somewhat harder than a general substitute classifier where products are sampled from the entire catalog, as our model needs to be able to make subtler distinctions among a selected group of candidates that have been shortlisted by a coarse similarity measure (Section 

2.1). To overcome the problem of naive sampling and reduce the noise in behavioral data, we built a three-step process generating the final training set, with free parameters, , , (Appendix B):

  1. We use co-view and co-purchase patterns to obtain positive and negative training examples. Positive examples are obtained from pairs of products which are viewed consecutively (co-view: if I want a TV, I will check several TVs in a row) and negative examples are obtained from products which are purchased consecutively (co-purchase: if I just bought a TV, I am unlikely to buy a second one). To reduce noise, we set a minimum threshold for the number of co-view occurrences () and the number of co-purchase occurrences () for a pair to be considered a positive or negative example respectively.

  2. We intuit that substitutable products are a priori

    visually similar, and utilize this to further reduce noise in the data. Thus, we apply a threshold on the cosine similarity of the image embedding of pairs to further refine this set of training examples. Given an image vector obtained through a pre-trained VGG16 

    (Simonyan and Zisserman, 2015), we enforce that positive/negative pairs must have a minimum/maximum cosine similarity.333While drafting this paper, we realized a similar approach has been recommended independently by (Zuo et al., 2020). We refer to this refinement/cleaning process as C.

  3. We remove pairs which are given both positive and negative labels, then build a graph using the remaining positive pairs and extract disconnected subgraphs as clusters of substitutable products. We eliminate clusters of size when generating synthetic pairs, to reduce the risk of sampling from clusters formed by noisy pairs and, at the same time, improve the balance of product types in the training data. By taking an existing positive/negative pair from our behavioural logs, we generate synthetic pairs by swapping out one of the products in the original pair with any product found in its substitute cluster, unlike (Guo et al., 2020) which samples negative examples from a random disconnected subgraph. We refer to this augmentation process as S.

We emphasize that only behavioral logs and product images are necessary so far: our approach does not assume peculiar meta-data or pre-made taxonomy, nor does our classifier require costly labelling, making the pipeline suitable for multi-shop scaling.

2.2.2. Binary Classifier: A Siamese Network

We utilize a binary classifier to predict whether two input products are substitutes or not. Products are represented by various dense representations of product features, such as behavioural embeddings and word2vec embeddings for product title, description and category strings (See Appendix B). For full reproducibility, we provide architectural and hyper-parameter details in Appendices B & C.

2.3. Product and Property Selection

2.3.1. Relevant Property Selection

At this stage, we make the only significant meta-data assumption of the entire pipeline, that is, the target catalog should specify product properties in some structured way – based on our experience with dozens of deployments, this is not a universal feature, but it is common for verticals with technical products (DIY, electronics, etc.), for which CEs are most useful. Given a mapping from products to their properties (say, from TV to the set ¡resolution, screen size, …¿), this stage determines which properties are relevant to shoppers when they are making a purchase decision (step in Fig. 2). By passing the candidates from Section 2.1 to the classifier in 2.2, we generate a final list of substitutable products, given an initial query product. For this list, we rank properties based on the weighted sum of three components444Weights have been determined empirically at first, but see Section 3.2 and our conclusion for potential use of human-in-the-loop inference., highlighted as important by previous literature (Katukuri et al., 2014; Dong et al., 2020) and domain knowledge:

  1. Query frequency: properties which are important to shoppers tend to appear frequently in shopper-generated content (Bing et al., 2016; Moraes et al., 2020) such as queries (Katukuri et al., 2014). We calculate the query frequency for each property (and their possible values) by mining search logs, and normalize the counts to range ;

  2. PDP frequency: merchandisers are more likely to explicitly mention important attributes in the PDP. We calculate a normalized count for each attribute by mining product descriptions in the catalog;

  3. Property entropy: it is important, for meaningful comparison, that property values have enough variation, so that comparison tables can help navigate easily the possible dimensions of a catalog. To calculate variety, we measure the entropy of the distribution of property values across the list of substitute products.

2.3.2. Final Display Selection

Recent literature (Wu et al., 2019) has highlighted the importance of diversity in RSs. Thus, after determining important product properties, we select the final substitutes per query item (step in Fig. 2), by making two additional calculations: price diversification and representative selection. Given the list of substitutes, we group products into 7 bins based on their log price.555

The mean log-price is used to set the central bin and the standard deviation is used to determine the bin width.

We discard the first and the last bin, as extreme prices can signal a potential mismatch in product category. With 5 bins remaining, we sample one substitute from the same price bin as the query item, and two substitutes from its higher-pricing and lower-pricing neighbor bin. Finally, similarly to (Chen and Karger, 2006)

, we employ a greedy approach during sampling, which maximises the information diversity among the final products to be displayed. We represent the property values of each product via one-hot encoding, so that products are represented by a concatenation of their one-hot encoded property vectors. We compute the difference in their information content by Hamming distance, where each property is weighted by the negative exponential of the entropy of the distribution of the property’s values. The intuition is that we want to vary properties which are far from quasi-uniform distributions to display products with meaningful variation, thereby giving shoppers a more complete picture of what is available.

3. Experiments

After having discussed the pipeline design, we report our experiments for the substitute model and the user study performed on property selection.

3.1. Substitute model

We evaluate the effectiveness of training a neural model for substitute classification in an unsupervised manner, by leveraging a manually prepared held out set for benchmarks (Section 3.1.1)666Since in a real deployment labels will not be present, a research setting is needed to first validate how well unsupervised training performs on golden data.. Since the objective of the substitute model is to refine the candidates from the initial fast retrieval step (Section 2.1), where candidates are a priori likely, but not guaranteed, to be substitutes, our test set also mimics this distribution. As a baseline, we thus adopt the cosine similarity (re-scaled to ) between the image vectors of two products as the confidence score for substitutability. This serves as a simple yet realistic baseline that allows us to quantitatively assess the precision boost afforded by the substitute model.

For our experiments, we consider all configurations of image vectors for cleaning () and synthetic augmentation () to shed light on their contribution to performance.7771 denotes usage/application of method whereas 0 denotes non-usage. We run experiments on 3 different seeds and with various combinations of dense product representation (Appendix B & C) as input. For each configuration of C and S

, we report average performance across seeds and product representations used, as well the confidence intervals in plots. We run an extensive set of experiments to acknowledge the varying quality of such representations across catalogs, and to demonstrate robustness of certain configurations when scaling CEs in a multi-shop scenario.

3.1.1. Dataset

For training and validation, we extract unsupervised co-view and co-purchase data from shopping sessions of two partnering shops, Shop A and Shop B. They are mid-sized shops: Shop A is in the sport apparel industry whereas Shop B is in home improvement. We use of all products for training and the remaining

for validation. We consider this to be a strict testing regime as none of the products used in validation and testing are seen in training. For testing, we first obtain a golden mapping of clusters of substitutable products by heuristic matching of categories provided in catalog data and extensive manual filtering. The golden mapping is then used to generate positive and negative test examples as explained in Section 


Full descriptive statistics are reported in Appendix

We selected shops with catalogs that are of high quality and contain fine-grained category information in order to generate golden mappings which best capture product substitutability. We emphasize that such catalog quality is not guaranteed across shops, which motivates our use of unsupervised data.

3.1.2. Results

We summarize experimental results in Table 1, and plot in Fig. 3 the Precision-Recall (PR) curves. For Shop A, when image vectors are not used for cleaning (), the model performs only as good as the baseline. When , we see a significant increase in precision across the higher ranges of recall; on the other hand, synthetic augmentation, , has minimal effect on model performance. Similar trends are observed for Shop B, albeit the benefit of is less pronounced. These results demonstrate the effectiveness of using image vectors to clean the otherwise noisy unsupervised co-occurrence data, and validate the effectiveness of the preparation detailed in Section 2.2

. However, as evident in the baseline performance of Shop B, caution must still be taken when relying on image vectors – depending on the vertical, visual similarity may not be as strong a proxy for substitutability and/or the pre-trained models used to generate the image embeddings are not fine-tuned for products in certain verticals. This opens up interesting avenues for future work such as self-supervised learning

(Zbontar et al., 2021) for niche verticals.

Figure 3. PR Curve for various configurations of C and S, and the Baseline for Shop A and B. Results are the average across seeds and input features, with a confidence interval of +/- 1 SD.

3.2. Property selection

We run an Amazon Mechanical Turk (MTurk) study to get preliminary insights on how well our algorithm matches how shoppers rank properties. Our investigation involves 4 product types that range from known products (e.g. running shoes) to increasingly technical (e.g. ski), each with 5-8 properties. While agreement with human judgement varies depending on the category, the algorithm seems to pick up at least some qualitatively relevant latent dimension.

3.2.1. Data Collection

We collect pairwise human judgements on property preferences. For each comparison, we present workers with the image of a product and two of its properties and ask them to judge which is more important to them when making a purchase decision. Each Human Intelligence Task (HIT) has 3 comparisons (Fig. 4

) in addition to a control task to filter out low quality responses. We collected an average of 30 responses per property pair for this experiment. To collate pairwise human responses, we estimate the underlying ranking using the Bradley-Terry Model

(Bradley and Terry, 1952; Maystre, 2015). We compare the estimated ranked list against our algorithm using Rank-biased Overlap (Webber et al., 2010) (RBO) as the measure of agreement.

Figure 4. Example of a HIT task.

3.2.2. Results

The results are summarized in Table 2. Agreement between our algorithm and humans is higher for popular/common products, lower for highly-technical ones, which may also reflect a lack of domain-specific knowledge by general MTurk workers.999Anecdotally, we also solicited feedback from active skiers in Coveo, and found that their experience influenced the properties which they found important. Interestingly enough, the RBO for Running Shoes is by far the highest. We suspect that this is because Running Shoes lie at an intersection of being both well-known, whereby crowded-sourced responses are most reliable, and technical, such that there exists a stronger ranking/ordering of its properties.

Product RBO
Shirt 0.633
Shorts 0.483
Running Shoe 0.783
Ski 0.169
Table 2. RBO for human vs our ranking (best in bold); for random permutation of length 5.

4. Conclusion

We shared insights from building a CE addressing large-to-mid-shops in the market long-tail, and as such particularly suited for multi-shop deployment. While preliminary, our multi-shop benchmarks confirms the viability of our pipeline, and we look forward to testing it online. Two important areas of improvements are personalization and human-in-the-loop inference. In the current system, all shoppers would receive the same set of candidates, but individual preferences and session intent (Tagliabue et al., 2020a) may be used to further shape the final table.

Finally, of the three ways in which we could use human judgements – qualitative validation, training data and active learning – we just focused on the first. Given the scalability of MTurk, however, we plan on extending human-in-the-loop computation in further iterations of the project.

5. Ethical Considerations

User data has been collected in the process of providing business services to the clients of Coveo: user data is collected and processed in an anonymized fashion, in full compliance with existing legislation (GDPR). In particular, the target dataset uses only anonymous uuids to label sessions and, as such, it does not contain any information that can be linked to individuals. As explained, our MTurk HITs include a task with pre-defined answer to control for workers randomly answering to questions; however, we still compensate workers for their time, even if their answers get discarded from the analysis.

We wish to thank Federico Bianchi, Mattia Pavoni and Andrea Polonioli for comments on a previous draft of this work, and general support with this research project.


  • D. Berg, C. Kiran, R. Cledat, S. Goyal, F. Hamad, and V. Tuulos (2019) External Links: Link Cited by: Appendix A.
  • R. Bhagat, S. Muralidharan, A. Lobzhanidze, and S. Vishwanath (2018) Buy it again: modeling repeat purchase recommendations. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 62–70. External Links: ISBN 9781450355520, Link, Document Cited by: §1.
  • F. Bianchi, J. Tagliabue, B. Yu, L. Bigon, and C. Greco (2020) Fantastic embeddings and how to align them: zero-shot inference in a multi-shop scenario. ArXiv abs/2007.14906. Cited by: 1st item, §1, §2.1.
  • F. Bianchi, J. Tagliabue, and B. Yu (2021) Query2Prod2Vec: Grounded Word Embeddings for eCommerce. In NAACL-HLT, Cited by: §2.1.
  • L. Bing, T. Wong, and W. Lam (2016) Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Trans. Internet Technol. 16 (2). External Links: ISSN 1533-5399, Link, Document Cited by: item 1.
  • R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. External Links: ISSN 00063444, Link Cited by: §3.2.1.
  • J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah (1993)

    Signature verification using a ”siamese” time delay neural network.

    IJPRAI 7 (4), pp. 669–688. External Links: Link Cited by: §2.2.
  • H. Chen and D. R. Karger (2006) Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, New York, NY, USA, pp. 429–436. External Links: ISBN 1595933697, Link, Document Cited by: §2.3.2.
  • T. Chen, H. Yin, G. Ye, Z. Huang, Y. Wang, and M. Wang (2020) Try this instead: personalized and interpretable substitute recommendation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Cited by: §2.2.1.
  • P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, New York, NY, USA. Cited by: §2.1.
  • B. Dageville, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, M. Hentschel, J. Huang, A. W. Lee, A. Motivala, A. Q. Munir, S. Pelley, P. Povinec, G. Rahn, S. Triantafyllis, and P. Unterbrunner (2016) The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, New York, NY, USA, pp. 215–226. External Links: ISBN 9781450335317, Link, Document Cited by: Appendix A.
  • X. Dong, X. He, A. Kan, X. Li, Y. Liang, J. Ma, Y. Xu, C. Zhang, T. Zhao, G. B. Saldana, S. Deshpande, A. M. Manduca, J. Ren, S. P. Singh, F. Xiao, H. Chang, G. Karamanolakis, Y. Mao, Y. Wang, C. Faloutsos, A. McCallum, and J. Han (2020) AutoKnow: self-driving knowledge collection for products of thousands of types. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Cited by: §2.3.1.
  • M. Grbovic, V. Radosavljevic, N. Djuric, N. Bhamidipati, J. Savla, V. Bhagwan, and D. Sharp (2015) E-commerce in your inbox: product recommendations at scale. In Proceedings of KDD ’15, External Links: Link, Document Cited by: §2.1.
  • M. Guo, N. Yan, X. Cui, S. H. Wu, U. Ahsan, R. West, and K. Al Jadda (2020) Deep learning-based online alternative product recommendations at scale. In Proceedings of The 3rd Workshop on e-Commerce and NLP, Seattle, WA, USA, pp. 19–23. External Links: Link, Document Cited by: item 3.
  • J. Hao, T. Zhao, J. Li, X. L. Dong, C. Faloutsos, Y. Sun, and W. Wang (2020) P-companion: a principled framework for diversified complementary product recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, External Links: ISBN 9781450368599, Link Cited by: 2nd item.
  • K. Hosanagar, D. Fleder, D. Lee, and A. Buja (2014) Will the global village fracture into tribes? recommender systems and their effects on consumer fragmentation. Management Science 60, pp. 805–823. External Links: Document Cited by: §1.
  • J. Katukuri, T. Könik, R. Mukherjee, and S. Kolay (2014) Recommending similar items in large-scale on line marketplaces. pp. . External Links: Document Cited by: item 1, §2.3.1.
  • L. Maystre (2015) Choix. External Links: Link Cited by: §3.2.1.
  • J. McAuley, R. Pandey, and J. Leskovec (2015) Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450336642, Link, Document Cited by: 2nd item, §2.2.1.
  • T. Mikolov, K. Chen, G. S. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: 1st item, §2.1.
  • F. Moraes, J. Yang, R. Zhang, and V. Murdock (2020) The role of attributes in product quality comparisons. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, CHIIR ’20, New York, NY, USA, pp. 253–262. External Links: ISBN 9781450368926, Link, Document Cited by: item 1.
  • A. Pichestapong (2019) External Links: Link Cited by: §1.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. External Links: 1908.10084 Cited by: §C.1.
  • B. Scheibehenne, R. Greifeneder, and P. M. Todd (2010) Can There Ever Be Too Many Options? A Meta-Analytic Review of Choice Overload. Journal of Consumer Research 37 (3), pp. 409–425. External Links: ISSN 0093-5301, Document, Link, https://academic.oup.com/jcr/article-pdf/37/3/409/5173186/37-3-409.pdf Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: 3rd item, item 2.
  • Statista Research Department (2020) External Links: Link Cited by: §1.
  • J. Tagliabue, C. Greco, J. Roy, F. Bianchi, G. Cassani, B. Yu, and P. J. Chia (2021) SIGIR 2021 e-commerce workshop data challenge. In SIGIR eCom 2021, Cited by: §1.
  • J. Tagliabue, B. Yu, and M. Beaulieu (2020a) How to grow a (product) tree. personalized category suggestions for ecommerce type-ahead. In Companion Proceedings of ACL, New York, NY, USA. Cited by: §2.1, §4.
  • J. Tagliabue, B. Yu, and F. Bianchi (2020b) The embeddings that came in from the cold: improving vectors for new and rare products with content-based inference. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20, New York, NY, USA, pp. 577–578. External Links: ISBN 9781450375832, Link, Document Cited by: §1.
  • [30] Techcrunch(Website) External Links: Link Cited by: footnote 1.
  • Techcrunch (2019a) External Links: Link Cited by: footnote 1.
  • Techcrunch (2019b) External Links: Link Cited by: footnote 1.
  • Techcrunch (2021) External Links: Link Cited by: footnote 1.
  • M. Tsagkias, T. H. King, S. Kallumadi, V. Murdock, and M. de Rijke (2020) Challenges and research opportunities in ecommerce search and recommendations. In SIGIR Forum, Vol. 54. Cited by: §1.
  • W. Webber, A. Moffat, and J. Zobel (2010) A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28 (4). External Links: ISSN 1046-8188, Link, Document Cited by: §3.2.1.
  • Q. Wu, Y. Liu, C. Miao, Y. Zhao, L. Guan, and H. Tang (2019) Recent advances in diversified recommendation. External Links: 1905.06589 Cited by: §2.3.2.
  • J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. CoRR abs/2103.03230. External Links: Link, 2103.03230 Cited by: §3.1.2.
  • S. Zhang, H. Yin, Q. Wang, T. Chen, H. Chen, and Q. V. H. Nguyen (2019) Inferring substitutable products with deep network embedding. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    pp. 4306–4312. External Links: Document, Link Cited by: §2.2.1.
  • X. Zhao, R. Louca, D. Hu, and L. Hong (2018) Learning item-interaction embeddings for user recommendations. ArXiv abs/1812.04407. Cited by: §2.1.
  • Z. Zuo, L. Wang, M. Momma, W. Wang, Y. Ni, J. Lin, and Y. Sun (2020) A flexible large-scale similar product identification system in e-commerce. KDD 1st International Workshop on Industrial Recommendation. Cited by: footnote 3.

Appendix A Implementation Details

We implement our pipeline leveraging Metaflow (Berg et al., 2019), which allows us to programmatically define our pipeline as a DAG. We develop our pipeline with three core phases (spread across several steps):

  • we dedicate initial steps in the DAG to pull data (such as user sessions, pre-cached embeddings) from various sources like Snowflake and S3, and perform various transformations on the data. Note that many of these steps run in parallel;

  • we launch in parallel our model training (which may in itself contain several steps as outlined in Section 2) with various configurations (e.g. input features). In addition, we are able to dedicate steps which have high resource demands (e.g. GPU) to AWS Batch;

  • we collate the results (e.g. metrics, trained model, model predictions) from each parallel run, and store them as Data Artifacts on S3 for further analysis.

The adoption of Metaflow on top of our cloud provider (AWS) speeds up development time (since it is the same code running locally and remotely), reduces training time (thanks to parallelism and GPU provisioning) and increases confidence in our experiments (thanks to versioning and full pipeline replayability). The setup we adopt fully decouples writing code from the underlying infrastructure, including data retrieval thanks to the “PaaS-like feeling” of Snowflake (Dageville et al., 2016). Fig. 5 shows the comparison table for a pair of mountain shoes (yellow), as produced by our Metaflow pipeline.

Figure 5. Example of a comparison table for mountain shoes, as prepared by our pipeline.

Appendix B Unsupervised Data and Product Representations

In this section, we provide details and hyper-parameters used in the generation of training data and of dense unsupervised representations for products.

b.1. Data Preparation

  • Co-view and Co-purchase Data: For Shop A, we obtain shopping sessions over a period of 3 months and for Shop B we obtain shopping sessions over a period of 1 month. For co-view pairs, we enforce a minimum count, , and for co-purchase pairs we enforce a minimum count .

  • Cleaning with Image Vectors: For both Shop A and Shop B, we enforce that positive pairs have a cosine similarity and that negative pairs have a cosine similarity .

  • Synthetic Augmentation: Maximum cluster size, , is set to 40.

Refer to Tables 3 & 4 for descriptive statistics of session and training data.

Shop #Products # Browse Session # Purchase Session
Shop A 20k 1.5M 27k
Shop B 50k 3M 12K
Table 3. Descriptive statistics of session data.
Shop Config Train (Pos/Neg) Validation (Pos/Neg)
Shop A C=0; S=0 19k/18k 1.5k/1k
C=0; S=1 27k/75k 2.8k/3.7k
C=1; S=0 8k/6k 0.5k/0.5k
C=1; S=1 17k/33k 1.5k/1.5k
Shop B C=0; S=0 50k/20k 3k/1k
C=0; S=1 60k/40k 6k/5k
C=1; S=0 40k/10k 2.5k/1k
C=1; S=1 70k/120k 7k/7k
Table 4. Descriptive statistics of training data.

b.2. Unsupervised Product Representations

  • Prod2Vec Embeddings: We train behavioural product embeddings using CBOW with negative sampling (Mikolov et al., 2013), swapping the concept of words in a sentence with products in a browsing session. Following best practices of (Bianchi et al., 2020) we adopt the hyper-parameters: window = 5 , iterations = 30, ns_exponent = 0.75, dimensions = 48, with the exception of a smaller window size, so that more emphasis is placed on co-viewed, and hence more likely substitutable products.

  • Textual Embeddings: We train Textual Embeddings using CBOW with negative sampling and using product descriptions as our text corpus. We adopt the hyper-parameters: window = 10, iterations = 30, ns_exponent = 0.75, dimensions = 48. We then take the name, description and categories of each product and obtain a dense representation for each meta-data by applying average-pooling over their word representations.

  • Image Embeddings: We prepare Image Embeddings by utilising a pre-trained VGG16 (Simonyan and Zisserman, 2015) network, and apply 7x7 2D-MaxPooling to the final MaxPool layer of VGG16 to obtain a 512-dim representation.

Appendix C Model Architecture and Training

c.1. Model Architecture

In this section we provide architectural details on the binary comparison model. At a high level, the model takes in two products as inputs and provides a confidence score indicating of whether the two products are substitutes.

First, each product is represented by embeddings , each of dimension representing a different type of information or modality. Details on how these embeddings are obtained can be found in Appendix B.

Secondly, the embeddings of a product are fused into a single dense representation by a neural network , which is re-used across all products. We define as:


is a dense re-projection layer (48-dim, ReLU activation),

is the concatenation operation, is a dense fusion layer (128-dim, ReLU activation) and refers to L2-Normalization operator.

Lastly, the fused representations of two products, are passed into a neural network , which produces the confidence score. We define as:

That is, we take the element-wise absolute difference (Reimers and Gurevych, 2019) between the two inputs and pass it into a dense classification layer

(1-dim) followed by the sigmoid function

to produce the binary classification score.

c.2. Model Training

For all experiments, we use Adam optimizer with learning rate of

, early stopping with patience of 20 epochs and a batch size of 32. For all experiments we tested the follow configurations of product representations:

  • description, name, prod2vec;

  • categories, description, name;

  • categories, description, name, prod2vec.

The feature set that yielded best results is [categories, description, name, prod2vec].