1. Introduction
Online shopping has seen tremendous growth in recent years (Statista Research Department, 2020), and shoppers now face an innumerable number of possibilities, which paradoxically may lead to decreasing satisfaction in their purchase decisions (Scheibehenne et al., 2010). Recommender systems (RSs) have been playing an indispensable role in fighting information overload, and large players (Pichestapong, 2019; Hosanagar et al., 2014) have been mostly responsible for modelling and product innovation (Tsagkias et al., 2020; Bhagat et al., 2018). Comparison engines (CEs) are a special case of RS, in which a product detail page (PDP) displays alternative choices in a table containing informative product specifications (Fig. 1). Unlike prevalent ”More like this” RSs, comparison tables when well-designed not only promote products that are relevant, but also intentionally select products which help customers better understand the range of available features. However – as demonstrated by the sub-optimal alternatives in Fig. 1 – building a CE is far from trivial even for players with full ownership of the data chain. In this paper, we share preliminary lessons learned when building CEs in a B2B scenario, that is, designing a scalable pipeline that is deployed across multiple shops. As convincingly argued in (Tagliabue et al., 2020b; Bianchi et al., 2020; Tagliabue et al., 2021), multi-tenant deployments require models to generalize to dozens of different retailers: a successful CE is therefore not only hard to build, but valuable to a wide range of practitioners – on one side, practitioners outside of humongous websites, who want to enhance their shop in the face of rising pressure from major players; on the other, multi-tenant SaaS providers who need to provide AI-based services that scale to a large number of clients.111As an indication of the relevant SaaS market size, we witnessed Coveo, Algolia, Lucidworks and Bloomreach raising more than 100M USD each from venture funds in the last two years for AI-powered services (Techcrunch, ; Techcrunch, 2019a, b, 2021). We summarize our contributions as follows:
-
we are the first, to the best of our knowledge, to detail a pipeline for building a comparison engine designed to be scalable in a multi-tenant scenario;
-
we perform extensive experiments on various data cleaning and augmentation approaches. One of our major practical contributions – in line with what independently reported by Hao et al. (2020) – is questioning the widely held belief that co-occurrence patterns are a sufficient proxy for substitutable products (McAuley et al., 2015);
-
we discuss the importance of diversity in the comparison table and propose a decision process to determine relevant attributes.

Example of product comparisons on Amazon.com.
While we do acknowledge that full online testing is needed to answer some outstanding design questions, we supplement our pipeline tests with a user study, allowing for a preliminary comparison between our results and human judgments, as well as guiding future decisions in our roadmap. Practitioners looking to replicate our work are encouraged to check the Appendix for details on our tools, modelling choices and Mechanical Turk setup.
2. Comparison Engine Pipeline
In this section, we present the pipeline architecture of our comparison system. The pipeline is composed of three main stages: a first candidate fast retrieval phase, to narrow down the search space; a candidate refinement phase, to ensure precision and produce the final shortlist of products; lastly, a final selection phase, to determine the information to be displayed in the comparison table. We will first explain the logic of each stage, then detail the experiments performed to benchmark the pipeline.222Note that due to space constraints, we cite the most relevant literature inline at the most appropriate step.

Overview of our CE pipeline.
Shop A | Shop B | |||||
Configuration | P@R=0.7 | P@R=0.8 | P@R=0.9 | P@R=0.7 | P@R=0.8 | P@R=0.9 |
Baseline | 0.744 | 0.743 | 0.682 | 0.611 | 0.573 | 0.539 |
C=0; S=0 | 0.759 (0.0235) | 0.710(0.0212) | 0.645 (0.0195) | 0.734 (0.0324) | 0.690 (0.0333) | 0.632 (0.0290) |
C=0; S=1 | 0.766 (0.0257) | 0.723 (0.0237) | 0.659 (0.0189) | 0.755 (0.0379) | 0.706 (0.0423) | 0.643 (0.0382) |
C=1; S=0 | 0.833 (0.0162) | 0.802 (0.0226) | 0.740 (0.0301) | 0.777 (0.0162) | 0.732 (0.0131) | 0.658 (0.0123) |
C=1; S=1 | 0.842 (0.0150) | 0.812 (0.0208) | 0.753 (0.0280) | 0.789 (0.0189) | 0.743 (0.0196) | 0.663 (0.0176) |
2.1. Fast Retrieval
Candidate retrieval aims to quickly generate potential substitutes given a query product, with a focus on recall (step in Fig. 2): we try to get a more diverse set, knowing false positives will be screened out in later phases. A common practice for fast retrieval in a dense space is using k-NN over an embedding space (Zhao et al., 2018; Covington et al., 2016): since recent literature (Bianchi et al., 2020, 2021; Tagliabue et al., 2020a) provides extensive evidence on the representational qualities of behavioral embeddings, we train a prod2vec space (Grbovic et al., 2015) by adapting word2vec (Mikolov et al., 2013) to eCommerce – i.e. a prod2vec space is just a word2vec space, where words in a sentence are replaced with products in a shopping session (Appendix B). After obtaining a prod2vec space we apply k-NN (based on cosine distance) to retrieve the closest products as its substitute candidates. Analogous to words in word2vec, products which are distributionally similar (based on historical sessions) are close in the prod2vec space, therefore the candidates retrieved in this step are already biased towards substitutable products.
2.2. Candidate Refinement
Candidates produced by the first stage are passed to the second stage for fine-grained processing. The goal of this stage is to boost precision by filtering out candidates that do not have matching product type, and re-rank the remaining ones so the most comparable ones are at the top of the list (step in Fig. 2). We employ a binary classification model (i.e. given a pair of products, are they substitutes?) built on top of a Siamese Network (Bromley et al., 1993), fed with unsupervised behavioral data.
2.2.1. Unsupervised Behavioural Data
Generating training data for substitute product detection is a well-explored topic in the literature (McAuley et al., 2015; Chen et al., 2020; Zhang et al., 2019)
. However, our inference is somewhat harder than a general substitute classifier where products are sampled from the entire catalog, as our model needs to be able to make subtler distinctions among a selected group of candidates that have been shortlisted by a coarse similarity measure (Section
2.1). To overcome the problem of naive sampling and reduce the noise in behavioral data, we built a three-step process generating the final training set, with free parameters, , , (Appendix B):-
We use co-view and co-purchase patterns to obtain positive and negative training examples. Positive examples are obtained from pairs of products which are viewed consecutively (co-view: if I want a TV, I will check several TVs in a row) and negative examples are obtained from products which are purchased consecutively (co-purchase: if I just bought a TV, I am unlikely to buy a second one). To reduce noise, we set a minimum threshold for the number of co-view occurrences () and the number of co-purchase occurrences () for a pair to be considered a positive or negative example respectively.
-
We intuit that substitutable products are a priori
visually similar, and utilize this to further reduce noise in the data. Thus, we apply a threshold on the cosine similarity of the image embedding of pairs to further refine this set of training examples. Given an image vector obtained through a pre-trained VGG16
(Simonyan and Zisserman, 2015), we enforce that positive/negative pairs must have a minimum/maximum cosine similarity.333While drafting this paper, we realized a similar approach has been recommended independently by (Zuo et al., 2020). We refer to this refinement/cleaning process as C. -
We remove pairs which are given both positive and negative labels, then build a graph using the remaining positive pairs and extract disconnected subgraphs as clusters of substitutable products. We eliminate clusters of size when generating synthetic pairs, to reduce the risk of sampling from clusters formed by noisy pairs and, at the same time, improve the balance of product types in the training data. By taking an existing positive/negative pair from our behavioural logs, we generate synthetic pairs by swapping out one of the products in the original pair with any product found in its substitute cluster, unlike (Guo et al., 2020) which samples negative examples from a random disconnected subgraph. We refer to this augmentation process as S.
We emphasize that only behavioral logs and product images are necessary so far: our approach does not assume peculiar meta-data or pre-made taxonomy, nor does our classifier require costly labelling, making the pipeline suitable for multi-shop scaling.
2.2.2. Binary Classifier: A Siamese Network
We utilize a binary classifier to predict whether two input products are substitutes or not. Products are represented by various dense representations of product features, such as behavioural embeddings and word2vec embeddings for product title, description and category strings (See Appendix B). For full reproducibility, we provide architectural and hyper-parameter details in Appendices B & C.
2.3. Product and Property Selection
2.3.1. Relevant Property Selection
At this stage, we make the only significant meta-data assumption of the entire pipeline, that is, the target catalog should specify product properties in some structured way – based on our experience with dozens of deployments, this is not a universal feature, but it is common for verticals with technical products (DIY, electronics, etc.), for which CEs are most useful. Given a mapping from products to their properties (say, from TV to the set ¡resolution, screen size, …¿), this stage determines which properties are relevant to shoppers when they are making a purchase decision (step in Fig. 2). By passing the candidates from Section 2.1 to the classifier in 2.2, we generate a final list of substitutable products, given an initial query product. For this list, we rank properties based on the weighted sum of three components444Weights have been determined empirically at first, but see Section 3.2 and our conclusion for potential use of human-in-the-loop inference., highlighted as important by previous literature (Katukuri et al., 2014; Dong et al., 2020) and domain knowledge:
-
Query frequency: properties which are important to shoppers tend to appear frequently in shopper-generated content (Bing et al., 2016; Moraes et al., 2020) such as queries (Katukuri et al., 2014). We calculate the query frequency for each property (and their possible values) by mining search logs, and normalize the counts to range ;
-
PDP frequency: merchandisers are more likely to explicitly mention important attributes in the PDP. We calculate a normalized count for each attribute by mining product descriptions in the catalog;
-
Property entropy: it is important, for meaningful comparison, that property values have enough variation, so that comparison tables can help navigate easily the possible dimensions of a catalog. To calculate variety, we measure the entropy of the distribution of property values across the list of substitute products.
2.3.2. Final Display Selection
Recent literature (Wu et al., 2019) has highlighted the importance of diversity in RSs. Thus, after determining important product properties, we select the final substitutes per query item (step in Fig. 2), by making two additional calculations: price diversification and representative selection. Given the list of substitutes, we group products into 7 bins based on their log price.555 The mean log-price is used to set the central bin and the standard deviation is used to determine the bin width.
, we employ a greedy approach during sampling, which maximises the information diversity among the final products to be displayed. We represent the property values of each product via one-hot encoding, so that products are represented by a concatenation of their one-hot encoded property vectors. We compute the difference in their information content by Hamming distance, where each property is weighted by the negative exponential of the entropy of the distribution of the property’s values. The intuition is that we want to vary properties which are far from quasi-uniform distributions to display products with meaningful variation, thereby giving shoppers a more complete picture of what is available.
3. Experiments
After having discussed the pipeline design, we report our experiments for the substitute model and the user study performed on property selection.
3.1. Substitute model
We evaluate the effectiveness of training a neural model for substitute classification in an unsupervised manner, by leveraging a manually prepared held out set for benchmarks (Section 3.1.1)666Since in a real deployment labels will not be present, a research setting is needed to first validate how well unsupervised training performs on golden data.. Since the objective of the substitute model is to refine the candidates from the initial fast retrieval step (Section 2.1), where candidates are a priori likely, but not guaranteed, to be substitutes, our test set also mimics this distribution. As a baseline, we thus adopt the cosine similarity (re-scaled to ) between the image vectors of two products as the confidence score for substitutability. This serves as a simple yet realistic baseline that allows us to quantitatively assess the precision boost afforded by the substitute model.
For our experiments, we consider all configurations of image vectors for cleaning () and synthetic augmentation () to shed light on their contribution to performance.7771 denotes usage/application of method whereas 0 denotes non-usage. We run experiments on 3 different seeds and with various combinations of dense product representation (Appendix B & C) as input. For each configuration of C and S
, we report average performance across seeds and product representations used, as well the confidence intervals in plots. We run an extensive set of experiments to acknowledge the varying quality of such representations across catalogs, and to demonstrate robustness of certain configurations when scaling CEs in a multi-shop scenario.
3.1.1. Dataset
For training and validation, we extract unsupervised co-view and co-purchase data from shopping sessions of two partnering shops, Shop A and Shop B. They are mid-sized shops: Shop A is in the sport apparel industry whereas Shop B is in home improvement. We use of all products for training and the remaining
for validation. We consider this to be a strict testing regime as none of the products used in validation and testing are seen in training. For testing, we first obtain a golden mapping of clusters of substitutable products by heuristic matching of categories provided in catalog data and extensive manual filtering. The golden mapping is then used to generate positive and negative test examples as explained in Section
2.1.888Full descriptive statistics are reported in Appendix
B. We selected shops with catalogs that are of high quality and contain fine-grained category information in order to generate golden mappings which best capture product substitutability. We emphasize that such catalog quality is not guaranteed across shops, which motivates our use of unsupervised data.3.1.2. Results
We summarize experimental results in Table 1, and plot in Fig. 3 the Precision-Recall (PR) curves. For Shop A, when image vectors are not used for cleaning (), the model performs only as good as the baseline. When , we see a significant increase in precision across the higher ranges of recall; on the other hand, synthetic augmentation, , has minimal effect on model performance. Similar trends are observed for Shop B, albeit the benefit of is less pronounced. These results demonstrate the effectiveness of using image vectors to clean the otherwise noisy unsupervised co-occurrence data, and validate the effectiveness of the preparation detailed in Section 2.2
. However, as evident in the baseline performance of Shop B, caution must still be taken when relying on image vectors – depending on the vertical, visual similarity may not be as strong a proxy for substitutability and/or the pre-trained models used to generate the image embeddings are not fine-tuned for products in certain verticals. This opens up interesting avenues for future work such as self-supervised learning
(Zbontar et al., 2021) for niche verticals.![]() |
![]() |
3.2. Property selection
We run an Amazon Mechanical Turk (MTurk) study to get preliminary insights on how well our algorithm matches how shoppers rank properties. Our investigation involves 4 product types that range from known products (e.g. running shoes) to increasingly technical (e.g. ski), each with 5-8 properties. While agreement with human judgement varies depending on the category, the algorithm seems to pick up at least some qualitatively relevant latent dimension.
3.2.1. Data Collection
We collect pairwise human judgements on property preferences. For each comparison, we present workers with the image of a product and two of its properties and ask them to judge which is more important to them when making a purchase decision. Each Human Intelligence Task (HIT) has 3 comparisons (Fig. 4
) in addition to a control task to filter out low quality responses. We collected an average of 30 responses per property pair for this experiment. To collate pairwise human responses, we estimate the underlying ranking using the Bradley-Terry Model
(Bradley and Terry, 1952; Maystre, 2015). We compare the estimated ranked list against our algorithm using Rank-biased Overlap (Webber et al., 2010) (RBO) as the measure of agreement.
3.2.2. Results
The results are summarized in Table 2. Agreement between our algorithm and humans is higher for popular/common products, lower for highly-technical ones, which may also reflect a lack of domain-specific knowledge by general MTurk workers.999Anecdotally, we also solicited feedback from active skiers in Coveo, and found that their experience influenced the properties which they found important. Interestingly enough, the RBO for Running Shoes is by far the highest. We suspect that this is because Running Shoes lie at an intersection of being both well-known, whereby crowded-sourced responses are most reliable, and technical, such that there exists a stronger ranking/ordering of its properties.
Product | RBO |
---|---|
Shirt | 0.633 |
Shorts | 0.483 |
Running Shoe | 0.783 |
Ski | 0.169 |
4. Conclusion
We shared insights from building a CE addressing large-to-mid-shops in the market long-tail, and as such particularly suited for multi-shop deployment. While preliminary, our multi-shop benchmarks confirms the viability of our pipeline, and we look forward to testing it online. Two important areas of improvements are personalization and human-in-the-loop inference. In the current system, all shoppers would receive the same set of candidates, but individual preferences and session intent (Tagliabue et al., 2020a) may be used to further shape the final table.
Finally, of the three ways in which we could use human judgements – qualitative validation, training data and active learning – we just focused on the first. Given the scalability of MTurk, however, we plan on extending human-in-the-loop computation in further iterations of the project.
5. Ethical Considerations
User data has been collected in the process of providing business services to the clients of Coveo: user data is collected and processed in an anonymized fashion, in full compliance with existing legislation (GDPR). In particular, the target dataset uses only anonymous uuids to label sessions and, as such, it does not contain any information that can be linked to individuals. As explained, our MTurk HITs include a task with pre-defined answer to control for workers randomly answering to questions; however, we still compensate workers for their time, even if their answers get discarded from the analysis.
Acknowledgements.
We wish to thank Federico Bianchi, Mattia Pavoni and Andrea Polonioli for comments on a previous draft of this work, and general support with this research project.References
- External Links: Link Cited by: Appendix A.
- Buy it again: modeling repeat purchase recommendations. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 62–70. External Links: ISBN 9781450355520, Link, Document Cited by: §1.
- Fantastic embeddings and how to align them: zero-shot inference in a multi-shop scenario. ArXiv abs/2007.14906. Cited by: 1st item, §1, §2.1.
- Query2Prod2Vec: Grounded Word Embeddings for eCommerce. In NAACL-HLT, Cited by: §2.1.
- Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Trans. Internet Technol. 16 (2). External Links: ISSN 1533-5399, Link, Document Cited by: item 1.
- Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. External Links: ISSN 00063444, Link Cited by: §3.2.1.
-
Signature verification using a ”siamese” time delay neural network.
. IJPRAI 7 (4), pp. 669–688. External Links: Link Cited by: §2.2. - Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, New York, NY, USA, pp. 429–436. External Links: ISBN 1595933697, Link, Document Cited by: §2.3.2.
- Try this instead: personalized and interpretable substitute recommendation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Cited by: §2.2.1.
- Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, New York, NY, USA. Cited by: §2.1.
- The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, New York, NY, USA, pp. 215–226. External Links: ISBN 9781450335317, Link, Document Cited by: Appendix A.
- AutoKnow: self-driving knowledge collection for products of thousands of types. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Cited by: §2.3.1.
- E-commerce in your inbox: product recommendations at scale. In Proceedings of KDD ’15, External Links: Link, Document Cited by: §2.1.
- Deep learning-based online alternative product recommendations at scale. In Proceedings of The 3rd Workshop on e-Commerce and NLP, Seattle, WA, USA, pp. 19–23. External Links: Link, Document Cited by: item 3.
- P-companion: a principled framework for diversified complementary product recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, External Links: ISBN 9781450368599, Link Cited by: 2nd item.
- Will the global village fracture into tribes? recommender systems and their effects on consumer fragmentation. Management Science 60, pp. 805–823. External Links: Document Cited by: §1.
- Recommending similar items in large-scale on line marketplaces. pp. . External Links: Document Cited by: item 1, §2.3.1.
- Choix. External Links: Link Cited by: §3.2.1.
- Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450336642, Link, Document Cited by: 2nd item, §2.2.1.
- Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: 1st item, §2.1.
- The role of attributes in product quality comparisons. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, CHIIR ’20, New York, NY, USA, pp. 253–262. External Links: ISBN 9781450368926, Link, Document Cited by: item 1.
- External Links: Link Cited by: §1.
- Sentence-bert: sentence embeddings using siamese bert-networks. External Links: 1908.10084 Cited by: §C.1.
- Can There Ever Be Too Many Options? A Meta-Analytic Review of Choice Overload. Journal of Consumer Research 37 (3), pp. 409–425. External Links: ISSN 0093-5301, Document, Link, https://academic.oup.com/jcr/article-pdf/37/3/409/5173186/37-3-409.pdf Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: 3rd item, item 2.
- External Links: Link Cited by: §1.
- SIGIR 2021 e-commerce workshop data challenge. In SIGIR eCom 2021, Cited by: §1.
- How to grow a (product) tree. personalized category suggestions for ecommerce type-ahead. In Companion Proceedings of ACL, New York, NY, USA. Cited by: §2.1, §4.
- The embeddings that came in from the cold: improving vectors for new and rare products with content-based inference. In Fourteenth ACM Conference on Recommender Systems, RecSys ’20, New York, NY, USA, pp. 577–578. External Links: ISBN 9781450375832, Link, Document Cited by: §1.
- [30] (Website) External Links: Link Cited by: footnote 1.
- External Links: Link Cited by: footnote 1.
- External Links: Link Cited by: footnote 1.
- External Links: Link Cited by: footnote 1.
- Challenges and research opportunities in ecommerce search and recommendations. In SIGIR Forum, Vol. 54. Cited by: §1.
- A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28 (4). External Links: ISSN 1046-8188, Link, Document Cited by: §3.2.1.
- Recent advances in diversified recommendation. External Links: 1905.06589 Cited by: §2.3.2.
- Barlow twins: self-supervised learning via redundancy reduction. CoRR abs/2103.03230. External Links: Link, 2103.03230 Cited by: §3.1.2.
-
Inferring substitutable products with deep network embedding.
In
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19
, pp. 4306–4312. External Links: Document, Link Cited by: §2.2.1. - Learning item-interaction embeddings for user recommendations. ArXiv abs/1812.04407. Cited by: §2.1.
- A flexible large-scale similar product identification system in e-commerce. KDD 1st International Workshop on Industrial Recommendation. Cited by: footnote 3.
Appendix A Implementation Details
We implement our pipeline leveraging Metaflow (Berg et al., 2019), which allows us to programmatically define our pipeline as a DAG. We develop our pipeline with three core phases (spread across several steps):
-
we dedicate initial steps in the DAG to pull data (such as user sessions, pre-cached embeddings) from various sources like Snowflake and S3, and perform various transformations on the data. Note that many of these steps run in parallel;
-
we launch in parallel our model training (which may in itself contain several steps as outlined in Section 2) with various configurations (e.g. input features). In addition, we are able to dedicate steps which have high resource demands (e.g. GPU) to AWS Batch;
-
we collate the results (e.g. metrics, trained model, model predictions) from each parallel run, and store them as Data Artifacts on S3 for further analysis.
The adoption of Metaflow on top of our cloud provider (AWS) speeds up development time (since it is the same code running locally and remotely), reduces training time (thanks to parallelism and GPU provisioning) and increases confidence in our experiments (thanks to versioning and full pipeline replayability). The setup we adopt fully decouples writing code from the underlying infrastructure, including data retrieval thanks to the “PaaS-like feeling” of Snowflake (Dageville et al., 2016). Fig. 5 shows the comparison table for a pair of mountain shoes (yellow), as produced by our Metaflow pipeline.

Appendix B Unsupervised Data and Product Representations
In this section, we provide details and hyper-parameters used in the generation of training data and of dense unsupervised representations for products.
b.1. Data Preparation
-
Co-view and Co-purchase Data: For Shop A, we obtain shopping sessions over a period of 3 months and for Shop B we obtain shopping sessions over a period of 1 month. For co-view pairs, we enforce a minimum count, , and for co-purchase pairs we enforce a minimum count .
-
Cleaning with Image Vectors: For both Shop A and Shop B, we enforce that positive pairs have a cosine similarity and that negative pairs have a cosine similarity .
-
Synthetic Augmentation: Maximum cluster size, , is set to 40.
Shop | #Products | # Browse Session | # Purchase Session |
---|---|---|---|
Shop A | 20k | 1.5M | 27k |
Shop B | 50k | 3M | 12K |
Shop | Config | Train (Pos/Neg) | Validation (Pos/Neg) |
---|---|---|---|
Shop A | C=0; S=0 | 19k/18k | 1.5k/1k |
C=0; S=1 | 27k/75k | 2.8k/3.7k | |
C=1; S=0 | 8k/6k | 0.5k/0.5k | |
C=1; S=1 | 17k/33k | 1.5k/1.5k | |
Shop B | C=0; S=0 | 50k/20k | 3k/1k |
C=0; S=1 | 60k/40k | 6k/5k | |
C=1; S=0 | 40k/10k | 2.5k/1k | |
C=1; S=1 | 70k/120k | 7k/7k | |
b.2. Unsupervised Product Representations
-
Prod2Vec Embeddings: We train behavioural product embeddings using CBOW with negative sampling (Mikolov et al., 2013), swapping the concept of words in a sentence with products in a browsing session. Following best practices of (Bianchi et al., 2020) we adopt the hyper-parameters: window = 5 , iterations = 30, ns_exponent = 0.75, dimensions = 48, with the exception of a smaller window size, so that more emphasis is placed on co-viewed, and hence more likely substitutable products.
-
Textual Embeddings: We train Textual Embeddings using CBOW with negative sampling and using product descriptions as our text corpus. We adopt the hyper-parameters: window = 10, iterations = 30, ns_exponent = 0.75, dimensions = 48. We then take the name, description and categories of each product and obtain a dense representation for each meta-data by applying average-pooling over their word representations.
-
Image Embeddings: We prepare Image Embeddings by utilising a pre-trained VGG16 (Simonyan and Zisserman, 2015) network, and apply 7x7 2D-MaxPooling to the final MaxPool layer of VGG16 to obtain a 512-dim representation.
Appendix C Model Architecture and Training
c.1. Model Architecture
In this section we provide architectural details on the binary comparison model. At a high level, the model takes in two products as inputs and provides a confidence score indicating of whether the two products are substitutes.
First, each product is represented by embeddings , each of dimension representing a different type of information or modality. Details on how these embeddings are obtained can be found in Appendix B.
Secondly, the embeddings of a product are fused into a single dense representation by a neural network , which is re-used across all products. We define as:
where
is a dense re-projection layer (48-dim, ReLU activation),
is the concatenation operation, is a dense fusion layer (128-dim, ReLU activation) and refers to L2-Normalization operator.Lastly, the fused representations of two products, are passed into a neural network , which produces the confidence score. We define as:
That is, we take the element-wise absolute difference (Reimers and Gurevych, 2019) between the two inputs and pass it into a dense classification layer
(1-dim) followed by the sigmoid function
to produce the binary classification score.c.2. Model Training
For all experiments, we use Adam optimizer with learning rate of
, early stopping with patience of 20 epochs and a batch size of 32. For all experiments we tested the follow configurations of product representations:
-
description, name, prod2vec;
-
categories, description, name;
-
categories, description, name, prod2vec.
The feature set that yielded best results is [categories, description, name, prod2vec].
Comments
There are no comments yet.