Log In Sign Up

A Boring-yet-effective Approach for the Product Ranking Task of the Amazon KDD Cup 2022

by   Vitor Jeronymo, et al.

In this work we describe our submission to the product ranking task of the Amazon KDD Cup 2022. We rely on a receipt that showed to be effective in previous competitions: we focus our efforts towards efficiently training and deploying large language odels, such as mT5, while reducing to a minimum the number of task-specific adaptations. Despite the simplicity of our approach, our best model was less than 0.004 nDCG@20 below the top submission. As the top 20 teams achieved an nDCG@20 close to .90, we argue that we need more difficult e-Commerce evaluation datasets to discriminate retrieval methods.


page 1

page 2


Track Product Details of Any Product Category with Our Amazon Scraping Services

Extract Amazon Product Data to identify the most sold products and ident...

Auditing E-Commerce Platforms for Algorithmically Curated Vaccine Misinformation

There is a growing concern that e-commerce platforms are amplifying vacc...

A Multi-task Learning Framework for Product Ranking with BERT

Product ranking is a crucial component for many e-commerce services. One...

A Semantic Alignment System for Multilingual Query-Product Retrieval

This paper mainly describes our winning solution (team name: www) to Ama...

Learning From Weights: A Cost-Sensitive Approach For Ad Retrieval

Retrieval models such as CLSM is trained on click-through data which tre...

Amazon Data Scraping: How it can benefit for modern business?

To begin with, #Amazon is known as the world’s largest Internet retailer...

Hybrid Machine Learning Forecasts for the FIFA Women's World Cup 2019

In this work, we combine two different ranking methods together with sev...

1. Introduction

Recent improvements in information retrieval, mainly due to pretrained transformer models, opened up the possibility of improving search in various domains (Jin et al., 2020; MacAvaney et al., 2020; Bi et al., 2020; Choi et al., 2020; Lin et al., 2021; Pradeep et al., 2021; Nogueira et al., 2020a; Ma et al., 2021). Among such domains, e-commerce search receives special attention by the industry as improvements in search quality often lead to increases in revenue.

In this work, we detail our submission to the Amazon KDD Cup 2022, whose goal is to evaluate ranking methods that can be used to improve the customer experience when searching for products.

2. Related Work

Our solution is based on the monoT5 model, that demonstrated strong effectiveness in various passage ranking tasks in different domains. We qualify our method as “boring”, since it is well known in the recent IR literature that models with more parameters can outperform smaller ones with task-specific adaptations. For example, Nogueira et al. (2020b) used the model to achieve state-of-the-art results on TREC 2004 Robust Track (Voorhees, 2004) while Pradeep et al. (2020) used the same model, finetuned only on MS MARCO, to achieve the best or second best performance on medical domain ranking datasets, such as Precision Medicine (Roberts et al., 2019) and TREC-COVID (Zhang et al., 2020). In addition, Rosa et al. (Rosa et al., 2021, 2022b) used large versions of monoT5 to reach the state of the art in a legal domain entailment task in the COLIEE competition (Rabelo et al., 2021; Kim et al., 2022). Furthermore, Rosa et al. (2022a) showed that the 3 billion-parameter variant of monoT5 achieves the state of the art in 12 out of 18 datasets of the Benchmark-IR (BEIR) (Thakur et al., 2021), which consists of datasets from different domains such as web, biomedical, scientific, financial and news.

3. Methodology

In this section, we describe mMonoT5, a multilingual variant of monoT5 (Nogueira et al., 2020b), which is an adaptation of the T5 model (Raffel et al., 2020) for the passage ranking task. We first finetune a multilingual T5 model (Xue et al., 2021) on the mMARCO dataset (Bonifacio et al., 2021), which is the translated version of MS MARCO (Bajaj et al., 2018) in 9 languages. The model is trained to generate a “yes” or “no” token depending on the relevance of a document to a query.

mMonoT5 uses the following input template:


where represents a query and represents a document that may or may not be relevant to the given query.

During inference, the model receives the same input prompt and estimates a score

that quantifies the relevance of a document to a query

by applying a softmax function to the logits of the ”yes” and ”no” tokens, and then taking the probability of the ”yes” token as the final score. That is,


After computing all scores for a given query, we rank then with respect to their scores.

After finetuning on mMARCO, we further finetuned the model on the training data of tasks 1 and 2 of the competition. We use the Beautiful Soup library to clean any remaining HTML tags that may appear in the product. Products are presented to the model as the concatenation of the fields product_title, product_description, product_bullet_point, product_brand and product_color_name, joined by whitespaces.

During the competition we observed that using task 2 training data improved the model substantially. Hence, we used task 1 and 2 training data by transforming the labeled data classes to “true” if ‘exact’ and all other classes as “false”. We use these tokens instead of “yes” and “no”, used by the original mMonoT5. We trained the model for 5 epochs, which takes about 72 hours in a TPU v3, using batches of 128 and maximum sequence length of 512 tokens.

4. Results

We show our results in Table 1. Our best model achieved an nDCG@20 of 0.9012 and 0.9007 on the public and private test sets, respectively, placing us in the ninth place on the leaderboard and only 0.0036 behind the first position.

Initially, we used the mMonoT5 base, with 580M parameters, finetuned on mMarco data to test the model’s zero-shot capability. This model achieves an nDCG@20 of 0.864. Then we further finetuned it on the training data of the competition, which results in a nDCG@20 of 0.89, which later, the 3.7B parameter version surpassed by 0.0112 points. We also tried translating the corpus and queries into English and using the monoT5-3B (English-only) finetuned on the competition data, but it could not out-do its multilingual counterpart.

Model Public Private
monoT5-3B (dataset translated to En) 0.8750 -
mMonoT5-580M (mMARCO only) 0.8640 -
mMonoT5-580M 0.8900 -
mMonoT5-3.7B (our best submission) 0.9012 0.9007
First place (team www) 0.9057 0.9043
20th place (team we666) 0.8933 0.8929
Table 1. Main results of the competition.

5. Conclusion

We described a boring but effective approach based on the multilingual variation of monoT5 that achieved competitive results in the product ranking task of the Amazon KDD Cup 2022.


  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenber, X. Song, A. Stoica, S. Tiwary, and T. Wang (2018) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: §3.
  • K. Bi, Q. Ai, and W. B. Croft (2020) A transformer-based embedding model for personalized product search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1521–1524. Cited by: §1.
  • L. H. Bonifacio, I. Campiotti, R. de Alencar Lotufo, and R. Nogueira (2021) MMARCO: A multilingual version of MS MARCO passage ranking dataset. CoRR abs/2108.13897. External Links: Link, 2108.13897 Cited by: §3.
  • J. I. Choi, S. Kallumadi, B. Mitra, E. Agichtein, and F. Javed (2020) Semantic product search for matching structured product catalogs in e-commerce. arXiv preprint arXiv:2008.08180. Cited by: §1.
  • Q. Jin, C. Tan, M. Chen, M. Yan, S. Huang, N. Zhang, and X. Liu (2020)

    Aliababa damo academy at trec precision medicine 2020: state-of-the-art evidence retriever for precision medicine with expert-in-the-loop active learning

    In TREC, Cited by: §1.
  • M. Kim, J. Rabelo, R. Goebel, M. Yoshioka, Y. Kano, and K. Satoh (2022) COLIEE 2022 summary: methods for legal document retrieval and entailment. Proceedings of the Sixteenth International Workshop on Juris-informatics (JURISIN 2022). Cited by: §2.
  • J. Lin, R. Nogueira, and A. Yates (2021) Pretrained transformers for text ranking: bert and beyond. Synthesis Lectures on Human Language Technologies 14 (4), pp. 1–325. Cited by: §1.
  • Y. Ma, Y. Shao, B. Liu, Y. Liu, M. Zhang, and S. Ma (2021) Retrieving legal cases from a large-scale candidate corpus. Proceedings of the Eighth International Competition on Legal Information Extraction/Entailment, COLIEE2021. Cited by: §1.
  • S. MacAvaney, A. Cohan, and N. Goharian (2020) SLEDGE-z: a zero-shot baseline for covid-19 literature search. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 4171–4179. Cited by: §1.
  • R. Nogueira, Z. Jiang, K. Cho, and J. Lin (2020a) Navigation-based candidate expansion and pretrained language models for citation recommendation. Scientometrics 125 (3), pp. 3001–3016. Cited by: §1.
  • R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin (2020b) Document ranking with a pretrained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 708–718. Cited by: §2, §3.
  • R. Pradeep, X. Ma, R. Nogueira, and J. Lin (2021) Vera: prediction techniques for reducing harmful misinformation in consumer health search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2066–2070. Cited by: §1.
  • R. Pradeep, X. Ma, X. Zhang, H. Cui, R. Xu, R. Nogueira, and J. Lin (2020)

    H2oloo at trec 2020: when all you got is a hammer… deep learning, health misinformation, and precision medicine

    Corpus 5 (d3), pp. d2. Cited by: §2.
  • J. Rabelo, R. Goebel, M. Kim, M. Yoshioka, Y. Kano, and K. Satoh (2021) Summary of the competition on legal information extraction/entailment (coliee) 2021. Proceedings of the Eighth International Competition on Legal Information Extraction/Entailment. Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer


    Journal of Machine Learning Research

    21 (140), pp. 1–67.
    External Links: Link Cited by: §3.
  • K. Roberts, D. Demner-Fushman, E. Voorhees, W. Hersh, S. Bedrick, A. J. Lazar, and S. Pant (2019) Overview of the trec 2019 precision medicine track. The … text REtrieval conference : TREC. Text REtrieval Conference 26. Cited by: §2.
  • G. M. Rosa, L. Bonifacio, V. Jeronymo, H. Abonizio, M. Fadaee, R. Lotufo, and R. Nogueira (2022a) No parameter left behind: how distillation and model size affect zero-shot retrieval. arXiv preprint arXiv:2206.02873. Cited by: §2.
  • G. M. Rosa, L. Bonifacio, V. Jeronymo, H. Abonizio, R. Lotufo, and R. Nogueira (2022b) Billions of parameters are worth more than in-domain training data: a case study in the legal case entailment task. arXiv preprint arXiv:2205.15172. Cited by: §2.
  • G. M. Rosa, R. C. Rodrigues, R. Lotufo, and R. Nogueira (2021) To tune or not to tune? zero-shot models for legal case entailment.

    ICAIL’21, Eighteenth International Conference on Artificial Intelligence and Law, June 21–25, 2021, São Paulo, Brazil

    Cited by: §2.
  • N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych (2021) BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663. External Links: Link Cited by: §2.
  • E. M. Voorhees (2004) Overview of the trec 2004 robust track. Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, November 16-19, 2004. Cited by: §2.
  • L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel (2021) MT5: a massively multilingual pre-trained text-to-text transformer. External Links: 2010.11934 Cited by: §3.
  • E. Zhang, N. Gupta, R. Nogueira, K. Cho, and J. Lin (2020) Rapidly deploying a neural search engine for the covid-19 open research dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Cited by: §2.