Multilingual Transformers for Product Matching – Experiments and a New Benchmark in Polish

05/31/2022
by   Michał Możdżonek, et al.
0

Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2020

Testing pre-trained Transformer models for Lithuanian news clustering

A recent introduction of Transformer deep learning architecture made bre...
research
09/13/2023

ProMap: Datasets for Product Mapping in E-commerce

The goal of product mapping is to decide, whether two listings from two ...
research
06/24/2023

Comparison of Pre-trained Language Models for Turkish Address Parsing

Transformer based pre-trained models such as BERT and its variants, whic...
research
02/24/2023

HULAT at SemEval-2023 Task 9: Data augmentation for pre-trained transformers applied to Multilingual Tweet Intimacy Analysis

This paper describes our participation in SemEval-2023 Task 9, Intimacy ...
research
12/24/2021

Spoiler in a Textstack: How Much Can Transformers Help?

This paper presents our research regarding spoiler detection in reviews....
research
08/02/2021

PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors

EuroVoc is a multilingual thesaurus that was built for organizing the le...
research
04/01/2022

Unitail: Detecting, Reading, and Matching in Retail Scene

To make full use of computer vision technology in stores, it is required...

Please sign up or login with your details

Forgot password? Click here to reset