Consistent Text Categorization using Data Augmentation in e-Commerce

05/09/2023
by   Guy Horowitz, et al.
0

The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model's output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model's consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2021

Categorizing Items with Short and Noisy Descriptions using Ensembled Transferred Embeddings

Item categorization is a machine learning task which aims at classifying...
research
12/14/2018

Don't Classify, Translate: Multi-Level E-Commerce Product Categorization Via Machine Translation

E-commerce platforms categorize their products into a multi-level taxono...
research
07/05/2022

Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Product matching is a fundamental step for the global understanding of c...
research
10/22/2020

Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data

Unsupervised Data Augmentation (UDA) is a semi-supervised technique that...
research
04/23/2021

APRF-Net: Attentive Pseudo-Relevance Feedback Network for Query Categorization

Query categorization is an essential part of query intent understanding ...
research
06/05/2018

Adapting Neural Text Classification for Improved Software Categorization

Software Categorization is the task of organizing software into groups t...
research
06/09/2016

e-Commerce product classification: our participation at cDiscount 2015 challenge

This report describes our participation in the cDiscount 2015 challenge ...

Please sign up or login with your details

Forgot password? Click here to reset