Pre-training Tasks for User Intent Detection and Embedding Retrieval in E-commerce Search

08/12/2022
by   Yiming Qiu, et al.
0

BERT-style models pre-trained on the general corpus (e.g., Wikipedia) and fine-tuned on specific task corpus, have recently emerged as breakthrough techniques in many NLP tasks: question answering, text classification, sequence labeling and so on. However, this technique may not always work, especially for two scenarios: a corpus that contains very different text from the general corpus Wikipedia, or a task that learns embedding spacial distribution for a specific purpose (e.g., approximate nearest neighbor search). In this paper, to tackle the above two scenarios that we have encountered in an industrial e-commerce search system, we propose customized and novel pre-training tasks for two critical modules: user intent detection and semantic embedding retrieval. The customized pre-trained models after fine-tuning, being less than 10 significantly improve the other baseline models: 1) no pre-training model and 2) fine-tuned model from the official pre-trained BERT using general corpus, on both offline datasets and online system. We have open sourced our datasets for the sake of reproducibility and future works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

The Effects of In-domain Corpus Size on pre-training BERT

Many prior language modeling efforts have shown that pre-training on an ...
research
05/25/2021

NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy Domains

Natural language processing (NLP) tasks (text classification, named enti...
research
09/13/2021

Effectiveness of Pre-training for Few-shot Intent Classification

This paper investigates the effectiveness of pre-training for few-shot i...
research
10/15/2020

CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets – RoBERTa Ensembles and The Continued Relevance of Handcrafted Features

This paper presents our submission to Task 2 of the Workshop on Noisy Us...
research
06/06/2022

Spam Detection Using BERT

Emails and SMSs are the most popular tools in today communications, and ...
research
08/18/2023

Differentiable Retrieval Augmentation via Generative Language Modeling for E-commerce Query Intent Classification

Retrieval augmentation, which enhances downstream models by a knowledge ...
research
08/29/2023

Multi-party Goal Tracking with LLMs: Comparing Pre-training, Fine-tuning, and Prompt Engineering

This paper evaluates the extent to which current Large Language Models (...

Please sign up or login with your details

Forgot password? Click here to reset