Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

04/17/2023
by   Shengyao Zhuang, et al.
0

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on fine-tuning strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel pre-training strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks

Encoder-decoder transformer architectures have become popular recently w...
research
08/12/2021

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Recent research demonstrates the effectiveness of using fine-tuned langu...
research
10/27/2022

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

Pre-trained language model (PTM) has been shown to yield powerful text r...
research
10/08/2022

Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding

E-commerce query understanding is the process of inferring the shopping ...
research
09/05/2023

nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources

State-of-the-art language models like T5 have revolutionized the NLP lan...
research
04/01/2022

CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos

Current dense retrievers are not robust to out-of-domain and outlier que...
research
08/21/2022

A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval

Dense retrieval (DR) has shown promising results in information retrieva...

Please sign up or login with your details

Forgot password? Click here to reset