Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps

11/08/2022
by   Hiroki Iida, et al.
0

IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with lowfrequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present stateof-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2022

Unsupervised Domain Adaptation Using Feature Disentanglement And GCNs For Medical Image Classification

The success of deep learning has set new benchmarks for many medical ima...
research
07/30/2023

Open-Set Domain Adaptation with Visual-Language Foundation Models

Unsupervised domain adaptation (UDA) has proven to be very effective in ...
research
07/06/2023

Dense Retrieval Adaptation using Target Domain Description

In information retrieval (IR), domain adaptation is the process of adapt...
research
02/06/2023

Domain Adaptation for Time Series Under Feature and Label Shifts

The transfer of models trained on labeled datasets in a source domain to...
research
03/08/2020

Mind the Gap: Enlarging the Domain Gap in Open Set Domain Adaptation

Unsupervised domain adaptation aims to leverage labeled data from a sour...
research
09/10/2021

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

We present IndoBERTweet, the first large-scale pretrained model for Indo...
research
08/19/2022

Cross-Domain Evaluation of a Deep Learning-Based Type Inference System

Optional type annotations allow for enriching dynamic programming langua...

Please sign up or login with your details

Forgot password? Click here to reset