Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

10/11/2022
by   Odunayo Ogundepo, et al.
0

Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique properties, the design of the tokenization algorithm is usually language-specific and requires at least some lingustic knowledge. However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages. To address this challenge, we propose a different approach to tokenization for lexical matching retrieval algorithms (e.g., BM25): using the WordPiece tokenizer, which can be built automatically from unsupervised data. We test the approach on 11 typologically diverse languages in the MrTyDi collection: results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages. In many cases, our approach also improves retrieval effectiveness when combined with existing custom-built tokenizers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Complementing Lexical Retrieval with Semantic Residual Embedding

Information retrieval traditionally has relied on lexical matching signa...
research
05/15/2023

Soft Prompt Decoding for Multilingual Dense Retrieval

In this work, we explore a Multilingual Information Retrieval (MLIR) tas...
research
08/14/2022

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

Robust test collections are crucial for Information Retrieval research. ...
research
04/15/2021

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List

Classical information retrieval systems such as BM25 rely on exact lexic...
research
01/24/2022

Patapasco: A Python Framework for Cross-Language Information Retrieval Experiments

While there are high-quality software frameworks for information retriev...
research
09/22/2020

Embedding-based Zero-shot Retrieval through Query Generation

Passage retrieval addresses the problem of locating relevant passages, u...
research
02/14/2020

Lightweight Lexical Test Prioritization for Immediate Feedback

The practice of unit testing enables programmers to obtain automated fee...

Please sign up or login with your details

Forgot password? Click here to reset