UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

04/15/2021
by   Kyoung-Rok Jang, et al.
0

Neural information retrieval (IR) models are promising mainly because their semantic matching capabilities can ameliorate the well-known synonymy and polysemy problems of word-based symbolic approaches. However, the power of neural models' dense representations comes at the cost of inefficiency, limiting it to be used as a re-ranker. Sparse representations, on the other hand, can help enhance symbolic or latent-term representations and yet take advantage of an inverted index for efficiency, being amenable to symbolic IR techniques that have been around for decades. In order to transcend the trade-off between sparse representations (symbolic or latent-term based) and dense representations, we propose an ultra-high dimensional (UHD) representation scheme equipped with directly controllable sparsity. With the high dimensionality, we attempt to make the meaning of each dimension less entangled and polysemous than dense embeddings. The sparsity allows for not only efficiency for vector calculations but also the possibility of making individual dimensions attributable to interpretable concepts. Our model, UHD-BERT, maximizes the benefits of ultra-high dimensional (UHD) sparse representations based on BERT language modeling, by adopting a bucketing method. With this method, different segments of an embedding (horizontal buckets) or the embeddings from multiple layers of BERT (vertical buckets) can be selected and merged so that diverse linguistic aspects can be represented. An additional and important benefit of our highly disentangled (high-dimensional) and efficient (sparse) representations is that this neural approach can be harmonized with well-studied symbolic IR techniques (e.g., inverted index, pseudo-relevance feedback, BM25), enabling us to build a powerful and efficient neuro-symbolic information retrieval system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2021

Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance mode...
research
12/28/2020

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

Information Retrieval using dense low-dimensional representations recent...
research
09/21/2021

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

In neural Information Retrieval (IR), ongoing research is directed towar...
research
07/12/2021

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

In neural Information Retrieval, ongoing research is directed towards im...
research
01/12/2022

Diagnosing BERT with Retrieval Heuristics

Word embeddings, made widely popular in 2013 with the release of word2ve...
research
05/06/2022

Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder

Dense retrievers encode texts and map them in an embedding space using p...
research
04/12/2020

Minimizing FLOPs to Learn Efficient Sparse Representations

Deep representation learning has become one of the most widely adopted a...

Please sign up or login with your details

Forgot password? Click here to reset