Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization

10/31/2022
by   Zexuan Qiu, et al.
0

Efficient document retrieval heavily relies on the technique of semantic hashing, which learns a binary code for every document and employs Hamming distance to evaluate document distances. However, existing semantic hashing methods are mostly established on outdated TFIDF features, which obviously do not contain lots of important semantic information about documents. Furthermore, the Hamming distance can only be equal to one of several integer values, significantly limiting its representational ability for document distances. To address these issues, in this paper, we propose to leverage BERT embeddings to perform efficient retrieval based on the product quantization technique, which will assign for every document a real-valued codeword from the codebook, instead of a binary code as in semantic hashing. Specifically, we first transform the original BERT embeddings via a learnable mapping and feed the transformed embedding into a probabilistic product quantization module to output the assigned codeword. The refining and quantizing modules can be optimized in an end-to-end manner by minimizing the probabilistic contrastive loss. A mutual information maximization based method is further proposed to improve the representativeness of codewords, so that documents can be quantized more accurately. Extensive experiments conducted on three benchmarks demonstrate that our proposed method significantly outperforms current state-of-the-art baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2017

Deep Hashing with Triplet Quantization Loss

With the explosive growth of image databases, deep hashing, which learns...
research
05/13/2021

Unsupervised Hashing with Contrastive Information Bottleneck

Many unsupervised hashing methods are implicitly established on the idea...
research
09/07/2021

Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

Existing unsupervised document hashing methods are mostly established on...
research
07/01/2021

Orthonormal Product Quantization Network for Scalable Face Image Retrieval

Recently, deep hashing with Hamming distance metric has drawn increasing...
research
04/09/2023

Learning to Tokenize for Generative Retrieval

Conventional document retrieval techniques are mainly based on the index...
research
02/17/2023

Binary Embedding-based Retrieval at Tencent

Large-scale embedding-based retrieval (EBR) is the cornerstone of search...
research
03/25/2018

Bernoulli Embeddings for Graphs

Just as semantic hashing can accelerate information retrieval, binary va...

Please sign up or login with your details

Forgot password? Click here to reset