Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

10/12/2021 ∙ by Jingtao Zhan, et al. ∙ Tsinghua University Institute of Computing Technology, Chinese Academy of Sciences 0

Dense Retrieval (DR) has achieved state-of-the-art first-stage ranking effectiveness. However, the efficiency of most existing DR models is limited by the large memory cost of storing dense vectors and the time-consuming nearest neighbor search (NNS) in vector space. Therefore, we present RepCONC, a novel retrieval model that learns discrete Representations via CONstrained Clustering. RepCONC jointly trains dual-encoders and the Product Quantization (PQ) method to learn discrete document representations and enables fast approximate NNS with compact indexes. It models quantization as a constrained clustering process, which requires the document embeddings to be uniformly clustered around the quantization centroids and supports end-to-end optimization of the quantization method and dual-encoders. We theoretically demonstrate the importance of the uniform clustering constraint in RepCONC and derive an efficient approximate solution for constrained clustering by reducing it to an instance of the optimal transport problem. Besides constrained clustering, RepCONC further adopts a vector-based inverted file system (IVF) to support highly efficient vector search on CPUs. Extensive experiments on two popular ad-hoc retrieval benchmarks show that RepCONC achieves better ranking effectiveness than competitive vector quantization baselines under different compression ratio settings. It also substantially outperforms a wide range of existing retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

DRhard

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).


view repo

JPQ

CIKM'21: JPQ substantially improves the efficiency of Dense Retrieval with 30x compression ratio, 10x CPU speedup and 2x GPU speedup.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Dense Retrieval (DR) has become a popular paradigm for first-stage retrieval in ad-hoc retrieval tasks. Through embedding queries and documents in a latent vector space with dual-encoders and using nearest neighbor search to retrieve relevant documents, the DR paradigm avoids the vocabulary mismatch problem, which has been a great challenge for traditional bag-of-words (BoW) models (Robertson and Walker, 1994). With end-to-end supervised training, recent works have achieved state-of-the-art ranking performance and significantly outperforms BoW models (Lin et al., 2020; Qu et al., 2021; Xiong et al., 2021; Zhan et al., 2020).

Despite the success in improving ranking performance, most existing DR models (Zhan et al., 2021b; Xiong et al., 2021; Qu et al., 2021; Karpukhin et al., 2020) are inefficient in memory usage and computational time. For memory inefficiency, the size of the embedding index is usually an order of magnitude larger than that of BoW index (Zhan et al., 2021a). At runtime, the vectors must be loaded to system memory or even GPU memory, which is both costly and highly limited in size. As for time inefficiency, many existing DR models (Zhan et al., 2021b; Xiong et al., 2021; Qu et al., 2021; Karpukhin et al., 2020) do not use approximate vector search (Jegou et al., 2010; Malkov and Yashunin, 2018). They have to conduct exhaustive search, i.e., computing relevance scores between the submitted query and all documents, which is less time-efficient than BoW models with inverted indexes. As a result, these DR models cannot use CPUs for retrieval due to high latency and have to use much more expensive GPUs to accelerate the search (Zhan et al., 2021a, b; Xiong et al., 2021).

A key solution for the efficiency issue of DR models is to learn discrete representations for document embeddings, which can be encoded into compact indexes and enable efficient vector search. Popular methods for learning this kind of discrete representation include Product Quantization (PQ) (Jegou et al., 2010; Ge et al., 2013) and Locality Sensitive Hashing (LSH) (Indyk and Motwani, 1998). However, these methods usually learn discrete representations in an unsupervised way and cannot benefit from supervised signals. Directly adopting these techniques usually hurts ranking effectiveness (Zhan et al., 2021a; Zhang et al., 2021).

Therefore, jointly optimizing dual-encoders and the quantization methods with supervised labels is regarded as a promising direction in improving retrieval effectiveness. However, it is inherently challenging because the quantization operation is non-differentiable and the model cannot be trained in an end-to-end fashion. There exist a number of recent works (e.g., JPQ (Zhan et al., 2021a)) trying to solve this problem, but they usually suffer from significant performance loss while improving efficiency. Therefore, we believe it is still an unsolved but essential problem.

To tackle this problem, we present RepCONC, which stands for learning discrete Representations via CONstrained Clustering111Code and models are available at https://github.com/jingtaozhan/RepCONC.. It jointly trains the dual-encoders and PQ by modeling quantization as a constrained clustering process. Specifically, constrained clustering

involves a clustering loss and a uniform clustering constraint. The clustering loss is introduced to train the discrete codes with the requirement that document embeddings are clustered around the quantization centroids. We also employ a uniform clustering constraint, which requires the vectors to be equally assigned to all quantization centroids. We add the constraint because we find that unconstrained clustering tends to assign vectors to a few major clusters and makes the quantized vectors indistinguishable with each other. Since this constraint leads to a difficult combinatorial optimization problem, we derive an approximate solution by relaxing it to an instance of the optimal transport problem. Besides the two components of

constrained clustering, RepCONC further employs vector-based inverted file system (IVF) (Jegou et al., 2010), which enables efficient non-exhaustive vector search. With these designs, RepCONC can run on either GPU or CPU 222Except that the user queries are still encoded on GPU. and perform vector search in an efficient way.

We conduct experiments on two widely-adopted ad-hoc retrieval benchmarks (Bajaj et al., 2016; Craswell et al., 2020) and compare RepCONC with a wide range of baselines, including both vector compression methods and retrieval models. Experimental results show that: 1) RepCONC significantly outperforms competitive vector compression baselines with different compression ratio settings from tens of times to hundreds of times. 2) RepCONC substantially outperforms various retrieval baselines in terms of retrieval effectiveness, memory efficiency, and time efficiency. 3) The ablation study demonstrates that constrained clustering is the key to the effectiveness of RepCONC.

2. Related Works

DR represents queries and documents with embeddings and utilizes vector search to retrieve relevant documents. Most existing DR models (Karpukhin et al., 2020; Zhan et al., 2020; Xiong et al., 2021; Zhan et al., 2021b; Qu et al., 2021) share the same BERT-base (Devlin et al., 2019; Liu et al., 2019)

architecture and utilize brute-force vector search. They differ in training methods, which can be classified into two categories. One line of research is negative sampling 

(Huang et al., 2020; Karpukhin et al., 2020; Zhan et al., 2020; Xiong et al., 2021; Zhan et al., 2021b). According to Zhan et al. (2021b), utilizing hard negatives helps improve top ranking performance. The other line is knowledge distillation (Qu et al., 2021; Lin et al., 2020; Hofstätter et al., 2021), which adopts a cross-encoder to generate pseudo labels. This paper uses negative sampling to train RepCONC and leaves training RepCONC with knowledge distillation to future work.

Since the models mentioned above utilize brute-force vector search, they incur very large embedding indexes and have to use costly GPUs to accelerate the search. How to address the efficiency issue has recently attracted researchers’ attention. Several studies propose some workarounds (Yamada et al., 2021; Zhang et al., 2021; Zhan et al., 2021a). BPR (Yamada et al., 2021)binarizes dense vectors and is conducted on the OpenQA task. An obvious limitation is that the compression ratio is fixed to 32x. DPQ (Zhang et al., 2021; Chen et al., 2020) utilizes PQ (Jegou et al., 2010) for compression and is designed for word embedding compression and recommendation systems. Most recently, Zhan et al. (2021a)

propose JPQ for document ranking and achieve state-of-the-art results. JPQ utilizes fixed discrete codes (Index Assignments) generated by K-Means and only trains the query encoder and PQ Centroid Embeddings. RepCONC is different from JPQ in both joint learning framework and efficiency design. Firstly, with the help of constrained clustering, RepCONC is able to optimize discrete codes (Index Assignments) while JPQ cannot. Secondly, RepCONC additionally employs the inverted file system (IVF) 

(Jegou et al., 2010) to accelerate search and thus, can efficiently retrieve documents on CPUs while JPQ has to rely on GPUs.

3. Constrained Clustering Model

In this section, we propose RepCONC, which stands for learning discrete Representations via CONstrained Clustering. We firstly introduce the preliminary of Production Quantization (Jegou et al., 2010), a widely-used vector compression method for approximate nearest neighbor search (ANNS). Then we elaborate our model.

3.1. Revisiting Product Quantization

RepCONC is based on Product Quantization (PQ) (Jegou et al., 2010). For vectors of dimension , PQ defines sets of embeddings, each of which includes embeddings of dimension . They are called PQ Centroid Embeddings. Formally, let be the centroid embedding from the set:

(1)

Given a document embedding , PQ firstly splits it into sub-vectors.

(2)

Then PQ independently quantizes each sub-vector to the nearest PQ Centroid Embedding. Formally, to quantize a sub-vector , PQ selects which achieves the minimum quantization error:

(3)

Let be the concatenation of :

(4)

where comma denotes vector concatenation. is called the Index Assignment of . Along with the PQ Centroid Embeddings, can reconstruct the quantized document embedding as follows:

(5)

PQ improves both memory efficiency and time efficiency. For memory efficiency, PQ does not explicitly store or . Instead, it only stores the PQ Centroid Embeddings and Index Assignments . Since is usually less than or equal to 256, can be encoded with bytes. Therefore, the compression ratio is about . As for time efficiency, PQ enables efficient vector search. Given a query embedding, PQ splits it equally to sub-vectors and pre-computes the similarities between the sub-vectors and PQ Centroid Embeddings. Then, PQ efficiently computes the similarities between the query embedding and each document embedding by aggregating the corresponding pre-computed similarities. Compared with directly computing vector similarity, the speedup ratio is about .

3.2. Clustering and Representation Learning

Jointly optimizing dual-encoders and PQ parameters is challenging. RepCONC views it as a simultaneous clustering and representation learning problem. To solve it, RepCONC utilizes both the ranking-oriented loss (Zhan et al., 2021a) and a clustering loss for training. The ranking-oriented loss helps learn representations for ranking. The clustering loss, i.e., the mean square error (MSE) between the document embeddings and the quantization centroids, helps cluster document embeddings to centroid embeddings. We illustrate the training workflow in Figure 1.

Figure 1. Training process of RepCONC.

The ranking-oriented loss (Zhan et al., 2021a)

replaces the uncompressed document embeddings in the common DR ranking loss functions 

(Qu et al., 2021; Xiong et al., 2021; Zhan et al., 2021b) with the quantized document embeddings. Therefore, it better evaluates the ranking performance with respect to the current compression parameters. Formally, ranking-oriented loss is formulated as:

(6)

where and are relevant and irrelevant documents, respectively. facilitates effective representation learning by encouraging the relevant pairs to be scored higher than irrelevant pairs.

Although incorporating the ranking-oriented loss helps a recent joint learning work achieve state-of-the-art compression results (Zhan et al., 2021a), we argue that simply relying on this loss is problematic. Since the quantization error in Eq. (3) is not included in , it may change arbitrarily and selecting Index Assignments based on Eq. (3) may lead to unexpected behaviors. Zhan et al. (2021a) avoids this pitfall by fixing the Index Assignments. However, fixed Index Assignments lead to sub-optimal ranking performance because they cannot benefit from supervised signals.

Different from Zhan et al. (2021a), RepCONC regards quantization as a clustering problem and introduces the MSE loss :

(7)

Minimizing requires the vectors before and after quantization to be close to each other. In this way, the document embeddings are clustered around the centroid embeddings. Combining both and helps the model to cluster document embeddings based on ranking effectiveness. It is expected to produce better clustering compared with unsupervised training.

The final loss is a weighted sum of ranking-oriented loss and the MSE loss .

(8)

is a hyper-parameter. If is too small, the documents are not clustered and the selected Index Assignments become arbitrary. If is too big, the MSE loss dominates the training process and harms ranking effectiveness. In practice, we find the model is relatively sensitive to , but becomes more robust and effective with the help of the uniform clustering constraint introduced in the next sections.

Since quantization involves some non-differentiable operations, we explicitly design the gradient back-propagation policy. The gradients of uncompressed document embeddings are defined as follows:

(9)

As the equation shows, we add the gradient of quantized document embeddings (the first term). The gradients are further back-propagated to dual-encoders. As for the PQ Centroid Embeddings, their gradients can be derived with chain rule. We formally show it as follows:

(10)

In the following sections, we show how RepCONC selects Index Assignments (instead of Eq. (3)).

3.3. Importance of Uniform Clustering

Figure 2. Illustration of Constrained Clustering. Darker colors in the heatmap indicate higher similarities (smaller distances). With the constraint, the discrete document embeddings are more diverse.

It is non-trivial to simultaneously conduct clustering and representation learning because the two objects are conflicting to some extent. Although representation learning encourages vectors to be distinguishable, clustering encourages vectors to be identical. In practice, clustering tends to map vectors to several major clusters while some clusters are rarely used or even empty. The problem worsens with Eq. (10) where rarely used centroid embeddings are less likely to be updated and may end up with arbitrary values. The unbalanced clustering distribution affects the distinguishability of the quantized vectors and compromises ranking effectiveness.

We tackle this challenge by imposing a uniform clustering constraint. It requires the document sub-vectors to be equally assigned to all PQ Centroid Embeddings. The learning object along with the constraint is formally expressed as:

(11)

We illustrate constrained clustering in Figure 2. As the figure shows, the discrete document embeddings are selected by minimizing the quantization error (maximizing the similarity) given the uniform clustering constraint. Without the constraint, the discrete document embeddings become identical. In the following, we theoretically analyze the importance of uniform clustering.

We introduce several notations for our theoretical analysis. We define as all possible Index Assignments:

(12)

Let be one Index Assignment and be the value of .

Firstly, we show that maximizing the distinguishability of vectors is equivalent to forcing the vectors to be equally quantized to all possible Index Assignments. Let and

be randomly sampled from all quantized document embeddings. We assume they are independent and identically distributed (i.i.d). The probability that they are equal satisfies:

(13)

where AM–GM inequality is used. The equality is achieved if and only if

(14)

That is to say, quantizing vectors equally to all possible Index Assignments helps representations to be distinguishable.

Next, we show uniformly clustering sub-vectors is the essential condition of Eq. (14). Given Eq. (14), the probability that a sub-vector is quantized to a centroid is a constant:

(15)

We further show that if sub-vectors are independent, uniformly clustering sub-vectors (Eq. (15)) is the sufficient condition of Eq. (14):

(16)

Although independence among sub-vectors may not hold for practical dual-encoders, we believe constraining quantization with Eq. (15) is still helpful for distinguishing quantized vectors.

3.4. Constrained Clustering Optimization

This section shows how to incorporate the uniform clustering constraint during training. In previous works related to joint learning with PQ (Zhan et al., 2021a; Zhang et al., 2021; Chen et al., 2020), the Index Assignments are selected based on Eq. (3). However, it cannot be applied to RepCONC because of the uniform clustering constraint. Next, we show how RepCONC incorporates the constraint to select Index Assignments during training.

We introduce a posterior distribution , which is the probability that the sub-vector is quantized to the centroid . The Index Assignment, , is the centroid with the maximum probability:

(17)

For previous works (Zhan et al., 2021a; Zhang et al., 2021; Chen et al., 2020) that use Eq. (3), can be regarded as being computed solely based on quantization error. Here for RepCONC, we compute by minimizing the quantization error given the uniform clustering constraint:

(18)

where indicates the set of all documents. The first condition constrains to be binary, the second condition is a natural requirement for probability, and the third condition is exactly the uniform clustering constraint. Without the third condition, Eq. (17) and (18) degenerate to Eq. (3), i.e., selecting Index Assignments with minimum quantization error.

Solving Eq. (18) is particularly difficult because it is a combinatorial optimization problem with the scale of millions or even billions of documents. Therefore, we use an approximate solution by relaxing to be continuous and focusing on uniformly clustering a mini-batch of documents :

(19)

Since can be regarded as the cost of mapping to

, this is an instance of the optimal transport problem and can be solved in polynomial time by linear program. In our implementation, we use Sinkhorn-Knopp algorithm 

(Cuturi, 2013) to efficiently solve Eq. (19).

3.5. Accelerating Search with IVF

Besides PQ, RepCONC employs the inverted file system (IVF) to accelerate vector search. After quantizing document embeddings, RepCONC uses k-means to generate clusters. Each document embedding belongs to the nearest cluster and is stored in the corresponding inverted list. Note that is much smaller than the corpus size. Given a query embedding, RepCONC selects the nearest clusters and only ranks the documents in them. The documents in other clusters are ignored. In this way, RepCONC approximately accelerates vector search by .

Note that RepCONC does not include IVF in the joint learning framework and simply uses IVF after training. The clusters are generated in an unsupervised manner. In practice, we find this already yields satisfying results, and thus training IVF with supervised labels is not explored.

IVF only induces negligible memory overhead and does not harm memory efficiency. For example, on MS MARCO Passage Ranking dataset (Bajaj et al., 2016) which has million passages, is set to and additional memory overhead is less than .

3.6. Training/Inference Details

3.6.1. Warmup with OPQ

In order to accelerate convergence, we warmup the dual-encoders and PQ Centroid Embeddings as follows. We use the open-sourced STAR 

(Zhan et al., 2021b) model to initialize dual-encoders. STAR is trained without quantization. Given the document embeddings output by STAR, we use OPQ (Ge et al., 2013) to warmup PQ parameters, which is a popular unsupervised PQ variant.

3.6.2. Two-Stage Negative Sampling

Hard negative sampling is shown to be important for retrieval models (Zhan et al., 2021b; Xiong et al., 2021). Following Zhan et al. (2021b), we train RepCONC in two stages. In the first stage, we retrieve static hard negatives using the initialized RepCONC. In the second stage, we use dynamic hard negatives, the top irrelevant documents retrieved at each training step. To enable end-to-end retrieval during training, we fix the Index Assignments and only train the query encoder and PQ Centroid Embeddings.

3.6.3. Efficient Encoding during Inference

During inference, we use Eq. (3) to quantize document embeddings instead of Eq. (18) or Eq. (19). In this way, we can quantize each document embedding online efficiently. Otherwise, computing with Eq. (18) is expensive and Eq. (19) introduces stochastic noise when batching documents.

4. Experimental Setup

Here we present our experimental settings, including datasets, baselines, and implementation details.

4.1. Datasets and Metrics

We conduct experiments on two large-scale ad-hoc retrieval benchmarks from the TREC 2019 Deep Learning Track 

(Craswell et al., 2020; Bajaj et al., 2016), passage ranking and document ranking. They have been widely-adopted in previous works related to neural ranking. The passage ranking task has a corpus of M passages, training queries, development queries (henceforth, MARCO Passage), and test queries (DL Passage). The document ranking task has a corpus of documents, training queries, development queries (MARCO Doc), and test queries (DL Doc). For both tasks, we report the official metrics and R@100 based on the full-corpus retrieval results.

4.2. Baselines

We exploit two types of baselines, vector compression methods and retrieval models.

4.2.1. Vector Compression Baselines

Unsupervised methods include PQ (Jegou et al., 2010), ScaNN (Guo et al., 2020), ITQ+LSH (Gong et al., 2012), OPQ (Ge et al., 2013), and OPQ+ScaNN. We use Faiss library (Johnson et al., 2019) to implement those baselines except for ScaNN (Guo et al., 2020), which is implemented based on its open-sourced code.

Supervised methods include recently proposed DPQ (Chen et al., 2020; Zhang et al., 2021) and JPQ (Zhan et al., 2021a), both of which are also based on PQ. We re-implement DPQ since it is originally designed for word embedding compression (Chen et al., 2020) and recommendation systems (Zhang et al., 2021). We use the same warmup process as RepCONC. JPQ is lately proposed for document ranking and shares the same warmup process. Another compression method, BPR (Yamada et al., 2021) binarizes dense vectors and thus is limited to a fixed compression ratio (32x). As RepCONC already achieves very small performance loss with a 64x compression ratio, we do not implement BPR for comparison.

Model Compr. MARCO Passage DL Passage MARCO Doc DL Doc
Ratio MRR@10 R@100 NDCG@10 R@100 MRR@100 R@100 NDCG@10 R@100
Uncompressed
ANCE (Xiong et al., 2021) ** **
ADORE (Zhan et al., 2021b)
Unsup. Compr.
PQ (Jegou et al., 2010) ** ** ** ** ** ** ** **
ScaNN (Guo et al., 2020) ** ** ** ** ** ** ** **
ITQ+LSH (Gong et al., 2012) ** ** ** ** ** ** ** **
OPQ (Ge et al., 2013) ** ** ** ** ** ** **
OPQ+ScaNN ** ** ** ** ** ** *
Sup. Compr.
DPQ (Chen et al., 2020; Zhang et al., 2021) ** ** ** ** **
JPQ (Zhan et al., 2021a) ** ** *
RepCONC (Ours)
Table 1. Comparison with different compression methods on TREC 2019 Deep Learning Track. Compression ratio is set to 64x, i.e., 48 bytes per passage/document. */** denotes that RepCONC performs significantly better than baselines at

level using the two-tailed pairwise t-test. ‘Unsup. Compr.’ and ‘Sup. Compr.’ denote unsupervised compression methods and supervised compression methods, respectively. Best compression method in each column is marked bold.

4.2.2. Retrieval Models

First-stage retrieval models involve BoW models and DR models. BoW models include BM25 (Robertson and Walker, 1994) and its variants, such as DeepCT (Dai and Callan, 2019), HDCT (Dai and Callan, 2020), doc2query (Nogueira et al., 2019b), and docT5query (Nogueira et al., 2019a). DR models include RepBERT (Zhan et al., 2020), ANCE (Xiong et al., 2021), STAR (Zhan et al., 2021b), and ADORE (Zhan et al., 2021b). Their output embeddings are of dimension and are not compressed. All of them utilize negative sampling methods for training as RepCONC. In our experiments related to time efficiency, we use IVF (Jegou et al., 2010)

with the same hyperparameters as RepCONC to accelerate ADORE 

(Zhan et al., 2021b), the most competitive uncompressed DR baseline.

Although several ranking models also conduct end-to-end retrieval, their latency is significantly higher than typical first-stage retrievers. Therefore, we classify them as complex end-to-end retrieval models. These models include ColBERT (Khattab and Zaharia, 2020), COIL (Gao et al., 2021), uniCOIL (Lin and Ma, 2021), and DeepImpact (Mallia et al., 2021) 333 Although uniCOIL (Lin and Ma, 2021) and DeepImpact (Mallia et al., 2021) can leverage the inverted indexes like BM25 (Robertson and Walker, 1994), they are much slower possibly due to much smaller vocabulary size (30k vs. 500k) and not removing stop words. . Note, for COIL (Gao et al., 2021), the authors uploaded a new model trained with hard negatives in the github repository, which is not included in its paper. We denote it as COIL-Hard.

4.3. Implementation Details

Here are our model settings. We build RepCONC based on huggingface transformers (Wolf et al., 2019) and Faiss ANNS library (Johnson et al., 2019). The dual-encoders use RoBERTa-base (Liu et al., 2019) as the backbone, and the output embedding dimension is 768. Embedding similarity is computed with inner product. For PQ hyper-parameters, is set to 256, and is set to 4, 8, 12, 16, 24, 32, and 48 for different compression ratios. The compression ratio equals since one vector is compressed to bytes.

Training settings are as follows. Most training hyper-parameters are kept the same in both datasets except for batch size due to the limitation of GPU memory. Following ADORE (Zhan et al., 2021b), training is in two stages. In the first stage where static hard negatives are used, the optimizer is AdamW (Loshchilov and Hutter, 2017); learning rates are and separately for encoders and centroid embeddings; in Eq. (8) is set to 0.05 for , 0.07 for , 0.1 for , 0.2 for , and 0.3 for ; batch sizes are separately set to 1024 and 256 for passage and document ranking. In the second stage where dynamic hard negatives are used, the optimizer is AdamW (Loshchilov and Hutter, 2017); learning rates are and separately for encoders and centroid embeddings; batch sizes are set to 128. For , we replace Eq. (6) with LambdaLoss (Burges, 2010) for better ranking performance. Training time is about 4 hours for passage ranking and 2 hours for document ranking.

Now we present our hardware settings and details about latency measurement. We use Xeon Gold 5218 CPUs and RTX 3090 GPUs. When training and measuring latency, we use one CPU thread and one GPU. Training time is about 9 hours for passage ranking and 2 days for document ranking on one RTX 3090 GPU. Additional notes about latency measurement are as follows. BoW search and vector search are both conducted on the CPU. For most neural retrieval models including RepCONC, query encoding is required and is performed on GPU. In our reranking experiments, the reranking models are also running on GPU.

5. Experiments

We empirically evaluate RepCONC to address the following three research questions:

  • RQ1: Can RepCONC substantially compress the index without significantly hurting retrieval effectiveness?

  • RQ2: How does RepCONC perform compared with other retrieval models?

  • RQ3: How does constrained clustering contribute to the effectiveness of RepCONC?

5.1. Comparison with Compression Methods

(a) MS MARCO Passage
(b) MS MARCO Document
Figure 5. Comparison with compression methods. Up and right is better.

This section compares RepCONC with vector compression baselines to answer RQ1. We compare it in two ways, a fixed 64x compression ratio and different compression ratios ranging from 64x to 784x.

Ranking performances given a fixed 64x compression ratio are presented in Table 1. Even if the index is compressed by 64 times, RepCONC outperforms ANCE (Xiong et al., 2021) and almost matches ADORE (Zhan et al., 2021b), the state-of-the-art DR model trained by negative sampling. Compared with unsupervised compression methods, RepCONC exhibits significant performance gains and demonstrates the importance of joint learning. As for supervised compression baselines, RepCONC significantly outperforms DPQ (Zhang et al., 2021; Chen et al., 2020) and especially outperforms the recently proposed state-of-the-art JPQ model (Zhan et al., 2021a) on most metrics. JPQ cannot train Index Assignments. It uses K-Means to generate them and fixes them during training. Our proposed RepCONC, on the contrary, is able to update Index Assignments during training and results show its effectiveness.

Ranking performances in terms of different compression ratios are plotted in Figure 5. The advantage of RepCONC is more significant when larger compression ratios are used. For example, its MRR score is more than twice the JPQ’s score when the compression ratio is 784x. We believe this is because RepCONC is able to generate high-quality Index Assignments specifically for ranking effectiveness, which becomes more important when fewer bytes are used. Instead, JPQ uses K-Means to produce task-blind Index Assignments and compromises ranking performance.

5.2. Comparison with Retrieval Models

This section compares RepCONC with various retrieval models to address RQ2. We firstly compare it with first-stage retrievers, including BoW models and DR models. Then we compare it with complex (slow) end-to-end retrievers.

5.2.1. Comparison with First-Stage Retrievers

(a) MS MARCO Passage
(b) MS MARCO Document
Figure 8. Comparison with first-stage retrieval models in terms of effectiveness-memory trade-off. Up and right is better. The x-axis indicates the average number of passages/documents stored in 1 kilobyte.

Figure 8 summarizes the effectiveness-memory tradeoff. As the figure shows, although DR models are much more effective than BoW models, they incur severe memory inefficiency. By jointly training the dual-encoders and quantization methods, RepCONC substantially improves memory efficiency of DR while still being very effective in ranking. It outperforms RepBERT (Zhan et al., 2020) and ANCE (Xiong et al., 2021) in effectiveness, and is almost as effective as ADORE (Zhan et al., 2021b), the state-of-the-art DR model trained by negative sampling. Compared with BoW models, it can build a much smaller index, especially on document dataset where text is much longer than that on passage dataset. For example, on document ranking task, it can build a 100x smaller index than BM25 while still being equally effective.

(a) MS MARCO Passage
(b) MS MARCO Document
Figure 11. Comparison with first-stage retrieval models in terms of effectiveness-latency trade-off. Up and right is better. The search is performed on CPU with one thread. QPS stands for ‘query per second’.

Figure 11 summarizes the effectiveness-latency tradeoff. To verify that RepCONC is more time-efficient than existing uncompressed DR models, we arm the state-of-the-art uncompressed DR model, ADORE, with the same IVF method (Jegou et al., 2010) as RepCONC employs. As the figure shows, both RepCONC-IVF and ADORE-IVF substantially outperform BoW models with the help of IVF acceleration. Most importantly, RepCONC-IVF outperforms ADORE-IVF, especially at large QPS settings. This is because PQ already provides RepCONC with about 15x speedup compared with brute-force dense retrieval. Therefore, ADORE is more dependent on IVF than RepCONC and has to sacrifice more effectiveness for acceleration. The results demonstrate the time efficiency of RepCONC.

5.2.2. Comparison with Complex End-to-End Retrievers

Figure 12. Comparison with complex (slow) end-to-end retrieval models in terms of effectiveness-latency tradeoff on MS MARCO Passage Ranking. The search is performed on CPU with one thread. Up and right is better. QPS stands for ‘query per second’.

This section compares RepCONC with some complex (slow) end-to-end neural retrieval models. These models achieve better ranking performance with much higher query latency because of their complex model architecture. In consideration of fair comparison, we add a reranking stage to RepCONC and compare them in terms of effectiveness-latency tradeoff. The reranking models are MonoBERT and DuoT5 models open-sourced by the pygaggle library 444https://github.com/castorini/pygaggle. MonoBERT firstly reranks top passages retrieved by RepCONC-IVF. Then DuoT5 further reranks the top passages output by MonoBERT. We tune the IVF speedup ratio, the MonoBERT reranking depth, and the DuoT5 reranking depth to evaluate ranking performance at different query latency. Note, query encoding and reranking are performed on GPU while the search is performed on CPU with one thread.

Ranking performances are summarized in Figure 12. We can see that RepCONC-IVF+Rerank substantially outperforms all baselines in terms of both effectiveness and time efficiency. In fact, RepCONC is also more memory-efficient than these baselines. The index size of RepCONC is less than 0.5GB, while COIL (Gao et al., 2021), ColBERT (Khattab and Zaharia, 2020), uniCOIL (Lin and Ma, 2021), and DeepImpact (Mallia et al., 2021) separately consumes 60GB, 162GB, 1.3GB, and 1.5GB for storing indexes. Therefore, RepCONC’s efficient and effective retrieval is highly beneficial to second-stage reranking and helps the two-stage ranking achieve much better ranking performance than complex end-to-end retrieval models.

5.3. Ablation Study

Models MRR@10
BPP:16 BPP:48
Baselines
DPQ (Zhang et al., 2021; Chen et al., 2020)
JPQ (Zhan et al., 2021a)
RepCONC
OPQ (Ge et al., 2013)
+ Clustering
+ Constraint
+ Dynamic Neg
Table 2. Ablation study on MSMARCO Passage Ranking dataset. BPP stands for ‘bytes per passage’

This section conducts an ablation study to answer RQ3. We summarize the results in Table 2.

As the results show, clustering object helps RepCONC outperform DPQ (Zhang et al., 2021; Chen et al., 2020)

. Although DPQ also utilizes a similar MSE loss, the gradients with respect to it only backpropagate to PQ centroids. Therefore, DPQ updates Index Assignments with a trick similar to Batch K-Means 

(Bottou and Bengio, 1995) instead of clustering. However, Batch K-Means is shown to converge slowly (Sculley, 2010). Besides, the target distribution of document embeddings is also changing during training, which makes convergence harder.

With the help of the uniform clustering constraint, RepCONC outperforms the state-of-the-art JPQ method (Zhan et al., 2021a), which uses fixed Index Assignments generated by OPQ (Ge et al., 2013). It demonstrates that simply adding a clustering loss is risky to retrieval effectiveness and that the constraint helps to tackle this problem by distinguishing the quantized vectors. The ranking performance is further improved by employing dynamic hard negatives (Zhan et al., 2021b).

Figure 13. Cluster distribution on MS MARCO Passage Ranking. Clusters are sorted by the assigned frequency. Distributions across different sub-vector blocks are averaged. RepCONCCon indicates RepCONC without constraint.

To further verify that the constraint helps produce balanced clustering results, we plot the frequencies of clusters being assigned in Figure 13. Without the constraint, RepCONCCon generates unbalanced clustering distribution. With the help of the constraint, the distribution is more balanced. It is similar to that of OPQ (Ge et al., 2013), which uses K-Means for clustering. The distribution is not uniform because we do not use the constraint during inference as discussed in Section 3.6.3.

6. Conclusions

To solve the efficiency issue existing in brute-force DR models, we present RepCONC, which learns discrete representations by modeling quantization as constrained clustering in the joint learning process. The clustering object requires the document embeddings to be clustered around the quantization centroids and facilitates joint optimization of PQ parameters and dual-encoders. To tackle the risk that clustering assigns vectors to only a few major centroids and results in indistinguishable quantized vectors, we introduce a uniform clustering constraint that enforces the vectors to be equally quantized to all possible centroids during training. The constraint is approximately solved as an instance of the optimal transport problem. In addition to constrained clustering, RepCONC employs the inverted file system (IVF) to enable efficient vector search on CPUs. We conduct experiments on two widely-adopted ad-hoc retrieval benchmarks. Experimental results show that RepCONC significantly outperforms competitive quantization baselines and substantially improves the memory efficiency and time efficiency of DR. It substantially outperforms various retrieval models in terms of retrieval effectiveness, memory efficiency, and time efficiency. The ablation study demonstrates that constrained clustering is the key to the effectiveness of RepCONC.

References

  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, et al. (2016) Ms marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: §1, §3.5, §4.1.
  • L. Bottou and Y. Bengio (1995) Convergence properties of the k-means algorithms. In Advances in neural information processing systems, pp. 585–592. Cited by: §5.3.
  • C. J. Burges (2010) From ranknet to lambdarank to lambdamart: an overview. Learning 11 (23-581), pp. 81. Cited by: §4.3.
  • T. Chen, L. Li, and Y. Sun (2020) Differentiable product quantization for end-to-end embedding compression. In

    International Conference on Machine Learning

    ,
    pp. 1617–1626. Cited by: §2, §3.4, §3.4, §4.2.1, Table 1, §5.1, §5.3, Table 2.
  • N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees (2020) Overview of the trec 2019 deep learning track. In Text REtrieval Conference (TREC), Cited by: §1, §4.1.
  • M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. Advances in neural information processing systems 26, pp. 2292–2300. Cited by: §3.4.
  • Z. Dai and J. Callan (2020) Context-aware document term weighting for ad-hoc search. Proceedings of The Web Conference 2020. Cited by: §4.2.2.
  • Z. Dai and J. Callan (2019)

    Context-aware sentence/passage term importance estimation for first stage retrieval

    .
    arXiv preprint arXiv:1910.10687. Cited by: §4.2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §2.
  • L. Gao, Z. Dai, and J. Callan (2021) COIL: revisit exact lexical match in information retrieval with contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 3030–3042. Cited by: §4.2.2, §5.2.2.
  • T. Ge, K. He, Q. Ke, and J. Sun (2013) Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 36 (4), pp. 744–755. Cited by: §1, §3.6.1, §4.2.1, Table 1, §5.3, §5.3, Table 2.
  • Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin (2012)

    Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval

    .
    IEEE transactions on pattern analysis and machine intelligence 35 (12), pp. 2916–2929. Cited by: §4.2.1, Table 1.
  • R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar (2020) Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. Cited by: §4.2.1, Table 1.
  • S. Hofstätter, S. Lin, J. Yang, J. Lin, and A. Hanbury (2021) Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp. 113–122. Cited by: §2.
  • J. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, P. Pronin, J. Padmanabhan, G. Ottaviano, and L. Yang (2020) Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2553–2561. Cited by: §2.
  • P. Indyk and R. Motwani (1998)

    Approximate nearest neighbors: towards removing the curse of dimensionality

    .
    In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    ,
    pp. 604–613. Cited by: §1.
  • H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §1, §1, §1, §2, §3.1, §3, §4.2.1, §4.2.2, Table 1, §5.2.1.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §4.2.1, §4.3.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Cited by: §1, §2.
  • O. Khattab and M. Zaharia (2020) ColBERT: efficient and effective passage search via contextualized late interaction over bert. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Cited by: §4.2.2, §5.2.2.
  • J. Lin and X. Ma (2021) A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. arXiv preprint arXiv:2106.14807. Cited by: §4.2.2, §5.2.2, footnote 3.
  • S. Lin, J. Yang, and J. Lin (2020) Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386. Cited by: §1, §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2, §4.3.
  • I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.3.
  • Y. A. Malkov and D. A. Yashunin (2018) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42 (4), pp. 824–836. Cited by: §1.
  • A. Mallia, O. Khattab, T. Suel, and N. Tonellotto (2021) Learning passage impacts for inverted indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21. Cited by: §4.2.2, §5.2.2, footnote 3.
  • R. Nogueira, J. Lin, and A. Epistemic (2019a) From doc2query to doctttttquery. Online preprint. Cited by: §4.2.2.
  • R. Nogueira, W. Yang, J. Lin, and K. Cho (2019b) Document expansion by query prediction. arXiv preprint arXiv:1904.08375. Cited by: §4.2.2.
  • Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang (2021) RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5835–5847. Cited by: §1, §1, §2, §3.2.
  • S. E. Robertson and S. Walker (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR’94, pp. 232–241. Cited by: §1, §4.2.2, footnote 3.
  • D. Sculley (2010) Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pp. 1177–1178. Cited by: §5.3.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: §4.3.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. N. Bennett, J. Ahmed, and A. Overwijk (2021) Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §3.2, §3.6.2, §4.2.2, Table 1, §5.1, §5.2.1.
  • I. Yamada, A. Asai, and H. Hajishirzi (2021) Efficient passage retrieval with hashing for open-domain question answering. In ACL, Cited by: §2, §4.2.1.
  • J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, and S. Ma (2021a) Jointly optimizing query encoder and product quantization to improve retrieval performance. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, Cited by: §1, §1, §1, §2, §3.2, §3.2, §3.2, §3.2, §3.4, §3.4, §4.2.1, Table 1, §5.1, §5.3, Table 2.
  • J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, and S. Ma (2021b) Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp. 1503–1512. Cited by: §1, §2, §3.2, §3.6.1, §3.6.2, §4.2.2, §4.3, Table 1, §5.1, §5.2.1, §5.3.
  • J. Zhan, J. Mao, Y. Liu, M. Zhang, and S. Ma (2020) RepBERT: contextualized text embeddings for first-stage retrieval. arXiv preprint arXiv:2006.15498. Cited by: §1, §2, §4.2.2, §5.2.1.
  • H. Zhang, H. Shen, Y. Qiu, Y. Jiang, S. Wang, S. Xu, Y. Xiao, B. Long, and W. Yang (2021) Joint learning of deep retrieval model and product quantization based embedding index. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp. 1718–1722. Cited by: §1, §2, §3.4, §3.4, §4.2.1, Table 1, §5.1, §5.3, Table 2.