DeepAI
Log In Sign Up

CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos

Current dense retrievers are not robust to out-of-domain and outlier queries, i.e. their effectiveness on these queries is much poorer than what one would expect. In this paper, we consider a specific instance of such queries: queries that contain typos. We show that a small character level perturbation in queries (as caused by typos) highly impacts the effectiveness of dense retrievers. We then demonstrate that the root cause of this resides in the input tokenization strategy employed by BERT. In BERT, tokenization is performed using the BERT's WordPiece tokenizer and we show that a token with a typo will significantly change the token distributions obtained after tokenization. This distribution change translates to changes in the input embeddings passed to the BERT-based query encoder of dense retrievers. We then turn our attention to devising dense retriever methods that are robust to such queries with typos, while still being as performant as previous methods on queries without typos. For this, we use CharacterBERT as the backbone encoder and an efficient yet effective training method, called Self-Teaching (ST), that distills knowledge from queries without typos into the queries with typos. Experimental results show that CharacterBERT in combination with ST achieves significantly higher effectiveness on queries with typos compared to previous methods. Along with these results and the open-sourced implementation of the methods, we also provide a new passage retrieval dataset consisting of real-world queries with typos and associated relevance assessments on the MS MARCO corpus, thus supporting the research community in the investigation of effective and robust dense retrievers. Code, experimental results and dataset are made available at https://github.com/ielab/CharacterBERT-DR.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/27/2021

Dealing with Typos for BERT-based Passage Retrieval and Ranking

Passage retrieval and ranking is a key task in open-domain question answ...
10/27/2022

COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning

We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to ...
06/02/2022

What Are Expected Queries in End-to-End Object Detection?

End-to-end object detection is rapidly progressed after the emergence of...
11/25/2021

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

Heavily pre-trained transformers for language modelling, such as BERT, h...
05/10/2021

Few-Shot Conversational Dense Retrieval

Dense retrieval (DR) has the potential to resolve the query understandin...
09/01/2022

Isotropic Representation Can Improve Dense Retrieval

The recent advancement in language representation modeling has broadly a...
08/13/2021

On Single and Multiple Representations in Dense Passage Retrieval

The advent of contextualised language models has brought gains in search...

Code Repositories

CharacterBERT-DR

The offcial repository for 'CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos', SIGIR2022


view repo

1. Introduction

Neural ranking models have shown to provide remarkable effectiveness improvements compared to traditional information retrieval (IR) methods (Lin et al., 2021b)

. These ranking models rely on heavily pre-trained deep language models, such as BERT 

(Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Notably, increasing effort has been devoted to developing effective dense retrievers (DRs) (Karpukhin et al., 2020; Xiong et al., 2020; Gao et al., 2021a; Zhan et al., 2020; Khattab and Zaharia, 2020; Gao and Callan, 2021a, b; Zhan et al., 2021; Ren et al., 2021a, b; Qu et al., 2021)

. DRs use BERT to encode both queries and documents into low dimensional dense vectors. Relevance between queries and documents is estimated by similarity matching methods (e.g., cosine similarity) between the dense vectors. Prior work has shown that DRs are much more effective than traditional bag-of-words retrieval methods. Arguably, this is due to a DRs’ ability to address the vocabulary mismatch problem: i.e., when queries and the corresponding relevant documents use different but semantically equivalent terms.

Recent studies however have also highlighted problems with current DRs that relate to their robustness to out-of-distribution queries, that is, queries that have uncommon traits compared to those that form the bulk of data used at training and for which DRs obtain unexpectedly low effectiveness compared to in-distribution queries. For example, Sciavolino et al. (Sciavolino et al., 2021) have shown DRs have poor effectiveness on queries that contain entities, while Arabzadeh et al. (Arabzadeh et al., 2021) showed poor effectiveness on queries with uncommon terms or complex queries. Another type of queries for which DRs are not robust, and that are the focus of this paper, are queries that contain typos (Zhuang and Zuccon, 2021; Penha et al., 2021; Wu et al., 2021; Arabzadeh et al., 2021). Zhuang and Zuccon (Zhuang and Zuccon, 2021) systematically simulated a wide range of typos in English queries and found that just injecting one character-level typo into MS MARCO queries dramatically decreases the effectiveness of DRs. Interestingly, they also found that DRs are more sensitive to typos in queries than bag-of-words methods (BM25). They further suggested that queries containing typos give rise to a specific type of vocabulary mismatch and questioned the ability of DRs to deal with this problem.

Figure 1. Our CharacterBERT + Self-Teaching training approach.

In this paper, we revisit the DRs’ lack of robustness on queries with typos by providing an in-depth analysis of this problem, including unveiling why this occurs, and a concrete solution for addressing this problem in DRs. Specifically, we demonstrate that DRs are not robust to typos in queries because of the limitations of the BERT’s WorldPiece tokenizer (Wu et al., 2016): a token with a typo will significantly change the token distributions obtained after tokenization – this dramatically changes the token embeddings used as input to the DRs query encoder, in turn resulting in a very different query embedding than the original one (obtained if typos were not present). For example, the word “information” is mapped to a single token by the BERT tokenizer. Thus, the input passed to the DRs’ query encoder for this word is a single embedding. However, if this word is misspelled into “infromation” it is then split by the tokenizer into four sub-tokens: [‘in’, ‘fr’, ‘oma’, ‘tion’]. As a consequence, the input passed to the DR’s query encoder for this word is composed of four token embeddings (also with additional position embeddings), which are very different from the single embedding for “information”. Then, the encoding produced by the DRs’ query encoder for the query “infromation” is understandably different from that for “information” – and empirically this is shown to significantly affect effectiveness.

To overcome this robustness issue when answering queries with typos and based on the above observation, we create a CharacterBERT (El Boukkouri et al., 2020) DR based query and passage encoder. CharacterBERT is a variant of BERT in which the WordPiece tokenizer is replaced by a Character-CNN (Peters et al., 2018; Jozefowicz et al., 2016) module to construct the embeddings for the tokens’ characters. Empirically, we show that simply replacing BERT-based DR encoders with CharacterBERT-based ones can improve the robustness of the DRs on queries with typos (while not hurting the effectiveness on queries without typos). However, this simply strategy still is not enough to bridge the gap in effectiveness between queries with and without typos: effectiveness differences are still considerably large. To further improve the effectiveness of DRs on queries with typos, we also propose a novel adversarial training method for DRs called Self-Teaching (ST), illustrated in Figure 1. Our ST method takes inspiration from methods in Knowledge Distillation (KD) (Hinton et al., 2015). ST distills the model’s knowledge constructed when training on queries without typos into the model’s knowledge for queries with typos. The fundamental difference between standard KD training and our proposed ST training is that the student model in our ST training is at the same time also the teacher model. This gives our ST training a key advantage: there is no need to have a well trained teacher model in advance or jointly train two different student and teacher models, thus resulting in savings of computational resources.

Our empirical evaluation, that builds upon Zhuang and Zuccon’s synthetic typo query generation  (Zhuang and Zuccon, 2021), shows that dense retrievers trained with CharacterBERT + ST significantly outperform baseline DRs trained with standard BERT and an augmentation-based training method (proposed in (Zhuang and Zuccon, 2021)). In addition to the synthetic typo query evaluation, we also compile a new dataset, DL-typo, with human relevance judgements and characterised by the availability of query pairs (without and with typos), where queries with typos are not synthetically produced but are real queries mined from a large search engine query log. This new dataset contributes real-world data related to queries with typos that researchers can use to investigate the robustness of dense retrievers.

Novel Contributions. We make the following contributions:

  1. We provide a thorough analysis of the behaviour of BERT-based dense retrievers on queries with typos. We then identify the reasons behind this behaviour, and in particular the loss in effectiveness.

  2. We propose to use CharacterBERT as the backbone of the dense retrievers’ bi-encoder in a bid to improve effectiveness on queries that contain typos.

  3. We propose a new Self-Teaching adversarial training regime that distillates the score distribution of queries that do not contain typos to the queries within typos: this is aimed at further improving the effectiveness of dense retrievers on queries with typos.

  4. We create a new dataset based on MS MARCO passage ranking data that contains 60 query pairs and corresponding relevance assessments. Each pair is constituted by a real query with typos extracted from a large search engine query log and the corresponding query where the typos have been corrected.

  5. We thoroughly evaluate the proposed methods on an array of datasets for passage ranking, including datasets where typos are synthetically generated in a controlled manner, and our dataset with real queries with typos. The experimental results suggest that our CharacterBERT-based dense retriever with Self-Teaching outperforms dense retrievers methods to deal with queries with typos.

  6. We further compare the effectiveness of the proposed methods for dense retrievers against state-of-the-art spelling correction methods (overlooked in previous work on dense retrievers). These results help contextualise the proposed dense retriever solutions, tease out their advantages and limitations compared to alternative pipelines with spell-checkers, and chart directions for future work.

2. Related Works

Effectiveness of Dense Retrievers. Pre-trained transformer-based language models (PLMs) (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019) from NLP have been adapted to IR tasks such as passage retrieval and have demonstrated much higher effectiveness than traditional bag-of-words methods (e.g., BM25) (Lin et al., 2021b). Among the retrieval methods that exploit PLMs, dense retrievers have attracted the most attention. Karpukhin et al. (2020) first demonstrated that simply leveraging hard negatives sampled from BM25 top retrieved passages is sufficient for learning a fairly effective BERT-based DR. Follow up work has focused on designing more complex hard negative mining strategies to further improve effectiveness (Xiong et al., 2020; Zhan et al., 2021; Qu et al., 2021). Xiong et al. (2020) proposed to sample hard negatives from an asynchronously updated Approximate Nearest Neighbor (ANN) index, giving rise to the ANCE model. Similar to ANCE, Zhan et al. (2021) suggested that static hard negative sampling is risky and leads to learning a suboptimal DR. They then proposed to dynamically update the query encoder so that hard negatives are sampled according to the latest query embeddings: this is the ADORE method. Training dense retrievers with hard negatives is however computationally expensive. To maximise the GPUs computational power, Qu et al. (2021) proposed to leverages negative samples across different GPUs, thus greatly enlarging the negatives samples per training data while maintaining training efficiency as GPU memory locality is exploited: this is the training pipeline of the RocketQA model. The availability of such a large number of negative samples significantly improves ranking effectiveness. However, this training pipeline still requires a large number of GPUs and memory. As an alternative, Gao et al. (2021b) introduced a gradient caching technique that allows the RocketQA training pipeline to be run within a single GPU. In this paper, we adopt the hard negative training approach of Karpukhin et al. (2020) to efficiently train an effective dense retriever, and leave the application of our methods on more complex dense retrievers training pipelines to future work.

Another widely used technique used to improve the effectiveness of dense retrievers is Knowledge Distillation (KD) (Hofstätter et al., 2020; Lin et al., 2020, 2021; Ren et al., 2021b; Izacard and Grave, 2020), where a DR is obtained by learning from a stronger teacher ranker. Lin et al. (2020, 2021) showed that an effective single vector-based dense retriever can be learnt from the late-interaction ColBERT teacher model. Others have suggested distilling knowledge from expensive but more effective cross-encoder-based rerankers to yield effective bi-encoder-based DRs (Hofstätter et al., 2020; Izacard and Grave, 2020). On the other hand, Ren et al. (2021b) showed that a bi-encoder-based student retriever and a cross-encoder-based re-ranker teacher model can be trained jointly. Our proposed Self-Teaching training is also inspired by KD but is fundamentally different from KD because it is specifically designed for improving the robustness of dense retrievers to queries with typos and it does not require a specific teacher model, in the sense that the student model itself is also the teacher model.

Robustness of Dense Retrievers. While generally effective, recent research has shown dense retrievers can exhibit poor effectiveness in specific circumstances, thus detracting from their robustness. An analysis of the MS MARCO passage ranking leaderboard results has identified that papular DRs perform poorly on queries with uncommon terms and on complex queries (i.e. those that “require interpretation […] beyond the immediate meaning of the terms” (Arabzadeh et al., 2021) in the queries). Coincidentally, Mackie et al. (2021) introduced a framework for identifying hard queries that challenge DRs. Sciavolino et al. (2021) found that DRs also perform poorly, and worse than traditional bag-of-words methods, on queries that contain entities. On the other hand, Zhuang and Zuccon (2021) investigated the effectiveness of DRs on queries with typos. We build directly upon their findings, and thus we detail their work further. They proposed a synthetic typo generation framework that given a query without typos it produces a corresponding query that contains a realistic typo generated accordingly to a set of rules derived from common types of misspellings found in search logs for English queries. Using these queries they evaluated the drop in effectiveness caused when dense retrievers have to answer a query with typos, rather than the corresponding original (typo-free) query. They found that BERT-based DRs are extremely sensitive to typos: the injection of a single character-level typo into queries dramatically decreases their effectiveness, rendering dense retrievers not robust to typos in queries. We further highlight that research in NLP has also shown that BERT-based methods are not robust to typos (Sun et al., 2020), although this research has not considered DRs and the retrieval task.

While previous research has highlighted limitations in terms of robustness, little work has been done towards strengthening dense retrievers and address the underlying problems – noteworthy exceptions are described next. Chen et al. (2021) tackled the robustness of DRs on entity queries and proposed a contrastive training mechanism to learn a DR by imitating the behaviour of a sparse retriever behaviour; the results from this are further fused with those from a standard DR to achieve satisfactory effectiveness on both general queries and queries with entities. A drawback of their method is that it requires large quantities of training data and computational resources. The aforementioned study of Zhuang and Zuccon (2021)

has investigated another type of queries that challenges the robustness of DRs: queries that contain typos. To address this, they devised an augmentation-based adversarial training method for dense retrievers (called typo-aware training) which, with a set probability, transforms training queries that do not contain typos into queries with typos. Their results show that DRs trained with their augmentation technique are more robust to queries with typos; furthermore, these same models do not show losses in effectiveness when used on queries that do not contain typos. That work however had two limitations: (i) it did not provide insights on why DRs are not effective on queries with typos, and (ii) it only evaluated methods on realistic but synthetically generated typos in queries. Our work builds upon this and it contributes the following new knowledge and resources: (1) an in-depth analysis of the behaviour of BERT-based DRs on queries with typos, (2) insights into why DRs are sensitive to queries with typos, (3) a new training method to further improve the robustness of DRs on typo queries, (4) a new dataset with human relevance judgements and real queries with typos to evaluate the robustness of DRs.

Diff 1 2 3 4 5 6 7 Total
Count 803.9 2,926.6 2,431.6 712.4 91.5 8.2 0.8 6,975
Table 1. Average distribution of tokenization difference for 10 replicas of the MS MARCO dev dataset.

3. Impact of typos on tokenization and query embeddings

In this section we investigate the impact of a typo in a query on dense retrievers, in terms of both representation and effectiveness: this provides the motivations for investigating methods to make dense retrievers robust to typos and for the proposed use of CharacterBERT and ST, along with a clear understanding of why dense retrievers are not effective when queries have typos.

For analysing this problem we rely on the synthetic typo generation process proposed by Zhuang and Zuccon (2021) where, for all queries in MS MARCO dev, typos are randomly inserted in a query token using 5 typo generators: RandInsert, RandDelete, RandSub, SwapNeighbor, and SwapAdjacent111See Section 2.1 of Zhuang and Zuccon (2021) for details about these typo generators. Note the method only inserts one typo per query, i.e. only one word in the query contains a typo, and only words with at least 3 characters and not contained in a standard stopword list are considered as candidates for typo-insertion222Because of these constrains, 5 queries out of 6,980 have been discarded.. This procedure then generates a dataset based on the MS MARCO dev query set which contains 6,975 pairs of queries: each pair is composed of the original MS MARCO query and the corresponding query with a typo. We repeat this typo generation process 10 times, so that results are less likely to be influenced by the specific tokens chosen for typo-insertion and the type of typos. Results presented next are averaged across the 10 replicas.

We first investigate the consequence a typo in the query has on the behaviour of the BERT WordPiece tokenization. Table 1 reports how many tokens are different between each pair of original query and query with typo. So, for instance, on average across the 10 replicas of this synthetic dataset, for 2,926.6 queries the BERT WordPiece tokenizer produces tokenizations of the queries with typos that differ from the original queries tokenizations for 3 tokens: we call this a tokenization difference of 3. In other words, the tokenization for the query with typo has 3 tokens that are different compared to the tokens obtained from the tokenization of the original query. In practice, this often also implies that the tokenizations of queries with typos have more tokens than the corresponding tokenization of the original queries without typos, but this is not always the case. For example if the original query contains “apple” and the corresponding query with typo has changed this word into “apply”, then the BERT WordPiece tokenizer will output one token for each of these words (i.e. their tokenizations have the same length), but the obtained tokens are different and thus the two queries have a tokenization difference of 1.

From the results in Table 1 we note that the majority of queries with typos result in tokenization differences of 2 or 3 tokens. In some rare cases, a typo in a query can even result in a tokenization difference of up to 6 or 7 tokens – importantly, however, no query pair leads to a tokenization difference of zero. Given that the BERT WordPiece tokenizer produces tokenizations for the original MS MARCO dev queries that contain on average 7 tokens, it is self-evident that even a small tokenization difference corresponds to sensibly different queries being then encoded by the DRs: a difference of 2-3 tokens corresponds to 29%-/43% of the tokens being different. Thus, it is not surprising that in practice BERT-based DRs that rely on the BERT WordPiece tokenizer are extremely sensitive to typos in queries, as these typos can largely change the input tokens passed to the DRs query encoders.

Next we investigate the differences in the representations produced by dense retrievers for original queries and queries with typos, and how these differences in representations lead to differences (and specifically drops) in effectiveness. To measure differences in representations, we measure the cosine similarity between the DR encoding of the original query and the DR encoding of the corresponding query with typos: the smaller the cosine, the higher the difference. We call this encoding similarity. To measure differences in effectiveness, we measure the MRR drop rate (), i.e., the average RR score decrease rate between original queries () and queries with typos (): . For these experiments, we use as representative method our baseline DR model (labelled StandardBERT-DR) trained on the MS MARCO training data with BM25-based hard negative sampling (see Section 5.1 for details on this baseline). Other DRs give rise to similar observations and are omitted here for clarity.

Results are reported in Figure 2 (solid lines), where differences in representations and differences in effectiveness are analysed across values of tokenization difference (ignore the dashed lines – these are analysed in Section 6). First, we observe that values of are sensibly large, i.e., queries with typos are less effective than the corresponding queries without typos. This is in line with previous work (Zhuang and Zuccon, 2021; Penha et al., 2021; Wu et al., 2021) and occurs regardless of the value of tokenization difference. However, as tokenization difference increases, so does : queries with typos for which their tokenized version largely differs from that of the corresponding original query lead to larger losses in effectiveness. We also note that as and tokenization differences grow hand-in-hand, encoding similarity instead behaves exactly in the opposite way: it decreases as tokenization differences increases. In other words: the less the injected typo affects the BERT WordPiece tokenizer, the less tokenization difference is, and thus the more similar the encodings produced for the original and the query with typo are, resulting in smaller differences (losses) in effectiveness between original and typo query – and vice-versa when the injected typos have a high effect on the BERT WordPiece tokenizer. These provide us with the following intuition and the motivations for the methods we proposed in Section 4: a typo-robust dense retriever needs to produce query encodings that are invariant to the presence of typos in queries.

4. Typo-robust dense retrieval with CharacterBERT and Self-teaching

Section 3 outlined the intuition that typo-robust dense retrievers require tokenization and query encoding mechanisms that are not sensitive to typos: in particular, the query encodings should be invariant to the presence or absence of typos in the query333i.e., the encoding of the query with typos should be the same, or have minimal differences compared to the encoding of the corresponding query without typo.. Next, we build upon this intuition and propose to use CharacterBERT, which uses a typo insensitive tokenization approach, as encoder in DRs; in addition we propose an efficient Self-Teaching training approach for DRs that specifically aims to render the query encoder invariant to the presence of typos in the query.

4.1. Typo-Robust DRs with CharacterBERT

The CharacterBERT model uses a Character-CNN module to construct the word embeddings that are then passed as input to BERT (El Boukkouri et al., 2020), thus dropping the need for the WordPiece tokenizer used in standard BERT models. The architecture outlined in Figure 1 shows that any word passed as input to CharacterBERT is first split into its characters. Each character then is used to retrieve the corresponding character embedding from a character embedding matrix; the character embedding is then passed as input to a stack of CNN layers. The outputs of the CNN layers are then aggregated and projected into a single vector representation for each token. These new token representations can serve as context-independent word embeddings which can therefore be combined with position and segment embeddings before being fed into BERT.

The key difference between BERT and CharacterBERT is that, instead of using a set of pre-trained token embeddings obtained from the WordPiece tokenizer, CharacterBERT produces a single embedding for any input word without relying on any pre-defined vocabulary (as WordPiece instead does). This is particularly desired for our task since with CharacterBERT a query word with a typo will then be represented by a single embedding. This is in contrast to the standard BERT models that rely on the WordPiece tokenizer, where a query word with a typo will often be represented by multiple token embeddings. Thus, a dense retriever encoder implemented with CharacterBERT may be more robust to typos in queries. This expectation is derived also from the fact that CharacterBERT has been found to be more robust than BERT to out-of-vocabulary words across a variety of NLP tasks, in particular in specialized domains (El Boukkouri et al., 2020). However, no previous work has investigated how CharacterBERT could help tackle the problem of robustness of dense retrievers when typos are in queries (and no previous work has used CharacterBERT to build a neural ranker).

We propose to adapt CharacterBERT to obtain a typo-robust dense retriever. Specifically, given a traditional bi-encoder DR architecture, we modify it so as to use the [CLS] token embedding output from CharacterBERT to encode both queries and passages into a single vector each, as opposed to relying on BERT for this:

(1)

where and are the encoded versions of query and passage . Note that the reliance on the bi-encoder architecture means passages can be encoded offline at indexing and at query time only the query requires encoding, thus retaining the efficiency advantage that characterises DRs. The ranking results are then constructed by Approximate Nearest Neighbor (ANN) search, where the similarity between query and passage is defined using the dot product:

(2)

The proposed method aims to address the impact of typos contained in queries. We do this by encoding queries and documents using CharacterBERT. An alternative approach would have been to only encode queries using CharacterBERT, as queries are the items that are likely to contain typos, and maintain the regular BERT encoder for encoding passages. This renders the architecture at all effects an ‘untied’ bi-encoder, i.e. a bi-encoder where the query and the document encoder do not share parameters – and in our case are at all effects different encoding methods. However, in our initial experiments, we found that this ‘untied’ architecture performs significantly worse than just using a single CharacterBERT to encode both queries and passages. Our hypothesis for this behaviour is that CharacterBERT may tend to generate query encodings that are very different to those of the regular BERT, making it harder to train both encoders to generate similar vectors for relevant query-passage pairs.

4.2. Knowledge Distillation with Self-Teaching

Knowledge distillation (KD) has become a common strategy for training effective dense retrievers: when using KD, a student DR model can learn from a stronger multi-vector DR teacher (like ColBERT) (Lin et al., 2020) or from a cross-encoder teacher (Hofstätter et al., 2020; Lin et al., 2021; Ren et al., 2021b). However, these KD training methods often require large computational resources because either a well trained teacher model with fixed parameters is needed, or jointly training of both student and teacher models is required. On the contrary, we design a novel KD training approach to learn typo-robust DRs that goes beyond such a computational requirement limitation: in our method, the model itself is both the student and the teacher and thus only one model is optimized during training. We call this approach Self-Teaching (ST).

The design of our Self-Teaching method is based on the following intuition. A key observation emerges from the findings in Figure 2 and previous works (Zhuang and Zuccon, 2021; Penha et al., 2021; Wu et al., 2021): Dense retrievers have much higher effectiveness on a query without typos than on its corresponding version with typos. This implies that dense retrievers can effectively “understand” and model (encode) the query, thus performing a better job in terms of relevance estimation, if the query itself does not contain typos. Hence, to improve the effectiveness of dense retrievers on queries with typos we can inject the dense retriever knowledge learnt from the corresponding query without typos.

Specifically, the proposed ST training applies one of many transformations to each query in the training query set . Such transformations mimic the common typos that users make in queries: for each query a transformation is randomly picked444In our experiments, each transformation has a uniform probability of being picked; a different distribution may be chosen to mimic popularity of types of typos, e.g., as observed from query logs for the specific search service. from the set of available transformations and applied to a random word (with constrains on the selection of such word) in the query, to generate a query with a typo. In terms of transformations, in our experiments we use the typo generators described in Section 3 and also used in previous work (Zhuang and Zuccon, 2021). We note realistic typo generations are language dependent; those we consider are realistic for English. The pair shares the same list of candidate passages that are associated with the original query q. Then, the similarity score between and any passage can be computed as ; similarly can be computed for . These scores are then normalised via Softmax:

(3)

The goal of our ST training is to minimise the difference between the score distribution obtained from the query with the typo () and the score distribution obtained from the corresponding query without typo (, the original query). For this, we minimize the KL-divergence loss:

(4)

This KL-divergence loss itself does not provide any relevance signal: it only reduces the score difference obtained by the original query and the query with the typo. Thus, we also provide ground-truth relevance labels to supervise the DR learning process so that it can learn the relevance relationships between queries and passages. Specifically, the candidate passages set of each query for the MS MARCO dataset555Which is used in this paper to train the dense retrievers. contains on average one positive passage and a set of negative passages ; these labelled passages are used to train with a supervised contrastive cross-entropy loss:

(5)

Finally, we combine the KL loss and supervised contrastive loss to form our ST loss , where the function is used to stop the gradients from the score distribution computed for the original query (without typos) in the KL loss. We find that this gives rise to a more stable training and yields slightly higher overall effectiveness.

5. Experimental settings

5.1. Baselines and Implementations

The goal of the methods proposed in this paper, namely CharacterBERT-based encoders and Self-Teaching training, is to produce dense retrievers that are as effective as standard dense retrievers on queries without typos, but are far more effective than these on queries that contain typos. To evaluate whether this goal is achieved, we then compare the proposed methods against popular and effective DRs that do not explicitly tackle the problem of queries with typos: ANCE (Xiong et al., 2020) and TCT-ColBERTv2 (Lin et al., 2021). The ANCE method is characterised by hard negatives sampled in an asynchronously iterative way as the DR is trained, and this is combined with a corresponding update of the ANN index. The TCT-ColBERTv2 method is characterised by the learning from a strong cross-encoder teacher model. Both methods are expensive to train and require high computational resources. Thus, we directly use the corresponding implementations, model checkpoints and pre-built ANN indexes from Pyserini (Lin et al., 2021a); Pyserini is also used to produce baseline BM25 results on all datasets, for completeness.

We highlight that our proposed CharacterBERT and ST methods can be applied on top of any DR, and thus also in combination with the selected benchmark methods ANCE and TCT-ColBERTv2: this would make these benchmark models robust to queries with typos. However, in our experiments we do not do this because of the aforementioned high training costs associated to these DRs – we leave these comparisons to future work.

Finally, we note that ANCE and TCT-ColBERTv2 have a number of carefully designed training “tricks”, specifically aimed at gaining maximum effectiveness on standard (without typo) queries. Because of this, we also implement a vanilla standard BERT dense retriever (StandardBERT-DR), which to large extents resembles the DPR approach (Karpukhin et al., 2020) (but performs better overall). StandardBERT-DR is then directly comparable to our CharacterBERT dense retriever (CharacterBERT-DR), which is trained in a similar manner, aside from the two architectures differing from their use of BERT and the WordPiece tokenizer for StandardBERT-DR vs. the only CharacterBERT model for the CharacterBERT-DR. For both models, we follow the hard negative training of DPR and use Tevatron DR training toolkit (Gao et al., 2022) to train from scratch and use the Asyncval toolkit (Zhuang and Zuccon, 2022) to validate model checkpoints during training. For each query in the MS MARCO training set, we randomly sample 7 hard negative passages from the top 200 passages retrieved by BM25 and one positive passage from the qrels. We set the batch size to 16 and apply in-batch negatives sampling to each training sample in the batch, resulting in negatives per training sample. We train with the AdamW optimizer and a 5e-6 learning rate, linear learning rate schedule for 150,000 updates. The training of a model can be finished in about a day on a single Tesla V100 32G GPU. In terms of pre-trained language models, we use the BERT-base-uncased checkpoint provided by the Huggingface transformers library (Wolf et al., 2020) for StandardBERT-DR, and the CharacterBERT checkpoint pre-trained on general domain data obtained from the original CharacterBERT repository666https://github.com/helboukkouri/character-bert for CharacterBERT-DR.

In addition, we also compare our methods against the only other method that has been proposed in the literature to deal with queries with typos in the context of dense retrievers, namely the typos-aware training approach (Zhuang and Zuccon, 2021), which we label with + Aug to indicate that it represents a data augmentation method. This method uses realistic typo generators to augment training queries and train standard BERT-based DRs to be typo-robust. Thus, in the experiments, we train StandardBERT-DR and CharacterBERT-DR under the following conditions:

  • StandardBERT-DR+Aug: the original method by  Zhuang and Zuccon (2021);

  • CharacterBERT-DR+Aug: this relies on our proposed CharacterBERT based dense retriever, and uses the augmentation-based typo-aware training approach by  Zhuang and Zuccon (2021);

  • StandardBERT-DR+ST: the BERT-based dense retriever is trained using our proposed Self-Teaching method: thus is it similar to the StandardBERT-DR+Aug where the augmentation-based typo-aware training approach is replaced by Self-Teaching

  • CharacterBERT-DR+ST: this relies on our proposed CharacterBERT based dense retriever and our Self-Teaching method.

5.2. Datasets and Evaluation

Datasets. We employ four datasets to evaluate the proposed methods. These are the dev query set of MS MARCO v1 passage ranking dataset (Nguyen et al., 2016)

, the TREC Deep Learning Track Passage Retrieval Task 2019 

(Craswell et al., 2020a) (DL 2019) and 2020 (Craswell et al., 2020b) (DL 2020), and a dataset we compile for this paper (DL-typo). All datasets use the 8.8 million passages released with MS MARCO, and differ in terms of the queries used.

The dev query set of MS MARCO contains 6,980 queries, each with on average one relevant passage per query. DL2019 and DL2020 instead contain 43 and 54 judged queries (from the MS MARCO dev set) but with more complete relevance assessments: on average 215.3 and 210.9 judged passages respectively (they may be relevant or not). For DL2019 and 2020, relevance judgements are graded and range from 0 (not relevant) to 3 (highly relevant). Relevance label 1 indicates passages on-topic but not relevant and hence we conflate these passages to label 0 when computing binary metrics (e.g., MAP, MRR), as per standard practice with these datasets. For these three datasets, we apply the synthetic typo query generation process (Zhuang and Zuccon, 2021) outlined in Section 3: for each query in a dataset, we apply one of a set of transformations that introduce a typo in the query, and to this query with typo we assign the same relevance assessments of the corresponding query without typo. We repeat the typo generation process 10 times for each dataset (i.e. each original query gives rise to 10 variations with typos) to evaluate average effectiveness of models on typo queries, across different types of typos and different query words being affected by typos.

#Queries Average number of assessed passages per query, wr.t. relevance label
0 (Irrelevant) 1 (Related) 2 (Relevant) 3 (Perfect) Total
60 37.75 11.50 9.67 4.60 63.52
Table 2. Statistics of our DL-typo dataset.

We also compile a purposely built dataset, called DL-typo, to analyse effectiveness on real queries with typos. For this, we sample pairs of queries from a public large-scale query spelling correction corpus released by Hagen et al. (2017): each pair is composed of a query with typo and its corrected version (by human annotators). These queries with typos come from the anonymized AOL query log (Pass et al., 2006). From this corpus we evenly sample 60 queries with typos (and the corresponding corrected queries) across four common typo types, i.e., 15 RandInsert, 15 RandDelete, 15 SwapNeighbor and 15 RandSub (including 3 SwapAdjacent).

We then contribute relevance assessments against passages in MS MARCO. To decide which passages to assess for relevance, we issue the 60 queries with typos and their corresponding corrected queries (hence 120 queries in total) to 7 retrieval models777BM25, StandardBERT-DR, StandardBERT-DR+Aug, StandardBERT-DR+ST, CharacterBERT-DR, CharacterBERT-DR+Aug, CharacterBERT-DR+ST. and retrieve passages from MS MARCO. We then pool the top 10 retrieved passages for each run and exhaustively judge the relevance of the passages in the pool according to the TREC DL juSTs a d guidelines888https://trec.nist.gov/data/deep2020.html using Relevation (Koopman and Zuccon, 2014). Table 2 reports the key statistics of our DL-typo dataset.

The training queries of the MS MARCO dataset are used for training all the models in this paper: there are about 0.5 million queries in such training dataset, and these were logged by the MS Bing search engine; the training dataset has on average one relevant passage per query (i.e., one positive label per query).

Evaluation Metrics.

Following common practice, to evaluate the ranking effectiveness of the dense retrievers we use the metrics originally used by the creators of each dataset. For the MS MARCO dev queries dataset, these are Mean Reciprocal Rank at top 10 (MRR@10) and Recall at 1,000 (R@1000). For TREC DL 2019 and 2020, these are Normalized Discounted Cumulative Gain at 10 (nDCG@10), Mean Reciprocal Rank at 1,000 (MRR) and Mean Average Precision (MAP). For these datasets, queries with typos are generated 10 times for each original query without typo: thus for the evaluation on queries with typos we report the metrics averaged for each repeated experiment. For our DL-typo dataset, since we follow the TREC DL standard assessment practice, we also use the same evaluation metrics used in TREC DL. Several repetitions of queries with typos are not performed for our dataset because these queries are not synthetically generated. Statistical significant differences between methods’ results are detected using a two-tailed paired t-test with Bonferroni correction.

Queries Methods MS MARCO TREC DL 2019 TREC DL 2020
MRR@10 R@1000 nDCG@10 MRR MAP nDCG@10 MRR MAP
Without Typos a) BM25 .187 .857 .497 .685 .290 .487 .659 .287
b) ANCE (Xiong et al., 2020) .330 .959 .645 .837 .371 .641 .790 .403
c) TCT-ColBERTv2 (Lin et al., 2021) .358 .969 .720 .887 .447 .689 .839 .475
d) StandardBERT-DR .325 .953 .608 .719 .353 .633 .798 .407
e) CharacterBERT-DR .327 .950 .609 .772 .340 .586 .744 .379
f) StandardBERT-DR+Aug (Zhuang and Zuccon, 2021) .325 .951 .620 .803 .347 .629 .794 .398
g) StandardBERT-DR+ST .331 .949 .616 .345 .631 .781 .393
h) CharacterBERT-DR+Aug .331 .949 .634 .343 .612 .843 .398
i) CharacterBERT-DR+ST .325 .950 .643 .845 .340 .606 .827 .390
With Typos j) BM25 .095 .611 .256 .340 .147 .291 .410 .168
k) ANCE (Xiong et al., 2020) .200 .803 .448 .643 .247 .461 .603 .278
l) TCT-ColBERTv2 (Lin et al., 2021) .199 .806 .449 .597 .256 .471 .615 .302
m) StandardBERT-DR .136 .688 .298 .427 .167 .323 .459 .194
n) CharacterBERT-DR .159 .724 .355 .506 .189 .326 .484 .203
o) StandardBERT-DR+Aug  (Zhuang and Zuccon, 2021) .215 .841 .434 .582 .235 .466 .630 .280
p) StandardBERT-DR+ST .228 .856 .443 .615 .241 .481 .629 .284
q) CharacterBERT-DR+Aug .251 .877 .498 .688 .264 .486 .683 .300
r) CharacterBERT-DR+ST .263 .894 .519 .706 .268 .514 .722 .314
Table 3. Results obtained on queries without typos and queries for which typos are obtained synthetically. We repeat the typo generation procedure 10 times, averages are computed first per seed query, and then across queries (distributions for statistical significance computation are formed from the first average). Methods statistically significantly better () than others are indicated by superscripts.
Figure 2. Analysis of MRR drop rate (

) and encoding similarity for different values of tokenization difference. Shaded area is standard deviation across queries in the 10 typos variations. Tokenization difference of 6 and 7 are discarded from this analysis as these bins contain too few samples (and thus large variance), although they still fit the trends observed for smaller tokenization difference values.

6. Results

In this section, we first report the results of our proposed methods against baselines on the datasets where synthetic typo generation was used. We then present the results obtained on our DL-typo dataset, which reflects the model performance on real user queries with typos. Finally, we compare our approach with an alternative search engine architecture that involves the use of spell-checkers in the query pre-processing steps to identify and correct typos: this pipeline is what currently many production search engines rely on for dealing with typos in queries.

6.1. Results on Queries with Synthetic Typos

The main results on MS MARCO, TREC DL 2019 and 2020 for both queries with and without typos are reported in Table 3.

First we consider methods that do not deal with typos in queries (runs a-d in Table 3) and their effectiveness on queries without typos. We observe that all dense retrievers outperform the bag-of-words baseline (BM25) , as expected. ANCE and TCT-ColBERTv2 are more effective than our vanilla StandardBERT-DR: These results are expected as StandardBERT-DR uses simple hard negative sampling practice while the other two DRs use much more sophisticated, and computationally expensive, training strategies. Even so, StandardBERT-DR achieves similar MRR@10 and R@1000 as ANCE on MS MARCO dev queries (no statistically significant difference), and higher MAP and MRR on TREC DL 2020 (improvements are however not significant).

We now compare StandardBERT-DR and CharacterBERT-DR: these two models are trained with exactly the same settings and negative sampling, and the only difference is the encoder used to encode queries and passages999While, the results from ANCE and TCT-ColBERTv2 are not directly comparable because of the more sophisticated training regimes used by these methods, and are only provided for contextualisation purposes – we argue that the encoder of CharacterBERT-DR is more suitable to deal with typos in queries. The two dense retrievers have very similar effectiveness across different metrics and datasets, with the only statistically significant difference obtained on nDCG@10 for TREC DL 2020. These results suggest that, on queries without typos, using CharacterBERT as the core of the dense retrievers’ encoders tends to have similar effectiveness as the use of BERT. It is expected, although not empirically proven here, that the adaptation of CharacterBERT to the same training regimes of ANCE and TCT-ColBERTv2 would lead to effectiveness similar to these more advance methods: We leave confirming this hypothesis to future work.

We then turn our attention to the results obtained on queries without typos when training with methods that attempt to make DRs more robust on queries with typos – namely runs f-i, which are based on StandardBERT-DR or CharacterBERT-DR and the two considered training methods Aug (Zhuang and Zuccon, 2021) and ST (proposed here). We find that these training strategies, although directed at improving effectiveness on queries with typos, often lead to higher effectiveness on queries without typos, regardless on the encoder type used (however, no difference – be it a gain or a loss – is statistically significant). For example, both CharacterBERT-DR+Aug and CharacterBERT-DR+ST achieve higher nDCG@10 and MRR than CharacterBERT-DR on TREC DL 2019 and 2020. The same trend is observed on TREC DL 2019 when analysing the results for CharacterBERT-DR.

Next, we consider queries with typos. The methods that are not explicitly designed to tackle queries with typos, i.e. runs j-m in Table 3, return results that are dramatically lower than their counterpart queries without typos (runs a-d). This finding agrees with previous work (Zhuang and Zuccon, 2021), and we stress that it applies also to complex and otherwise highly performing dense retrievers like ANCE and TCT-ColBERTv2. We further note that the use of CharacterBERT alone does not address the problem of robustness to typos: CharacterBERT-DR does now perform significantly better than StandardBERT-DR on MS MARCO dev queries, but the overall effectiveness is poor.

On the other hand, dense retrievers trained using methods that explicitly address robustness on queries with typos (runs o-r) showcase improved effectiveness. Among those, DRs that use our ST method always perform better than those that use the previous proposed Aug method (except for MRR for StandardBERT-DR+ST on TREC DL 2020 – difference not significant). Improvements provided on MS MARCO dev queries by our ST compared to Aug for StandardBERT-DR for all metrics and for CharacterBERT-DR for R@1000 are statistically significant; other difference between the two methods across other datasets are at times large, but not significant. We further note that when either of these two forms of training are employed, CharacterBERT-DR always shows higher effectiveness than StandardBERT-DR on queries with typos. This finding supports our intuition presented in Section 3. To further analyse this aspect, in Figure 2 we also plot the MRR drop rate () and the cosine similarity of CharacterBERT-DR+ST (dashed lines) with respect to tokenization difference; recall that the solid lines refer to StandardBERT-DR (without any specific training tackling queries with typos) and were analysed in Section 3. Interestingly, our CharacterBERT-DR+ST shows the exact opposite trend of StandardBERT-DR: For queries with typos that resulted in larger tokenization differences (i.e. a larger amount of tokens differ from those obtained by the corresponding query without typos) the of CharacterBERT-DR+ST decreases, while the encoding similarity between the encoded original queries and the encoded queries with typos increases. This analysis further demonstrates that our method is effective at narrowing the gap (in effectiveness and representation) between the queries with and without typos.

A note on query latency and model size. Intuitively, CharacterBERT should result in higher query latency than BERT as it uses Character-CNNs to construct token embeddings, while BERT just uses a lookup table. However, in our experiments, the query latency on GPU of CharacterBERT is just one millisecond slower than that of BERT. This is due to the fact that the number of token embeddings constructed by Character-CNNs is usually smaller than the number of token embeddings constructed by the WordPiece tokenizer. Thus, for the same query, the self-attention computation in the BERT transformer layers is often less time consuming for CharacterBERT than for BERT. Another side effect of Character-CNNs is that CharacterBERT has less model parameters than BERT (105M vs 110M) as it does not need to store the token embeddings in the BERT’s WordPiece vocabulary: it only has 262 character embeddings and some extra parameters for the CNNs.

Queries Methods DL-typo
nDCG@10 MRR MAP
Without Typos a) BM25 .527 .633 .298
b) ANCE (Xiong et al., 2020) .606 .770 .480
c) TCT-ColBERTv2 (Lin et al., 2021) .650 .890 .565
d) StandardBERT-DR .722 .833 .565
e) CharacterBERT-DR .716 .855 .538
f) StandardBERT-DR+Aug (Zhuang and Zuccon, 2021) .737 .840 .594
g) StandardBERT-DR+ST .725 .827 .583
h) CharacterBERT-DR+Aug .713 .821 .539
i) CharacterBERT-DR+ST .706 .793 .539
With Typos j) BM25 .212 .203 .104
k) ANCE (Xiong et al., 2020) .340 .508 .245
l) TCT-ColBERTv2 (Lin et al., 2021) .310 .421 .221
m) StandardBERT-DR .283 .371 .167
n) CharacterBERT-DR .297 .386 .224
o) StandardBERT-DR+Aug  (Zhuang and Zuccon, 2021) .408 .502 .283
p) StandardBERT-DR+ST .433 .531 .301
q) CharacterBERT-DR+Aug .443 .598 .326
r) CharacterBERT-DR+ST .473 .615 .348
Table 4. DL-typo results. Methods statistically significantly better () than others are indicated by superscripts.
Figure 3. How nDCG@10 changes according to the relative frequency of typos in the query set.
Queries Method MS MARCO DL-typo MRR@10 R@1000 nDCG@10 MRR MAP Without Typos a) pyspellchecker -> StandardBERT-DR .276 .888 .700 .811 .550 b) MSspellchecker -> StandardBERT-DR .324 .951 .719 .833 .563 c) pyspellchecker -> CharacterBERT-DR .279 .887 .703 .824 .527 d) MSspellchecker -> CharacterBERT-DR .326 .948 .715 .855 .538 e) CharacterBERT-DR+ST .325 .950 .706 .793 .539 With Typos d) pyspellchecker -> StandardBERT-DR .231 .819 .475 .562 .340 f) MSspellchecker -> StandardBERT-DR .303 .920 .716 .833 .559 g) pyspellchecker -> CharacterBERT-DR .234 .821 .462 .573 .339 h) MSspellchecker -> CharacterBERT-DR .305 .930 .714 .855 .539 i) CharacterBERT-DR+ST .263 .894 .473 .615 .348 Table 5. Comparison between CharacterBERT-DR+ST and pipelines that involve spell-checkers. Methods statistically significantly better () than others are indicated by superscripts.

6.2. Results on Queries with Real Typos

The results presented so far are obtained with synthetic typo generation. Although, this synthetic evaluation has been used in previous works (Zhuang and Zuccon, 2021; Sun et al., 2020; Penha et al., 2021; Wu et al., 2021), synthetically generated typos in queries may differ from these encountered in practice. Our DL-typo dataset allows us to investigate effectiveness in real user queries with typos. Results obtained on DL-typo are reported in Table 4. Note that runs based on StandardBERT-DR and CharacterBERT-DR (i.e., runs d-i and m-r) were pooled to form our dataset (and thus fully assessed at least up to rank 10), while ANCE and TCT-ColBERTv2 were not pulled: thus their evaluation is affected by a larger number of unjudged passages, and thus direct comparison with the other methods may be unfair101010About 20% - 30% passages in the top-10 of ANCE and TCT-ColBERT are unjudged.. Overall, results on DL-typo for StandardBERT-DR and CharacterBERT-DR show similar trends to those on the other considered datasets (Table 3). In particular, on queries without typos, training with Aug or ST delivers results similar to standard training; however they do improve results on queries with typos – and our proposed CharacterBERT-DR+ST achieves the best effectiveness across all metrics for queries with typos. While CharacterBERT-DR based models show slightly worse effectiveness than standardBERT-DR for queries without typos, we do not find statistically significant differences, except for MAP for standardBERT-DR+Aug. We further study on this dataset what is the relative percentage of queries with typos that need to be present in a dataset to prefer the use of CharacterBERT-DR+ST over StandardBERT-DR. This analysis is shown in Figure 3: already with only of the overall queries containing typos, CharacterBERT-DR+ST shows higher effectiveness than StandardBERT-DR. This difference becomes statistically significant when about or more of the queries contain typos. We note that several previous studies have observed a high typos rate in search engines query logs (Nordlie, 1999; Spink et al., 2001; Wang et al., 2003; Wilbur et al., 2006; Hagen et al., 2017) and fixing those typos could be very expensive111111https://unbxd.com/blog/site-search-and-common-misspellings/. Our CharacterBERT-DR+ST can be a promising solution for this problem.

6.3. CharacterBERT-DR+ST vs Spell-checkers

Next, we compare CharacterBERT-DR+ST with a common pipeline used in search engines for dealing with typos: the pre-processing of queries with spell-checker systems. For this, we employ pyspellchecker121212https://github.com/barrust/pyspellchecker, which implements a rule-based spell checking algorithm, and the state-of-the-art Microsoft Bing spell checking API (MSspellchecker)131313https://docs.microsoft.com/en-us/azure/cognitive-services/bing-spell-check/overview, which leverages deep learning and statistical machine translation, along with a large amount of crawled data and search engine interactions, to provide accurate and contextual corrections. Differently from CharacterBERT-DR+ST, which tackles typos in queries in an end-to-end manner, the use of spell-checkers requires the introduction of an extra step: first it applies the spelling correction algorithm to the query to identify and correct possible typos, and then issues the corrected query to the retrieval model. Empirical comparison is reported with respect to MS MARCO dev queries and DL-typo.

Empirical results are reported in Table 5. When queries do not contain typos and the MS MARCO dataset is considered, CharacterBERT-DR+ST and the MSspellchecker have similar effectiveness, and both are significantly better than pyspellchecker. In particular, on these queries, pyspellchecker fails because it incorrectly identifies typos in queries where there are not, thus hurting the dense retriever effectiveness. The methods instead do not show statistically significant differences on the queries without typos contained in DL-typo. When queries with typos are considered, MSspellchecker has a much higher effectiveness than the other methods. However, we note that the MSspellchecker is the most expensive method among those considered in terms of both training and inference, and is likely finetuned across a significantly larger amount of data and labels than the other considered methods. Our CharacterBERT-DR+ST exhibits higher effectiveness than pyspellchecker on both datasets (statistically significant on MS MARCO), with the only exception that nDCG@10 is slightly worse than pyspellchecker->StandardBERT-DR on the DL-typo dataset (no statistical significance), suggesting that our end-to-end method is better than a pipeline with a rule-based spell-checker: it achieves similar, when not higher, effectiveness at a lower computational cost. The use of an end-to-end system like that possible with CharacterBERT-DR+ST as opposed to a pipeline that includes separate components like a spell-checker and a separate dense retriever, presents a key engineering advantage: a simpler, less complex system, with less components that need to be maintained, monitored and updated.

7. Conclusions

We extensively analysed the behaviour of BERT-based dense retrievers on queries that contain typos and identified the underlying reasons for which these methods are not robust to typos in queries. Specifically, we unveiled that BERT’s WordPiece tokenizer can dramatically change the input token distributions of the query encoder in presence of typos in the query, thereafter negatively impacting the downstream search task. To overcome this issue, we proposed to use CharacterBERT in combination with a Self-Teaching (ST) adversarial training for learning dense retrievers that are embedding-invariant w.r.t. queries with and without typos. Our empirical results demonstrated that dense retrievers trained with our proposed CharacterBERT and ST are much more robust to typos in queries, while not deteriorating effectiveness on queries without typos. In addition, our methods show the potential for replacing traditional spell-checkers used in retrieval pipelines, thus simplifying the retrieval system and lowering engineering and maintenance operations. However, we also observe there are still gaps compared to more sophisticated and highly trained deep learning based spell-checkers, highlighting that further improvements to the proposed methods are required to obtain an end-to-end dense retriever that can replace search engine pipelines that use these more complex spell-checkers. Furthermore, we also observe that the application of the proposed CharacterBERT and ST to more complex training approaches such as TCT-ColBERTv2 (not investigated here) may well lead to even stronger, typo-robust dense retrievers. Code, experimental results and DL-typo dataset are publicly available at https://github.com/ielab/CharacterBERT-DR.

Acknowledgements.
This research is partially funded by the Grain Research and Development Corporation project AgAsk (UOQ2003- 009RTX). We thank Ahmed Mourad, Harry Scells, Shuai Wang from UQ ielab and Yun Wen for helping assembling the DL-typo dataset.

References

  • N. Arabzadeh, B. Mitra, and E. Bagheri (2021) MS marco chameleons: challenging the ms marco leaderboard with extremely obstinate queries. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4426–4435. Cited by: §1, §2.
  • X. Chen, K. Lakhotia, B. Oğuz, A. Gupta, P. Lewis, S. Peshterliev, Y. Mehdad, S. Gupta, and W. Yih (2021) Salient phrase aware dense retrieval: can a dense retriever imitate a sparse one?. arXiv preprint arXiv:2110.06918. Cited by: §2.
  • N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees (2020a) Overview of the trec 2019 deep learning track. In Proceedings of the Twenty-Ninth Text REtrieval Conference (NIST Special Publication), Cited by: §5.2.
  • N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees (2020b) Overview of the trec 2019 deep learning track. In Proceedings of the Twenty-Ninth Text REtrieval Conference (NIST Special Publication), Cited by: §5.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.
  • H. El Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigenbaum, and J. Tsujii (2020) CharacterBERT: reconciling ELMo and BERT for word-level open-vocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6903–6915. Cited by: §1, §4.1, §4.1.
  • L. Gao and J. Callan (2021a) Condenser: a pre-training architecture for dense retrieval. In

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021

    ,
    pp. 981–993. Cited by: §1.
  • L. Gao and J. Callan (2021b) Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540. Cited by: §1.
  • L. Gao, Z. Dai, T. Chen, Z. Fan, B. Van Durme, and J. Callan (2021a) Complementing lexical retrieval with semantic residual embedding. In 43rd European Conference on IR Research, ECIR 2021, Cited by: §1.
  • L. Gao, X. Ma, J. J. Lin, and J. Callan (2022) Tevatron: an efficient and flexible toolkit for dense retrieval. ArXiv abs/2203.05765. Cited by: §5.1.
  • L. Gao, Y. Zhang, J. Han, and J. Callan (2021b) Scaling deep contrastive learning batch size under memory limited setup. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pp. 316–321. Cited by: §2.
  • M. Hagen, M. Potthast, M. Gohsen, A. Rathgeber, and B. Stein (2017) A large-scale query spelling correction corpus. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1261–1264. Cited by: §5.2, §6.2.
  • G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    .
    arXiv preprint arXiv:1503.02531. Cited by: §1.
  • S. Hofstätter, S. Althammer, M. Schröder, M. Sertkan, and A. Hanbury (2020) Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666. Cited by: §2, §4.2.
  • G. Izacard and E. Grave (2020) Distilling knowledge from reader to retriever for question answering. arXiv preprint arXiv:2012.04584. Cited by: §2.
  • R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: §1.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6769–6781. Cited by: §1, §2, §5.1.
  • O. Khattab and M. Zaharia (2020) Colbert: efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48. Cited by: §1.
  • B. Koopman and G. Zuccon (2014) Relevation! an open source system for information retrieval relevance assessment. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 1243–1244. Cited by: §5.2.
  • J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, and R. Nogueira (2021a) Pyserini: a python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2356–2362. Cited by: §5.1.
  • J. Lin, R. Nogueira, and A. Yates (2021b) Pretrained transformers for text ranking: bert and beyond. Synthesis Lectures on Human Language Technologies 14 (4), pp. 1–325. Cited by: §1, §2.
  • S. Lin, J. Yang, and J. Lin (2020) Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386. Cited by: §2, §4.2.
  • S. Lin, J. Yang, and J. Lin (2021) In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), Online, pp. 163–173. Cited by: §2, §4.2, §5.1, Table 3, Table 4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.
  • I. Mackie, J. Dalton, and A. Yates (2021) How deep is your learning: the dl-hard annotated deep learning dataset. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2335–2341. External Links: ISBN 9781450380379 Cited by: §2.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (CEUR Workshop Proceedings, Vol. 1773), Cited by: §5.2.
  • R. Nordlie (1999) “User revealment”—a comparison of initial queries and ensuing question development in online searching and in human reference interactions. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 11–18. Cited by: §6.2.
  • G. Pass, A. Chowdhury, and C. Torgeson (2006) A picture of search. In Proceedings of the 1st international conference on Scalable information systems, Cited by: §5.2.
  • G. Penha, A. Câmara, and C. Hauff (2021) Evaluating the robustness of retrieval pipelines with query variation generators. arXiv preprint arXiv:2111.13057. Cited by: §1, §3, §4.2, §6.2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1.
  • Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang (2021) RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5835–5847. Cited by: §1, §2.
  • R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, H. Wang, and J. Wen (2021a) PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2173–2183. Cited by: §1.
  • R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, H. Wang, and J. Wen (2021b) RocketQAv2: a joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2825–2835. Cited by: §1, §2, §4.2.
  • C. Sciavolino, Z. Zhong, J. Lee, and D. Chen (2021) Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6138–6148. Cited by: §1, §2.
  • A. Spink, D. Wolfram, M. B. Jansen, and T. Saracevic (2001) Searching the web: the public and their queries. Journal of the American society for information science and technology 52 (3), pp. 226–234. Cited by: §6.2.
  • L. Sun, K. Hashimoto, W. Yin, A. Asai, J. Li, P. Yu, and C. Xiong (2020) Adv-bert: bert is not robust on misspellings! generating nature adversarial samples on bert. arXiv preprint arXiv:2003.04985. Cited by: §2, §6.2.
  • P. Wang, M. W. Berry, and Y. Yang (2003) Mining longitudinal web queries: trends and patterns. Journal of the american Society for Information Science and technology 54 (8), pp. 743–758. Cited by: §6.2.
  • W. J. Wilbur, W. Kim, and N. Xie (2006) Spelling correction in the pubmed search engine. Information retrieval 9 (5), pp. 543–564. Cited by: §6.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. Cited by: §5.1.
  • C. Wu, R. Zhang, J. Guo, Y. Fan, and X. Cheng (2021) Are neural ranking models robust?. arXiv preprint arXiv:2108.05018. Cited by: §1, §3, §4.2, §6.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    .
    arXiv preprint arXiv:1609.08144. Cited by: §1.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. N. Bennett, J. Ahmed, and A. Overwijk (2020) Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations, Cited by: §1, §2, §5.1, Table 3, Table 4.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §2.
  • J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, and S. Ma (2021) Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §1, §2.
  • J. Zhan, J. Mao, Y. Liu, M. Zhang, and S. Ma (2020) RepBERT: contextualized text embeddings for first-stage retrieval. arXiv preprint arXiv:2006.15498. Cited by: §1.
  • S. Zhuang and G. Zuccon (2021) Dealing with typos for bert-based passage retrieval and ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2836–2842. Cited by: §1, §1, §2, §2, §3, §3, §4.2, §4.2, 1st item, 2nd item, §5.1, §5.2, Table 3, §6.1, §6.1, §6.2, Table 4, footnote 1.
  • S. Zhuang and G. Zuccon (2022) Asyncval: a toolkit for asynchronously validating dense retriever checkpoints during training. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §5.1.